• Open

    In-memory factorization of holographic perceptual representations. (arXiv:2211.05052v1 [cs.ET])
    Disentanglement of constituent factors of a sensory signal is central to perception and cognition and hence is a critical task for future artificial intelligence systems. In this paper, we present a compute engine capable of efficiently factorizing holographic perceptual representations by exploiting the computation-in-superposition capability of brain-inspired hyperdimensional computing and the intrinsic stochasticity associated with analog in-memory computing based on nanoscale memristive devices. Such an iterative in-memory factorizer is shown to solve at least five orders of magnitude larger problems that cannot be solved otherwise, while also significantly lowering the computational time and space complexity. We present a large-scale experimental demonstration of the factorizer by employing two in-memory compute chips based on phase-change memristive devices. The dominant matrix-vector multiply operations are executed at O(1) thus reducing the computational time complexity to merely the number of iterations. Moreover, we experimentally demonstrate the ability to factorize visual perceptual representations reliably and efficiently.  ( 2 min )
    A Characterization of List Learnability. (arXiv:2211.04956v1 [stat.ML])
    A classical result in learning theory shows the equivalence of PAC learnability of binary hypothesis classes and the finiteness of VC dimension. Extending this to the multiclass setting was an open problem, which was settled in a recent breakthrough result characterizing multiclass PAC learnability via the DS dimension introduced earlier by Daniely and Shalev-Shwartz. In this work we consider list PAC learning where the goal is to output a list of $k$ predictions. List learning algorithms have been developed in several settings before and indeed, list learning played an important role in the recent characterization of multiclass learnability. In this work we ask: when is it possible to $k$-list learn a hypothesis class? We completely characterize $k$-list learnability in terms of a generalization of DS dimension that we call the $k$-DS dimension. Generalizing the recent characterization of multiclass learnability, we show that a hypothesis class is $k$-list learnable if and only if the $k$-DS dimension is finite.  ( 2 min )
    Efficient Large-scale Audio Tagging via Transformer-to-CNN Knowledge Distillation. (arXiv:2211.04772v1 [cs.SD])
    Audio Spectrogram Transformer models rule the field of Audio Tagging, outrunning previously dominating Convolutional Neural Networks (CNNs). Their superiority is based on the ability to scale up and exploit large-scale datasets such as AudioSet. However, Transformers are demanding in terms of model size and computational requirements compared to CNNs. We propose a training procedure for efficient CNNs based on offline Knowledge Distillation (KD) from high-performing yet complex transformers. The proposed training schema and the efficient CNN design based on MobileNetV3 results in models outperforming previous solutions in terms of parameter and computational efficiency and prediction performance. We provide models of different complexity levels, scaling from low-complexity models up to a new state-of-the-art performance of .483 mAP on AudioSet. Source Code available at: https://github.com/fschmid56/EfficientAT  ( 2 min )
    Holmes: An Efficient and Lightweight Semantic Based Anomalous Email Detector. (arXiv:2104.08044v13 [cs.CR] UPDATED)
    Email threat is a serious issue for enterprise security. The threat can be in various malicious forms, such as phishing, fraud, blackmail and malvertisement. The traditional anti-spam gateway often maintains a greylist to filter out unexpected emails based on suspicious vocabularies present in the email's subject and contents. However, this type of signature-based approach cannot effectively discover novel and unknown suspicious emails that utilize various evolving malicious payloads. To address the problem, in this paper, we present Holmes, an efficient and lightweight semantic based engine for anomalous email detection. Holmes can convert each email event log into a sentence through word embedding and then identify abnormalities that deviate from a historical baseline based on those translated sentences. We have evaluated the performance of Holmes in a real-world enterprise environment, where around 5,000 emails are sent/received each day. In our experiments, Holmes shows a high capability to detect email threats, especially those that cannot be handled by the enterprise anti-spam gateway. It is also demonstrated through our experiment that Holmes can discover more concealed malicious emails that are immune from several commercial detection tools.  ( 3 min )
    Collaborative Best Arm Identification with Limited Communication on Non-IID Data. (arXiv:2207.08015v2 [cs.LG] UPDATED)
    In this paper, we study the tradeoffs between the time speedup and the round complexity in the collaborative learning model with non-IID data, where multiple agents interact with possibly different environments and they want to learn an objective in the aggregated environment. We use a basic problem in bandit theory called best arm identification in multi-armed bandits as a vehicle to deliver the following conceptual message: collaborative learning on non-IID data is provably more difficult than that on IID data. In particular, we show the following: 1) Learning time speedup in the non-IID data setting can be much smaller than $1$ (that is, a slowdown). When the number of rounds $R = O(1)$, we will need at least a polynomial number of agents (in terms of the number of arms) to achieve a speedup $\tilde{\Omega}(1)$. This is in stark contrast to the IID data setting, where the speedup is always $\tilde{\Omega}(1)$ regardless of $R$ and the number of agents $K$. 2) Local adaptivity of the agents cannot help much in the non-IID data setting. This is in contrast with the IID data setting, in which to achieve the same speedup, the best non-adaptive algorithm requires a significantly larger number of rounds than the best adaptive algorithm.
    miCSE: Mutual Information Contrastive Learning for Low-shot Sentence Embeddings. (arXiv:2211.04928v1 [cs.CL])
    This paper presents miCSE, a mutual information-based Contrastive learning framework that significantly advances the state-of-the-art in few-shot sentence embedding. The proposed approach imposes alignment between the attention pattern of different views during contrastive learning. Learning sentence embeddings with miCSE entails enforcing the syntactic consistency across augmented views for every single sentence, making contrastive self-supervised learning more sample efficient. As a result, the proposed approach shows strong performance in the few-shot learning domain. While it achieves superior results compared to state-of-the-art methods on multiple benchmarks in few-shot learning, it is comparable in the full-shot scenario. The proposed approach is conceptually simple, easy to implement and optimize, yet empirically powerful. This study opens up avenues for efficient self-supervised learning methods that are more robust than current contrastive methods for sentence embedding.
    Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy. (arXiv:2107.02780v5 [econ.EM] UPDATED)
    The US Census Bureau will deliberately corrupt data sets derived from the 2020 US Census in an effort to maintain privacy, suggesting a painful trade-off between the privacy of respondents and the precision of economic analysis. To investigate whether this trade-off is inevitable, we formulate a semiparametric model of causal inference with high dimensional corrupted data. We propose a procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments, with a rate of $n^{-1/2}$ for semiparametric estimands that degrades gracefully for nonparametric estimands. Our key assumption is that the true covariates are approximately low rank, which we interpret as approximate repeated measurements and validate in the Census. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. Calibrated simulations verify the coverage of our data cleaning-adjusted confidence intervals and demonstrate the relevance of our results for 2020 Census data.  ( 2 min )
    ParGAN: Learning Real Parametrizable Transformations. (arXiv:2211.04996v1 [cs.CV])
    Current methods for image-to-image translation produce compelling results, however, the applied transformation is difficult to control, since existing mechanisms are often limited and non-intuitive. We propose ParGAN, a generalization of the cycle-consistent GAN framework to learn image transformations with simple and intuitive controls. The proposed generator takes as input both an image and a parametrization of the transformation. We train this network to preserve the content of the input image while ensuring that the result is consistent with the given parametrization. Our approach does not require paired data and can learn transformations across several tasks and datasets. We show how, with disjoint image domains with no annotated parametrization, our framework can create smooth interpolations as well as learn multiple transformations simultaneously.  ( 2 min )
    Acting upon Imagination: when to trust imagined trajectories in model based reinforcement learning. (arXiv:2105.05716v3 [cs.AI] UPDATED)
    Model based reinforcement learning (MBRL) uses an imperfect model of the world to imagine trajectories of future states and plan the best actions that maximize a given reward. These trajectories are imperfect and MBRL attempts to overcome this by relying on model predictive control (MPC) to continuously re-imagine trajectories from scratch. Such re-generation of imagined trajectories carries the major computational cost and increasing complexity in tasks with longer receding horizon. We investigate how far in the future the imagined trajectories can be relied upon while still maintaining acceptable reward. After taking each action, information becomes available about its immediate effect and its impact on outcomes expected of future actions. Hereby, we propose four methods for deciding whether to trust and act upon imagined trajectories: i) looking at recent errors with respect to expectations, ii) comparing the confidence in an action imagined against its execution, iii) observing the deviation in projected future states iv) observing the deviation in projected future rewards. An experiment analyzing the effects of acting upon imagination shows that our methods reduce computation by at least 20\% and up to 80\%, depending on the environment, while retaining acceptable reward.
    Characteristic Neural Ordinary Differential Equations. (arXiv:2111.13207v4 [cs.LG] UPDATED)
    We propose Characteristic-Neural Ordinary Differential Equations (C-NODEs), a framework for extending Neural Ordinary Differential Equations (NODEs) beyond ODEs. While NODEs model the evolution of a latent variables as the solution to an ODE, C-NODE models the evolution of the latent variables as the solution of a family of first-order quasi-linear partial differential equations (PDEs) along curves on which the PDEs reduce to ODEs, referred to as characteristic curves. This in turn allows the application of the standard frameworks for solving ODEs, namely the adjoint method. Learning optimal characteristic curves for given tasks improves the performance and computational efficiency, compared to state of the art NODE models. We prove that the C-NODE framework extends the classical NODE on classification tasks by demonstrating explicit C-NODE representable functions not expressible by NODEs. Additionally, we present C-NODE-based continuous normalizing flows, which describe the density evolution of latent variables along multiple dimensions. Empirical results demonstrate the improvements provided by the proposed method for classification and density estimation on CIFAR-10, SVHN, and MNIST datasets under a similar computational budget as the existing NODE methods. The results also provide empirical evidence that the learned curves improve the efficiency of the system through a lower number of parameters and function evaluations compared with baselines.
    Innovations in Integrating Machine Learning and Agent-Based Modeling of Biomedical Systems. (arXiv:2206.01092v2 [q-bio.QM] UPDATED)
    Agent-based modeling (ABM) is a well-established paradigm for simulating complex systems via interactions between constituent entities. Machine learning (ML) refers to approaches whereby statistical algorithms 'learn' from data on their own, without imposing a priori theories of system behavior. Biological systems -- from molecules, to cells, to entire organisms -- consist of vast numbers of entities, governed by complex webs of interactions that span many spatiotemporal scales and exhibit nonlinearity, stochasticity and intricate coupling between entities. The macroscopic properties and collective dynamics of such systems are difficult to capture via continuum modelling and mean-field formalisms. ABM takes a 'bottom-up' approach that obviates these difficulties by enabling one to easily propose and test a set of well-defined 'rules' to be applied to the individual entities (agents) in a system. Evaluating a system and propagating its state over discrete time-steps effectively simulates the system, allowing observables to be computed and system properties to be analyzed. Because the rules that govern an ABM can be difficult to abstract and formulate from experimental data, there is an opportunity to use ML to help infer optimal, system-specific ABM rules. Once such rule-sets are devised, ABM calculations can generate a wealth of data, and ML can be applied there too -- e.g., to probe statistical measures that meaningfully describe a system's stochastic properties. As an example of synergy in the other direction (from ABM to ML), ABM simulations can generate realistic datasets for training ML algorithms (e.g., for regularization, to mitigate overfitting). In these ways, one can envision various synergistic ABM$\rightleftharpoons$ML loops. This review summarizes how ABM and ML have been integrated in contexts that span spatiotemporal scales, from cellular to population-level epidemiology.
    Discrimination and Class Imbalance Aware Online Naive Bayes. (arXiv:2211.04812v1 [cs.LG])
    Fairness-aware mining of massive data streams is a growing and challenging concern in the contemporary domain of machine learning. Many stream learning algorithms are used to replace humans at critical decision-making points e.g., hiring staff, assessing credit risk, etc. This calls for handling massive incoming information with minimum response delay while ensuring fair and high quality decisions. Recent discrimination-aware learning methods are optimized based on overall accuracy. However, the overall accuracy is biased in favor of the majority class; therefore, state-of-the-art methods mainly diminish discrimination by partially or completely ignoring the minority class. In this context, we propose a novel adaptation of Na\"ive Bayes to mitigate discrimination embedded in the streams while maintaining high predictive performance for both the majority and minority classes. Our proposed algorithm is simple, fast, and attains multi-objective optimization goals. To handle class imbalance and concept drifts, a dynamic instance weighting module is proposed, which gives more importance to recent instances and less importance to obsolete instances based on their membership in minority or majority class. We conducted experiments on a range of streaming and static datasets and deduced that our proposed methodology outperforms existing state-of-the-art fairness-aware methods in terms of both discrimination score and balanced accuracy.
    Wide Attention Is The Way Forward For Transformers?. (arXiv:2210.00640v2 [cs.LG] UPDATED)
    The Transformer is an extremely powerful and prominent deep learning architecture. In this work, we challenge the commonly held belief in deep learning that going deeper is better, and show an alternative design approach that is building wider attention Transformers. We demonstrate that wide single layer Transformer models can compete with or outperform deeper ones in a variety of Natural Language Processing (NLP) tasks when both are trained from scratch. The impact of changing the model aspect ratio on Transformers is then studied systematically. This ratio balances the number of layers and the number of attention heads per layer while keeping the total number of attention heads and all other hyperparameters constant. On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts. We show an in-depth evaluation and demonstrate how wide models require a far smaller memory footprint and can run faster on commodity hardware, in addition, these wider models are also more interpretable. For example, a single layer Transformer on the IMDb byte level text classification has 3.1x faster inference latency on a CPU than its equally accurate deeper counterpart, and is half the size. We therefore put forward wider and shallower models as a viable and desirable alternative for small models on NLP tasks, and as an important area of research for domains beyond this.
    System Safety Engineering for Social and Ethical ML Risks: A Case Study. (arXiv:2211.04602v1 [cs.LG])
    Governments, industry, and academia have undertaken efforts to identify and mitigate harms in ML-driven systems, with a particular focus on social and ethical risks of ML components in complex sociotechnical systems. However, existing approaches are largely disjointed, ad-hoc and of unknown effectiveness. Systems safety engineering is a well established discipline with a track record of identifying and managing risks in many complex sociotechnical domains. We adopt the natural hypothesis that tools from this domain could serve to enhance risk analyses of ML in its context of use. To test this hypothesis, we apply a "best of breed" systems safety analysis, Systems Theoretic Process Analysis (STPA), to a specific high-consequence system with an important ML-driven component, namely the Prescription Drug Monitoring Programs (PDMPs) operated by many US States, several of which rely on an ML-derived risk score. We focus in particular on how this analysis can extend to identifying social and ethical risks and developing concrete design-level controls to mitigate them.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v5 [stat.ME] UPDATED)
    Neural density estimators have proven remarkably powerful in performing efficient simulation-based Bayesian inference in various research domains. In particular, the BayesFlow framework uses a two-step approach to enable amortized parameter estimation in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations are poor representations of reality? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the latent data space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase as a function of the distance between the true data-generating distribution and the typical set of simulations in the latent summary space. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized Bayesian inference.
    Multi-level Domain Adaptation for Lane Detection. (arXiv:2206.10692v2 [cs.CV] UPDATED)
    We focus on bridging domain discrepancy in lane detection among different scenarios to greatly reduce extra annotation and re-training costs for autonomous driving. Critical factors hinder the performance improvement of cross-domain lane detection that conventional methods only focus on pixel-wise loss while ignoring shape and position priors of lanes. To address the issue, we propose the Multi-level Domain Adaptation (MLDA) framework, a new perspective to handle cross-domain lane detection at three complementary semantic levels of pixel, instance and category. Specifically, at pixel level, we propose to apply cross-class confidence constraints in self-training to tackle the imbalanced confidence distribution of lane and background. At instance level, we go beyond pixels to treat segmented lanes as instances and facilitate discriminative features in target domain with triplet learning, which effectively rebuilds the semantic context of lanes and contributes to alleviating the feature confusion. At category level, we propose an adaptive inter-domain embedding module to utilize the position prior of lanes during adaptation. In two challenging datasets, ie TuSimple and CULane, our approach improves lane detection performance by a large margin with gains of 8.8% on accuracy and 7.4% on F1-score respectively, compared with state-of-the-art domain adaptation algorithms.
    Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v4 [math.OC] UPDATED)
    We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $\epsilon$-stationary point is of the order $\mathcal{O}(1/\epsilon^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/\epsilon^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.
    Predicting Shallow Water Dynamics using Echo-State Networks with Transfer Learning. (arXiv:2112.09182v2 [cs.LG] UPDATED)
    In this paper we demonstrate that reservoir computing can be used to learn the dynamics of the shallow-water equations. In particular, while most previous applications of reservoir computing have required training on a particular trajectory to further predict the evolution along that trajectory alone, we show the capability of reservoir computing to predict trajectories of the shallow-water equations with initial conditions not seen in the training process. However, in this setting, we find that the performance of the network deteriorates for initial conditions with ambient conditions (such as total water height and average velocity) that are different from those in the training dataset. To circumvent this deficiency, we introduce a transfer learning approach wherein a small additional training step with the relevant ambient conditions is used to improve the predictions.
    Direct multi-modal inversion of geophysical logs using deep learning. (arXiv:2201.01871v3 [physics.geo-ph] UPDATED)
    Geosteering of wells requires fast interpretation of geophysical logs, which is a non-unique inverse problem. Current work presents a proof-of-concept approach to multi-modal probabilistic inversion of logs using a single evaluation of an artificial deep neural network (DNN). A mixture density DNN (MDN) is trained using the "multiple-trajectory-prediction" (MTP) loss functions, which avoids mode collapse typical for traditional MDNs, and allows multi-modal prediction ahead of data. The proposed approach is verified on the real-time stratigraphic inversion of gamma-ray logs. The multi-modal predictor outputs several likely inverse solutions/predictions, providing more accurate and realistic solutions than a deterministic regression using a DNN. For these likely stratigraphic curves, the model simultaneously predicts their probabilities, which are implicitly learned from the training geological data. The stratigraphy predictions and their probabilities obtained in milliseconds from the MDN can enable better real-time decisions under geological uncertainties.
    Decision-Focused Learning without Differentiable Optimization: Learning Locally Optimized Decision Losses. (arXiv:2203.16067v4 [cs.LG] UPDATED)
    Decision-Focused Learning (DFL) is a paradigm for tailoring a predictive model to a downstream optimization task that uses its predictions in order to perform better on that specific task. The main technical challenge associated with DFL is that it requires being able to differentiate through the optimization problem, which is difficult due to discontinuous solutions and other challenges. Past work has largely gotten around this issue by handcrafting task-specific surrogates to the original optimization problem that provide informative gradients when differentiated through. However, the need to handcraft surrogates for each new task limits the usability of DFL. In addition, there are often no guarantees about the convexity of the resulting surrogates and, as a result, training a predictive model using them can lead to inferior local optima. In this paper, we do away with surrogates altogether and instead learn loss functions that capture task-specific information. To the best of our knowledge, ours is the first approach that entirely replaces the optimization component of decision-focused learning with a loss that is automatically learned. Our approach (a) only requires access to a black-box oracle that can solve the optimization problem and is thus generalizable, and (b) can be convex by construction and so can be easily optimized over. We evaluate our approach on three resource allocation problems from the literature and find that our approach outperforms learning without taking into account task structure in all three domains, and even hand-crafted surrogates from the literature.
    Stochastic optimization on matrices and a graphon McKean-Vlasov limit. (arXiv:2210.00422v2 [math.PR] UPDATED)
    We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a "small noise" assumption the limit is shown to be the gradient flow of functions on graphons whose existence was established in arXiv:2111.09459. We also consider limits of stochastic gradient descents with added properly scaled reflected Brownian noise. The limiting curve of graphons is characterized by a family of stochastic differential equations with reflections and can be thought of as an extension of the classical McKean-Vlasov limit for interacting diffusions. The proofs introduce a family of infinite-dimensional exchangeable arrays of reflected diffusions and a novel notion of propagation of chaos for large matrices of interacting diffusions.
    Optimal degree of smoothness to exploit in nonparametric regressions. (arXiv:2112.03626v5 [math.ST] UPDATED)
    When the unknown regression function of a single variable is known to have derivatives up to the $(\gamma+1)$th order bounded in absolute values by a common constant everywhere or a.e., the classical minimax optimal rate of the mean integrated squared error (MISE) $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ leads one to conclude that, as $\gamma$ gets larger, the rate gets closer to $\frac{1}{n}$. This paper shows that: (i) if $n\leq\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is roughly $\frac{\log n}{n}$ and the optimal degree of smoothness to exploit is roughly $\left\lceil \frac{\log n}{2}\right\rceil -2$; (ii) if $n>\left(\gamma+1\right)^{2\gamma+3}$, the minimax optimal MISE rate is $\left(\frac{1}{n}\right)^{\frac{2\gamma+2}{2\gamma+3}}$ and the optimal degree of smoothness to exploit is $\gamma+1$. The building blocks of our minimax optimality results are a set of metric entropy bounds we develop in this paper for smooth function classes. Some of our bounds are original, and some of them improve and/or generalize the ones in the literature. Our metric entropy bounds allow us to explore the minimax optimal MISE rates associated with some commonly seen smoothness classes and also several non-standard smoothness classes, and can also be of independent interest even if one does not care about the nonparametric regressions.
    Two-layer neural networks with values in a Banach space. (arXiv:2105.02095v5 [cs.LG] UPDATED)
    We study two-layer neural networks whose domain and range are Banach spaces with separable preduals. In addition, we assume that the image space is equipped with a partial order, i.e. it is a Riesz space. As the nonlinearity we choose the lattice operation of taking the positive part; in case of $\mathbb R^d$-valued neural networks this corresponds to the ReLU activation function. We prove inverse and direct approximation theorems with Monte-Carlo rates for a certain class of functions, extending existing results for the finite-dimensional case. In the second part of the paper, we study, from the regularisation theory viewpoint, the problem of finding optimal representations of such functions via signed measures on a latent space from a finite number of noisy observations. We discuss regularity conditions known as source conditions and obtain convergence rates in a Bregman distance for the representing measure in the regime when both the noise level goes to zero and the number of samples goes to infinity at appropriate rates.
    Finding Second-Order Stationary Points in Nonconvex-Strongly-Concave Minimax Optimization. (arXiv:2110.04814v3 [math.OC] UPDATED)
    We study the smooth minimax optimization problem $\min_{\bf x}\max_{\bf y} f({\bf x},{\bf y})$, where $f$ is $\ell$-smooth, strongly-concave in ${\bf y}$ but possibly nonconvex in ${\bf x}$. Most of existing works focus on finding the first-order stationary points of the function $f({\bf x},{\bf y})$ or its primal function $P({\bf x})\triangleq \max_{\bf y} f({\bf x},{\bf y})$, but few of them focus on achieving second-order stationary points. In this paper, we propose a novel approach for minimax optimization, called Minimax Cubic Newton (MCN), which could find an $\big(\varepsilon,\kappa^{1.5}\sqrt{\rho\varepsilon}\,\big)$-second-order stationary point of $P({\bf x})$ with calling ${\mathcal O}\big(\kappa^{1.5}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of second-order oracles and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ times of first-order oracles, where $\kappa$ is the condition number and $\rho$ is the Lipschitz continuous constant for the Hessian of $f({\bf x},{\bf y})$. In addition, we propose an inexact variant of MCN for high-dimensional problems to avoid calling expensive second-order oracles. Instead, our method solves the cubic sub-problem inexactly via gradient descent and matrix Chebyshev expansion. This strategy still obtains the desired approximate second-order stationary point with high probability but only requires $\tilde{\mathcal O}\big(\kappa^{1.5}\ell\varepsilon^{-2}\big)$ Hessian-vector oracle calls and $\tilde{\mathcal O}\big(\kappa^{2}\sqrt{\rho}\varepsilon^{-1.5}\big)$ first-order oracle calls. To the best of our knowledge, this is the first work that considers the non-asymptotic convergence behavior of finding second-order stationary points for minimax problems without the convex-concave assumptions.
    Optimistic No-regret Algorithms for Discrete Caching. (arXiv:2208.06414v2 [cs.LG] UPDATED)
    We take a systematic look at the problem of storing whole files in a cache with limited capacity in the context of optimistic learning, where the caching policy has access to a prediction oracle (provided by, e.g., a Neural Network). The successive file requests are assumed to be generated by an adversary, and no assumption is made on the accuracy of the oracle. In this setting, we provide a universal lower bound for prediction-assisted online caching and proceed to design a suite of policies with a range of performance-complexity trade-offs. All proposed policies offer sublinear regret bounds commensurate with the accuracy of the oracle. Our results substantially improve upon all recently-proposed online caching policies, which, being unable to exploit the oracle predictions, offer only $O(\sqrt{T})$ regret. In this pursuit, we design, to the best of our knowledge, the first comprehensive optimistic Follow-the-Perturbed leader policy, which generalizes beyond the caching problem. We also study the problem of caching files with different sizes and the bipartite network caching problem. Finally, we evaluate the efficacy of the proposed policies through extensive numerical experiments using real-world traces.
    Signed Latent Factors for Spamming Activity Detection. (arXiv:2209.13814v1 [cs.IR] CROSS LISTED)
    Due to the increasing trend of performing spamming activities (e.g., Web spam, deceptive reviews, fake followers, etc.) on various online platforms to gain undeserved benefits, spam detection has emerged as a hot research issue. Previous attempts to combat spam mainly employ features related to metadata, user behaviors, or relational ties. These works have made considerable progress in understanding and filtering spamming campaigns. However, this problem remains far from fully solved. Almost all the proposed features focus on a limited number of observed attributes or explainable phenomena, making it difficult for existing methods to achieve further improvement. To broaden the vision about solving the spam problem and address long-standing challenges (class imbalance and graph incompleteness) in the spam detection area, we propose a new attempt of utilizing signed latent factors to filter fraudulent activities. The spam-contaminated relational datasets of multiple online applications in this scenario are interpreted by the unified signed network. Two competitive and highly dissimilar algorithms of latent factors mining (LFM) models are designed based on multi-relational likelihoods estimation (LFM-MRLE) and signed pairwise ranking (LFM-SPR), respectively. We then explore how to apply the mined latent factors to spam detection tasks. Experiments on real-world datasets of different kinds of Web applications (social media and Web forum) indicate that LFM models outperform state-of-the-art baselines in detecting spamming activities. By specifically manipulating experimental data, the effectiveness of our methods in dealing with incomplete and imbalanced challenges is valida
    ViTALiTy: Unifying Low-rank and Sparse Approximation for Vision Transformer Acceleration with a Linear Taylor Attention. (arXiv:2211.05109v1 [cs.CV])
    Vision Transformer (ViT) has emerged as a competitive alternative to convolutional neural networks for various computer vision applications. Specifically, ViT multi-head attention layers make it possible to embed information globally across the overall image. Nevertheless, computing and storing such attention matrices incurs a quadratic cost dependency on the number of patches, limiting its achievable efficiency and scalability and prohibiting more extensive real-world ViT applications on resource-constrained devices. Sparse attention has been shown to be a promising direction for improving hardware acceleration efficiency for NLP models. However, a systematic counterpart approach is still missing for accelerating ViT models. To close the above gap, we propose a first-of-its-kind algorithm-hardware codesigned framework, dubbed ViTALiTy, for boosting the inference efficiency of ViTs. Unlike sparsity-based Transformer accelerators for NLP, ViTALiTy unifies both low-rank and sparse components of the attention in ViTs. At the algorithm level, we approximate the dot-product softmax operation via first-order Taylor attention with row-mean centering as the low-rank component to linearize the cost of attention blocks and further boost the accuracy by incorporating a sparsity-based regularization. At the hardware level, we develop a dedicated accelerator to better leverage the resulting workload and pipeline from ViTALiTy's linear Taylor attention which requires the execution of only the low-rank component, to further boost the hardware efficiency. Extensive experiments and ablation studies validate that ViTALiTy offers boosted end-to-end efficiency (e.g., $3\times$ faster and $3\times$ energy-efficient) under comparable accuracy, with respect to the state-of-the-art solution.
    Teaching Yourself: Graph Self-Distillation on Neighborhood for Node Classification. (arXiv:2210.02097v4 [cs.LG] UPDATED)
    Recent years have witnessed great success in handling graph-related tasks with Graph Neural Networks (GNNs). Despite their great academic success, Multi-Layer Perceptrons (MLPs) remain the primary workhorse for practical industrial applications. One reason for this academic-industrial gap is the neighborhood-fetching latency incurred by data dependency in GNNs, which make it hard to deploy for latency-sensitive applications that require fast inference. Conversely, without involving any feature aggregation, MLPs have no data dependency and infer much faster than GNNs, but their performance is less competitive. Motivated by these complementary strengths and weaknesses, we propose a Graph Self-Distillation on Neighborhood (GSDN) framework to reduce the gap between GNNs and MLPs. Specifically, the GSDN framework is based purely on MLPs, where structural information is only implicitly used as prior to guide knowledge self-distillation between the neighborhood and the target, substituting the explicit neighborhood information propagation as in GNNs. As a result, GSDN enjoys the benefits of graph topology-awareness in training but has no data dependency in inference. Extensive experiments have shown that the performance of vanilla MLPs can be greatly improved with self-distillation, e.g., GSDN improves over stand-alone MLPs by 15.54\% on average and outperforms the state-of-the-art GNNs on six datasets. Regarding inference speed, GSDN infers 75X-89X faster than existing GNNs and 16X-25X faster than other inference acceleration methods.
    Discover, Explanation, Improvement: Automatic Slice Detection Framework for Natural Language Processing. (arXiv:2211.04476v1 [cs.CL])
    Current natural language processing (NLP) models such as BERT and RoBERTa have achieved high overall performance, but they often make systematic errors due to bias or certain difficult features to learn. Thus research on slice detection models (SDM) which automatically identifies underperforming groups of datapoints has gradually caught more attention, which aims at both understanding model behaviors and providing insights for future model training and designing. However, there is little systematic research on SDM and quantitative evaluation of its assessment for NLP models. Our paper fills this gap by proposing "Discover, Explanation, Improvement" framework that discovers coherent and underperforming groups of datapoints and unites datapoints of each slice under human-understandable concepts; it also provides comprehensive evaluation tasks and the corresponding quantitative metrics, which enable convenient comparison for future works. Results show that our framework can accurately select error-prone datapoints with informative semantic features that summarize error patterns, based on which it directly boosts model performance by an average of 2.85 points based on trained models without tuning any parameters across multiple datasets.
    A Note on Task-Aware Loss via Reweighing Prediction Loss by Decision-Regret. (arXiv:2211.05116v1 [cs.LG])
    In this short technical note we propose a baseline for decision-aware learning for contextual linear optimization, which solves stochastic linear optimization when cost coefficients can be predicted based on context information. We propose a decision-aware version of predict-then-optimize. We reweigh the prediction error by the decision regret incurred by an (unweighted) pilot estimator of costs to obtain a decision-aware predictor, then optimize with cost predictions from the decision-aware predictor. This method can be motivated as a finite-difference, iterate-independent approximation of the gradients of previously proposed end-to-end learning algorithms; it is also consistent with previously suggested intuition for end-to-end learning. This baseline is computationally easy to implement with readily available reweighted prediction oracles and linear optimization, and can be implemented with convex optimization so long as the prediction error minimization is convex. Empirically, we demonstrate that this approach can lead to improvements over a "predict-then-optimize" framework for settings with misspecified models, and is competitive with other end-to-end approaches. Therefore, due to its simplicity and ease of use, we suggest it as a simple baseline for end-to-end and decision-aware learning.
    AI-Bind: Improving Binding Predictions for Novel Protein Targets and Ligands. (arXiv:2112.13168v5 [q-bio.QM] UPDATED)
    Identifying novel drug-target interactions (DTI) is a critical and rate limiting step in drug discovery. While deep learning models have been proposed to accelerate the identification process, we show that state-of-the-art models fail to generalize to novel (i.e., never-before-seen) structures. We first unveil the mechanisms responsible for this shortcoming, demonstrating how models rely on shortcuts that leverage the topology of the protein-ligand bipartite network, rather than learning the node features. Then, we introduce AI-Bind, a pipeline that combines network-based sampling strategies with unsupervised pre-training, allowing us to limit the annotation imbalance and improve binding predictions for novel proteins and ligands. We illustrate the value of AI-Bind by predicting drugs and natural compounds with binding affinity to SARS-CoV-2 viral proteins and the associated human proteins. We also validate these predictions via docking simulations and comparison with recent experimental evidence, and step up the process of interpreting machine learning prediction of protein-ligand binding by identifying potential active binding sites on the amino acid sequence. Overall, AI-Bind offers a powerful high-throughput approach to identify drug-target combinations, with the potential of becoming a powerful tool in drug discovery.
    A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation. (arXiv:2206.05604v2 [stat.ML] UPDATED)
    The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance. As a result, computation and memory costs in resource-limited applications may be significantly reduced by dropping redundant weights, neurons, or layers. There have been many model compression algorithms proposed that provide impressive empirical success. However, a theoretical understanding of model compression is still limited. One problem is understanding if a network is more compressible than another of the same structure. Another problem is quantifying how much one can prune a network with theoretically guaranteed accuracy degradation. In this work, we propose to use the sparsity-sensitive $\ell_q$-norm ($0<q<1$) to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression with a controlled accuracy degradation bound. We also develop adaptive algorithms for pruning each neuron in the network informed by our theory. Numerical studies demonstrate the promising performance of the proposed methods compared with standard pruning algorithms.
    StructDiffusion: Object-Centric Diffusion for Semantic Rearrangement of Novel Objects. (arXiv:2211.04604v1 [cs.RO])
    Robots operating in human environments must be able to rearrange objects into semantically-meaningful configurations, even if these objects are previously unseen. In this work, we focus on the problem of building physically-valid structures without step-by-step instructions. We propose StructDiffusion, which combines a diffusion model and an object-centric transformer to construct structures out of a single RGB-D image based on high-level language goals, such as "set the table." Our method shows how diffusion models can be used for complex multi-step 3D planning tasks. StructDiffusion improves success rate on assembling physically-valid structures out of unseen objects by on average 16% over an existing multi-modal transformer model, while allowing us to use one multi-task model to produce a wider range of different structures. We show experiments on held-out objects in both simulation and on real-world rearrangement tasks. For videos and additional results, check out our website: this http URL
    The Best of Both Worlds: a Framework for Combining Degradation Prediction with High Performance Super-Resolution Networks. (arXiv:2211.05018v1 [cs.CV])
    To date, the best-performing blind super-resolution (SR) techniques follow one of two paradigms: A) generate and train a standard SR network on synthetic low-resolution - high-resolution (LR - HR) pairs or B) attempt to predict the degradations an LR image has suffered and use these to inform a customised SR network. Despite significant progress, subscribers to the former miss out on useful degradation information that could be used to improve the SR process. On the other hand, followers of the latter rely on weaker SR networks, which are significantly outperformed by the latest architectural advancements. In this work, we present a framework for combining any blind SR prediction mechanism with any deep SR network, using a metadata insertion block to insert prediction vectors into SR network feature maps. Through comprehensive testing, we prove that state-of-the-art contrastive and iterative prediction schemes can be successfully combined with high-performance SR networks such as RCAN and HAN within our framework. We show that our hybrid models consistently achieve stronger SR performance than both their non-blind and blind counterparts. Furthermore, we demonstrate our framework's robustness by predicting degradations and super-resolving images from a complex pipeline of blurring, noise and compression.
    Hierarchical Bayesian Modelling for Knowledge Transfer Across Engineering Fleets via Multitask Learning. (arXiv:2204.12404v3 [stat.ML] UPDATED)
    A population-level analysis is proposed to address data sparsity when building predictive models for engineering infrastructure. Utilising an interpretable hierarchical Bayesian approach and operational fleet data, domain expertise is naturally encoded (and appropriately shared) between different sub-groups, representing (i) use-type, (ii) component, or (iii) operating condition. Specifically, domain expertise is exploited to constrain the model via assumptions (and prior distributions) allowing the methodology to automatically share information between similar assets, improving the survival analysis of a truck fleet and power prediction in a wind farm. In each asset management example, a set of correlated functions is learnt over the fleet, in a combined inference, to learn a population model. Parameter estimation is improved when sub-fleets share correlated information at different levels of the hierarchy. In turn, groups with incomplete data automatically borrow statistical strength from those that are data-rich. The statistical correlations enable knowledge transfer via Bayesian transfer learning, and the correlations can be inspected to inform which assets share information for which effect (i.e. parameter). Both case studies demonstrate the wide applicability to practical infrastructure monitoring, since the approach is naturally adapted between interpretable fleet models of different in situ examples.
    Almost Tight Error Bounds on Differentially Private Continual Counting. (arXiv:2211.05006v1 [cs.LG])
    The first large-scale deployment of private federated learning uses differentially private counting in the continual release model as a subroutine (Google AI blog titled "Federated Learning with Formal Differential Privacy Guarantees"). In this case, a concrete bound on the error is very relevant to reduce the privacy parameter. The standard mechanism for continual counting is the binary mechanism. We present a novel mechanism and show that its mean squared error is both asymptotically optimal and a factor 10 smaller than the error of the binary mechanism. We also show that the constants in our analysis are almost tight by giving non-asymptotic lower and upper bounds that differ only in the constants of lower-order terms. Our algorithm is a matrix mechanism for the counting matrix and takes constant time per release. We also use our explicit factorization of the counting matrix to give an upper bound on the excess risk of the private learning algorithm of Denisov et al. (NeurIPS 2022). Our lower bound for any continual counting mechanism is the first tight lower bound on continual counting under approximate differential privacy. It is achieved using a new lower bound on a certain factorization norm, denoted by $\gamma_F(\cdot)$, in terms of the singular values of the matrix. In particular, we show that for any complex matrix, $A \in \mathbb{C}^{m \times n}$, \[ \gamma_F(A) \geq \frac{1}{\sqrt{m}}\|A\|_1, \] where $\|\cdot \|$ denotes the Schatten-1 norm. We believe this technique will be useful in proving lower bounds for a larger class of linear queries. To illustrate the power of this technique, we show the first lower bound on the mean squared error for answering parity queries.
    Deep Reinforcement Learning for Cryptocurrency Trading: Practical Approach to Address Backtest Overfitting. (arXiv:2209.05559v4 [q-fin.ST] UPDATED)
    Designing profitable and reliable trading strategies is challenging in the highly volatile cryptocurrency market. Existing works applied deep reinforcement learning methods and optimistically reported increased profits in backtesting, which may suffer from the \textit{false positive} issue due to overfitting. In this paper, we propose a practical approach to address backtest overfitting for cryptocurrency trading using deep reinforcement learning. First, we formulate the detection of backtest overfitting as a hypothesis test. Then, we train the DRL agents, estimate the probability of overfitting, and reject the overfitted agents, increasing the chance of good trading performance. Finally, on 10 cryptocurrencies over a testing period from 05/01/2022 to 06/27/2022 (during which the crypto market \textbf{crashed two times}), we show that the less overfitted deep reinforcement learning agents have a higher return than that of more overfitted agents, an equal weight strategy, and the S\&P DBM Index (market benchmark), offering confidence in possible deployment to a real market.
    Deep W-Networks: Solving Multi-Objective Optimisation Problems With Deep Reinforcement Learning. (arXiv:2211.04813v1 [cs.LG])
    In this paper, we build on advances introduced by the Deep Q-Networks (DQN) approach to extend the multi-objective tabular Reinforcement Learning (RL) algorithm W-learning to large state spaces. W-learning algorithm can naturally solve the competition between multiple single policies in multi-objective environments. However, the tabular version does not scale well to environments with large state spaces. To address this issue, we replace underlying Q-tables with DQN, and propose an addition of W-Networks, as a replacement for tabular weights (W) representations. We evaluate the resulting Deep W-Networks (DWN) approach in two widely-accepted multi-objective RL benchmarks: deep sea treasure and multi-objective mountain car. We show that DWN solves the competition between multiple policies while outperforming the baseline in the form of a DQN solution. Additionally, we demonstrate that the proposed algorithm can find the Pareto front in both tested environments.
    Spiking Neural Network Decision Feedback Equalization. (arXiv:2211.04756v1 [eess.SP])
    In the past years, artificial neural networks (ANNs) have become the de-facto standard to solve tasks in communications engineering that are difficult to solve with traditional methods. In parallel, the artificial intelligence community drives its research to biology-inspired, brain-like spiking neural networks (SNNs), which promise extremely energy-efficient computing. In this paper, we investigate the use of SNNs in the context of channel equalization for ultra-low complexity receivers. We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE). For conversion of real-world data into spike signals we introduce a novel ternary encoding and compare it with traditional log-scale encoding. We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels. We highlight that mainly the conversion of the channel output to spikes introduces a small performance penalty. The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.  ( 2 min )
    Predicting CO$_2$ Absorption in Ionic Liquids with Molecular Descriptors and Explainable Graph Neural Networks. (arXiv:2210.01120v2 [physics.chem-ph] UPDATED)
    Ionic Liquids (ILs) provide a promising solution for CO$_2$ capture and storage to mitigate global warming. However, identifying and designing the high-capacity IL from the giant chemical space requires expensive, and exhaustive simulations and experiments. Machine learning (ML) can accelerate the process of searching for desirable ionic molecules through accurate and efficient property predictions in a data-driven manner. But existing descriptors and ML models for the ionic molecule suffer from the inefficient adaptation of molecular graph structure. Besides, few works have investigated the explainability of ML models to help understand the learned features that can guide the design of efficient ionic molecules. In this work, we develop both fingerprint-based ML models and Graph Neural Networks (GNNs) to predict the CO$_2$ absorption in ILs. Fingerprint works on graph structure at the feature extraction stage, while GNNs directly handle molecule structure in both the feature extraction and model prediction stage. We show that our method outperforms previous ML models by reaching a high accuracy (MAE of 0.0137, $R^2$ of 0.9884). Furthermore, we take the advantage of GNNs feature representation and develop a substructure-based explanation method that provides insight into how each chemical fragments within IL molecules contribute to the CO$_2$ absorption prediction of ML models. We also show that our explanation result agrees with some ground truth from the theoretical reaction mechanism of CO$_2$ absorption in ILs, which can advise on the design of novel and efficient functional ILs in the future.
    Smoothness Analysis for Probabilistic Programs with Application to Optimised Variational Inference. (arXiv:2208.10530v2 [cs.PL] UPDATED)
    We present a static analysis for discovering differentiable or more generally smooth parts of a given probabilistic program, and show how the analysis can be used to improve the pathwise gradient estimator, one of the most popular methods for posterior inference and model learning. Our improvement increases the scope of the estimator from differentiable models to non-differentiable ones without requiring manual intervention of the user; the improved estimator automatically identifies differentiable parts of a given probabilistic program using our static analysis, and applies the pathwise gradient estimator to the identified parts while using a more general but less efficient estimator, called score estimator, for the rest of the program. Our analysis has a surprisingly subtle soundness argument, partly due to the misbehaviours of some target smoothness properties when viewed from the perspective of program analysis designers. For instance, some smoothness properties are not preserved by function composition, and this makes it difficult to analyse sequential composition soundly without heavily sacrificing precision. We formulate five assumptions on a target smoothness property, prove the soundness of our analysis under those assumptions, and show that our leading examples satisfy these assumptions. We also show that by using information from our analysis instantiated for differentiability, our improved gradient estimator satisfies an important differentiability requirement and thus computes the correct estimate on average (i.e., returns an unbiased estimate) under a regularity condition. Our experiments with representative probabilistic programs in the Pyro language show that our static analysis is capable of identifying smooth parts of those programs accurately, and making our improved pathwise gradient estimator exploit all the opportunities for high performance in those programs.
    Imbalanced Data Classification via Generative Adversarial Network with Application to Anomaly Detection in Additive Manufacturing Process. (arXiv:2210.17274v2 [cs.LG] UPDATED)
    Supervised classification methods have been widely utilized for the quality assurance of the advanced manufacturing process, such as additive manufacturing (AM) for anomaly (defects) detection. However, since abnormal states (with defects) occur much less frequently than normal ones (without defects) in the manufacturing process, the number of sensor data samples collected from a normal state outweighs that from an abnormal state. This issue causes imbalanced training data for classification models, thus deteriorating the performance of detecting abnormal states in the process. It is beneficial to generate effective artificial sample data for the abnormal states to make a more balanced training set. To achieve this goal, this paper proposes a novel data augmentation method based on a generative adversarial network (GAN) using additive manufacturing process image sensor data. The novelty of our approach is that a standard GAN and classifier are jointly optimized with techniques to stabilize the learning process of standard GAN. The diverse and high-quality generated samples provide balanced training data to the classifier. The iterative optimization between GAN and classifier provides the high-performance classifier. The effectiveness of the proposed method is validated by both open-source data and real-world case studies in polymer and metal AM processes.
    Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement Learning. (arXiv:2211.04583v1 [cs.LG])
    Offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without the possibility of additional online data collection. This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment. However, this promise also bears the drawback of this setting. The restricted dataset induces subjective uncertainty because the agent can encounter unfamiliar sequences of states and actions that the training data did not cover. Moreover, inherent system stochasticity further increases uncertainty and aggravates the offline RL problem, preventing the agent from learning an optimal policy. To mitigate the destructive uncertainty effects, we need to balance the aspiration to take reward-maximizing actions with the incurred risk due to incorrect ones. In financial economics, modern portfolio theory (MPT) is a method that risk-averse investors can use to construct diversified portfolios that maximize their returns without unacceptable levels of risk. We integrate MPT into the agent's decision-making process to present a simple-yet-highly-effective risk-aware planning algorithm for offline RL. Our algorithm allows us to systematically account for the \emph{estimated quality} of specific actions and their \emph{estimated risk} due to the uncertainty. We show that our approach can be coupled with the Transformer architecture to yield a state-of-the-art planner for offline RL tasks, maximizing the return while significantly reducing the variance.
    FedDef: Defense Against Gradient Leakage in Federated Learning-based Network Intrusion Detection Systems. (arXiv:2210.04052v2 [cs.CR] UPDATED)
    Deep learning (DL) methods have been widely applied to anomaly-based network intrusion detection system (NIDS) to detect malicious traffic. To expand the usage scenarios of DL-based methods, the federated learning (FL) framework allows multiple users to train a global model on the basis of respecting individual data privacy. However, it has not yet been systematically evaluated how robust FL-based NIDSs are against existing privacy attacks under existing defenses. To address this issue, we propose two privacy evaluation metrics designed for FL-based NIDSs, including (1) privacy score that evaluates the similarity between the original and recovered traffic features using reconstruction attacks, and (2) evasion rate against NIDSs using Generative Adversarial Network-based adversarial attack with the reconstructed benign traffic. We conduct experiments to show that existing defenses provide little protection that the corresponding adversarial traffic can even evade the SOTA NIDS Kitsune. To defend against such attacks and build a more robust FL-based NIDS, we further propose FedDef, a novel optimization-based input perturbation defense strategy with theoretical guarantee. It achieves both high utility by minimizing the gradient distance and strong privacy protection by maximizing the input distance. We experimentally evaluate four existing defenses on four datasets and show that our defense outperforms all the baselines in terms of privacy protection with up to 7 times higher privacy score, while maintaining model accuracy loss within 3% under optimal parameter combination.
    Implicit Graphon Neural Representation. (arXiv:2211.03329v2 [cs.LG] UPDATED)
    Graphons are general and powerful models for generating graphs of varying size. In this paper, we propose to directly model graphons using neural networks, obtaining Implicit Graphon Neural Representation (IGNR). Existing work in modeling and reconstructing graphons often approximates a target graphon by a fixed resolution piece-wise constant representation. Our IGNR has the benefit that it can represent graphons up to arbitrary resolutions, and enables natural and efficient generation of arbitrary sized graphs with desired structure once the model is learned. Furthermore, we allow the input graph data to be unaligned and have different sizes by leveraging the Gromov-Wasserstein distance. We first demonstrate the effectiveness of our model by showing its superior performance on a graphon learning task. We then propose an extension of IGNR that can be incorporated into an auto-encoder framework, and demonstrate its good performance under a more general setting of graphon learning. We also show that our model is suitable for graph representation learning and graph generation.
    Disentangling Aesthetic and Technical Effects for Video Quality Assessment of User Generated Content. (arXiv:2211.04894v1 [cs.CV])
    User-generated-content (UGC) videos have dominated the Internet during recent years. While many methods attempt to objectively assess the quality of these UGC videos, the mechanisms of human quality perception in the UGC-VQA problem is still yet to be explored. To better explain the quality perception mechanisms and learn more robust representations, we aim to disentangle the effects of aesthetic quality issues and technical quality issues risen by the complicated video generation processes in the UGC-VQA problem. To overcome the absence of respective supervisions during disentanglement, we propose the Limited View Biased Supervisions (LVBS) scheme where two separate evaluators are trained with decomposed views specifically designed for each issue. Composed of an Aesthetic Quality Evaluator (AQE) and a Technical Quality Evaluator (TQE) under the LVBS scheme, the proposed Disentangled Objective Video Quality Evaluator (DOVER) reach excellent performance (0.91 SRCC for KoNViD-1k, 0.89 SRCC for LSVQ, 0.88 SRCC for YouTube-UGC) in the UGC-VQA problem. More importantly, our blind subjective studies prove that the separate evaluators in DOVER can effectively match human perception on respective disentangled quality issues. Codes and demos are released in https://github.com/teowu/dover.  ( 2 min )
    RIGID: Robust Linear Regression with Missing Data. (arXiv:2205.13635v3 [cs.LG] UPDATED)
    We present a robust framework to perform linear regression with missing entries in the features. By considering an elliptical data distribution, and specifically a multivariate normal model, we are able to conditionally formulate a distribution for the missing entries and present a robust framework, which minimizes the worst case error caused by the uncertainty about the missing data. We show that the proposed formulation, which naturally takes into account the dependency between different variables, ultimately reduces to a convex program, for which a customized and scalable solver can be delivered. In addition to a detailed analysis to deliver such solver, we also asymptoticly analyze the behavior of the proposed framework, and present technical discussions to estimate the required input parameters. We complement our analysis with experiments performed on synthetic, semi-synthetic, and real data, and show how the proposed formulation improves the prediction accuracy and robustness, and outperforms the competing techniques. Missing data is a common problem associated with many datasets in machine learning. With the significant increase in using robust optimization techniques to train machine learning models, this paper presents a novel robust regression framework that operates by minimizing the uncertainty associated with missing data. The proposed approach allows training models with incomplete data, while minimizing the impact of uncertainty associated with the unavailable data. The ideas developed in this paper can be generalized beyond linear models and elliptical data distributions.
    Automated Learning: An Implementation of The A* Search Algorithm over The Random Base Functions. (arXiv:2211.05085v1 [physics.data-an])
    This letter explains an algorithm for finding a set of base functions. The method aims to capture the leading behavior of the dataset in terms of a few base functions. Implementation of the A-star search will help find these functions, while the gradient descent optimizes the parameters of the functions at each search step. We will show the resulting plots to compare the extrapolation with the unseen data.
    Curriculum generation using Autoencoder based continuous optimization. (arXiv:2106.08569v2 [cs.LG] UPDATED)
    Research in Curriculum Learning has shown better performance on the task by optimizing the sequence of the training data. Recent works have focused on using complex reinforcement learning techniques to find the optimal data ordering strategy to maximize learning for a given network. In this paper, we present a simple yet efficient technique based on continuous optimization trained with auto-encoding procedure. We call this new approach Training Sequence Optimization (TSO). With a usual encoder-decoder setup we try to learn the latent space continuous representation of the training strategy and a predictor network is used on the continuous representation to predict the accuracy of the strategy on the fixed network architecture. The performance predictor and encoder enable us to perform gradient-based optimization by gradually moving towards the latent space representation of training data ordering with potentially better accuracy. We show an empirical gain of 2AP with our generated optimal curriculum strategy over the random strategy using the CIFAR-100 and CIFAR-10 datasets and have better boosts than the existing state-of-the-art CL algorithms.
    Hyper-GST: Predict Metro Passenger Flow Incorporating GraphSAGE, Hypergraph, Social-meaningful Edge Weights and Temporal Exploitation. (arXiv:2211.04988v1 [cs.LG])
    Predicting metro passenger flow precisely is of great importance for dynamic traffic planning. Deep learning algorithms have been widely applied due to their robust performance in modelling non-linear systems. However, traditional deep learning algorithms completely discard the inherent graph structure within the metro system. Graph-based deep learning algorithms could utilise the graph structure but raise a few challenges, such as how to determine the weights of the edges and the shallow receptive field caused by the over-smoothing issue. To further improve these challenges, this study proposes a model based on GraphSAGE with an edge weights learner applied. The edge weights learner utilises socially meaningful features to generate edge weights. Hypergraph and temporal exploitation modules are also constructed as add-ons for better performance. A comparison study is conducted on the proposed algorithm and other state-of-art graph neural networks, where the proposed algorithm could improve the performance.
    Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task. (arXiv:2211.05039v1 [cs.LG])
    We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. Further, we propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.
    Comparative analysis of machine learning methods for active flow control. (arXiv:2202.11664v3 [physics.flu-dyn] UPDATED)
    Machine learning frameworks such as Genetic Programming (GP) and Reinforcement Learning (RL) are gaining popularity in flow control. This work presents a comparative analysis of the two, bench-marking some of their most representative algorithms against global optimization techniques such as Bayesian Optimization (BO) and Lipschitz global optimization (LIPO). First, we review the general framework of the model-free control problem, bringing together all methods as black-box optimization problems. Then, we test the control algorithms on three test cases. These are (1) the stabilization of a nonlinear dynamical system featuring frequency cross-talk, (2) the wave cancellation from a Burgers' flow and (3) the drag reduction in a cylinder wake flow. We present a comprehensive comparison to illustrate their differences in exploration versus exploitation and their balance between `model capacity' in the control law definition versus `required complexity'. We believe that such a comparison paves the way toward the hybridization of the various methods, and we offer some perspective on their future development in the literature on flow control problems.  ( 2 min )
    Variants of SGD for Lipschitz Continuous Loss Functions in Low-Precision Environments. (arXiv:2211.04655v1 [math.OC])
    Motivated by neural network training in low-bit floating and fixed-point environments, this work studies the convergence of variants of SGD with computational error. Considering a general stochastic Lipschitz continuous loss function, a novel convergence result to a Clarke stationary point is presented assuming that only an approximation of its stochastic gradient can be computed as well as error in computing the SGD step itself. Different variants of SGD are then tested empirically in a variety of low-precision arithmetic environments, with improved test set accuracy achieved compared to SGD for two image recognition tasks.
    Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models. (arXiv:2211.05105v1 [cs.CV])
    Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed-inappropriate image prompts (I2P)-containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment.
    A framework for online, stabilizing reinforcement learning. (arXiv:2207.08730v8 [eess.SY] UPDATED)
    Online reinforcement learning is concerned with training an agent on-the-fly via dynamic interaction with the environment. Here, due to the specifics of the application, it is not generally possible to perform long pre-training, as it is commonly done in off-line, model-free approaches, which are akin to dynamic programming. Such applications may be found more frequently in industry, rather than in pure digital fields, such as cloud services, video games, database management, etc., where reinforcement learning has been demonstrating success. Online reinforcement learning, in contrast, is more akin to classical control, which utilizes some model knowledge about the environment. Stability of the closed-loop (agent plus the environment) is a major challenge for such online approaches. In this paper, we tackle this problem by a special fusion of online reinforcement learning with elements of classical control, namely, based on the Lyapunov theory of stability. The idea is to start the agent at once, without pre-training, and learn approximately optimal policy under specially designed constraints, which guarantee stability. The resulting approach was tested in an extensive experimental study with a mobile robot. A nominal parking controller was used as a baseline. It was observed that the suggested agent could always successfully park the robot, while significantly improving the cost. While many approaches may be exploited for mobile robot control, we suggest that the experiments showed the promising potential of online reinforcement learning agents based on Lyapunov-like constraints. The presented methodology may be utilized in safety-critical, industrial applications where stability is necessary.
    Towards Global Crop Maps with Transfer Learning. (arXiv:2211.04755v1 [cs.CV])
    The continuous increase in global population and the impact of climate change on crop production are expected to affect the food sector significantly. In this context, there is need for timely, large-scale and precise mapping of crops for evidence-based decision making. A key enabler towards this direction are new satellite missions that freely offer big remote sensing data of high spatio-temporal resolution and global coverage. During the previous decade and because of this surge of big Earth observations, deep learning methods have dominated the remote sensing and crop mapping literature. Nevertheless, deep learning models require large amounts of annotated data that are scarce and hard-to-acquire. To address this problem, transfer learning methods can be used to exploit available annotations and enable crop mapping for other regions, crop types and years of inspection. In this work, we have developed and trained a deep learning model for paddy rice detection in South Korea using Sentinel-1 VH time-series. We then fine-tune the model for i) paddy rice detection in France and Spain and ii) barley detection in the Netherlands. Additionally, we propose a modification in the pre-trained weights in order to incorporate extra input features (Sentinel-1 VV). Our approach shows excellent performance when transferring in different areas for the same crop type and rather promising results when transferring in a different area and crop type.
    Efficiently Scaling Transformer Inference. (arXiv:2211.05102v1 [cs.LG])
    We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapidly throughout application areas. We develop a simple analytical model for inference efficiency to select the best multi-dimensional partitioning techniques optimized for TPU v4 slices based on the application requirements. We combine these with a suite of low-level optimizations to achieve a new Pareto frontier on the latency and model FLOPS utilization (MFU) tradeoffs on 500B+ parameter models that outperforms the FasterTransformer suite of benchmarks. We further show that with appropriate partitioning, the lower memory requirements of multiquery attention (i.e. multiple query heads share single key/value head) enables scaling up to 32x larger context lengths. Finally, we achieve a low-batch-size latency of 29ms per token during generation (using int8 weight quantization) and a 76% MFU during large-batch-size processing of input tokens, while supporting a long 2048-token context length on the PaLM 540B parameter model.
    Knowledge Distillation for Federated Learning: a Practical Guide. (arXiv:2211.04742v1 [cs.LG])
    Federated Learning (FL) enables the training of Deep Learning models without centrally collecting possibly sensitive raw data. This paves the way for stronger privacy guarantees when building predictive models. The most used algorithms for FL are parameter-averaging based schemes (e.g., Federated Averaging) that, however, have well known limits: (i) Clients must implement the same model architecture; (ii) Transmitting model weights and model updates implies high communication cost, which scales up with the number of model parameters; (iii) In presence of non-IID data distributions, parameter-averaging aggregation schemes perform poorly due to client model drifts. Federated adaptations of regular Knowledge Distillation (KD) can solve and/or mitigate the weaknesses of parameter-averaging FL algorithms while possibly introducing other trade-offs. In this article, we provide a review of KD-based algorithms tailored for specific FL issues.
    An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. (arXiv:2211.04468v1 [q-bio.QM])
    Virtual, make-on-demand chemical libraries have transformed early-stage drug discovery by unlocking vast, synthetically accessible regions of chemical space. Recent years have witnessed rapid growth in these libraries from millions to trillions of compounds, hiding undiscovered, potent hits for a variety of therapeutic targets. However, they are quickly approaching a size beyond that which permits explicit enumeration, presenting new challenges for virtual screening. To overcome these challenges, we propose the Combinatorial Synthesis Library Variational Auto-Encoder (CSLVAE). The proposed generative model represents such libraries as a differentiable, hierarchically-organized database. Given a compound from the library, the molecular encoder constructs a query for retrieval, which is utilized by the molecular decoder to reconstruct the compound by first decoding its chemical reaction and subsequently decoding its reactants. Our design minimizes autoregression in the decoder, facilitating the generation of large, valid molecular graphs. Our method performs fast and parallel batch inference for ultra-large synthesis libraries, enabling a number of important applications in early-stage drug discovery. Compounds proposed by our method are guaranteed to be in the library, and thus synthetically and cost-effectively accessible. Importantly, CSLVAE can encode out-of-library compounds and search for in-library analogues. In experiments, we demonstrate the capabilities of the proposed method in the navigation of massive combinatorial synthesis libraries.
    Utilising Bayesian Networks to combine multimodal data and expert opinion for the robust prediction of depression and its symptoms. (arXiv:2211.04924v1 [cs.LG])
    Predicting the presence of major depressive disorder (MDD) using behavioural and cognitive signals is a highly non-trivial task. The heterogeneous clinical profile of MDD means that any given speech, facial expression and/or observed cognitive pattern may be associated with a unique combination of depressive symptoms. Conventional discriminative machine learning models potentially lack the complexity to robustly model this heterogeneity. Bayesian networks, however, may instead be well-suited to such a scenario. These networks are probabilistic graphical models that efficiently describe the joint probability distribution over a set of random variables by explicitly capturing their conditional dependencies. This framework provides further advantages over standard discriminative modelling by offering the possibility to incorporate expert opinion in the graphical structure of the models, generating explainable model predictions, informing about the uncertainty of predictions, and naturally handling missing data. In this study, we apply a Bayesian framework to capture the relationships between depression, depression symptoms, and features derived from speech, facial expression and cognitive game data collected at thymia.  ( 2 min )
    Physics-informed inference of aerial animal movements from weather radar data. (arXiv:2211.04539v1 [cs.LG])
    Studying animal movements is essential for effective wildlife conservation and conflict mitigation. For aerial movements, operational weather radars have become an indispensable data source in this respect. However, partial measurements, incomplete spatial coverage, and poor understanding of animal behaviours make it difficult to reconstruct complete spatio-temporal movement patterns from available radar data. We tackle this inverse problem by learning a mapping from high-dimensional radar measurements to low-dimensional latent representations using a convolutional encoder. Under the assumption that the latent system dynamics are well approximated by a locally linear Gaussian transition model, we perform efficient posterior estimation using the classical Kalman smoother. A convolutional decoder maps the inferred latent system states back to the physical space in which the known radar observation model can be applied, enabling fully unsupervised training. To encourage physical consistency, we additionally introduce a physics-informed loss term that leverages known mass conservation constraints. Our experiments on synthetic radar data show promising results in terms of reconstruction quality and data-efficiency.  ( 2 min )
    Graph representation learning for street networks. (arXiv:2211.04984v1 [stat.ML])
    Streets networks provide an invaluable source of information about the different temporal and spatial patterns emerging in our cities. These streets are often represented as graphs where intersections are modelled as nodes and streets as links between them. Previous work has shown that raster representations of the original data can be created through a learning algorithm on low-dimensional representations of the street networks. In contrast, models that capture high-level urban network metrics can be trained through convolutional neural networks. However, the detailed topological data is lost through the rasterisation of the street network. The models cannot recover this information from the image alone, failing to capture complex street network features. This paper proposes a model capable of inferring good representations directly from the street network. Specifically, we use a variational autoencoder with graph convolutional layers and a decoder that outputs a probabilistic fully-connected graph to learn latent representations that encode both local network structure and the spatial distribution of nodes. We train the model on thousands of street network segments and use the learnt representations to generate synthetic street configurations. Finally, we proposed a possible application to classify the urban morphology of different network segments by investigating their common characteristics in the learnt space.  ( 2 min )
    Accountable and Explainable Methods for Complex Reasoning over Text. (arXiv:2211.04946v1 [cs.LG])
    A major concern of Machine Learning (ML) models is their opacity. They are deployed in an increasing number of applications where they often operate as black boxes that do not provide explanations for their predictions. Among others, the potential harms associated with the lack of understanding of the models' rationales include privacy violations, adversarial manipulations, and unfair discrimination. As a result, the accountability and transparency of ML models have been posed as critical desiderata by works in policy and law, philosophy, and computer science. In computer science, the decision-making process of ML models has been studied by developing accountability and transparency methods. Accountability methods, such as adversarial attacks and diagnostic datasets, expose vulnerabilities of ML models that could lead to malicious manipulations or systematic faults in their predictions. Transparency methods explain the rationales behind models' predictions gaining the trust of relevant stakeholders and potentially uncovering mistakes and unfairness in models' decisions. To this end, transparency methods have to meet accountability requirements as well, e.g., being robust and faithful to the underlying rationales of a model. This thesis presents my research that expands our collective knowledge in the areas of accountability and transparency of ML models developed for complex reasoning tasks over text.
    Recovering Unbalanced Communities in the Stochastic Block Model With Application to Clustering with a Faulty Oracle. (arXiv:2202.08522v2 [cs.LG] UPDATED)
    The stochastic block model (SBM) is a fundamental model for studying graph clustering or community detection in networks. It has received great attention in the last decade and the balanced case, i.e., assuming all clusters have large size, has been well studied. However, our understanding of SBM with unbalanced communities (arguably, more relevant in practice) is still very limited. In this paper, we provide a simple SVD-based algorithm for recovering the communities in the SBM with communities of varying sizes. We improve upon a result of Ailon, Chen and Xu [ICML 2013] by removing the assumption that there is a large interval such that the sizes of clusters do not fall in. Under the planted clique conjecture, the size of the clusters that can be recovered by our algorithm is nearly optimal (up to polylogarithmic factors) when the probability parameters are constant. As a byproduct, we obtain a polynomial-time algorithm with sublinear query complexity for a clustering problem with a faulty oracle, which finds all clusters of size larger than $\tilde{\Omega}({\sqrt{n}})$ even if $\Omega(n)$ small clusters co-exist in the graph. In contrast, all the previous efficient algorithms that makes sublinear number of queries cannot recover any large cluster, if there are more than $\tilde{\Omega}(n^{2/5})$ small clusters.
    Duality for Neural Networks through Reproducing Kernel Banach Spaces. (arXiv:2211.05020v1 [math.FA])
    Reproducing Kernel Hilbert spaces (RKHS) have been a very successful tool in various areas of machine learning. Recently, Barron spaces have been used to proof bounds on the generalisation error for neural networks. Unfortunately, Barron spaces cannot be understood in terms of RKHS due to the strong nonlinear coupling of the weights. We show that this can be solved by using the more general Reproducing Kernel Banach spaces (RKBS). This class of integral RKBS can be understood as an infinite union of RKHS spaces. As the RKBS is not a Hilbert space, it is not its own dual space. However, we show that its dual space is again an RKBS where the roles of the data and parameters are interchanged, forming an adjoint pair of RKBSs including a reproducing property in the dual space. This allows us to construct the saddle point problem for neural networks, which can be used in whole field of primal-dual optimisation techniques.
    Profiling and Improving the PyTorch Dataloader for high-latency Storage: A Technical Report. (arXiv:2211.04908v1 [cs.LG])
    A growing number of Machine Learning Frameworks recently made Deep Learning accessible to a wider audience of engineers, scientists, and practitioners, by allowing straightforward use of complex neural network architectures and algorithms. However, since deep learning is rapidly evolving, not only through theoretical advancements but also with respect to hardware and software engineering, ML frameworks often lose backward compatibility and introduce technical debt that can lead to bottlenecks and sub-optimal resource utilization. Moreover, the focus is in most cases not on deep learning engineering, but rather on new models and theoretical advancements. In this work, however, we focus on engineering, more specifically on the data loading pipeline in the PyTorch Framework. We designed a series of benchmarks that outline performance issues of certain steps in the data loading process. Our findings show that for classification tasks that involve loading many files, like images, the training wall-time can be significantly improved. With our new, modified ConcurrentDataloader we can reach improvements in GPU utilization and significantly reduce batch loading time, up to 12X. This allows for the use of the cloud-based, S3-like object storage for datasets, and have comparable training time as if datasets are stored on local drives.
    Leveraging Offline Data in Online Reinforcement Learning. (arXiv:2211.04974v1 [cs.LG])
    Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $\epsilon$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $\epsilon$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.
    Perceived personality state estimation in dyadic and small group interaction with deep learning methods. (arXiv:2211.04979v1 [cs.HC])
    Dyadic and small group collaboration is an evolutionary advantageous behaviour and the need for such collaboration is a regular occurrence in day to day life. In this paper we estimate the perceived personality traits of individuals in dyadic and small groups over thin-slices of interaction on four multimodal datasets. We find that our transformer based predictive model performs similarly to human annotators tasked with predicting the perceived big-five personality traits of participants. Using this model we analyse the estimated perceived personality traits of individuals performing tasks in small groups and dyads. Permutation analysis shows that in the case of small groups undergoing collaborative tasks, the perceived personality of group members clusters, this is also observed for dyads in a collaborative problem solving task, but not in dyads under non-collaborative task settings. Additionally, we find that the group level average perceived personality traits provide a better predictor of group performance than the group level average self-reported personality traits.
    Reducing Down(stream)time: Pretraining Molecular GNNs using Heterogeneous AI Accelerators. (arXiv:2211.04598v1 [cs.LG])
    The demonstrated success of transfer learning has popularized approaches that involve pretraining models from massive data sources and subsequent finetuning towards a specific task. While such approaches have become the norm in fields such as natural language processing, implementation and evaluation of transfer learning approaches for chemistry are in the early stages. In this work, we demonstrate finetuning for downstream tasks on a graph neural network (GNN) trained over a molecular database containing 2.7 million water clusters. The use of Graphcore IPUs as an AI accelerator for training molecular GNNs reduces training time from a reported 2.7 days on 0.5M clusters to 1.2 hours on 2.7M clusters. Finetuning the pretrained model for downstream tasks of molecular dynamics and transfer to a different potential energy surface took only 8.3 hours and 28 minutes, respectively, on a single GPU.
    Code-Switching without Switching: Language Agnostic End-to-End Speech Translation. (arXiv:2210.01512v2 [cs.CL] UPDATED)
    We propose a) a Language Agnostic end-to-end Speech Translation model (LAST), and b) a data augmentation strategy to increase code-switching (CS) performance. With increasing globalization, multiple languages are increasingly used interchangeably during fluent speech. Such CS complicates traditional speech recognition and translation, as we must recognize which language was spoken first and then apply a language-dependent recognizer and subsequent translation component to generate the desired target language output. Such a pipeline introduces latency and errors. In this paper, we eliminate the need for that, by treating speech recognition and translation as one unified end-to-end speech translation problem. By training LAST with both input languages, we decode speech into one target language, regardless of the input language. LAST delivers comparable recognition and speech translation accuracy in monolingual usage, while reducing latency and error rate considerably when CS is observed.  ( 2 min )
    Constrained Update Projection Approach to Safe Policy Optimization. (arXiv:2209.07089v2 [cs.LG] UPDATED)
    Safe reinforcement learning (RL) studies problems where an intelligent agent has to not only maximize reward but also avoid exploring unsafe areas. In this study, we propose CUP, a novel policy optimization method based on Constrained Update Projection framework that enjoys rigorous safety guarantee. Central to our CUP development is the newly proposed surrogate functions along with the performance bound. Compared to previous safe RL methods, CUP enjoys the benefits of 1) CUP generalizes the surrogate functions to generalized advantage estimator (GAE), leading to strong empirical performance. 2) CUP unifies performance bounds, providing a better understanding and interpretability for some existing algorithms; 3) CUP provides a non-convex implementation via only first-order optimizers, which does not require any strong approximation on the convexity of the objectives. To validate our CUP method, we compared CUP against a comprehensive list of safe RL baselines on a wide range of tasks. Experiments show the effectiveness of CUP both in terms of reward and safety constraint satisfaction. We have opened the source code of CUP at this link https://github.com/zmsn-2077/ CUP-safe-rl.
    Artificial intelligence for improved fitting of trajectories of elementary particles in inhomogeneous dense materials immersed in a magnetic field. (arXiv:2211.04890v1 [physics.data-an])
    In this article, we use artificial intelligence algorithms to show how to enhance the resolution of the elementary particle track fitting in inhomogeneous dense detectors, such as plastic scintillators. We use deep learning to replace more traditional Bayesian filtering methods, drastically improving the reconstruction of the interacting particle kinematics. We show that a specific form of neural network, inherited from the field of natural language processing, is very close to the concept of a Bayesian filter that adopts a hyper-informative prior. Such a paradigm change can influence the design of future particle physics experiments and their data exploitation.
    Clinical Contrastive Learning for Biomarker Detection. (arXiv:2211.05092v1 [cs.CV])
    This paper presents a novel positive and negative set selection strategy for contrastive learning of medical images based on labels that can be extracted from clinical data. In the medical field, there exists a variety of labels for data that serve different purposes at different stages of a diagnostic and treatment process. Clinical labels and biomarker labels are two examples. In general, clinical labels are easier to obtain in larger quantities because they are regularly collected during routine clinical care, while biomarker labels require expert analysis and interpretation to obtain. Within the field of ophthalmology, previous work has shown that clinical values exhibit correlations with biomarker structures that manifest within optical coherence tomography (OCT) scans. We exploit this relationship between clinical and biomarker data to improve performance for biomarker classification. This is accomplished by leveraging the larger amount of clinical data as pseudo-labels for our data without biomarker labels in order to choose positive and negative instances for training a backbone network with a supervised contrastive loss. In this way, a backbone network learns a representation space that aligns with the clinical data distribution available. Afterwards, we fine-tune the network trained in this manner with the smaller amount of biomarker labeled data with a cross-entropy loss in order to classify these key indicators of disease directly from OCT scans. Our method is shown to outperform state of the art self-supervised methods by as much as 5% in terms of accuracy on individual biomarker detection.  ( 3 min )
    On the use of learning-based forecasting methods for ameliorating fashion business processes: A position paper. (arXiv:2211.04798v1 [cs.CV])
    The fashion industry is one of the most active and competitive markets in the world, manufacturing millions of products and reaching large audiences every year. A plethora of business processes are involved in this large-scale industry, but due to the generally short life-cycle of clothing items, supply-chain management and retailing strategies are crucial for good market performance. Correctly understanding the wants and needs of clients, managing logistic issues and marketing the correct products are high-level problems with a lot of uncertainty associated to them given the number of influencing factors, but most importantly due to the unpredictability often associated with the future. It is therefore straightforward that forecasting methods, which generate predictions of the future, are indispensable in order to ameliorate all the various business processes that deal with the true purpose and meaning of fashion: having a lot of people wear a particular product or style, rendering these items, people and consequently brands fashionable. In this paper, we provide an overview of three concrete forecasting tasks that any fashion company can apply in order to improve their industrial and market impact. We underline advances and issues in all three tasks and argue about their importance and the impact they can have at an industrial level. Finally, we highlight issues and directions of future work, reflecting on how learning-based forecasting methods can further aid the fashion industry.
    Semi-Equivariant Continuous Normalizing Flows for Target-Aware Molecule Generation. (arXiv:2211.04754v1 [cs.LG])
    We propose an algorithm for learning a conditional generative model of a molecule given a target. Specifically, given a receptor molecule that one wishes to bind to, the conditional model generates candidate ligand molecules that may bind to it. The distribution should be invariant to rigid body transformations that act $\textit{jointly}$ on the ligand and the receptor; it should also be invariant to permutations of either the ligand or receptor atoms. Our learning algorithm is based on a continuous normalizing flow. We establish semi-equivariance conditions on the flow which guarantee the aforementioned invariance conditions on the conditional distribution. We propose a graph neural network architecture which implements this flow, and which is designed to learn effectively despite the vast differences in size between the ligand and receptor. We evaluate our method on the CrossDocked2020 dataset, attaining a significant improvement in binding affinity over competing methods.
    Resource frugal optimizer for quantum machine learning. (arXiv:2211.04965v1 [quant-ph])
    Quantum-enhanced data science, also known as quantum machine learning (QML), is of growing interest as an application of near-term quantum computers. Variational QML algorithms have the potential to solve practical problems on real hardware, particularly when involving quantum data. However, training these algorithms can be challenging and calls for tailored optimization procedures. Specifically, QML applications can require a large shot-count overhead due to the large datasets involved. In this work, we advocate for simultaneous random sampling over both the dataset as well as the measurement operators that define the loss function. We consider a highly general loss function that encompasses many QML applications, and we show how to construct an unbiased estimator of its gradient. This allows us to propose a shot-frugal gradient descent optimizer called Refoqus (REsource Frugal Optimizer for QUantum Stochastic gradient descent). Our numerics indicate that Refoqus can save several orders of magnitude in shot cost, even relative to optimizers that sample over measurement operators alone.
    Machine-Learned Exclusion Limits without Binning. (arXiv:2211.04806v1 [hep-ph])
    Machine-Learned Likelihoods (MLL) is a method that, by combining modern machine-learning classification techniques with likelihood-based inference tests, allows to estimate the experimental sensitivity of high-dimensional data sets. We extend the MLL method by including the exclusion hypothesis tests and show that the addition of Kernel Density Estimators avoids the need to bin the classifier output in order to extract the resulting one-dimensional signal and background probability density functions. We first test our method on toy models generated with multivariate Gaussian distributions, where the true probability distribution functions are known. We then apply it to a case of interest in the search for new physics at the HL-LHC, in which a $Z^\prime$ boson decays into lepton pairs, comparing the performance of our method for estimating 95\% CL exclusion limits to the results obtained applying a binned likelihood to the machine-learning classifier output.
    A physics-aware deep learning model for energy localization in multiscale shock-to-detonation simulations of heterogeneous energetic materials. (arXiv:2211.04561v1 [cond-mat.mtrl-sci])
    Predictive simulations of the shock-to-detonation transition (SDT) in heterogeneous energetic materials (EM) are vital to the design and control of their energy release and sensitivity. Due to the complexity of the thermo-mechanics of EM during the SDT, both macro-scale response and sub-grid mesoscale energy localization must be captured accurately. This work proposes an efficient and accurate multiscale framework for SDT simulations of EM. We employ deep learning to model the mesoscale energy localization of shock-initiated EM microstructures upon which prediction results are used to supply reaction progress rate information to the macroscale SDT simulation. The proposed multiscale modeling framework is divided into two stages. First, a physics-aware recurrent convolutional neural network (PARC) is used to model the mesoscale energy localization of shock-initiated heterogeneous EM microstructures. PARC is trained using direct numerical simulations (DNS) of hotspot ignition and growth within microstructures of pressed HMX material subjected to different input shock strengths. After training, PARC is employed to supply hotspot ignition and growth rates for macroscale SDT simulations. We show that PARC can play the role of a surrogate model in a multiscale simulation framework, while drastically reducing the computation cost and providing improved representations of the sub-grid physics. The proposed multiscale modeling approach will provide a new tool for material scientists in designing high-performance and safer energetic materials.
    Interpretable Deep Reinforcement Learning for Green Security Games with Real-Time Information. (arXiv:2211.04987v1 [cs.LG])
    Green Security Games with real-time information (GSG-I) add the real-time information about the agents' movement to the typical GSG formulation. Prior works on GSG-I have used deep reinforcement learning (DRL) to learn the best policy for the agent in such an environment without any need to store the huge number of state representations for GSG-I. However, the decision-making process of DRL methods is largely opaque, which results in a lack of trust in their predictions. To tackle this issue, we present an interpretable DRL method for GSG-I that generates visualization to explain the decisions taken by the DRL algorithm. We also show that this approach performs better and works well with a simpler training regimen compared to the existing method.
    Combining Contrastive Learning and Knowledge Graph Embeddings to develop medical word embeddings for the Italian language. (arXiv:2211.05035v1 [cs.CL])
    Word embeddings play a significant role in today's Natural Language Processing tasks and applications. While pre-trained models may be directly employed and integrated into existing pipelines, they are often fine-tuned to better fit with specific languages or domains. In this paper, we attempt to improve available embeddings in the uncovered niche of the Italian medical domain through the combination of Contrastive Learning (CL) and Knowledge Graph Embedding (KGE). The main objective is to improve the accuracy of semantic similarity between medical terms, which is also used as an evaluation task. Since the Italian language lacks medical texts and controlled vocabularies, we have developed a specific solution by combining preexisting CL methods (multi-similarity loss, contextualization, dynamic sampling) and the integration of KGEs, creating a new variant of the loss. Although without having outperformed the state-of-the-art, represented by multilingual models, the obtained results are encouraging, providing a significant leap in performance compared to the starting model, while using a significantly lower amount of data.
    Directional Privacy for Deep Learning. (arXiv:2211.04686v1 [cs.LG])
    Differentially Private Stochastic Gradient Descent (DP-SGD) is a key method for applying privacy in the training of deep learning models. This applies isotropic Gaussian noise to gradients during training, which can perturb these gradients in any direction, damaging utility. Metric DP, however, can provide alternative mechanisms based on arbitrary metrics that might be more suitable. In this paper we apply \textit{directional privacy}, via a mechanism based on the von Mises-Fisher (VMF) distribution, to perturb gradients in terms of \textit{angular distance} so that gradient direction is broadly preserved. We show that this provides $\epsilon d$-privacy for deep learning training, rather than the $(\epsilon, \delta)$-privacy of the Gaussian mechanism; and that experimentally, on key datasets, the VMF mechanism can outperform the Gaussian in the utility-privacy trade-off.
    SUPRA: Superpixel Guided Loss for Improved Multi-modal Segmentation in Endoscopy. (arXiv:2211.04658v1 [cs.CV])
    Domain shift is a well-known problem in the medical imaging community. In particular, for endoscopic image analysis where the data can have different modalities the performance of deep learning (DL) methods gets adversely affected. In other words, methods developed on one modality cannot be used for a different modality. However, in real clinical settings, endoscopists switch between modalities for better mucosal visualisation. In this paper, we explore the domain generalisation technique to enable DL methods to be used in such scenarios. To this extend, we propose to use super pixels generated with Simple Linear Iterative Clustering (SLIC) which we refer to as "SUPRA" for SUPeRpixel Augmented method. SUPRA first generates a preliminary segmentation mask making use of our new loss "SLICLoss" that encourages both an accurate and color-consistent segmentation. We demonstrate that SLICLoss when combined with Binary Cross Entropy loss (BCE) can improve the model's generalisability with data that presents significant domain shift. We validate this novel compound loss on a vanilla U-Net using the EndoUDA dataset, which contains images for Barret's Esophagus and polyps from two modalities. We show that our method yields an improvement of nearly 25% in the target domain set compared to the baseline.
    Hyper-Parameter Auto-Tuning for Sparse Bayesian Learning. (arXiv:2211.04847v1 [eess.SP])
    Choosing the values of hyper-parameters in sparse Bayesian learning (SBL) can significantly impact performance. However, the hyper-parameters are normally tuned manually, which is often a difficult task. Most recently, effective automatic hyper-parameter tuning was achieved by using an empirical auto-tuner. In this work, we address the issue of hyper-parameter auto-tuning using neural network (NN)-based learning. Inspired by the empirical auto-tuner, we design and learn a NN-based auto-tuner, and show that considerable improvement in convergence rate and recovery performance can be achieved.
    Accelerating Adversarial Perturbation by 50% with Semi-backward Propagation. (arXiv:2211.04973v1 [cs.LG])
    Adversarial perturbation plays a significant role in the field of adversarial robustness, which solves a maximization problem over the input data. We show that the backward propagation of such optimization can accelerate $2\times$ (and thus the overall optimization including the forward propagation can accelerate $1.5\times$), without any utility drop, if we only compute the output gradient but not the parameter gradient during the backward propagation.
    ARMOR: A Model-based Framework for Improving Arbitrary Baseline Policies with Offline Data. (arXiv:2211.04538v1 [cs.LG])
    We propose a new model-based offline RL framework, called Adversarial Models for Offline Reinforcement Learning (ARMOR), which can robustly learn policies to improve upon an arbitrary baseline policy regardless of data coverage. Based on the concept of relative pessimism, ARMOR is designed to optimize for the worst-case relative performance when facing uncertainty. In theory, we prove that the learned policy of ARMOR never degrades the performance of the baseline policy with any admissible hyperparameter, and can learn to compete with the best policy within data coverage when the hyperparameter is well tuned, and the baseline policy is supported by the data. Such a robust policy improvement property makes ARMOR especially suitable for building real-world learning systems, because in practice ensuring no performance degradation is imperative before considering any benefit learning can bring.
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v2 [cs.LG] UPDATED)
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.  ( 2 min )
    LiCo-Net: Linearized Convolution Network for Hardware-efficient Keyword Spotting. (arXiv:2211.04635v1 [cs.LG])
    This paper proposes a hardware-efficient architecture, Linearized Convolution Network (LiCo-Net) for keyword spotting. It is optimized specifically for low-power processor units like microcontrollers. ML operators exhibit heterogeneous efficiency profiles on power-efficient hardware. Given the exact theoretical computation cost, int8 operators are more computation-effective than float operators, and linear layers are often more efficient than other layers. The proposed LiCo-Net is a dual-phase system that uses the efficient int8 linear operators at the inference phase and applies streaming convolutions at the training phase to maintain a high model capacity. The experimental results show that LiCo-Net outperforms single-value decomposition filter (SVDF) on hardware efficiency with on-par detection performance. Compared to SVDF, LiCo-Net reduces cycles by 40% on HiFi4 DSP.  ( 2 min )
    Share the Tensor Tea: How Databases can Leverage the Machine Learning Ecosystem. (arXiv:2209.04579v1 [cs.DB] CROSS LISTED)
    We demonstrate Tensor Query Processor (TQP): a query processor that automatically compiles relational operators into tensor programs. By leveraging tensor runtimes such as PyTorch, TQP is able to: (1) integrate with ML tools (e.g., Pandas for data ingestion, Tensorboard for visualization); (2) target different hardware (e.g., CPU, GPU) and software (e.g., browser) backends; and (3) end-to-end accelerate queries containing both relational and ML operators. TQP is generic enough to support the TPC-H benchmark, and it provides performance that is comparable to, and often better than, that of specialized CPU and GPU query processors.  ( 2 min )
    Minimalist Data Wrangling with Python. (arXiv:2211.04630v1 [cs.LG])
    Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimensionality reduction, identifying naturally occurring data clusters, modelling patterns in data, comparing data between groups, and reporting the results. This textbook is a non-profit project. Its online and PDF versions are freely available at https://datawranglingpy.gagolewski.com/.
    First principles physics-informed neural network for quantum wavefunctions and eigenvalue surfaces. (arXiv:2211.04607v1 [cs.LG])
    Physics-informed neural networks have been widely applied to learn general parametric solutions of differential equations. Here, we propose a neural network to discover parametric eigenvalue and eigenfunction surfaces of quantum systems. We apply our method to solve the hydrogen molecular ion. This is an ab-initio deep learning method that solves the Schrodinger equation with the Coulomb potential yielding realistic wavefunctions that include a cusp at the ion positions. The neural solutions are continuous and differentiable functions of the interatomic distance and their derivatives are analytically calculated by applying automatic differentiation. Such a parametric and analytical form of the solutions is useful for further calculations such as the determination of force fields.
    Large Language Models with Controllable Working Memory. (arXiv:2211.05110v1 [cs.CL])
    Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model with an informational context to aid its performance on the underlying task, how the model's world knowledge interacts with the factual information presented in the context remains under explored. As a desirable behavior, an LLM should give precedence to the context whenever it contains task-relevant information that conflicts with the model's memorized knowledge. This enables model predictions to be grounded in the context, which can then be used to update or correct specific model predictions without frequent retraining. By contrast, when the context is irrelevant to the task, the model should ignore it and fall back on its internal knowledge. In this paper, we undertake a first joint study of the aforementioned two properties, namely controllability and robustness, in the context of LLMs. We demonstrate that state-of-the-art T5 and PaLM (both pretrained and finetuned) could exhibit poor controllability and robustness, which do not scale with increasing model size. As a solution, we propose a novel method - Knowledge Aware FineTuning (KAFT) - to strengthen both controllability and robustness by incorporating counterfactual and irrelevant contexts to standard supervised datasets. Our comprehensive evaluation showcases the utility of KAFT across model architectures and sizes.
    Framework Construction of an Adversarial Federated Transfer Learning Classifier. (arXiv:2211.04734v1 [cs.LG])
    As the Internet grows in popularity, more and more classification jobs, such as IoT, finance industry and healthcare field, rely on mobile edge computing to advance machine learning. In the medical industry, however, good diagnostic accuracy necessitates the combination of large amounts of labeled data to train the model, which is difficult and expensive to collect and risks jeopardizing patients' privacy. In this paper, we offer a novel medical diagnostic framework that employs a federated learning platform to ensure patient data privacy by transferring classification algorithms acquired in a labeled domain to a domain with sparse or missing labeled data. Rather than using a generative adversarial network, our framework uses a discriminative model to build multiple classification loss functions with the goal of improving diagnostic accuracy. It also avoids the difficulty of collecting large amounts of labeled data or the high cost of generating large amount of sample data. Experiments on real-world image datasets demonstrates that the suggested adversarial federated transfer learning method is promising for real-world medical diagnosis applications that use image classification.
    Continual learning autoencoder training for a particle-in-cell simulation via streaming. (arXiv:2211.04770v1 [cs.LG])
    The upcoming exascale era will provide a new generation of physics simulations. These simulations will have a high spatiotemporal resolution, which will impact the training of machine learning models since storing a high amount of simulation data on disk is nearly impossible. Therefore, we need to rethink the training of machine learning models for simulations for the upcoming exascale era. This work presents an approach that trains a neural network concurrently to a running simulation without storing data on a disk. The training pipeline accesses the training data by in-memory streaming. Furthermore, we apply methods from the domain of continual learning to enhance the generalization of the model. We tested our pipeline on the training of a 3d autoencoder trained concurrently to laser wakefield acceleration particle-in-cell simulation. Furthermore, we experimented with various continual learning methods and their effect on the generalization.
    A survey of some recent developments in measures of association. (arXiv:2211.04702v1 [stat.ME])
    This paper surveys some recent developments in measures of association related to a new coefficient of correlation introduced by the author. A straightforward extension of this coefficient to standard Borel spaces (which includes all Polish spaces), overlooked in the literature so far, is proposed at the end of the survey.
    Understanding Benign Overfitting in Gradient-Based Meta Learning. (arXiv:2206.13482v2 [cs.LG] UPDATED)
    Meta learning has demonstrated tremendous success in few-shot learning with limited supervised data. In those settings, the meta model is usually overparameterized. While the conventional statistical learning theory suggests that overparameterized models tend to overfit, empirical evidence reveals that overparameterized meta learning methods still work well -- a phenomenon often called "benign overfitting." To understand this phenomenon, we focus on the meta learning settings with a challenging bilevel structure that we term the gradient-based meta learning, and analyze its generalization performance under an overparameterized meta linear regression model. While our analysis uses the relatively tractable linear models, our theory contributes to understanding the delicate interplay among data heterogeneity, model adaptation and benign overfitting in gradient-based meta learning tasks. We corroborate our theoretical claims through numerical simulations.
    Extragradient with Positive Momentum is Optimal for Games with Cross-Shaped Jacobian Spectrum. (arXiv:2211.04659v1 [cs.LG])
    The extragradient method has recently gained increasing attention, due to its convergence behavior on smooth games. In $n$-player differentiable games, the eigenvalues of the Jacobian of the vector field are distributed on the complex plane, exhibiting more convoluted dynamics compared to classical (i.e., single player) minimization. In this work, we take a polynomial-based analysis of the extragradient with momentum for optimizing games with \emph{cross-shaped} Jacobian spectrum on the complex plane. We show two results. First, based on the hyperparameter setup, the extragradient with momentum exhibits three different modes of convergence: when the eigenvalues are distributed $i)$ on the real line, $ii)$ both on the real line along with complex conjugates, and $iii)$ only as complex conjugates. Then, we focus on the case $ii)$, i.e., when the eigenvalues of the Jacobian have \emph{cross-shaped} structure, as observed in training generative adversarial networks. For this problem class, we derive the optimal hyperparameters of the momentum extragradient method, and show that it achieves an accelerated convergence rate.
    Towards Algorithmic Fairness in Space-Time: Filling in Black Holes. (arXiv:2211.04568v1 [stat.AP])
    New technologies and the availability of geospatial data have drawn attention to spatio-temporal biases present in society. For example: the COVID-19 pandemic highlighted disparities in the availability of broadband service and its role in the digital divide; the environmental justice movement in the United States has raised awareness to health implications for minority populations stemming from historical redlining practices; and studies have found varying quality and coverage in the collection and sharing of open-source geospatial data. Despite the extensive literature on machine learning (ML) fairness, few algorithmic strategies have been proposed to mitigate such biases. In this paper we highlight the unique challenges for quantifying and addressing spatio-temporal biases, through the lens of use cases presented in the scientific literature and media. We envision a roadmap of ML strategies that need to be developed or adapted to quantify and overcome these challenges -- including transfer learning, active learning, and reinforcement learning techniques. Further, we discuss the potential role of ML in providing guidance to policy makers on issues related to spatial fairness.
    Learning to Price Supply Chain Contracts against a Learning Retailer. (arXiv:2211.04586v1 [cs.LG])
    The rise of big data analytics has automated the decision-making of companies and increased supply chain agility. In this paper, we study the supply chain contract design problem faced by a data-driven supplier who needs to respond to the inventory decisions of the downstream retailer. Both the supplier and the retailer are uncertain about the market demand and need to learn about it sequentially. The goal for the supplier is to develop data-driven pricing policies with sublinear regret bounds under a wide range of possible retailer inventory policies for a fixed time horizon. To capture the dynamics induced by the retailer's learning policy, we first make a connection to non-stationary online learning by following the notion of variation budget. The variation budget quantifies the impact of the retailer's learning strategy on the supplier's decision-making. We then propose dynamic pricing policies for the supplier for both discrete and continuous demand. We also note that our proposed pricing policy only requires access to the support of the demand distribution, but critically, does not require the supplier to have any prior knowledge about the retailer's learning policy or the demand realizations. We examine several well-known data-driven policies for the retailer, including sample average approximation, distributionally robust optimization, and parametric approaches, and show that our pricing policies lead to sublinear regret bounds in all these cases. At the managerial level, we answer affirmatively that there is a pricing policy with a sublinear regret bound under a wide range of retailer's learning policies, even though she faces a learning retailer and an unknown demand distribution. Our work also provides a novel perspective in data-driven operations management where the principal has to learn to react to the learning policies employed by other agents in the system.
    Distributional Shift Adaptation using Domain-Specific Features. (arXiv:2211.04670v1 [cs.LG])
    Machine learning algorithms typically assume that the training and test samples come from the same distributions, i.e., in-distribution. However, in open-world scenarios, streaming big data can be Out-Of-Distribution (OOD), rendering these algorithms ineffective. Prior solutions to the OOD challenge seek to identify invariant features across different training domains. The underlying assumption is that these invariant features should also work reasonably well in the unlabeled target domain. By contrast, this work is interested in the domain-specific features that include both invariant features and features unique to the target domain. We propose a simple yet effective approach that relies on correlations in general regardless of whether the features are invariant or not. Our approach uses the most confidently predicted samples identified by an OOD base model (teacher model) to train a new model (student model) that effectively adapts to the target domain. Empirical evaluations on benchmark datasets show that the performance is improved over the SOTA by ~10-20%
    Differentiable Quantum Programming with Unbounded Loops. (arXiv:2211.04507v1 [quant-ph])
    The emergence of variational quantum applications has led to the development of automatic differentiation techniques in quantum computing. Recently, Zhu et al. (PLDI 2020) have formulated differentiable quantum programming with bounded loops, providing a framework for scalable gradient calculation by quantum means for training quantum variational applications. However, promising parameterized quantum applications, e.g., quantum walk and unitary implementation, cannot be trained in the existing framework due to the natural involvement of unbounded loops. To fill in the gap, we provide the first differentiable quantum programming framework with unbounded loops, including a newly designed differentiation rule, code transformation, and their correctness proof. Technically, we introduce a randomized estimator for derivatives to deal with the infinite sum in the differentiation of unbounded loops, whose applicability in classical and probabilistic programming is also discussed. We implement our framework with Python and Q#, and demonstrate a reasonable sample efficiency. Through extensive case studies, we showcase an exciting application of our framework in automatically identifying close-to-optimal parameters for several parameterized quantum applications.
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v2 [stat.ML] UPDATED)
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We begin to address this problem in the context of multi-class classification by developing a novel training algorithm producing models with more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method can lead to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.
    Secure and Private Source Coding with Private Key and Decoder Side Information. (arXiv:2205.05068v3 [cs.IT] UPDATED)
    The problem of secure source coding with multiple terminals is extended by considering a remote source whose noisy measurements are the correlated random variables used for secure source reconstruction. The main additions to the problem include 1) all terminals noncausally observe a noisy measurement of the remote source; 2) a private key is available to all legitimate terminals; 3) the public communication link between the encoder and decoder is rate-limited; and 4) the secrecy leakage to the eavesdropper is measured with respect to the encoder input, whereas the privacy leakage is measured with respect to the remote source. Exact rate regions are characterized for a lossy source coding problem with a private key, remote source, and decoder side information under security, privacy, communication, and distortion constraints. By replacing the distortion constraint with a reliability constraint, we obtain the exact rate region also for the lossless case. Furthermore, the lossy rate region for scalar discrete-time Gaussian sources and measurement channels is established.
    Neural Networks with Divisive normalization for image segmentation with application in cityscapes dataset. (arXiv:2203.13558v2 [cs.CV] UPDATED)
    One of the key problems in computer vision is adaptation: models are too rigid to follow the variability of the inputs. The canonical computation that explains adaptation in sensory neuroscience is divisive normalization, and it has appealing effects on image manifolds. In this work we show that including divisive normalization in current deep networks makes them more invariant to non-informative changes in the images. In particular, we focus on U-Net architectures for image segmentation. Experiments show that the inclusion of divisive normalization in the U-Net architecture leads to better segmentation results with respect to conventional U-Net. The gain increases steadily when dealing with images acquired in bad weather conditions. In addition to the results on the Cityscapes and Foggy Cityscapes datasets, we explain these advantages through visualization of the responses: the equalization induced by the divisive normalization leads to more invariant features to local changes in contrast and illumination.
    Learning to Learn Domain-invariant Parameters for Domain Generalization. (arXiv:2211.04582v1 [cs.LG])
    Due to domain shift, deep neural networks (DNNs) usually fail to generalize well on unknown test data in practice. Domain generalization (DG) aims to overcome this issue by capturing domain-invariant representations from source domains. Motivated by the insight that only partial parameters of DNNs are optimized to extract domain-invariant representations, we expect a general model that is capable of well perceiving and emphatically updating such domain-invariant parameters. In this paper, we propose two modules of Domain Decoupling and Combination (DDC) and Domain-invariance-guided Backpropagation (DIGB), which can encourage such general model to focus on the parameters that have a unified optimization direction between pairs of contrastive samples. Our extensive experiments on two benchmarks have demonstrated that our proposed method has achieved state-of-the-art performance with strong generalization capability.
    Finite Sample Identification of Wide Shallow Neural Networks with Biases. (arXiv:2211.04589v1 [cs.LG])
    Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
    Detecting and Accommodating Novel Types and Concepts in an Embodied Simulation Environment. (arXiv:2211.04555v1 [cs.LG])
    In this paper, we present methods for two types of metacognitive tasks in an AI system: rapidly expanding a neural classification model to accommodate a new category of object, and recognizing when a novel object type is observed instead of misclassifying the observation as a known class. Our methods take numerical data drawn from an embodied simulation environment, which describes the motion and properties of objects when interacted with, and we demonstrate that this type of representation is important for the success of novel type detection. We present a suite of experiments in rapidly accommodating the introduction of new categories and concepts and in novel type detection, and an architecture to integrate the two in an interactive system.
    Creating a Safety Assurance Case for an ML Satellite-Based Wildfire Detection and Alert System. (arXiv:2211.04530v1 [cs.LG])
    Wildfires are a common problem in many areas of the world with often catastrophic consequences. A number of systems have been created to provide early warnings of wildfires, including those that use satellite data to detect fires. The increased availability of small satellites, such as CubeSats, allows the wildfire detection response time to be reduced by deploying constellations of multiple satellites over regions of interest. By using machine learned components on-board the satellites, constraints which limit the amount of data that can be processed and sent back to ground stations can be overcome. There are hazards associated with wildfire alert systems, such as failing to detect the presence of a wildfire, or detecting a wildfire in the incorrect location. It is therefore necessary to be able to create a safety assurance case for the wildfire alert ML component that demonstrates it is sufficiently safe for use. This paper describes in detail how a safety assurance case for an ML wildfire alert system is created. This represents the first fully developed safety case for an ML component containing explicit argument and evidence as to the safety of the machine learning.
    Learning to Follow Instructions in Text-Based Games. (arXiv:2211.04591v1 [cs.LG])
    Text-based games present a unique class of sequential decision making problem in which agents interact with a partially observable, simulated environment via actions and observations conveyed through natural language. Such observations typically include instructions that, in a reinforcement learning (RL) setting, can directly or indirectly guide a player towards completing reward-worthy tasks. In this work, we study the ability of RL agents to follow such instructions. We conduct experiments that show that the performance of state-of-the-art text-based game agents is largely unaffected by the presence or absence of such instructions, and that these agents are typically unable to execute tasks to completion. To further study and address the task of instruction following, we equip RL agents with an internal structured representation of natural language instructions in the form of Linear Temporal Logic (LTL), a formal language that is increasingly used for temporally extended reward specification in RL. Our framework both supports and highlights the benefit of understanding the temporal semantics of instructions and in measuring progress towards achievement of such a temporally extended behaviour. Experiments with 500+ games in TextWorld demonstrate the superior performance of our approach.
    Cold Start Streaming Learning for Deep Networks. (arXiv:2211.04624v1 [cs.LG])
    The ability to dynamically adapt neural networks to newly-available data without performance deterioration would revolutionize deep learning applications. Streaming learning (i.e., learning from one data example at a time) has the potential to enable such real-time adaptation, but current approaches i) freeze a majority of network parameters during streaming and ii) are dependent upon offline, base initialization procedures over large subsets of data, which damages performance and limits applicability. To mitigate these shortcomings, we propose Cold Start Streaming Learning (CSSL), a simple, end-to-end approach for streaming learning with deep networks that uses a combination of replay and data augmentation to avoid catastrophic forgetting. Because CSSL updates all model parameters during streaming, the algorithm is capable of beginning streaming from a random initialization, making base initialization optional. Going further, the algorithm's simplicity allows theoretical convergence guarantees to be derived using analysis of the Neural Tangent Random Feature (NTRF). In experiments, we find that CSSL outperforms existing baselines for streaming learning in experiments on CIFAR100, ImageNet, and Core50 datasets. Additionally, we propose a novel multi-task streaming learning setting and show that CSSL performs favorably in this domain. Put simply, CSSL performs well and demonstrates that the complicated, multi-step training pipelines adopted by most streaming methodologies can be replaced with a simple, end-to-end learning approach without sacrificing performance.
    Bayesian Learning with Wasserstein Barycenters. (arXiv:1805.10833v5 [stat.ML] UPDATED)
    We introduce and study a novel model-selection strategy for Bayesian learning, based on optimal transport, along with its associated predictive posterior law: the Wasserstein population barycenter of the posterior law over models. We first show how this estimator, termed Bayesian Wasserstein barycenter (BWB), arises naturally in a general, parameter-free Bayesian model-selection framework, when the considered Bayesian risk is the Wasserstein distance. Examples are given, illustrating how the BWB extends some classic parametric and non-parametric selection strategies. Furthermore, we also provide explicit conditions granting the existence and statistical consistency of the BWB, and discuss some of its general and specific properties, providing insights into its advantages compared to usual choices, such as the model average estimator. Finally, we illustrate how this estimator can be computed using the stochastic gradient descent (SGD) algorithm in Wasserstein space introduced in a companion paper arXiv:2201.04232v2 [math.OC], and provide a numerical example for experimental validation of the proposed method.
    The Dice loss in the context of missing or empty labels: Introducing $\Phi$ and $\epsilon$. (arXiv:2207.09521v2 [cs.CV] UPDATED)
    Albeit the Dice loss is one of the dominant loss functions in medical image segmentation, most research omits a closer look at its derivative, i.e. the real motor of the optimization when using gradient descent. In this paper, we highlight the peculiar action of the Dice loss in the presence of missing or empty labels. First, we formulate a theoretical basis that gives a general description of the Dice loss and its derivative. It turns out that the choice of the reduction dimensions $\Phi$ and the smoothing term $\epsilon$ is non-trivial and greatly influences its behavior. We find and propose heuristic combinations of $\Phi$ and $\epsilon$ that work in a segmentation setting with either missing or empty labels. Second, we empirically validate these findings in a binary and multiclass segmentation setting using two publicly available datasets. We confirm that the choice of $\Phi$ and $\epsilon$ is indeed pivotal. With $\Phi$ chosen such that the reductions happen over a single batch (and class) element and with a negligible $\epsilon$, the Dice loss deals with missing labels naturally and performs similarly compared to recent adaptations specific for missing labels. With $\Phi$ chosen such that the reductions happen over multiple batch elements or with a heuristic value for $\epsilon$, the Dice loss handles empty labels correctly. We believe that this work highlights some essential perspectives and hope that it encourages researchers to better describe their exact implementation of the Dice loss in future work.
    Optimal Graph Filters for Clustering Attributed Graphs. (arXiv:2211.04634v1 [cs.LG])
    Many real-world systems can be represented as graphs where the different entities are presented by nodes and their interactions by edges. An important task in studying large datasets is graph clustering. While there has been a lot of work on graph clustering using the connectivity between the nodes, many real-world networks also have node attributes. Clustering attributed graphs requires joint modeling of graph structure and node attributes. Recent work has focused on graph convolutional networks and graph convolutional filters to combine structural and content information. However, these methods are mostly limited to lowpass filtering and do not explicitly optimize the filters for the clustering task. In this paper, we introduce a graph signal processing based approach, where we design polynomial graph filters optimized for clustering. The proposed approach is formulated as a two-step iterative optimization problem where graph filters that are interpretable and optimal for the given data are learned while maximizing the separation between different clusters. The proposed approach is evaluated on attributed networks and compared to the state-of-the-art graph convolutional network approaches.
    Foundation Models for Semantic Novelty in Reinforcement Learning. (arXiv:2211.04878v1 [cs.LG])
    Effectively exploring the environment is a key challenge in reinforcement learning (RL). We address this challenge by defining a novel intrinsic reward based on a foundation model, such as contrastive language image pretraining (CLIP), which can encode a wealth of domain-independent semantic visual-language knowledge about the world. Specifically, our intrinsic reward is defined based on pre-trained CLIP embeddings without any fine-tuning or learning on the target RL task. We demonstrate that CLIP-based intrinsic rewards can drive exploration towards semantically meaningful states and outperform state-of-the-art methods in challenging sparse-reward procedurally-generated environments.
    Deep Explainable Learning with Graph Based Data Assessing and Rule Reasoning. (arXiv:2211.04693v1 [cs.AI])
    Learning an explainable classifier often results in low accuracy model or ends up with a huge rule set, while learning a deep model is usually more capable of handling noisy data at scale, but with the cost of hard to explain the result and weak at generalization. To mitigate this gap, we propose an end-to-end deep explainable learning approach that combines the advantage of deep model in noise handling and expert rule-based interpretability. Specifically, we propose to learn a deep data assessing model which models the data as a graph to represent the correlations among different observations, whose output will be used to extract key data features. The key features are then fed into a rule network constructed following predefined noisy expert rules with trainable parameters. As these models are correlated, we propose an end-to-end training framework, utilizing the rule classification loss to optimize the rule learning model and data assessing model at the same time. As the rule-based computation is none-differentiable, we propose a gradient linking search module to carry the gradient information from the rule learning model to the data assessing model. The proposed method is tested in an industry production system, showing comparable prediction accuracy, much higher generalization stability and better interpretability when compared with a decent deep ensemble baseline, and shows much better fitting power than pure rule-based approach.
    On the Robustness of Explanations of Deep Neural Network Models: A Survey. (arXiv:2211.04780v1 [cs.LG])
    Explainability has been widely stated as a cornerstone of the responsible and trustworthy use of machine learning models. With the ubiquitous use of Deep Neural Network (DNN) models expanding to risk-sensitive and safety-critical domains, many methods have been proposed to explain the decisions of these models. Recent years have also seen concerted efforts that have shown how such explanations can be distorted (attacked) by minor input perturbations. While there have been many surveys that review explainability methods themselves, there has been no effort hitherto to assimilate the different methods and metrics proposed to study the robustness of explanations of DNN models. In this work, we present a comprehensive survey of methods that study, understand, attack, and defend explanations of DNN models. We also present a detailed review of different metrics used to evaluate explanation methods, as well as describe attributional attack and defense methods. We conclude with lessons and take-aways for the community towards ensuring robust explanations of DNN model predictions.
    OutlierDetection.jl: A modular outlier detection ecosystem for the Julia programming language. (arXiv:2211.04550v1 [cs.LG])
    OutlierDetection.jl is an open-source ecosystem for outlier detection in Julia. It provides a range of high-performance outlier detection algorithms implemented directly in Julia. In contrast to previous packages, our ecosystem enables the development highly-scalable outlier detection algorithms using a high-level programming language. Additionally, it provides a standardized, yet flexible, interface for future outlier detection algorithms and allows for model composition unseen in previous packages. Best practices such as unit testing, continuous integration, and code coverage reporting are enforced across the ecosystem. The most recent version of OutlierDetection.jl is available at https://github.com/OutlierDetectionJL/OutlierDetection.jl.
    MP-SeizNet: A Multi-Path CNN Bi-LSTM Network for Seizure-Type Classification Using EEG. (arXiv:2211.04628v1 [eess.SP])
    Seizure type identification is essential for the treatment and management of epileptic patients. However, it is a difficult process known to be time consuming and labor intensive. Automated diagnosis systems, with the advancement of machine learning algorithms, have the potential to accelerate the classification process, alert patients, and support physicians in making quick and accurate decisions. In this paper, we present a novel multi-path seizure-type classification deep learning network (MP-SeizNet), consisting of a convolutional neural network (CNN) and a bidirectional long short-term memory neural network (Bi-LSTM) with an attention mechanism. The objective of this study was to classify specific types of seizures, including complex partial, simple partial, absence, tonic, and tonic-clonic seizures, using only electroencephalogram (EEG) data. The EEG data is fed to our proposed model in two different representations. The CNN was fed with wavelet-based features extracted from the EEG signals, while the Bi-LSTM was fed with raw EEG signals to let our MP-SeizNet jointly learns from different representations of seizure data for more accurate information learning. The proposed MP-SeizNet was evaluated using the largest available EEG epilepsy database, the Temple University Hospital EEG Seizure Corpus, TUSZ v1.5.2. We evaluated our proposed model across different patient data using three-fold cross-validation and across seizure data using five-fold cross-validation, achieving F1 scores of 87.6% and 98.1%, respectively.
    Graph Neural Networks with Adaptive Readouts. (arXiv:2211.04952v1 [cs.LG])
    An effective aggregation of node features into a graph-level representation via readout functions is an essential step in numerous learning tasks involving graph neural networks. Typically, readouts are simple and non-adaptive functions designed such that the resulting hypothesis space is permutation invariant. Prior work on deep sets indicates that such readouts might require complex node embeddings that can be difficult to learn via standard neighborhood aggregation schemes. Motivated by this, we investigate the potential of adaptive readouts given by neural networks that do not necessarily give rise to permutation invariant hypothesis spaces. We argue that in some problems such as binding affinity prediction where molecules are typically presented in a canonical form it might be possible to relax the constraints on permutation invariance of the hypothesis space and learn a more effective model of the affinity by employing an adaptive readout function. Our empirical results demonstrate the effectiveness of neural readouts on more than 40 datasets spanning different domains and graph characteristics. Moreover, we observe a consistent improvement over standard readouts (i.e., sum, max, and mean) relative to the number of neighborhood aggregation iterations and different convolutional operators.
    Final infarct prediction in acute ischemic stroke. (arXiv:2211.04850v1 [eess.IV])
    This article focuses on the control center of each human body: the brain. We will point out the pivotal role of the cerebral vasculature and how its complex mechanisms may vary between subjects. We then emphasize a specific acute pathological state, i.e., acute ischemic stroke, and show how medical imaging and its analysis can be used to define the treatment. We show how the core-penumbra concept is used in practice using mismatch criteria and how machine learning can be used to make predictions of the final infarct, either via deconvolution or convolutional neural networks.
    Care for the Mind Amid Chronic Diseases: An Interpretable AI Approach Using IoT. (arXiv:2211.04509v1 [cs.AI])
    Health sensing for chronic disease management creates immense benefits for social welfare. Existing health sensing studies primarily focus on the prediction of physical chronic diseases. Depression, a widespread complication of chronic diseases, is however understudied. We draw on the medical literature to support depression prediction using motion sensor data. To connect human expertise in the decision-making, safeguard trust for this high-stake prediction, and ensure algorithm transparency, we develop an interpretable deep learning model: Temporal Prototype Network (TempPNet). TempPNet is built upon the emergent prototype learning models. To accommodate the temporal characteristic of sensor data and the progressive property of depression, TempPNet differs from existing prototype learning models in its capability of capturing the temporal progression of depression. Extensive empirical analyses using real-world motion sensor data show that TempPNet outperforms state-of-the-art benchmarks in depression prediction. Moreover, TempPNet interprets its predictions by visualizing the temporal progression of depression and its corresponding symptoms detected from sensor data. We further conduct a user study to demonstrate its superiority over the benchmarks in interpretability. This study offers an algorithmic solution for impactful social good - collaborative care of chronic diseases and depression in health sensing. Methodologically, it contributes to extant literature with a novel interpretable deep learning model for depression prediction from sensor data. Patients, doctors, and caregivers can deploy our model on mobile devices to monitor patients' depression risks in real-time. Our model's interpretability also allows human experts to participate in the decision-making by reviewing the interpretation of prediction outcomes and making informed interventions.
  • Open

    A Unified Analysis of Multi-task Functional Linear Regression Models with Manifold Constraint and Composite Quadratic Penalty. (arXiv:2211.04874v1 [math.ST])
    This work studies the multi-task functional linear regression models where both the covariates and the unknown regression coefficients (called slope functions) are curves. For slope function estimation, we employ penalized splines to balance bias, variance, and computational complexity. The power of multi-task learning is brought in by imposing additional structures over the slope functions. We propose a general model with double regularization over the spline coefficient matrix: i) a matrix manifold constraint, and ii) a composite penalty as a summation of quadratic terms. Many multi-task learning approaches can be treated as special cases of this proposed model, such as a reduced-rank model and a graph Laplacian regularized model. We show the composite penalty induces a specific norm, which helps to quantify the manifold curvature and determine the corresponding proper subset in the manifold tangent space. The complexity of tangent space subset is then bridged to the complexity of geodesic neighbor via generic chaining. A unified convergence upper bound is obtained and specifically applied to the reduced-rank model and the graph Laplacian regularized model. The phase transition behaviors for the estimators are examined as we vary the configurations of model parameters.  ( 2 min )
    Leveraging Offline Data in Online Reinforcement Learning. (arXiv:2211.04974v1 [cs.LG])
    Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $\epsilon$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $\epsilon$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.  ( 2 min )
    Conformal Frequency Estimation with Sketched Data. (arXiv:2204.04270v2 [stat.ME] UPDATED)
    A flexible conformal inference method is developed to construct confidence intervals for the frequencies of queried objects in very large data sets, based on a much smaller sketch of those data. The approach is data-adaptive and requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals under the sole assumption of data exchangeability. Although our solution is broadly applicable, this paper focuses on applications involving the count-min sketch algorithm and a non-linear variation thereof. The performance is compared to that of frequentist and Bayesian alternatives through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
    Stochastic optimization on matrices and a graphon McKean-Vlasov limit. (arXiv:2210.00422v2 [math.PR] UPDATED)
    We consider stochastic gradient descents on the space of large symmetric matrices of suitable functions that are invariant under permuting the rows and columns using the same permutation. We establish deterministic limits of these random curves as the dimensions of the matrices go to infinity while the entries remain bounded. Under a "small noise" assumption the limit is shown to be the gradient flow of functions on graphons whose existence was established in arXiv:2111.09459. We also consider limits of stochastic gradient descents with added properly scaled reflected Brownian noise. The limiting curve of graphons is characterized by a family of stochastic differential equations with reflections and can be thought of as an extension of the classical McKean-Vlasov limit for interacting diffusions. The proofs introduce a family of infinite-dimensional exchangeable arrays of reflected diffusions and a novel notion of propagation of chaos for large matrices of interacting diffusions.
    A Note on Task-Aware Loss via Reweighing Prediction Loss by Decision-Regret. (arXiv:2211.05116v1 [cs.LG])
    In this short technical note we propose a baseline for decision-aware learning for contextual linear optimization, which solves stochastic linear optimization when cost coefficients can be predicted based on context information. We propose a decision-aware version of predict-then-optimize. We reweigh the prediction error by the decision regret incurred by an (unweighted) pilot estimator of costs to obtain a decision-aware predictor, then optimize with cost predictions from the decision-aware predictor. This method can be motivated as a finite-difference, iterate-independent approximation of the gradients of previously proposed end-to-end learning algorithms; it is also consistent with previously suggested intuition for end-to-end learning. This baseline is computationally easy to implement with readily available reweighted prediction oracles and linear optimization, and can be implemented with convex optimization so long as the prediction error minimization is convex. Empirically, we demonstrate that this approach can lead to improvements over a "predict-then-optimize" framework for settings with misspecified models, and is competitive with other end-to-end approaches. Therefore, due to its simplicity and ease of use, we suggest it as a simple baseline for end-to-end and decision-aware learning.
    A survey of some recent developments in measures of association. (arXiv:2211.04702v1 [stat.ME])
    This paper surveys some recent developments in measures of association related to a new coefficient of correlation introduced by the author. A straightforward extension of this coefficient to standard Borel spaces (which includes all Polish spaces), overlooked in the literature so far, is proposed at the end of the survey.
    Constrained Stochastic Nonconvex Optimization with State-dependent Markov Data. (arXiv:2206.11346v4 [math.OC] UPDATED)
    We study stochastic optimization algorithms for constrained nonconvex stochastic optimization problems with Markovian data. In particular, we focus on the case when the transition kernel of the Markov chain is state-dependent. Such stochastic optimization problems arise in various machine learning problems including strategic classification and reinforcement learning. For this problem, we study both projection-based and projection-free algorithms. In both cases, we establish that the number of calls to the stochastic first-order oracle to obtain an appropriately defined $\epsilon$-stationary point is of the order $\mathcal{O}(1/\epsilon^{2.5})$. In the projection-free setting we additionally establish that the number of calls to the linear minimization oracle is of order $\mathcal{O}(1/\epsilon^{5.5})$. We also empirically demonstrate the performance of our algorithm on the problem of strategic classification with neural networks.
    Conformal Frequency Estimation with Sketched Data under Relaxed Exchangeability. (arXiv:2211.04612v1 [stat.ME])
    A flexible method is developed to construct a confidence interval for the frequency of a queried object in a very large data set, based on a much smaller sketch of the data. The approach requires no knowledge of the data distribution or of the details of the sketching algorithm; instead, it constructs provably valid frequentist confidence intervals for random queries using a conformal inference approach. After achieving marginal coverage for random queries under the assumption of data exchangeability, the proposed method is extended to provide stronger inferences accounting for possibly heterogeneous frequencies of different random queries, redundant queries, and distribution shifts. While the presented methods are broadly applicable, this paper focuses on use cases involving the count-min sketch algorithm and a non-linear variation thereof, to facilitate comparison to prior work. In particular, the developed methods are compared empirically to frequentist and Bayesian alternatives, through simulations and experiments with data sets of SARS-CoV-2 DNA sequences and classic English literature.
    Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy. (arXiv:2107.02780v5 [econ.EM] UPDATED)
    The US Census Bureau will deliberately corrupt data sets derived from the 2020 US Census in an effort to maintain privacy, suggesting a painful trade-off between the privacy of respondents and the precision of economic analysis. To investigate whether this trade-off is inevitable, we formulate a semiparametric model of causal inference with high dimensional corrupted data. We propose a procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments, with a rate of $n^{-1/2}$ for semiparametric estimands that degrades gracefully for nonparametric estimands. Our key assumption is that the true covariates are approximately low rank, which we interpret as approximate repeated measurements and validate in the Census. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. Calibrated simulations verify the coverage of our data cleaning-adjusted confidence intervals and demonstrate the relevance of our results for 2020 Census data.
    A Characterization of List Learnability. (arXiv:2211.04956v1 [stat.ML])
    A classical result in learning theory shows the equivalence of PAC learnability of binary hypothesis classes and the finiteness of VC dimension. Extending this to the multiclass setting was an open problem, which was settled in a recent breakthrough result characterizing multiclass PAC learnability via the DS dimension introduced earlier by Daniely and Shalev-Shwartz. In this work we consider list PAC learning where the goal is to output a list of $k$ predictions. List learning algorithms have been developed in several settings before and indeed, list learning played an important role in the recent characterization of multiclass learnability. In this work we ask: when is it possible to $k$-list learn a hypothesis class? We completely characterize $k$-list learnability in terms of a generalization of DS dimension that we call the $k$-DS dimension. Generalizing the recent characterization of multiclass learnability, we show that a hypothesis class is $k$-list learnable if and only if the $k$-DS dimension is finite.
    Resource frugal optimizer for quantum machine learning. (arXiv:2211.04965v1 [quant-ph])
    Quantum-enhanced data science, also known as quantum machine learning (QML), is of growing interest as an application of near-term quantum computers. Variational QML algorithms have the potential to solve practical problems on real hardware, particularly when involving quantum data. However, training these algorithms can be challenging and calls for tailored optimization procedures. Specifically, QML applications can require a large shot-count overhead due to the large datasets involved. In this work, we advocate for simultaneous random sampling over both the dataset as well as the measurement operators that define the loss function. We consider a highly general loss function that encompasses many QML applications, and we show how to construct an unbiased estimator of its gradient. This allows us to propose a shot-frugal gradient descent optimizer called Refoqus (REsource Frugal Optimizer for QUantum Stochastic gradient descent). Our numerics indicate that Refoqus can save several orders of magnitude in shot cost, even relative to optimizers that sample over measurement operators alone.
    Graph representation learning for street networks. (arXiv:2211.04984v1 [stat.ML])
    Streets networks provide an invaluable source of information about the different temporal and spatial patterns emerging in our cities. These streets are often represented as graphs where intersections are modelled as nodes and streets as links between them. Previous work has shown that raster representations of the original data can be created through a learning algorithm on low-dimensional representations of the street networks. In contrast, models that capture high-level urban network metrics can be trained through convolutional neural networks. However, the detailed topological data is lost through the rasterisation of the street network. The models cannot recover this information from the image alone, failing to capture complex street network features. This paper proposes a model capable of inferring good representations directly from the street network. Specifically, we use a variational autoencoder with graph convolutional layers and a decoder that outputs a probabilistic fully-connected graph to learn latent representations that encode both local network structure and the spatial distribution of nodes. We train the model on thousands of street network segments and use the learnt representations to generate synthetic street configurations. Finally, we proposed a possible application to classify the urban morphology of different network segments by investigating their common characteristics in the learnt space.
    Enhanced Bayesian Neural Networks for Macroeconomics and Finance. (arXiv:2211.04752v1 [econ.EM])
    We develop Bayesian neural networks (BNNs) that permit to model generic nonlinearities and time variation for (possibly large sets of) macroeconomic and financial variables. From a methodological point of view, we allow for a general specification of networks that can be applied to either dense or sparse datasets, and combines various activation functions, a possibly very large number of neurons, and stochastic volatility (SV) for the error term. From a computational point of view, we develop fast and efficient estimation algorithms for the general BNNs we introduce. From an empirical point of view, we show both with simulated data and with a set of common macro and financial applications that our BNNs can be of practical use, particularly so for observations in the tails of the cross-sectional or time series distributions of the target variables.
    Hierarchical Bayesian Modelling for Knowledge Transfer Across Engineering Fleets via Multitask Learning. (arXiv:2204.12404v3 [stat.ML] UPDATED)
    A population-level analysis is proposed to address data sparsity when building predictive models for engineering infrastructure. Utilising an interpretable hierarchical Bayesian approach and operational fleet data, domain expertise is naturally encoded (and appropriately shared) between different sub-groups, representing (i) use-type, (ii) component, or (iii) operating condition. Specifically, domain expertise is exploited to constrain the model via assumptions (and prior distributions) allowing the methodology to automatically share information between similar assets, improving the survival analysis of a truck fleet and power prediction in a wind farm. In each asset management example, a set of correlated functions is learnt over the fleet, in a combined inference, to learn a population model. Parameter estimation is improved when sub-fleets share correlated information at different levels of the hierarchy. In turn, groups with incomplete data automatically borrow statistical strength from those that are data-rich. The statistical correlations enable knowledge transfer via Bayesian transfer learning, and the correlations can be inspected to inform which assets share information for which effect (i.e. parameter). Both case studies demonstrate the wide applicability to practical infrastructure monitoring, since the approach is naturally adapted between interpretable fleet models of different in situ examples.
    Finite Sample Identification of Wide Shallow Neural Networks with Biases. (arXiv:2211.04589v1 [cs.LG])
    Artificial neural networks are functions depending on a finite number of parameters typically encoded as weights and biases. The identification of the parameters of the network from finite samples of input-output pairs is often referred to as the \emph{teacher-student model}, and this model has represented a popular framework for understanding training and generalization. Even if the problem is NP-complete in the worst case, a rapidly growing literature -- after adding suitable distributional assumptions -- has established finite sample identification of two-layer networks with a number of neurons $m=\mathcal O(D)$, $D$ being the input dimension. For the range $D<m<D^2$ the problem becomes harder, and truly little is known for networks parametrized by biases as well. This paper fills the gap by providing constructive methods and theoretical guarantees of finite sample identification for such wider shallow networks with biases. Our approach is based on a two-step pipeline: first, we recover the direction of the weights, by exploiting second order information; next, we identify the signs by suitable algebraic evaluations, and we recover the biases by empirical risk minimization via gradient descent. Numerical results demonstrate the effectiveness of our approach.
    Detecting Model Misspecification in Amortized Bayesian Inference with Neural Networks. (arXiv:2112.08866v5 [stat.ME] UPDATED)
    Neural density estimators have proven remarkably powerful in performing efficient simulation-based Bayesian inference in various research domains. In particular, the BayesFlow framework uses a two-step approach to enable amortized parameter estimation in settings where the likelihood function is implicitly defined by a simulation program. But how faithful is such inference when simulations are poor representations of reality? In this paper, we conceptualize the types of model misspecification arising in simulation-based inference and systematically investigate the performance of the BayesFlow framework under these misspecifications. We propose an augmented optimization objective which imposes a probabilistic structure on the latent data space and utilize maximum mean discrepancy (MMD) to detect potentially catastrophic misspecifications during inference undermining the validity of the obtained results. We verify our detection criterion on a number of artificial and realistic misspecifications, ranging from toy conjugate models to complex models of decision making and disease outbreak dynamics applied to real data. Further, we show that posterior inference errors increase as a function of the distance between the true data-generating distribution and the typical set of simulations in the latent summary space. Thus, we demonstrate the dual utility of MMD as a method for detecting model misspecification and as a proxy for verifying the faithfulness of amortized Bayesian inference.
    Extragradient with Positive Momentum is Optimal for Games with Cross-Shaped Jacobian Spectrum. (arXiv:2211.04659v1 [cs.LG])
    The extragradient method has recently gained increasing attention, due to its convergence behavior on smooth games. In $n$-player differentiable games, the eigenvalues of the Jacobian of the vector field are distributed on the complex plane, exhibiting more convoluted dynamics compared to classical (i.e., single player) minimization. In this work, we take a polynomial-based analysis of the extragradient with momentum for optimizing games with \emph{cross-shaped} Jacobian spectrum on the complex plane. We show two results. First, based on the hyperparameter setup, the extragradient with momentum exhibits three different modes of convergence: when the eigenvalues are distributed $i)$ on the real line, $ii)$ both on the real line along with complex conjugates, and $iii)$ only as complex conjugates. Then, we focus on the case $ii)$, i.e., when the eigenvalues of the Jacobian have \emph{cross-shaped} structure, as observed in training generative adversarial networks. For this problem class, we derive the optimal hyperparameters of the momentum extragradient method, and show that it achieves an accelerated convergence rate.
    Learning to Price Supply Chain Contracts against a Learning Retailer. (arXiv:2211.04586v1 [cs.LG])
    The rise of big data analytics has automated the decision-making of companies and increased supply chain agility. In this paper, we study the supply chain contract design problem faced by a data-driven supplier who needs to respond to the inventory decisions of the downstream retailer. Both the supplier and the retailer are uncertain about the market demand and need to learn about it sequentially. The goal for the supplier is to develop data-driven pricing policies with sublinear regret bounds under a wide range of possible retailer inventory policies for a fixed time horizon. To capture the dynamics induced by the retailer's learning policy, we first make a connection to non-stationary online learning by following the notion of variation budget. The variation budget quantifies the impact of the retailer's learning strategy on the supplier's decision-making. We then propose dynamic pricing policies for the supplier for both discrete and continuous demand. We also note that our proposed pricing policy only requires access to the support of the demand distribution, but critically, does not require the supplier to have any prior knowledge about the retailer's learning policy or the demand realizations. We examine several well-known data-driven policies for the retailer, including sample average approximation, distributionally robust optimization, and parametric approaches, and show that our pricing policies lead to sublinear regret bounds in all these cases. At the managerial level, we answer affirmatively that there is a pricing policy with a sublinear regret bound under a wide range of retailer's learning policies, even though she faces a learning retailer and an unknown demand distribution. Our work also provides a novel perspective in data-driven operations management where the principal has to learn to react to the learning policies employed by other agents in the system.
    Maximum likelihood recursive state estimation in state-space models: A new approach based on statistical analysis of incomplete data. (arXiv:2211.04631v1 [stat.ME])
    This paper revisits the work of Rauch et al. (1965) and develops a novel method for recursive maximum likelihood particle filtering for general state-space models. The new method is based on statistical analysis of incomplete observations of the systems. Score function and conditional observed information of the incomplete observations/data are introduced and their distributional properties are discussed. Some identities concerning the score function and information matrices of the incomplete data are derived. Maximum likelihood estimation of state-vector is presented in terms of the score function and observed information matrices. In particular, to deal with nonlinear state-space, a sequential Monte Carlo method is developed. It is given recursively by an EM-gradient-particle filtering which extends the work of Lange (1995) for state estimation. To derive covariance matrix of state-estimation errors, an explicit form of observed information matrix is proposed. It extends Louis (1982) general formula for the same matrix to state-vector estimation. Under (Neumann) boundary conditions of state transition probability distribution, the inverse of this matrix coincides with the Cramer-Rao lower bound on the covariance matrix of estimation errors of unbiased state-estimator. In the case of linear models, the method shows that the Kalman filter is a fully efficient state estimator whose covariance matrix of estimation error coincides with the Cramer-Rao lower bound. Some numerical examples are discussed to exemplify the main results.  ( 3 min )
    Training Uncertainty-Aware Classifiers with Conformalized Deep Learning. (arXiv:2205.05878v2 [stat.ML] UPDATED)
    Deep neural networks are powerful tools to detect hidden patterns in data and leverage them to make predictions, but they are not designed to understand uncertainty and estimate reliable probabilities. In particular, they tend to be overconfident. We begin to address this problem in the context of multi-class classification by developing a novel training algorithm producing models with more dependable uncertainty estimates, without sacrificing predictive power. The idea is to mitigate overconfidence by minimizing a loss function, inspired by advances in conformal inference, that quantifies model uncertainty by carefully leveraging hold-out data. Experiments with synthetic and real data demonstrate this method can lead to smaller conformal prediction sets with higher conditional coverage, after exact calibration with hold-out data, compared to state-of-the-art alternatives.  ( 2 min )
    Active Acquisition for Multimodal Temporal Data: A Challenging Decision-Making Task. (arXiv:2211.05039v1 [cs.LG])
    We introduce a challenging decision-making task that we call active acquisition for multimodal temporal data (A2MT). In many real-world scenarios, input features are not readily available at test time and must instead be acquired at significant cost. With A2MT, we aim to learn agents that actively select which modalities of an input to acquire, trading off acquisition cost and predictive performance. A2MT extends a previous task called active feature acquisition to temporal decision making about high-dimensional inputs. Further, we propose a method based on the Perceiver IO architecture to address A2MT in practice. Our agents are able to solve a novel synthetic scenario requiring practically relevant cross-modal reasoning skills. On two large-scale, real-world datasets, Kinetics-700 and AudioSet, our agents successfully learn cost-reactive acquisition behavior. However, an ablation reveals they are unable to learn to learn adaptive acquisition strategies, emphasizing the difficulty of the task even for state-of-the-art models. Applications of A2MT may be impactful in domains like medicine, robotics, or finance, where modalities differ in acquisition cost and informativeness.  ( 2 min )
    Understanding Benign Overfitting in Gradient-Based Meta Learning. (arXiv:2206.13482v2 [cs.LG] UPDATED)
    Meta learning has demonstrated tremendous success in few-shot learning with limited supervised data. In those settings, the meta model is usually overparameterized. While the conventional statistical learning theory suggests that overparameterized models tend to overfit, empirical evidence reveals that overparameterized meta learning methods still work well -- a phenomenon often called "benign overfitting." To understand this phenomenon, we focus on the meta learning settings with a challenging bilevel structure that we term the gradient-based meta learning, and analyze its generalization performance under an overparameterized meta linear regression model. While our analysis uses the relatively tractable linear models, our theory contributes to understanding the delicate interplay among data heterogeneity, model adaptation and benign overfitting in gradient-based meta learning tasks. We corroborate our theoretical claims through numerical simulations.  ( 2 min )
    Machine-Learned Exclusion Limits without Binning. (arXiv:2211.04806v1 [hep-ph])
    Machine-Learned Likelihoods (MLL) is a method that, by combining modern machine-learning classification techniques with likelihood-based inference tests, allows to estimate the experimental sensitivity of high-dimensional data sets. We extend the MLL method by including the exclusion hypothesis tests and show that the addition of Kernel Density Estimators avoids the need to bin the classifier output in order to extract the resulting one-dimensional signal and background probability density functions. We first test our method on toy models generated with multivariate Gaussian distributions, where the true probability distribution functions are known. We then apply it to a case of interest in the search for new physics at the HL-LHC, in which a $Z^\prime$ boson decays into lepton pairs, comparing the performance of our method for estimating 95\% CL exclusion limits to the results obtained applying a binned likelihood to the machine-learning classifier output.  ( 2 min )
    Active Exploration via Experiment Design in Markov Chains. (arXiv:2206.14332v2 [cs.LG] UPDATED)
    A key challenge in science and engineering is to design experiments to learn about some unknown quantity of interest. Classical experimental design optimally allocates the experimental budget to maximize a notion of utility (e.g., reduction in uncertainty about the unknown quantity). We consider a rich setting, where the experiments are associated with states in a {\em Markov chain}, and we can only choose them by selecting a {\em policy} controlling the state transitions. This problem captures important applications, from exploration in reinforcement learning to spatial monitoring tasks. We propose an algorithm -- \textsc{markov-design} -- that efficiently selects policies whose measurement allocation \emph{provably converges to the optimal one}. The algorithm is sequential in nature, adapting its choice of policies (experiments) informed by past measurements. In addition to our theoretical analysis, we showcase our framework on applications in ecological surveillance and pharmacology.  ( 2 min )
    Sparse Bayesian Lasso via a Variable-Coefficient $\ell_1$ Penalty. (arXiv:2211.05089v1 [stat.ME])
    Modern statistical learning algorithms are capable of amazing flexibility, but struggle with interpretability. One possible solution is sparsity: making inference such that many of the parameters are estimated as being identically 0, which may be imposed through the use of nonsmooth penalties such as the $\ell_1$ penalty. However, the $\ell_1$ penalty introduces significant bias when high sparsity is desired. In this article, we retain the $\ell_1$ penalty, but define learnable penalty weights $\lambda_p$ endowed with hyperpriors. We start the article by investigating the optimization problem this poses, developing a proximal operator associated with the $\ell_1$ norm. We then study the theoretical properties of this variable-coefficient $\ell_1$ penalty in the context of penalized likelihood. Next, we investigate application of this penalty to Variational Bayes, developing a model we call the Sparse Bayesian Lasso which allows for behavior qualitatively like Lasso regression to be applied to arbitrary variational models. In simulation studies, this gives us the Uncertainty Quantification and low bias properties of simulation-based approaches with an order of magnitude less computation. Finally, we apply our methodology to a Bayesian lagged spatiotemporal regression model of internal displacement that occurred during the Iraqi Civil War of 2013-2017.  ( 2 min )
    Bayesian Learning with Wasserstein Barycenters. (arXiv:1805.10833v5 [stat.ML] UPDATED)
    We introduce and study a novel model-selection strategy for Bayesian learning, based on optimal transport, along with its associated predictive posterior law: the Wasserstein population barycenter of the posterior law over models. We first show how this estimator, termed Bayesian Wasserstein barycenter (BWB), arises naturally in a general, parameter-free Bayesian model-selection framework, when the considered Bayesian risk is the Wasserstein distance. Examples are given, illustrating how the BWB extends some classic parametric and non-parametric selection strategies. Furthermore, we also provide explicit conditions granting the existence and statistical consistency of the BWB, and discuss some of its general and specific properties, providing insights into its advantages compared to usual choices, such as the model average estimator. Finally, we illustrate how this estimator can be computed using the stochastic gradient descent (SGD) algorithm in Wasserstein space introduced in a companion paper arXiv:2201.04232v2 [math.OC], and provide a numerical example for experimental validation of the proposed method.  ( 2 min )
    An efficient graph generative model for navigating ultra-large combinatorial synthesis libraries. (arXiv:2211.04468v1 [q-bio.QM])
    Virtual, make-on-demand chemical libraries have transformed early-stage drug discovery by unlocking vast, synthetically accessible regions of chemical space. Recent years have witnessed rapid growth in these libraries from millions to trillions of compounds, hiding undiscovered, potent hits for a variety of therapeutic targets. However, they are quickly approaching a size beyond that which permits explicit enumeration, presenting new challenges for virtual screening. To overcome these challenges, we propose the Combinatorial Synthesis Library Variational Auto-Encoder (CSLVAE). The proposed generative model represents such libraries as a differentiable, hierarchically-organized database. Given a compound from the library, the molecular encoder constructs a query for retrieval, which is utilized by the molecular decoder to reconstruct the compound by first decoding its chemical reaction and subsequently decoding its reactants. Our design minimizes autoregression in the decoder, facilitating the generation of large, valid molecular graphs. Our method performs fast and parallel batch inference for ultra-large synthesis libraries, enabling a number of important applications in early-stage drug discovery. Compounds proposed by our method are guaranteed to be in the library, and thus synthetically and cost-effectively accessible. Importantly, CSLVAE can encode out-of-library compounds and search for in-library analogues. In experiments, we demonstrate the capabilities of the proposed method in the navigation of massive combinatorial synthesis libraries.  ( 2 min )
    Flexible variable selection in the presence of missing data. (arXiv:2202.12989v3 [stat.ME] UPDATED)
    In many applications, it is of interest to identify a parsimonious set of features, or panel, from multiple candidates that achieves a desired level of performance in predicting a response. This task is often complicated in practice by missing data arising from the sampling design or other random mechanisms. Most recent work on variable selection in missing data contexts relies in some part on a finite-dimensional statistical model, e.g., a generalized or penalized linear model. In cases where this model is misspecified, the selected variables may not all be truly scientifically relevant and can result in panels with suboptimal classification performance. To address this limitation, we propose a nonparametric variable selection algorithm combined with multiple imputation to develop flexible panels in the presence of missing-at-random data. We outline strategies based on the proposed algorithm that achieve control of commonly used error rates. Through simulations, we show that our proposal has good operating characteristics and results in panels with higher classification and variable selection performance compared to several existing penalized regression approaches in cases where a generalized linear model is misspecified. Finally, we use the proposed method to develop biomarker panels for separating pancreatic cysts with differing malignancy potential in a setting where complicated missingness in the biomarkers arose due to limited specimen volumes.  ( 3 min )
    A Theoretical Understanding of Neural Network Compression from Sparse Linear Approximation. (arXiv:2206.05604v2 [stat.ML] UPDATED)
    The goal of model compression is to reduce the size of a large neural network while retaining a comparable performance. As a result, computation and memory costs in resource-limited applications may be significantly reduced by dropping redundant weights, neurons, or layers. There have been many model compression algorithms proposed that provide impressive empirical success. However, a theoretical understanding of model compression is still limited. One problem is understanding if a network is more compressible than another of the same structure. Another problem is quantifying how much one can prune a network with theoretically guaranteed accuracy degradation. In this work, we propose to use the sparsity-sensitive $\ell_q$-norm ($0<q<1$) to characterize compressibility and provide a relationship between soft sparsity of the weights in the network and the degree of compression with a controlled accuracy degradation bound. We also develop adaptive algorithms for pruning each neuron in the network informed by our theory. Numerical studies demonstrate the promising performance of the proposed methods compared with standard pruning algorithms.  ( 2 min )
  • Open

    Free cloud services with access to GPUs ?
    Hi everyone! I'm taking a course where i am using ml and after hours and hours of tuning my rainbow dqn agent IT FINALLY WORKS. But now i need to train it :(:( I would like to know if anyone knew about cloud computing free trials where you have access to gpus? The google cloud and AWS trials dont.. submitted by /u/Secret-Toe-8185 [link] [comments]  ( 52 min )

  • Open

    Is there any art ai that mixes images?
    So like, I've been trying to find some sort of art ai that can grab two or more images as references to make new art from (opposed to the ones that create art from text prompts). There is artbreeder, but that one is kinda limiting with uploads and everything, and I haven't been able to find anything good. can you guys help me out by chance? submitted by /u/XCanadienGamerX [link] [comments]  ( 46 min )
    Has model distillation fallen off?
    Back around the GPT-2 and BERT days, model distillation was a big topic and almost always followed the release of a new model. These days however, it seems to be less prevalent in newer projects. Have groups largely abandoned model distillation? submitted by /u/holamyeung [link] [comments]  ( 46 min )
    Any AI art generators with no restrictions and are free-to-use with decent speed and image quality?
    I've been playing around with AI art generators and was wondering if there are any generators with no restrictions and are free-to-use with decent speed and image quality. Please tell me about any AI art generators that have no restrictions, payment, are free-to-use, have good picture quality, and good generating speed because I'd like to now. Will satisfy me. submitted by /u/DravinRaven [link] [comments]  ( 46 min )
    Kermit isn’t just a frog, but your one stop ML data management platform
    Hey Machine Learning Engineers 👋, we are Kermit.ai and our mission is to become the world’s “go to” platform for NLP researchers and practitioners to accelerate and simplify their machine learning data management process. You struggle over data annotation? We got you covered. You suffer from biased data? We got you covered. You have trouble versioning your data? Guess what.. we got you covered. ​ https://preview.redd.it/rdm7tciac6z91.jpg?width=500&format=pjpg&auto=webp&s=1f61c0f9d489ceb9d02a3537a107ffe94b508882 Kermit.ai supports you throughout the entire ML Data Management Process: Managing Data Annotating Data Preparing Data Identifying Outliers in Data Visualizing, Curating and Searching Data Visit us at kermit.ai to sign up for the waitlist and learn more. Keep engineering 🧑🏽‍💻, Kermit submitted by /u/kermitai [link] [comments]  ( 45 min )
    I need to make an assignment about AI
    Hi redditors, I need your help about an assignment for my university. It should be 10 to 20 pages long, with 50% of theoretic and 50% of practical content (statistics, case studies, dissemination of knowledge). How would you build up the structure of my work? submitted by /u/MorWa01 [link] [comments]  ( 46 min )
    This video footage is from around 1960s. Look out the way it is being restored by CodeFormer, an incredible Transformer-based prediction network. Checkout the paper and code in comments below ---->
    submitted by /u/ai-lover [link] [comments]  ( 44 min )
    Artificial Intelligence Takes On Near-Earth Objects (NEOs)
    submitted by /u/SamuelSmith1416 [link] [comments]  ( 44 min )
    Just an honest question: biology uses pain as the ultimate teaching tool, what is used in the world of ai that simulates pain to direct a desired result? Or has this ever been attempted?
    submitted by /u/stickafugginit [link] [comments]  ( 48 min )
    AI Dream 59 - EPIC TRIPPY INFINITE ZOOM - Deep Space Journey by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 44 min )
    Anyone has an AI for stock market prediction please?
    I heard that AI can predict the stock market but I have no idea how to get there, can someone please help me? submitted by /u/SandraPlugged [link] [comments]  ( 44 min )
    “Too easy“—Midjourney tests dramatic new version of its AI image generator
    submitted by /u/Tao_Dragon [link] [comments]  ( 48 min )
    AI helps optimize power electronic converters
    A new and more efficient way of modeling and designing power electronic converters using artificial intelligence (AI) has been created by a team of experts from Cardiff University and the Compound Semiconductor Applications (CSA) Catapult. The method has reduced design times for technology by up to 78% compared to traditional approaches and was used to create a device with an efficiency of over 98%. The team's findings have been published in the IEEE Open Journal of Power Electronics and IEEE Transactions on Power Electronics. A power converter is an electrical device for converting electrical energy. It can convert alternating current (AC) into direct current (DC) and vice versa or change the voltage or frequency of a current. Power converters are used in a wide range of technologies, …  ( 46 min )
    Neural vocoder and its application in speech recognition
    Now a days, most people are familiar with AI voice assistants and use this technology in their day-to-day life. Recently at Google’s I/O 2018 event, they demonstrated a phone call conversation between the google assistant and a real person. The reason behind the realistic conversation is, of course, the intelligence but also the voice of the AI. The technology used to create a similar realistic voice is called Speed synthesis. In this article, we will be discussing one of the speech synthesis algorithms and try to learn about the architecture and working of the neural vocoder. Following are the topics to be covered. https://analyticsindiamag.com/neural-vocoder-and-its-application-in-speech-recognition/ submitted by /u/analyticsindiam [link] [comments]  ( 49 min )
    Autumn Women | Made with Artificial Intelligence 🎉
    submitted by /u/AubreBrumfield [link] [comments]  ( 43 min )
    I fed the new midjourney ai to create some burning man image "in the style" of my name! Solipsistic, yes indeed. :)
    submitted by /u/treyratcliff [link] [comments]  ( 44 min )
    Join us for a usability testing & receive a $100 AUD gift voucher!
    Hey everyone, PI.EXCHANGE is developing an AutoML tool to help our users identify which of their customers are likely to leave their business. Ideally you are: Familiar with tabular data for customer transactions Have basic understanding about customer relationship management in the banking, retail, ecommerce, etc. industries If you're interested, please sign-up via this link https://www.pi.exchange/usability-testing-churn-prototype https://preview.redd.it/y1gwfuo002z91.png?width=1200&format=png&auto=webp&s=2509b11d838d30851fcfdaba04bbdf215b84861f submitted by /u/PIEXCHANGE [link] [comments]  ( 49 min )
    Have you ever imagined what a watermelon-alike lamp or a tiger-alike rabbit would look like? In this work, Bytedance researchers attempt to answer these questions by exploring a new task called semantic mixing, aiming at mixing two different semantics to create a new concept
    submitted by /u/ai-lover [link] [comments]  ( 47 min )
    Latest Artificial Intelligence (AI) Research Proposes A Method To Transform Faces Through Time
    submitted by /u/ai-lover [link] [comments]  ( 46 min )
  • Open

    [P] deodel - a mixed attributes classifier
    Deodel implements in Python a classifier algorithm with native support for mixed attributes data. It features good accuracy and versatility. It is available at: https://github.com/c4pub/deodel submitted by /u/usrideas [link] [comments]  ( 59 min )
    [Discussion] Can we train with multiple sources of data, some very reliable, others less so?
    Hi all, I will be careful not the use the term "confidence" to keep the goal clear and not confuse with confidence interval or predictions. I have two sources of data. One very reliable (experimental), and another source less so, but still carries useful information. Is it possible to feed the entirety of the data to an algorithm while specifying a certain "trust" or "reliability" in the data source? The goal being putting more weight on the reliable source, while still picking up some hidden patterns from the second source? submitted by /u/DreamyPen [link] [comments]  ( 61 min )
    [Discussion] Binary classifier False-positive rate estimate
    I have a binary classification problem, where I have 2 methods. Method A: This method is 100% precise, but has low recall Method B: It has a way higher recall, but lower precision. I was thinking of using the samples from method A to estimate a False-Positive rate for method B, but how to do that? Any way to estimate the false positive rate here? submitted by /u/pedromnasc [link] [comments]  ( 61 min )
    [R] A relabelling of the COCO 2017 dataset
    Hi everybody, here is the complete relabelling of the COCO 2017 dataset for segmentation. This is all free of charge, un-gated, and was done by Sama, a labelling company for CV data. The dataset is available on the Sama website under a Creative Commons license: https://www.sama.com/sama-coco-dataset/. I would also love to hear any feedback. Disclaimer: I work for Sama submitted by /u/iknowjerome [link] [comments]  ( 57 min )
    [R] An optimal control perspective on diffusion-based generative modeling
    Explore the connection between diffusion models and optimal control 🔥 📖 Paper 🎙Come to our oral at the NeurIPS workshop on score-based methods and let’s discuss how one field can benefit from the other. https://preview.redd.it/6kc1yvz1i4z91.png?width=2852&format=png&auto=webp&s=b6049a8901adc7077d3cfe51508fee70c8133a97 Highlights: 1️⃣ Log-density of the underlying SDE satisfies a HJB equation. 2️⃣ ELBO follows directly from the verification theorem. 3️⃣ Diffusion-based approach to sample from (unnormalized) densities. ...and more to come! submitted by /u/julbern [link] [comments]  ( 62 min )
    [D] Alternatives to padding
    I am working on a project which requires me to consolidate the representations of several vectors of different shapes. Is there any alternative method to padding that can be used for this? submitted by /u/GrammarPaparazzi [link] [comments]  ( 66 min )
    [R] Vector database for dense vector embeddings in a nutshell
    Hi folks! When we created this brand new category 3 years ago nobody knew what the heck a vector database is. Now as it becoming a less strange term and gaining more interest every day, we hear many more interested people asking about what it is and how it's different. So here it is, (dramatic pause) the ultimate guide served with the latest paper! In the tutorial, we answered 3 (power of three!) most commonly asked questions about vector database: What is a vector database? How does it compare to vector search libraries / ANN libraries like FAISS, ScaNN, and HNSW? How does it compare to vector search plugins, such as ClickHouse and Elasticsearch? Link to tutorial: https://zilliz.com/learn/what-is-vector-database Link to paper for architectural scoop: https://arxiv.org/abs/2206.13843 If you like what you see and fancy more, check out the Vector Database 101 series that's being updated on a weekly basis. Enjoy! submitted by /u/claireyuw [link] [comments]  ( 58 min )
    [R] LiBai: a large-scale open-source model training toolbox
    Glad to share our our open-source work: LiBai, which is a large-scale open-source model training toolbox based on OneFlow, the biggest feature of the library is allows users to easily training any model in parallel. Github links: https://github.com/Oneflow-Inc/libai. LiBai Document: https://libai.readthedocs.io/en/latest/tutorials/get_started/Installation.html. Model Zoo Support 3D-parallel (data parallel + tensor parallel + pipeline parallel) Models: - Bert, GPT2, T5, Vision Transformer, Swin Transformer, ResMLP, Roberta. And there are more Projects in LiBai. Characteristics of LiBai LiBai gets better Throughouts compared to Megatron, refer to Benchmark for more details 3-D Parallel BERT LiBai Megatron nl24_fp16_2x2x4_ac_mb128_gb2048_2n8g 267.39 samples/s 233.7 samples…  ( 58 min )
  • Open

    How Prodege saved $1.5 million in annual human review costs using low-code computer vision AI
    This post was co-authored by Arun Gupta, the Director of Business Intelligence at Prodege, LLC. Prodege is a data-driven marketing and consumer insights platform comprised of consumer brands—Swagbucks, MyPoints, Tada, ySense, InboxDollars, InboxPounds, DailyRewards, PollFish, and Upromise—along with a complementary suite of business solutions for marketers and researchers. Prodege has 120 million users and has […]  ( 7 min )
    Identifying and avoiding common data issues while building no code ML models with Amazon SageMaker Canvas
    Business analysts work with data and like to analyze, explore, and understand data to achieve effective business outcomes. To address business problems, they often rely on machine learning (ML) practitioners such as data scientists to assist with techniques such as utilizing ML to build models using existing data and generate predictions. However, it isn’t always […]  ( 8 min )
  • Open

    Learning backpropogation from end to end
    Hi I'm wanna learn the backpropgation algorithm from end to end. That is to understand all the steps on the way include: Taylor series approximation of function at least in high level SGD and why going in the contrast way to the gradient is the right decision The different stages of the propogation Does anyone has a recommended resource for that? Thanks submitted by /u/yehuda1033 [link] [comments]  ( 46 min )
  • Open

    Ensuring AI works with the right dose of curiosity
    Researchers make headway in solving a longstanding problem of balancing curious “exploration” versus “exploitation” of known pathways in reinforcement learning.  ( 10 min )
  • Open

    Cloud Intelligence/AIOps – Infusing AI into Cloud Computing Systems
    When legendary computer scientist Jim Gray accepted the Turing Award in 1999, he laid out a dozen long-range information technology research goals. One of those goals called for the creation of trouble-free server systems or, in Gray’s words, to “build a system used by millions of people each day and yet administered and managed by […] The post Cloud Intelligence/AIOps – Infusing AI into Cloud Computing Systems appeared first on Microsoft Research.  ( 14 min )
  • Open

    Give the Gift of Gaming With GeForce NOW Gift Cards
    The holiday season is approaching, and GeForce NOW has everyone covered. This GFN Thursday brings an easy way to give the gift of gaming with GeForce NOW gift cards, for yourself or for a gamer in your life. Plus, stream 10 new games from the cloud this week, including the first story downloadable content (DLC) Read article > The post Give the Gift of Gaming With GeForce NOW Gift Cards appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Straight on a map or straight on a globe?
    Straight lines on a globe are not straight on a map, and straight lines on a map are not straight on a globe. A straight line on a globe is an arc of a great circle, the shortest path between two points. When projected onto a map, a straight path looks curved. Here’s an image […] Straight on a map or straight on a globe? first appeared on John D. Cook.  ( 5 min )
  • Open

    US Universities doing research in RL
    I am looking for US Ph.D. programs to apply to for Fall 2023. Does anyone know which universities have labs that are actively doing research in RL? Of course I am not talking about Unis such as CMU or Stanford, because my CV is probably not good enough for them. P.S. There is a post on the subject, but it's 4 years old: https://www.reddit.com/r/reinforcementlearning/comments/9cx9xw/grad_schools_with_good_programs_for_someone/?utm_source=share&utm_medium=android_app&utm_name=androidcss&utm_term=1&utm_content=share_button submitted by /u/QuestHunter123 [link] [comments]  ( 53 min )
    n-step return in stable baseline PPO algorithm
    Is there a way to define n-step returns in stable baseline PPO algorithm? For instance, I want to choose 3 step, 5 step etc, how do I input these parameters in the model itself before training? Thank you ​ EDIT: Is it done by controlling the value of lambda? For instance if lambda equals 0 it will weight more towards the 1 step return and if lambda equals 1, it will weight more towards long-term returns ? submitted by /u/Playful_Shop_8165 [link] [comments]  ( 53 min )
    "Mysteries of mode collapse due to RLHF" tuning of GPT-3, Janus (why is InstructGPT-3 so boring?)
    submitted by /u/gwern [link] [comments]  ( 51 min )
  • Open

    Want Greater Business Efficiency? Map Your Smart Data Transformation Journey
    Companies need to collect huge volumes of data produced to extract valuable insights via data analysis to survive, let alone thrive in the competitive marketplace. It is the lifeblood of most business decisions, functions, and processes. But the process of data collection can be highly challenging for many organizations. The post Want Greater Business Efficiency? Map Your Smart Data Transformation Journey appeared first on Data Science Central.  ( 21 min )
  • Open

    From fat droplets to floating forests: cross-domain transfer learning using a PatchGAN-based segmentation model. (arXiv:2211.03937v1 [cs.LG])
    Many scientific domains gather sufficient labels to train machine algorithms through human-in-the-loop techniques provided by the Zooniverse.org citizen science platform. As the range of projects, task types and data rates increase, acceleration of model training is of paramount concern to focus volunteer effort where most needed. The application of Transfer Learning (TL) between Zooniverse projects holds promise as a solution. However, understanding the effectiveness of TL approaches that pretrain on large-scale generic image sets vs. images with similar characteristics possibly from similar tasks is an open challenge. We apply a generative segmentation model on two Zooniverse project-based data sets: (1) to identify fat droplets in liver cells (FatChecker; FC) and (2) the identification of kelp beds in satellite images (Floating Forests; FF) through transfer learning from the first project. We compare and contrast its performance with a TL model based on the COCO image set, and subsequently with baseline counterparts. We find that both the FC and COCO TL models perform better than the baseline cases when using >75% of the original training sample size. The COCO-based TL model generally performs better than the FC-based one, likely due to its generalized features. Our investigations provide important insights into usage of TL approaches on multi-domain data hosted across different Zooniverse projects, enabling future projects to accelerate task completion.  ( 3 min )
    Doubly Inhomogeneous Reinforcement Learning. (arXiv:2211.03983v1 [stat.ML])
    This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.  ( 2 min )
    Centaur: Federated Learning for Constrained Edge Devices. (arXiv:2211.04175v1 [cs.LG])
    Federated learning (FL) on deep neural networks facilitates new applications at the edge, especially for wearable and Internet-of-Thing devices. Such devices capture a large and diverse amount of data, but they have memory, compute, power, and connectivity constraints which hinder their participation in FL. We propose Centaur, a multitier FL framework, enabling ultra-constrained devices to efficiently participate in FL on large neural nets. Centaur combines two major ideas: (i) a data selection scheme to choose a portion of samples that accelerates the learning, and (ii) a partition-based training algorithm that integrates both constrained and powerful devices owned by the same user. Evaluations, on four benchmark neural nets and three datasets, show that Centaur gains ~10% higher accuracy than local training on constrained devices with ~58% energy saving on average. Our experimental results also demonstrate the superior efficiency of Centaur when dealing with imbalanced data, client participation heterogeneity, and various network connection probabilities.  ( 2 min )
    The Hypervolume Indicator Hessian Matrix: Analytical Expression, Computational Time Complexity, and Sparsity. (arXiv:2211.04171v1 [math.OC])
    The problem of approximating the Pareto front of a multiobjective optimization problem can be reformulated as the problem of finding a set that maximizes the hypervolume indicator. This paper establishes the analytical expression of the Hessian matrix of the mapping from a (fixed size) collection of $n$ points in the $d$-dimensional decision space (or $m$ dimensional objective space) to the scalar hypervolume indicator value. To define the Hessian matrix, the input set is vectorized, and the matrix is derived by analytical differentiation of the mapping from a vectorized set to the hypervolume indicator. The Hessian matrix plays a crucial role in second-order methods, such as the Newton-Raphson optimization method, and it can be used for the verification of local optimal sets. So far, the full analytical expression was only established and analyzed for the relatively simple bi-objective case. This paper will derive the full expression for arbitrary dimensions ($m\geq2$ objective functions). For the practically important three-dimensional case, we also provide an asymptotically efficient algorithm with time complexity in $O(n\log n)$ for the exact computation of the Hessian Matrix' non-zero entries. We establish a sharp bound of $12m-6$ for the number of non-zero entries. Also, for the general $m$-dimensional case, a compact recursive analytical expression is established, and its algorithmic implementation is discussed. Also, for the general case, some sparsity results can be established; these results are implied by the recursive expression. To validate and illustrate the analytically derived algorithms and results, we provide a few numerical examples using Python and Mathematica implementations. Open-source implementations of the algorithms and testing data are made available as a supplement to this paper.  ( 3 min )
    Improving Graph Neural Networks at Scale: Combining Approximate PageRank and CoreRank. (arXiv:2211.04248v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved great successes in many learning tasks performed on graph structures. Nonetheless, to propagate information GNNs rely on a message passing scheme which can become prohibitively expensive when working with industrial-scale graphs. Inspired by the PPRGo model, we propose the CorePPR model, a scalable solution that utilises a learnable convex combination of the approximate personalised PageRank and the CoreRank to diffuse multi-hop neighbourhood information in GNNs. Additionally, we incorporate a dynamic mechanism to select the most influential neighbours for a particular node which reduces training time while preserving the performance of the model. Overall, we demonstrate that CorePPR outperforms PPRGo, particularly on large graphs where selecting the most influential nodes is particularly relevant for scalability. Our code is publicly available at: https://github.com/arielramos97/CorePPR.  ( 2 min )
    Causal Discovery in Linear Latent Variable Models Subject to Measurement Error. (arXiv:2211.03984v1 [cs.LG])
    We focus on causal discovery in the presence of measurement error in linear systems where the mixing matrix, i.e., the matrix indicating the independent exogenous noise terms pertaining to the observed variables, is identified up to permutation and scaling of the columns. We demonstrate a somewhat surprising connection between this problem and causal discovery in the presence of unobserved parentless causes, in the sense that there is a mapping, given by the mixing matrix, between the underlying models to be inferred in these problems. Consequently, any identifiability result based on the mixing matrix for one model translates to an identifiability result for the other model. We characterize to what extent the causal models can be identified under a two-part faithfulness assumption. Under only the first part of the assumption (corresponding to the conventional definition of faithfulness), the structure can be learned up to the causal ordering among an ordered grouping of the variables but not all the edges across the groups can be identified. We further show that if both parts of the faithfulness assumption are imposed, the structure can be learned up to a more refined ordered grouping. As a result of this refinement, for the latent variable model with unobserved parentless causes, the structure can be identified. Based on our theoretical results, we propose causal structure learning methods for both models, and evaluate their performance on synthetic data.  ( 3 min )
    Reinforcement Learning with Stepwise Fairness Constraints. (arXiv:2211.03994v1 [cs.LG])
    AI methods are used in societally important settings, ranging from credit to employment to housing, and it is crucial to provide fairness in regard to algorithmic decision making. Moreover, many settings are dynamic, with populations responding to sequential decision policies. We introduce the study of reinforcement learning (RL) with stepwise fairness constraints, requiring group fairness at each time step. Our focus is on tabular episodic RL, and we provide learning algorithms with strong theoretical guarantees in regard to policy optimality and fairness violation. Our framework provides useful tools to study the impact of fairness constraints in sequential settings and brings up new challenges in RL.  ( 2 min )
    Pretraining in Deep Reinforcement Learning: A Survey. (arXiv:2211.03959v1 [cs.LG])
    The past few years have seen rapid progress in combining reinforcement learning (RL) with deep learning. Various breakthroughs ranging from games to robotics have spurred the interest in designing sophisticated RL algorithms and systems. However, the prevailing workflow in RL is to learn tabula rasa, which may incur computational inefficiency. This precludes continuous deployment of RL algorithms and potentially excludes researchers without large-scale computing resources. In many other areas of machine learning, the pretraining paradigm has shown to be effective in acquiring transferable knowledge, which can be utilized for a variety of downstream tasks. Recently, we saw a surge of interest in Pretraining for Deep RL with promising results. However, much of the research has been based on different experimental settings. Due to the nature of RL, pretraining in this field is faced with unique challenges and hence requires new design principles. In this survey, we seek to systematically review existing works in pretraining for deep reinforcement learning, provide a taxonomy of these methods, discuss each sub-field, and bring attention to open problems and future directions.  ( 2 min )
    Efficacy of MRI data harmonization in the age of machine learning. A multicenter study across 36 datasets. (arXiv:2211.04125v1 [cs.LG])
    Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage. We tested these tools using brain T1-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we measured the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage.
    Online Prediction in Sub-linear Space. (arXiv:2207.07974v2 [cs.DS] UPDATED)
    We provide the first sub-linear space and sub-linear regret algorithm for online learning with expert advice (against an oblivious adversary), addressing an open question raised recently by Srinivas, Woodruff, Xu and Zhou (STOC 2022). We also demonstrate a separation between oblivious and (strong) adaptive adversaries by proving a linear memory lower bound of any sub-linear regret algorithm against an adaptive adversary. Our algorithm is based on a novel pool selection procedure that bypasses the traditional wisdom of leader selection for online learning, and a generic reduction that transforms any weakly sub-linear regret $o(T)$ algorithm to $T^{1-\alpha}$ regret algorithm, which may be of independent interest. Our lower bound utilizes the connection of no-regret learning and equilibrium computation in zero-sum games, leading to a proof of a strong lower bound against an adaptive adversary.
    Linear Self-Attention Approximation via Trainable Feedforward Kernel. (arXiv:2211.04076v1 [cs.LG])
    In pursuit of faster computation, Efficient Transformers demonstrate an impressive variety of approaches -- models attaining sub-quadratic attention complexity can utilize a notion of sparsity or a low-rank approximation of inputs to reduce the number of attended keys; other ways to reduce complexity include locality-sensitive hashing, key pooling, additional memory to store information in compacted or hybridization with other architectures, such as CNN. Often based on a strong mathematical basis, kernelized approaches allow for the approximation of attention with linear complexity while retaining high accuracy. Therefore, in the present paper, we aim to expand the idea of trainable kernel methods to approximate the self-attention mechanism of the Transformer architecture.  ( 2 min )
    Adaptive Semantic Communications: Overfitting the Source and Channel for Profit. (arXiv:2211.04339v1 [cs.IT])
    Most semantic communication systems leverage deep learning models to provide end-to-end transmission performance surpassing the established source and channel coding approaches. While, so far, research has mainly focused on architecture and model improvements, but such a model trained over a full dataset and ergodic channel responses is unlikely to be optimal for every test instance. Due to limitations on the model capacity and imperfect optimization and generalization, such learned models will be suboptimal especially when the testing data distribution or channel response is different from that in the training phase, as is likely to be the case in practice. To tackle this, in this paper, we propose a novel semantic communication paradigm by leveraging the deep learning model's overfitting property. Our model can for instance be updated after deployment, which can further lead to substantial gains in terms of the transmission rate-distortion (RD) performance. This new system is named adaptive semantic communication (ASC). In our ASC system, the ingredients of wireless transmitted stream include both the semantic representations of source data and the adapted decoder model parameters. Specifically, we take the overfitting concept to the extreme, proposing a series of ingenious methods to adapt the semantic codec or representations to an individual data or channel state instance. The whole ASC system design is formulated as an optimization problem whose goal is to minimize the loss function that is a tripartite tradeoff among the data rate, model rate, and distortion terms. The experiments (including user study) verify the effectiveness and efficiency of our ASC system. Notably, the substantial gain of our overfitted coding paradigm can catalyze semantic communication upgrading to a new era.
    A Novel Multi-Layer Modular Approach for Real-Time Gravitational-Wave Detection. (arXiv:2206.06004v2 [gr-qc] UPDATED)
    Advanced LIGO and Advanced Virgo ground-based interferometers are poised to probe an unprecedentedly large volume of space, enhancing the discovery power of the observations to even new sources of gravitational wave emitters. In this scenario, the development of highly optimized gravitational wave detection algorithms is crucial. We propose a novel layered framework for real-time detection of gravitational waves inspired by speech processing techniques and, in the present implementation, based on a state-of-the-art machine learning approach involving a hybridization of genetic programming and neural networks. The key aspects of the newly proposed framework are: the well structured, layered approach, and the low computational complexity. The paper describes the basic concepts of the framework and the derivation of the first three layers. Even if, in the present implementation, the layers are based on models derived using a machine learning approach, the proposed layered structure has a universal nature. To train and test the models, we used simulated binary black hole gravitational wave waveforms in synthetic Gaussian noise representative of Advanced LIGO sensitivity design. Compared to more complex approaches, such as convolutional neural networks, our framework, even using the simple ground model described in the paper, has only a slightly lower performance, but with a much lower computational complexity and a higher degree of modularity. Furthermore, the underlying exploitation of short-term features makes the results of the new framework virtually independent against time-position of gravitational wave signals, simplifying its future exploitation in real-time multi-layer pipelines for gravitational-wave detection with second generation interferometers.
    Efficient probabilistic reconciliation of forecasts for real-valued and count time series. (arXiv:2210.02286v2 [stat.ML] UPDATED)
    Hierarchical time series are common in several applied fields. Forecasts are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.
    Ensemble Consensus-based Representation Deep Reinforcement Learning for Hybrid FSO/RF Communication Systems. (arXiv:2108.02551v2 [cs.LG] UPDATED)
    Hybrid FSO/RF system requires an efficient FSO and RF link switching mechanism to improve the system capacity by realizing the complementary benefits of both the links. The dynamics of network conditions, such as fog, dust, and sand storms compound the link switching problem and control complexity. To address this problem, we initiate the study of deep reinforcement learning (DRL) for link switching of hybrid FSO/RF systems. Specifically, in this work, we focus on actor-critic called Actor/Critic-FSO/RF and Deep-Q network (DQN) called DQN-FSO/RF for FSO/RF link switching under atmospheric turbulences. To formulate the problem, we define the state, action, and reward function of a hybrid FSO/RF system. DQN-FSO/RF frequently updates the deployed policy that interacts with the environment in a hybrid FSO/RF system, resulting in high switching costs. To overcome this, we lift this problem to ensemble consensus-based representation learning for deep reinforcement called DQNEnsemble-FSO/RF. The proposed novel DQNEnsemble-FSO/RF DRL approach uses consensus learned features representations based on an ensemble of asynchronous threads to update the deployed policy. Experimental results corroborate that the proposed DQNEnsemble-FSO/RF's consensus-learned features switching achieves better performance than Actor/Critic-FSO/RF, DQN-FSO/RF, and MyOpic for FSO/RF link switching while keeping the switching cost significantly low.
    Unsupervised Reward Shaping for a Robotic Sequential Picking Task from Visual Observations in a Logistics Scenario. (arXiv:2209.12350v2 [cs.RO] UPDATED)
    We focus on an unloading problem, typical of the logistics sector, modeled as a sequential pick-and-place task. In this type of task, modern machine learning techniques have shown to work better than classic systems since they are more adaptable to stochasticity and better able to cope with large uncertainties. More specifically, supervised and imitation learning have achieved outstanding results in this regard, with the shortcoming of requiring some form of supervision which is not always obtainable for all settings. On the other hand, reinforcement learning (RL) requires much milder form of supervision but still remains impracticable due to its inefficiency. In this paper, we propose and theoretically motivate a novel Unsupervised Reward Shaping algorithm from expert's observations which relaxes the level of supervision required by the agent and works on improving RL performance in our task.
    State Advantage Weighting for Offline RL. (arXiv:2210.04251v2 [cs.LG] UPDATED)
    We present state advantage weighting for offline reinforcement learning (RL). In contrast to action advantage $A(s,a)$ that we commonly adopt in QSA learning, we leverage state advantage $A(s,s^\prime)$ and QSS learning for offline RL, hence decoupling the action from values. We expect the agent can get to the high-reward state and the action is determined by how the agent can get to that corresponding state. Experiments on D4RL datasets show that our proposed method can achieve remarkable performance against the common baselines. Furthermore, our method shows good generalization capability when transferring from offline to online.
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v2 [cs.LG] UPDATED)
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.
    Concise and interpretable multi-label rule sets. (arXiv:2210.01533v2 [cs.LG] UPDATED)
    Multi-label classification is becoming increasingly ubiquitous, but not much attention has been paid to interpretability. In this paper, we develop a multi-label classifier that can be represented as a concise set of simple "if-then" rules, and thus, it offers better interpretability compared to black-box models. Notably, our method is able to find a small set of relevant patterns that lead to accurate multi-label classification, while existing rule-based classifiers are myopic and wasteful in searching rules,requiring a large number of rules to achieve high accuracy. In particular, we formulate the problem of choosing multi-label rules to maximize a target function, which considers not only discrimination ability with respect to labels, but also diversity. Accounting for diversity helps to avoid redundancy, and thus, to control the number of rules in the solution set. To tackle the said maximization problem we propose a 2-approximation algorithm, which relies on a novel technique to sample high-quality rules. In addition to our theoretical analysis, we provide a thorough experimental evaluation, which indicates that our approach offers a trade-off between predictive performance and interpretability that is unmatched in previous work.
    Generating counterfactual explanations of tumor spatial proteomes to discover effective, combinatorial therapies that enhance cancer immunotherapy. (arXiv:2211.04020v1 [q-bio.QM])
    Recent advances in spatial omics methods enable the molecular composition of human tumors to be imaged at micron-scale resolution across hundreds of patients and ten to thousands of molecular imaging channels. Large-scale molecular imaging datasets offer a new opportunity to understand how the spatial organization of proteins and cell types within a tumor modulate the response of a patient to different therapeutic strategies and offer potential insights into the design of novel therapies to increase patient response. However, spatial omics datasets require computational analysis methods that can scale to incorporate hundreds to thousands of imaging channels (ie colors) while enabling the extraction of molecular patterns that correlate with treatment responses across large number of patients with potentially heterogeneous tumors presentations. Here, we have develop a machine learning strategy for the identification and design of signaling molecule combinations that predict the degree of immune system engagement with a specific patient tumors. We specifically train a classifier to predict T cell distribution in patient tumors using the images from 30-40 molecular imaging channels. Second, we apply a gradient descent based counterfactual reasoning strategy to the classifier and discover combinations of signaling molecules predicted to increase T cell infiltration. Applied to spatial proteomics data of melanoma tumor, our model predicts that increasing the level of CXCL9, CXCL10, CXCL12, CCL19 and decreasing the level of CCL8 in melanoma tumor will increase T cell infiltration by 10-fold across a cohort of 69 patients. The model predicts that the combination is many fold more effective than single target perturbations. Our work provides a paradigm for machine learning based prediction and design of cancer therapeutics based on classification of immune system activity in spatial omics data.
    Synthesis of separation processes with reinforcement learning. (arXiv:2211.04327v1 [cs.LG])
    This paper shows the implementation of reinforcement learning (RL) in commercial flowsheet simulator software (Aspen Plus V12) for designing and optimising a distillation sequence. The aim of the SAC agent was to separate a hydrocarbon mixture in its individual components by utilising distillation. While doing so it tries to maximise the profit produced by the distillation sequence. All actions of the agent were set by the SAC agent in Python and communicated in Aspen Plus via an API. Here the distillation column was simulated by use of the build-in RADFRAC column. With this a connection was established for data transfer between Python and Aspen and the agent succeeded to show learning behaviour, while increasing profit. Although results were generated, the use of Aspen was slow (190 hours) and Aspen was found unsuitable for parallelisation. This makes that Aspen is incompatible for solving RL problems. Code and thesis are available at https://github.com/lollcat/Aspen-RL
    DiffPhase: Generative Diffusion-based STFT Phase Retrieval. (arXiv:2211.04332v1 [eess.AS])
    Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis. As a generative approach, diffusion models have been shown to be especially suitable for imputation problems, where missing data is generated based on existing data. Phase retrieval is inherently an imputation problem, where phase information has to be generated based on the given magnitude. In this work we build upon previous work in the speech domain, adapting a speech enhancement diffusion model specifically for STFT phase retrieval. Evaluation using speech quality and intelligibility metrics shows the diffusion approach is well-suited to the phase retrieval task, with performance surpassing both classical and modern methods.
    Causal Fourier Analysis on Directed Acyclic Graphs and Posets. (arXiv:2209.07970v2 [eess.SP] UPDATED)
    We present a novel form of Fourier analysis, and associated signal processing concepts, for signals (or data) indexed by edge-weighted directed acyclic graphs (DAGs). This means that our Fourier basis yields an eigendecomposition of a suitable notion of shift and convolution operators that we define. DAGs are the common model to capture causal relationships between data values and in this case our proposed Fourier analysis relates data with its causes under a linearity assumption that we define. The definition of the Fourier transform requires the transitive closure of the weighted DAG for which several forms are possible depending on the interpretation of the edge weights. Examples include level of influence, distance, or pollution distribution. Our framework is different from prior GSP: it is specific to DAGs and leverages, and extends, the classical theory of Moebius inversion from combinatorics. For a prototypical application we consider DAGs modeling dynamic networks in which edges change over time. Specifically, we model the spread of an infection on such a DAG obtained from real-world contact tracing data and learn the infection signal from samples assuming sparsity in the Fourier domain.
    The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization. (arXiv:2206.02768v2 [stat.ML] UPDATED)
    The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.
    Learning Causal Representations of Single Cells via Sparse Mechanism Shift Modeling. (arXiv:2211.03553v2 [q-bio.GN] UPDATED)
    Latent variable models such as the Variational Auto-Encoder (VAE) have become a go-to tool for analyzing biological data, especially in the field of single-cell genomics. One remaining challenge is the interpretability of latent variables as biological processes that define a cell's identity. Outside of biological applications, this problem is commonly referred to as learning disentangled representations. Although several disentanglement-promoting variants of the VAE were introduced, and applied to single-cell genomics data, this task has been shown to be infeasible from independent and identically distributed measurements, without additional structure. Instead, recent methods propose to leverage non-stationary data, as well as the sparse mechanism shift assumption in order to learn disentangled representations with a causal semantic. Here, we extend the application of these methodological advances to the analysis of single-cell genomics data with genetic or chemical perturbations. More precisely, we propose a deep generative model of single-cell gene expression data for which each perturbation is treated as a stochastic intervention targeting an unknown, but sparse, subset of latent variables. We benchmark these methods on simulated single-cell data to evaluate their performance at latent units recovery, causal target identification and out-of-domain generalization. Finally, we apply those approaches to two real-world large-scale gene perturbation data sets and find that models that exploit the sparse mechanism shift hypothesis surpass contemporary methods on a transfer learning task. We implement our new model and benchmarks using the scvi-tools library, and release it as open-source software at \url{https://github.com/Genentech/sVAE}.
    Over-The-Air Clustered Wireless Federated Learning. (arXiv:2211.03363v2 [cs.LG] UPDATED)
    Privacy, security, and bandwidth constraints have led to federated learning (FL) in wireless systems, where training a machine learning (ML) model is accomplished collaboratively without sharing raw data. Often, such collaborative FL strategies necessitate model aggregation at a server. On the other hand, decentralized FL necessitates that participating clients reach a consensus ML model by exchanging parameter updates. In this work, we propose the over-the-air clustered wireless FL (CWFL) strategy, which eliminates the need for a strong central server and yet achieves an accuracy similar to the server-based strategy while using fewer channel uses as compared to decentralized FL. We theoretically show that the convergence rate of CWFL per cluster is O(1/T) while mitigating the impact of noise. Using the MNIST and CIFAR datasets, we demonstrate the accuracy performance of CWFL for the different number of clusters across communication rounds.
    Physics-Constrained Backdoor Attacks on Power System Fault Localization. (arXiv:2211.04445v1 [cs.CR])
    The advances in deep learning (DL) techniques have the potential to deliver transformative technological breakthroughs to numerous complex tasks in modern power systems that suffer from increasing uncertainty and nonlinearity. However, the vulnerability of DL has yet to be thoroughly explored in power system tasks under various physical constraints. This work, for the first time, proposes a novel physics-constrained backdoor poisoning attack, which embeds the undetectable attack signal into the learned model and only performs the attack when it encounters the corresponding signal. The paper illustrates the proposed attack on the real-time fault line localization application. Furthermore, the simulation results on the 68-bus power system demonstrate that DL-based fault line localization methods are not robust to our proposed attack, indicating that backdoor poisoning attacks pose real threats to DL implementations in power systems. The proposed attack pipeline can be easily generalized to other power system tasks.
    Dynamics of (mis)information flow and engaging power of narratives. (arXiv:2207.12264v2 [physics.soc-ph] UPDATED)
    The debate around misinformation and its potentially detrimental effects on public opinion is complex and multifaceted, to the extent that even the relevant academic research has not found unanimity on the prevalence and consumption of misinformation compared with mainstream content. The methodological framework presented here emphasises the importance of considering data representative of the complexity of the phenomenon and metrics that control for possible scale effects. By combining statistical, econometric and machine learning models, we shed light on the real impact of misinformation about a subject of general interest and social relevance, such as vaccines, on both the information available to citizens and their news diet. Our results show the prominent role achieved by misinformation sources in the news ecosystem, but also - and above all - the inability of mainstream media to drive the public debate over time on issues that are particularly sensitive and emotional. Taking properly account for the temporal dynamics of public debate seems crucial to prevent the latter from moving into uncontrolled spaces where false narratives are more easily conveyed and entrenched.
    ABC: Adversarial Behavioral Cloning for Offline Mode-Seeking Imitation Learning. (arXiv:2211.04005v1 [cs.LG])
    Given a dataset of expert agent interactions with an environment of interest, a viable method to extract an effective agent policy is to estimate the maximum likelihood policy indicated by this data. This approach is commonly referred to as behavioral cloning (BC). In this work, we describe a key disadvantage of BC that arises due to the maximum likelihood objective function; namely that BC is mean-seeking with respect to the state-conditional expert action distribution when the learner's policy is represented with a Gaussian. To address this issue, we introduce a modified version of BC, Adversarial Behavioral Cloning (ABC), that exhibits mode-seeking behavior by incorporating elements of GAN (generative adversarial network) training. We evaluate ABC on toy domains and a domain based on Hopper from the DeepMind Control suite, and show that it outperforms standard BC by being mode-seeking in nature.  ( 2 min )
    A Neural Network Subgrid Model of the Early Stages of Planet Formation. (arXiv:2211.04160v1 [astro-ph.EP])
    Planet formation is a multi-scale process in which the coagulation of $\mathrm{\mu m}$-sized dust grains in protoplanetary disks is strongly influenced by the hydrodynamic processes on scales of astronomical units ($\approx 1.5\times 10^8 \,\mathrm{km}$). Studies are therefore dependent on subgrid models to emulate the micro physics of dust coagulation on top of a large scale hydrodynamic simulation. Numerical simulations which include the relevant physical effects are complex and computationally expensive. Here, we present a fast and accurate learned effective model for dust coagulation, trained on data from high resolution numerical coagulation simulations. Our model captures details of the dust coagulation process that were so far not tractable with other dust coagulation prescriptions with similar computational efficiency.  ( 2 min )
    An Incremental Phase Mapping Approach for X-ray Diffraction Patterns using Binary Peak Representations. (arXiv:2211.04011v1 [cs.LG])
    Despite the huge advancement in knowledge discovery and data mining techniques, the X-ray diffraction (XRD) analysis process has mostly remained untouched and still involves manual investigation, comparison, and verification. Due to the large volume of XRD samples from high-throughput XRD experiments, it has become impossible for domain scientists to process them manually. Recently, they have started leveraging standard clustering techniques, to reduce the XRD pattern representations requiring manual efforts for labeling and verification. Nevertheless, these standard clustering techniques do not handle problem-specific aspects such as peak shifting, adjacent peaks, background noise, and mixed phases; hence, resulting in incorrect composition-phase diagrams that complicate further steps. Here, we leverage data mining techniques along with domain expertise to handle these issues. In this paper, we introduce an incremental phase mapping approach based on binary peak representations using a new threshold based fuzzy dissimilarity measure. The proposed approach first applies an incremental phase computation algorithm on discrete binary peak representation of XRD samples, followed by hierarchical clustering or manual merging of similar pure phases to obtain the final composition-phase diagram. We evaluate our method on the composition space of two ternary alloy systems- Co-Ni-Ta and Co-Ti-Ta. Our results are verified by domain scientists and closely resembles the manually computed ground-truth composition-phase diagrams. The proposed approach takes us closer towards achieving the goal of complete end-to-end automated XRD analysis.  ( 3 min )
    CLEAR: Generative Counterfactual Explanations on Graphs. (arXiv:2210.08443v2 [cs.LG] UPDATED)
    Counterfactual explanations promote explainability in machine learning models by answering the question "how should an input instance be perturbed to obtain a desired predicted label?". The comparison of this instance before and after perturbation can enhance human interpretation. Most existing studies on counterfactual explanations are limited in tabular data or image data. In this work, we study the problem of counterfactual explanation generation on graphs. A few studies have explored counterfactual explanations on graphs, but many challenges of this problem are still not well-addressed: 1) optimizing in the discrete and disorganized space of graphs; 2) generalizing on unseen graphs; and 3) maintaining the causality in the generated counterfactuals without prior knowledge of the causal model. To tackle these challenges, we propose a novel framework CLEAR which aims to generate counterfactual explanations on graphs for graph-level prediction models. Specifically, CLEAR leverages a graph variational autoencoder based mechanism to facilitate its optimization and generalization, and promotes causality by leveraging an auxiliary variable to better identify the underlying causal model. Extensive experiments on both synthetic and real-world graphs validate the superiority of CLEAR over the state-of-the-art methods in different aspects.
    A Scalable and Extensible Approach to Benchmarking NL2Code for 18 Programming Languages. (arXiv:2208.08227v3 [cs.LG] UPDATED)
    Large language models have demonstrated the ability to condition on and generate both natural language and programming language text. Such models open up the possibility of multi-language code generation: could code generation models generalize knowledge from one language to another? Although contemporary code generation models can generate semantically correct Python code, little is known about their abilities with other languages. We facilitate the exploration of this topic by proposing MultiPL-E, the first multi-language parallel benchmark for natural-language-to-code-generation. MultiPL-E extends the HumanEval benchmark (Chen et al, 2021) to support 18 more programming languages, encompassing a range of programming paradigms and popularity. We evaluate two state-of-the-art code generation models on MultiPL-E: Codex and InCoder. We find that on several languages, Codex matches and even exceeds its performance on Python. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible. We describe a general approach for easily adding support for new benchmarks and languages to MultiPL-E.
    A jet tagging algorithm of graph network with HaarPooling message passing. (arXiv:2210.13869v2 [hep-ex] CROSS LISTED)
    Recently methods of graph neural networks (GNNs) have been applied to solving the problems in high energy physics (HEP) and have shown its great potential for quark-gluon tagging with graph representation of jet events. In this paper, we introduce an approach of GNNs combined with a HaarPooling operation to analyze the events, called HaarPooling Message Passing neural network (HMPNet). In HMPNet, HaarPooling not only extract the features of graph, but also embed additional information obtained by clustering of k-means of different particle observables. We construct Haarpooling from three different observables: absolute energy $\log E$, transverse momentum $\log p_T$ , and relative coordinates $(\Delta\eta,\Delta\phi)$, then discuss their impacts on the tagging and compare the results with those obtained via MPNN and ParticleNet (PN). The results show that an appropriate selection of information for HaarPooling enhance the accuracy of quark-gluon tagging, for adding extra information of $\log P_T$ to the HMPNet outperforms all the others, meanwhile adding relative coordinates information $(\Delta\eta,\Delta\phi)$ is not very beneficial.
    Occlusion-Aware Crowd Navigation Using People as Sensors. (arXiv:2210.00552v2 [cs.RO] UPDATED)
    Autonomous navigation in crowded spaces poses a challenge for mobile robots due to the highly dynamic, partially observable environment. Occlusions are highly prevalent in such settings due to a limited sensor field of view and obstructing human agents. Previous work has shown that observed interactive behaviors of human agents can be used to estimate potential obstacles despite occlusions. We propose integrating such social inference techniques into the planning pipeline. We use a variational autoencoder with a specially designed loss function to learn representations that are meaningful for occlusion inference. This work adopts a deep reinforcement learning approach to incorporate the learned representation for occlusion-aware planning. In simulation, our occlusion-aware policy achieves comparable collision avoidance performance to fully observable navigation by estimating agents in occluded spaces. We demonstrate successful policy transfer from simulation to the real-world Turtlebot 2i. To the best of our knowledge, this work is the first to use social occlusion inference for crowd navigation.
    Adaptive Data Depth via Multi-Armed Bandits. (arXiv:2211.03985v1 [stat.CO])
    Data depth, introduced by Tukey (1975), is an important tool in data science, robust statistics, and computational geometry. One chief barrier to its broader practical utility is that many common measures of depth are computationally intensive, requiring on the order of $n^d$ operations to exactly compute the depth of a single point within a data set of $n$ points in $d$-dimensional space. Often however, we are not directly interested in the absolute depths of the points, but rather in their \textit{relative ordering}. For example, we may want to find the most central point in a data set (a generalized median), or to identify and remove all outliers (points on the fringe of the data set with low depth). With this observation, we develop a novel and instance-adaptive algorithm for adaptive data depth computation by reducing the problem of exactly computing $n$ depths to an $n$-armed stochastic multi-armed bandit problem which we can efficiently solve. We focus our exposition on simplicial depth, developed by \citet{liu1990notion}, which has emerged as a promising notion of depth due to its interpretability and asymptotic properties. We provide general instance-dependent theoretical guarantees for our proposed algorithms, which readily extend to many other common measures of data depth including majority depth, Oja depth, and likelihood depth. When specialized to the case where the gaps in the data follow a power law distribution with parameter $\alpha<2$, we show that we can reduce the complexity of identifying the deepest point in the data set (the simplicial median) from $O(n^d)$ to $\tilde{O}(n^{d-(d-1)\alpha/2})$, where $\tilde{O}$ suppresses logarithmic factors. We corroborate our theoretical results with numerical experiments on synthetic data, showing the practical utility of our proposed methods.
    Simulator-based explanation and debugging of hazard-triggering events in DNN-based safety-critical systems. (arXiv:2204.00480v4 [cs.SE] UPDATED)
    When Deep Neural Networks (DNNs) are used in safety-critical systems, engineers should determine the safety risks associated with failures (i.e., erroneous outputs) observed during testing. For DNNs processing images, engineers visually inspect all failure-inducing images to determine common characteristics among them. Such characteristics correspond to hazard-triggering events (e.g., low illumination) that are essential inputs for safety analysis. Though informative, such activity is expensive and error-prone. To support such safety analysis practices, we propose SEDE, a technique that generates readable descriptions for commonalities in failure-inducing, real-world images and improves the DNN through effective retraining. SEDE leverages the availability of simulators, which are commonly used for cyber-physical systems. It relies on genetic algorithms to drive simulators towards the generation of images that are similar to failure-inducing, real-world images in the test set; it then employs rule learning algorithms to derive expressions that capture commonalities in terms of simulator parameter values. The derived expressions are then used to generate additional images to retrain and improve the DNN. With DNNs performing in-car sensing tasks, SEDE successfully characterized hazard-triggering events leading to a DNN accuracy drop. Also, SEDE enabled retraining leading to significant improvements in DNN accuracy, up to 18 percentage points.
    Fashion-Specific Attributes Interpretation via Dual Gaussian Visual-Semantic Embedding. (arXiv:2210.17417v2 [cs.CV] UPDATED)
    Several techniques to map various types of components, such as words, attributes, and images, into the embedded space have been studied. Most of them estimate the embedded representation of target entity as a point in the projective space. Some models, such as Word2Gauss, assume a probability distribution behind the embedded representation, which enables the spread or variance of the meaning of embedded target components to be captured and considered in more detail. We examine the method of estimating embedded representations as probability distributions for the interpretation of fashion-specific abstract and difficult-to-understand terms. Terms, such as "casual," "adult-casual,'' "beauty-casual," and "formal," are extremely subjective and abstract and are difficult for both experts and non-experts to understand, which discourages users from trying new fashion. We propose an end-to-end model called dual Gaussian visual-semantic embedding, which maps images and attributes in the same projective space and enables the interpretation of the meaning of these terms by its broad applications. We demonstrate the effectiveness of the proposed method through multifaceted experiments involving image and attribute mapping, image retrieval and re-ordering techniques, and a detailed theoretical/analytical discussion of the distance measure included in the loss function.
    Sample Complexity of Forecast Aggregation. (arXiv:2207.13126v2 [cs.LG] UPDATED)
    We consider a Bayesian forecast aggregation model where $n$ experts, after observing private signals about an unknown binary event, report their posterior beliefs about the event to a principal, who then aggregates the reports into a single prediction for the event. The signals of the experts and the outcome of the event follow a joint distribution that is unknown to the principal, but the principal has access to i.i.d. "samples" from the distribution, where each sample is a tuple of experts' reports (not signals) and the realization of the event. Using these samples, the principal aims to find an $\varepsilon$-approximately optimal aggregator, where optimality is measured in terms of the expected squared distance between the aggregated prediction and the realization of the event. We show that the sample complexity of this problem is at least $\tilde \Omega(m^{n-2} / \varepsilon)$ for arbitrary discrete distributions, where $m$ is the size of each expert's signal space. This sample complexity grows exponentially in the number of experts $n$. But, if experts' signals are independent conditioned on the realization of the event, then the sample complexity is significantly reduced, to $\tilde O(1 / \varepsilon^2)$, which does not depend on $n$. Finally, we generalize our model to non-binary events and obtain sample complexity bounds that depend on the event space size.
    Complex-to-Real Random Features for Polynomial Kernels. (arXiv:2202.02031v3 [stat.ML] UPDATED)
    Polynomial kernels are among the most popular kernels in machine learning, since their feature maps model the interactions between the dimensions of the input data. However, these features correspond to tensor products of the input with itself, which makes their dimension grow exponentially with the polynomial degree. We address this issue by proposing Complexto-Real (CtR) sketches for tensor products that can be used as random feature approximations of polynomial kernels. These sketches leverage intermediate complex random projections, leading to better theoretical guarantees and potentially much lower variances than analogs using real projections. Our sketches are simple to construct and their final output is real-valued, which makes their downstream use straightforward. Finally, we show that they achieve state-of-the-art performance in terms of accuracy and speed.
    An Efficient and Reliable Asynchronous Federated Learning Scheme for Smart Public Transportation. (arXiv:2208.07194v3 [cs.LG] UPDATED)
    Since the traffic conditions change over time, machine learning models that predict traffic flows must be updated continuously and efficiently in smart public transportation. Federated learning (FL) is a distributed machine learning scheme that allows buses to receive model updates without waiting for model training on the cloud. However, FL is vulnerable to poisoning or DDoS attacks since buses travel in public. Some work introduces blockchain to improve reliability, but the additional latency from the consensus process reduces the efficiency of FL. Asynchronous Federated Learning (AFL) is a scheme that reduces the latency of aggregation to improve efficiency, but the learning performance is unstable due to unreasonably weighted local models. To address the above challenges, this paper offers a blockchain-based asynchronous federated learning scheme with a dynamic scaling factor (DBAFL). Specifically, the novel committee-based consensus algorithm for blockchain improves reliability at the lowest possible cost of time. Meanwhile, the devised dynamic scaling factor allows AFL to assign reasonable weights to stale local models. Extensive experiments conducted on heterogeneous devices validate outperformed learning performance, efficiency, and reliability of DBAFL.
    FewSOL: A Dataset for Few-Shot Object Learning in Robotic Environments. (arXiv:2207.03333v2 [cs.CV] UPDATED)
    We introduce the Few-Shot Object Learning (FewSOL) dataset for object recognition with a few images per object. We captured 336 real-world objects with 9 RGB-D images per object from different views. Object segmentation masks, object poses and object attributes are provided. In addition, synthetic images generated using 330 3D object models are used to augment the dataset. We investigated (i) few-shot object classification and (ii) joint object segmentation and few-shot classification with the state-of-the-art methods for few-shot learning and meta-learning using our dataset. The evaluation results show that there is still a large margin to be improved for few-shot object classification in robotic environments. Our dataset can be used to study a set of few-shot object recognition problems such as classification, detection and segmentation, shape reconstruction, pose estimation, keypoint correspondences and attribute recognition. The dataset and code are available at https://irvlutd.github.io/FewSOL.
    Linear algebra with transformers. (arXiv:2112.01898v2 [cs.LG] UPDATED)
    Transformers can learn to perform numerical computations from examples only. I study nine problems of linear algebra, from basic matrix operations to eigenvalue decomposition and inversion, and introduce and discuss four encoding schemes to represent real numbers. On all problems, transformers trained on sets of random matrices achieve high accuracies (over 90%). The models are robust to noise, and can generalize out of their training distribution. In particular, models trained to predict Laplace-distributed eigenvalues generalize to different classes of matrices: Wigner matrices or matrices with positive eigenvalues. The reverse is not true.
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v2 [stat.ML] UPDATED)
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.
    Causal Discovery in Linear Structural Causal Models with Deterministic Relations. (arXiv:2111.00341v2 [cs.LG] UPDATED)
    Linear structural causal models (SCMs) -- in which each observed variable is generated by a subset of the other observed variables as well as a subset of the exogenous sources -- are pervasive in causal inference and casual discovery. However, for the task of causal discovery, existing work almost exclusively focus on the submodel where each observed variable is associated with a distinct source with non-zero variance. This results in the restriction that no observed variable can deterministically depend on other observed variables or latent confounders. In this paper, we extend the results on structure learning by focusing on a subclass of linear SCMs which do not have this property, i.e., models in which observed variables can be causally affected by any subset of the sources, and are allowed to be a deterministic function of other observed variables or latent confounders. This allows for a more realistic modeling of influence or information propagation in systems. We focus on the task of causal discovery form observational data generated from a member of this subclass. We derive a set of necessary and sufficient conditions for unique identifiability of the causal structure. To the best of our knowledge, this is the first work that gives identifiability results for causal discovery under both latent confounding and deterministic relationships. Further, we propose an algorithm for recovering the underlying causal structure when the aforementioned conditions are satisfied. We validate our theoretical results both on synthetic and real datasets.
    Training Fully Connected Neural Networks is $\exists\mathbb{R}$-Complete. (arXiv:2204.01368v2 [cs.CC] UPDATED)
    We consider the algorithmic problem of finding the optimal weights and biases for a two-layer fully connected neural network to fit a given set of data points. This problem is known as empirical risk minimization in the machine learning community. We show that the problem is $\exists\mathbb{R}$-complete. This complexity class can be defined as the set of algorithmic problems that are polynomial-time equivalent to finding real roots of a polynomial with integer coefficients. Furthermore, we show that arbitrary algebraic numbers are required as weights to be able to train some instances to optimality, even if all data points are rational. Our results hold even if the following restrictions are all added simultaneously. $\bullet$ There are exactly two output neurons. $\bullet$ There are exactly two input neurons. $\bullet$ The data has only 13 different labels. $\bullet$ The number of hidden neurons is a constant fraction of the number of data points. $\bullet$ The target training error is zero. $\bullet$ The ReLU activation function is used. This shows that even very simple networks are difficult to train. The result explains why typical methods for $\mathsf{NP}$-complete problems, like mixed-integer programming or SAT-solving, cannot train neural networks to global optimality, unless $\mathsf{NP}=\exists\mathbb{R}$. We strengthen a recent result by Abrahamsen, Kleist and Miltzow [NeurIPS 2021].
    Robust Manifold Nonnegative Tucker Factorization for Tensor Data Representation. (arXiv:2211.03934v1 [cs.AI])
    Nonnegative Tucker Factorization (NTF) minimizes the euclidean distance or Kullback-Leibler divergence between the original data and its low-rank approximation which often suffers from grossly corruptions or outliers and the neglect of manifold structures of data. In particular, NTF suffers from rotational ambiguity, whose solutions with and without rotation transformations are equally in the sense of yielding the maximum likelihood. In this paper, we propose three Robust Manifold NTF algorithms to handle outliers by incorporating structural knowledge about the outliers. They first applies a half-quadratic optimization algorithm to transform the problem into a general weighted NTF where the weights are influenced by the outliers. Then, we introduce the correntropy induced metric, Huber function and Cauchy function for weights respectively, to handle the outliers. Finally, we introduce a manifold regularization to overcome the rotational ambiguity of NTF. We have compared the proposed method with a number of representative references covering major branches of NTF on a variety of real-world image databases. Experimental results illustrate the effectiveness of the proposed method under two evaluation metrics (accuracy and nmi).  ( 2 min )
    Geometry-aware Transformer for molecular property prediction. (arXiv:2106.15516v2 [cs.LG] UPDATED)
    Recently, graph neural networks (GNNs) have achieved remarkable performances for quantum mechanical problems. However, a graph convolution can only cover a localized region, and cannot capture long-range interactions of atoms. This behavior is contrary to theoretical interatomic potentials, which is a fundamental limitation of the spatial based GNNs. In this work, we propose a novel attention-based framework for molecular property prediction tasks. We represent a molecular conformation as a discrete atomic sequence combined by atom-atom distance attributes, named Geometry-aware Transformer (GeoT). In particular, we adopt a Transformer architecture, which has been widely used for sequential data. Our proposed model trains sequential representations of molecular graphs based on globally constructed attentions, maintaining all spatial arrangements of atom pairs. Our method does not suffer from cost intensive computations, such as angle calculations. The experimental results on several public benchmarks and visualization maps verified that keeping the long-range interatomic attributes can significantly improve the model predictability.
    Spoofing Attack Detection in the Physical Layer with Commutative Neural Networks. (arXiv:2211.04269v1 [cs.LG])
    In a spoofing attack, an attacker impersonates a legitimate user to access or tamper with data intended for or produced by the legitimate user. In wireless communication systems, these attacks may be detected by relying on features of the channel and transmitter radios. In this context, a popular approach is to exploit the dependence of the received signal strength (RSS) at multiple receivers or access points with respect to the spatial location of the transmitter. Existing schemes rely on long-term estimates, which makes it difficult to distinguish spoofing from movement of a legitimate user. This limitation is here addressed by means of a deep neural network that implicitly learns the distribution of pairs of short-term RSS vector estimates. The adopted network architecture imposes the invariance to permutations of the input (commutativity) that the decision problem exhibits. The merits of the proposed algorithm are corroborated on a data set that we collected.
    One-shot learning for solution operators of partial differential equations. (arXiv:2104.05512v2 [cs.LG] UPDATED)
    Discovering governing equations of a physical system, represented by partial differential equations (PDEs), from data is a central challenge in a variety of areas of science and engineering. Current methods require either some prior knowledge (e.g., candidate PDE terms) to discover the PDE form, or a large dataset to learn a surrogate model of the PDE solution operator. Here, we propose the first solution operator learning method that only needs one PDE solution, i.e., one-shot learning. We first decompose the entire computational domain into small domains, where we learn a local solution operator, and then we find the coupled solution via either mesh-based fixed-point iteration or meshfree local-solution-operator informed neural networks. We demonstrate the effectiveness of our method on different PDEs, and our method exhibits a strong generalization property.
    Individualized PATE: Differentially Private Machine Learning with Individual Privacy Guarantees. (arXiv:2202.10517v4 [cs.LG] UPDATED)
    Applying machine learning (ML) to sensitive domains requires privacy protection of the underlying training data through formal privacy frameworks, such as differential privacy (DP). Yet, usually, the privacy of the training data comes at the cost of the resulting ML models' utility. One reason for this is that DP uses one uniform privacy budget epsilon for all training data points, which has to align with the strictest privacy requirement encountered among all data holders. In practice, different data holders have different privacy requirements and data points of data holders with lower requirements can contribute more information to the training process of the ML models. To account for this need, we propose two novel methods based on the Private Aggregation of Teacher Ensembles (PATE) framework to support the training of ML models with individualized privacy guarantees. We formally describe the methods, provide a theoretical analysis of their privacy bounds, and experimentally evaluate their effect on the final model's utility using the MNIST, SVHN, and Adult income datasets. Our empirical results show that the individualized privacy methods yield ML models of higher accuracy than the non-individualized baseline. Thereby, we improve the privacy-utility trade-off in scenarios in which different data holders consent to contribute their sensitive data at different individual privacy levels.
    A Penalty Based Method for Communication-Efficient Decentralized Bilevel Programming. (arXiv:2211.04088v1 [cs.LG])
    Bilevel programming has recently received attention in the literature, due to a wide range of applications, including reinforcement learning and hyper-parameter optimization. However, it is widely assumed that the underlying bilevel optimization problem is solved either by a single machine or in the case of multiple machines connected in a star-shaped network, i.e., federated learning setting. The latter approach suffers from a high communication cost on the central node (e.g., parameter server) and exhibits privacy vulnerabilities. Hence, it is of interest to develop methods that solve bilevel optimization problems in a communication-efficient decentralized manner. To that end, this paper introduces a penalty function based decentralized algorithm with theoretical guarantees for this class of optimization problems. Specifically, a distributed alternating gradient-type algorithm for solving consensus bilevel programming over a decentralized network is developed. A key feature of the proposed algorithm is to estimate the hyper-gradient of the penalty function via decentralized computation of matrix-vector products and few vector communications, which is then integrated within our alternating algorithm to give the finite-time convergence analysis under different convexity assumptions. Owing to the generality of this complexity analysis, our result yields convergence rates for a wide variety of consensus problems including minimax and compositional optimization. Empirical results on both synthetic and real datasets demonstrate that the proposed method works well in practice.  ( 2 min )
    Semantic Information Retrieval in Wireless Networks. (arXiv:2204.13366v2 [cs.IT] UPDATED)
    Motivated by recent success of Machine Learning (ML) tools in wireless communications, the idea of semantic communication by Weaver from 1949 has received considerable attention. It breaks with the classic design paradigm of Shannon by aiming to transmit the meaning of a message, i.e., semantics, rather than its exact copy and thus allows for savings in channel uses or information rate. In this work, we extend the fundamental approach from Basu et al. for modeling semantics from logical to probabilistic entailment relations between meaning and messages. Thus, we model semantics by means of a hidden random variable and define the task of semantic communication as transmission of messages over a communication channel such that semantics is best preserved. We formulate the semantic communication design either as an Information Maximization or as an Information Bottleneck optimization problem. Finally, we propose the ML-based semantic communication system SINFONI for a distributed multipoint scenario: SINFONI communicates the meaning behind multiple messages that are observed at different senders to a single receiver for semantic retrieval. We analyze SINFONI by processing images as an example of messages. Numerical results reveal a tremendous rate normalized SNR shift up to 20 dB compared to classically designed communication systems.
    Quantization-Based Optimization: Alternative Stochastic Approximation of Global Optimization. (arXiv:2211.03972v1 [cs.LG])
    In this study, we propose a global optimization algorithm based on quantizing the energy level of an objective function in an NP-hard problem. According to the white noise hypothesis for a quantization error with a dense and uniform distribution, we can regard the quantization error as i.i.d. white noise. From stochastic analysis, the proposed algorithm converges weakly only under conditions satisfying Lipschitz continuity, instead of local convergence properties such as the Hessian constraint of the objective function. This shows that the proposed algorithm ensures global optimization by Laplace's condition. Numerical experiments show that the proposed algorithm outperforms conventional learning methods in solving NP-hard optimization problems such as the traveling salesman problem.  ( 2 min )
    Word Order Matters when you Increase Masking. (arXiv:2211.04427v1 [cs.CL])
    Word order, an essential property of natural languages, is injected in Transformer-based neural language models using position encoding. However, recent experiments have shown that explicit position encoding is not always useful, since some models without such feature managed to achieve state-of-the art performance on some tasks. To understand better this phenomenon, we examine the effect of removing position encodings on the pre-training objective itself (i.e., masked language modelling), to test whether models can reconstruct position information from co-occurrences alone. We do so by controlling the amount of masked tokens in the input sentence, as a proxy to affect the importance of position information for the task. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task. These findings point towards a direct relationship between the amount of masking and the ability of Transformers to capture order-sensitive aspects of language using position encoding.
    Individualized and Global Feature Attributions for Gradient Boosted Trees in the Presence of $\ell_2$ Regularization. (arXiv:2211.04409v1 [stat.ML])
    While $\ell_2$ regularization is widely used in training gradient boosted trees, popular individualized feature attribution methods for trees such as Saabas and TreeSHAP overlook the training procedure. We propose Prediction Decomposition Attribution (PreDecomp), a novel individualized feature attribution for gradient boosted trees when they are trained with $\ell_2$ regularization. Theoretical analysis shows that the inner product between PreDecomp and labels on in-sample data is essentially the total gain of a tree, and that it can faithfully recover additive models in the population case when features are independent. Inspired by the connection between PreDecomp and total gain, we also propose TreeInner, a family of debiased global feature attributions defined in terms of the inner product between any individualized feature attribution and labels on out-sample data for each tree. Numerical experiments on a simulated dataset and a genomic ChIP dataset show that TreeInner has state-of-the-art feature selection performance. Code reproducing experiments is available at https://github.com/nalzok/TreeInner .
    Fast and Credible Likelihood-Free Cosmology with Truncated Marginal Neural Ratio Estimation. (arXiv:2111.08030v2 [astro-ph.CO] UPDATED)
    Sampling-based inference techniques are central to modern cosmological data analysis; these methods, however, scale poorly with dimensionality and typically require approximate or intractable likelihoods. In this paper we describe how Truncated Marginal Neural Ratio Estimation (TMNRE) (a new approach in so-called simulation-based inference) naturally evades these issues, improving the $(i)$ efficiency, $(ii)$ scalability, and $(iii)$ trustworthiness of the inferred posteriors. Using measurements of the Cosmic Microwave Background (CMB), we show that TMNRE can achieve converged posteriors using orders of magnitude fewer simulator calls than conventional Markov Chain Monte Carlo (MCMC) methods. Remarkably, the required number of samples is effectively independent of the number of nuisance parameters. In addition, a property called \emph{local amortization} allows the performance of rigorous statistical consistency checks that are not accessible to sampling-based methods. TMNRE promises to become a powerful tool for cosmological data analysis, particularly in the context of extended cosmologies, where the timescale required for conventional sampling-based inference methods to converge can greatly exceed that of simple cosmological models such as $\Lambda$CDM. To perform these computations, we use an implementation of TMNRE via the open-source code \texttt{swyft}.
    Efficient Compressed Ratio Estimation using Online Sequential Learning for Edge Computing. (arXiv:2211.04284v1 [cs.LG])
    Owing to the widespread adoption of the Internet of Things, a vast amount of sensor information is being acquired in real time. Accordingly, the communication cost of data from edge devices is increasing. Compressed sensing (CS), a data compression method that can be used on edge devices, has been attracting attention as a method to reduce communication costs. In CS, estimating the appropriate compression ratio is important. There is a method to adaptively estimate the compression ratio for the acquired data using reinforcement learning. However, the computational costs associated with existing reinforcement learning methods that can be utilized on edges are expensive. In this study, we developed an efficient reinforcement learning method for edge devices, referred to as the actor--critic online sequential extreme learning machine (AC-OSELM), and a system to compress data by estimating an appropriate compression ratio on the edge using AC-OSELM. The performance of the proposed method in estimating the compression ratio is evaluated by comparing it with other reinforcement learning methods for edge devices. The experimental results show that AC-OSELM achieved the same or better compression performance and faster compression ratio estimation than the existing methods.
    The Interpolated MVU Mechanism For Communication-efficient Private Federated Learning. (arXiv:2211.03942v1 [cs.LG])
    We consider private federated learning (FL), where a server aggregates differentially private gradient updates from a large number of clients in order to train a machine learning model. The main challenge is balancing privacy with both classification accuracy of the learned model as well as the amount of communication between the clients and server. In this work, we build on a recently proposed method for communication-efficient private FL -- the MVU mechanism -- by introducing a new interpolation mechanism that can accommodate a more efficient privacy analysis. The result is the new Interpolated MVU mechanism that provides SOTA results on communication-efficient private FL on a variety of datasets.  ( 2 min )
    Estimating Treatment Effects using Neurosymbolic Program Synthesis. (arXiv:2211.04370v1 [cs.AI])
    Estimating treatment effects from observational data is a central problem in causal inference. Methods to solve this problem exploit inductive biases and heuristics from causal inference to design multi-head neural network architectures and regularizers. In this work, we propose to use neurosymbolic program synthesis, a data-efficient, and interpretable technique, to solve the treatment effect estimation problem. We theoretically show that neurosymbolic programming can solve the treatment effect estimation problem. By designing a Domain Specific Language (DSL) for treatment effect estimation problem based on the inductive biases used in literature, we argue that neurosymbolic programming is a better alternative to treatment effect estimation than traditional methods. Our empirical study reveals that our method, which implicitly encodes inductive biases in a DSL, achieves better performance on benchmark datasets than the state-of-the-art methods.
    Fairness-aware Regression Robust to Adversarial Attacks. (arXiv:2211.04449v1 [cs.CR])
    In this paper, we take a first step towards answering the question of how to design fair machine learning algorithms that are robust to adversarial attacks. Using a minimax framework, we aim to design an adversarially robust fair regression model that achieves optimal performance in the presence of an attacker who is able to add a carefully designed adversarial data point to the dataset or perform a rank-one attack on the dataset. By solving the proposed nonsmooth nonconvex-nonconcave minimax problem, the optimal adversary as well as the robust fairness-aware regression model are obtained. For both synthetic data and real-world datasets, numerical results illustrate that the proposed adversarially robust fair models have better performance on poisoned datasets than other fair machine learning models in both prediction accuracy and group-based fairness measure.
    Black Box Lie Group Preconditioners for SGD. (arXiv:2211.04422v1 [stat.ML])
    A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second-order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations.
    Gradient-enhanced deep neural network approximations. (arXiv:2211.04226v1 [cs.LG])
    We propose in this work the gradient-enhanced deep neural networks (DNNs) approach for function approximations and uncertainty quantification. More precisely, the proposed approach adopts both the function evaluations and the associated gradient information to yield enhanced approximation accuracy. In particular, the gradient information is included as a regularization term in the gradient-enhanced DNNs approach, for which we present similar posterior estimates (by the two-layer neural networks) as those in the path-norm regularized DNNs approximations. We also discuss the application of this approach to gradient-enhanced uncertainty quantification, and present several numerical experiments to show that the proposed approach can outperform the traditional DNNs approach in many cases of interests.
    Expressing linear equality constraints in feedforward neural networks. (arXiv:2211.04395v1 [cs.LG])
    We seek to impose linear, equality constraints in feedforward neural networks. As top layer predictors are usually nonlinear, this is a difficult task if we seek to deploy standard convex optimization methods and strong duality. To overcome this, we introduce a new saddle-point Lagrangian with auxiliary predictor variables on which constraints are imposed. Elimination of the auxiliary variables leads to a dual minimization problem on the Lagrange multipliers introduced to satisfy the linear constraints. This minimization problem is combined with the standard learning problem on the weight matrices. From this theoretical line of development, we obtain the surprising interpretation of Lagrange parameters as additional, penultimate layer hidden units with fixed weights stemming from the constraints. Consequently, standard minimization approaches can be used despite the inclusion of Lagrange parameters -- a very satisfying, albeit unexpected, discovery. Examples ranging from multi-label classification to constrained autoencoders are envisaged in the future.
    Clustered Federated Learning based on Nonconvex Pairwise Fusion. (arXiv:2211.04218v1 [cs.LG])
    This study investigates clustered federated learning (FL), one of the formulations of FL with non-i.i.d. data, where the devices are partitioned into clusters and each cluster optimally fits its data with a localized model. We propose a novel clustered FL framework, which applies a nonconvex penalty to pairwise differences of parameters. This framework can automatically identify clusters without a priori knowledge of the number of clusters and the set of devices in each cluster. To implement the proposed framework, we develop a novel clustered FL method called FPFC. Advancing from the standard ADMM, our method is implemented in parallel, updates only a subset of devices at each communication round, and allows each participating device to perform a variable amount of work. This greatly reduces the communication cost while simultaneously preserving privacy, making it practical for FL. We also propose a new warmup strategy for hyperparameter tuning under FL settings and consider the asynchronous variant of FPFC (asyncFPFC). Theoretically, we provide convergence guarantees of FPFC for general nonconvex losses and establish the statistical convergence rate under a linear model with squared loss. Our extensive experiments demonstrate the advantages of FPFC over existing methods.
    Private Set Generation with Discriminative Information. (arXiv:2211.04446v1 [cs.CR])
    Differentially private data generation techniques have become a promising solution to the data privacy challenge -- it enables sharing of data while complying with rigorous privacy guarantees, which is essential for scientific progress in sensitive domains. Unfortunately, restricted by the inherent complexity of modeling high-dimensional distributions, existing private generative models are struggling with the utility of synthetic samples. In contrast to existing works that aim at fitting the complete data distribution, we directly optimize for a small set of samples that are representative of the distribution under the supervision of discriminative information from downstream tasks, which is generally an easier task and more suitable for private training. Our work provides an alternative view for differentially private generation of high-dimensional data and introduces a simple yet effective method that greatly improves the sample utility of state-of-the-art approaches.
    Reduced Order Probabilistic Emulation for Physics-Based Thermosphere Models. (arXiv:2211.04392v1 [physics.space-ph])
    The geospace environment is volatile and highly driven. Space weather has effects on Earth's magnetosphere that cause a dynamic and enigmatic response in the thermosphere, particularly on the evolution of neutral mass density. Many models exist that use space weather drivers to produce a density response, but these models are typically computationally expensive or inaccurate for certain space weather conditions. In response, this work aims to employ a probabilistic machine learning (ML) method to create an efficient surrogate for the Thermosphere Ionosphere Electrodynamics General Circulation Model (TIE-GCM), a physics-based thermosphere model. Our method leverages principal component analysis to reduce the dimensionality of TIE-GCM and recurrent neural networks to model the dynamic behavior of the thermosphere much quicker than the numerical model. The newly developed reduced order probabilistic emulator (ROPE) uses Long-Short Term Memory neural networks to perform time-series forecasting in the reduced state and provide distributions for future density. We show that across the available data, TIE-GCM ROPE has similar error to previous linear approaches while improving storm-time modeling. We also conduct a satellite propagation study for the significant November 2003 storm which shows that TIE-GCM ROPE can capture the position resulting from TIE-GCM density with < 5 km bias. Simultaneously, linear approaches provide point estimates that can result in biases of 7 - 18 km.
    FLock: Defending Malicious Behaviors in Federated Learning with Blockchain. (arXiv:2211.04344v1 [cs.CR])
    Federated learning (FL) is a promising way to allow multiple data owners (clients) to collaboratively train machine learning models without compromising data privacy. Yet, existing FL solutions usually rely on a centralized aggregator for model weight aggregation, while assuming clients are honest. Even if data privacy can still be preserved, the problem of single-point failure and data poisoning attack from malicious clients remains unresolved. To tackle this challenge, we propose to use distributed ledger technology (DLT) to achieve FLock, a secure and reliable decentralized Federated Learning system built on blockchain. To guarantee model quality, we design a novel peer-to-peer (P2P) review and reward/slash mechanism to detect and deter malicious clients, powered by on-chain smart contracts. The reward/slash mechanism, in addition, serves as incentives for participants to honestly upload and review model parameters in the FLock system. FLock thus improves the performance and the robustness of FL systems in a fully P2P manner.
    From Causal Pairs to Causal Graphs. (arXiv:2211.04312v1 [cs.LG])
    Causal structure learning from observational data remains a non-trivial task due to various factors such as finite sampling, unobserved confounding factors, and measurement errors. Constraint-based and score-based methods tend to suffer from high computational complexity due to the combinatorial nature of estimating the directed acyclic graph (DAG). Motivated by the `Cause-Effect Pair' NIPS 2013 Workshop on Causality Challenge, in this paper, we take a different approach and generate a probability distribution over all possible graphs informed by the cause-effect pair features proposed in response to the workshop challenge. The goal of the paper is to propose new methods based on this probabilistic information and compare their performance with traditional and state-of-the-art approaches. Our experiments, on both synthetic and real datasets, show that our proposed methods not only have statistically similar or better performances than some traditional approaches but also are computationally faster.
    Quantum Persistent Homology for Time Series. (arXiv:2211.04465v1 [quant-ph])
    Persistent homology, a powerful mathematical tool for data analysis, summarizes the shape of data through tracking topological features across changes in different scales. Classical algorithms for persistent homology are often constrained by running times and memory requirements that grow exponentially on the number of data points. To surpass this problem, two quantum algorithms of persistent homology have been developed based on two different approaches. However, both of these quantum algorithms consider a data set in the form of a point cloud, which can be restrictive considering that many data sets come in the form of time series. In this paper, we alleviate this issue by establishing a quantum Takens's delay embedding algorithm, which turns a time series into a point cloud by considering a pertinent embedding into a higher dimensional space. Having this quantum transformation of time series to point clouds, then one may use a quantum persistent homology algorithm to extract the topological features from the point cloud associated with the original times series.
    Machine Learning-Aided Operations and Communications of Unmanned Aerial Vehicles: A Contemporary Survey. (arXiv:2211.04324v1 [cs.RO])
    The ongoing amalgamation of UAV and ML techniques is creating a significant synergy and empowering UAVs with unprecedented intelligence and autonomy. This survey aims to provide a timely and comprehensive overview of ML techniques used in UAV operations and communications and identify the potential growth areas and research gaps. We emphasise the four key components of UAV operations and communications to which ML can significantly contribute, namely, perception and feature extraction, feature interpretation and regeneration, trajectory and mission planning, and aerodynamic control and operation. We classify the latest popular ML tools based on their applications to the four components and conduct gap analyses. This survey also takes a step forward by pointing out significant challenges in the upcoming realm of ML-aided automated UAV operations and communications. It is revealed that different ML techniques dominate the applications to the four key modules of UAV operations and communications. While there is an increasing trend of cross-module designs, little effort has been devoted to an end-to-end ML framework, from perception and feature extraction to aerodynamic control and operation. It is also unveiled that the reliability and trust of ML in UAV operations and applications require significant attention before full automation of UAVs and potential cooperation between UAVs and humans come to fruition.
    Learning Spatio-Temporal Model of Disease Progression with NeuralODEs from Longitudinal Volumetric Data. (arXiv:2211.04234v1 [cs.CV])
    Robust forecasting of the future anatomical changes inflicted by an ongoing disease is an extremely challenging task that is out of grasp even for experienced healthcare professionals. Such a capability, however, is of great importance since it can improve patient management by providing information on the speed of disease progression already at the admission stage, or it can enrich the clinical trials with fast progressors and avoid the need for control arms by the means of digital twins. In this work, we develop a deep learning method that models the evolution of age-related disease by processing a single medical scan and providing a segmentation of the target anatomy at a requested future point in time. Our method represents a time-invariant physical process and solves a large-scale problem of modeling temporal pixel-level changes utilizing NeuralODEs. In addition, we demonstrate the approaches to incorporate the prior domain-specific constraints into our method and define temporal Dice loss for learning temporal objectives. To evaluate the applicability of our approach across different age-related diseases and imaging modalities, we developed and tested the proposed method on the datasets with 967 retinal OCT volumes of 100 patients with Geographic Atrophy, and 2823 brain MRI volumes of 633 patients with Alzheimer's Disease. For Geographic Atrophy, the proposed method outperformed the related baseline models in the atrophy growth prediction. For Alzheimer's Disease, the proposed method demonstrated remarkable performance in predicting the brain ventricle changes induced by the disease, achieving the state-of-the-art result on TADPOLE challenge.
    GOOD-D: On Unsupervised Graph Out-Of-Distribution Detection. (arXiv:2211.04208v1 [cs.LG])
    Most existing deep learning models are trained based on the closed-world assumption, where the test data is assumed to be drawn i.i.d. from the same distribution as the training data, known as in-distribution (ID). However, when models are deployed in an open-world scenario, test samples can be out-of-distribution (OOD) and therefore should be handled with caution. To detect such OOD samples drawn from unknown distribution, OOD detection has received increasing attention lately. However, current endeavors mostly focus on grid-structured data and its application for graph-structured data remains under-explored. Considering the fact that data labeling on graphs is commonly time-expensive and labor-intensive, in this work we study the problem of unsupervised graph OOD detection, aiming at detecting OOD graphs solely based on unlabeled ID data. To achieve this goal, we develop a new graph contrastive learning framework GOOD-D for detecting OOD graphs without using any ground-truth labels. By performing hierarchical contrastive learning on the augmented graphs generated by our perturbation-free graph data augmentation method, GOOD-D is able to capture the latent ID patterns and accurately detect OOD graphs based on the semantic inconsistency in different granularities (i.e., node-level, graph-level, and group-level). As a pioneering work in unsupervised graph-level OOD detection, we build a comprehensive benchmark to compare our proposed approach with different state-of-the-art methods. The experiment results demonstrate the superiority of our approach over different methods on various datasets.
    FedGrad: Optimisation in Decentralised Machine Learning. (arXiv:2211.04254v1 [cs.LG])
    Federated Learning is a machine learning paradigm where we aim to train machine learning models in a distributed fashion. Many clients/edge devices collaborate with each other to train a single model on the central. Clients do not share their own datasets with each other, decoupling computation and data on the same device. In this paper, we propose yet another adaptive federated optimization method and some other ideas in the field of federated learning. We also perform experiments using these methods and showcase the improvement in the overall performance of federated learning.
    Federated Learning Using Three-Operator ADMM. (arXiv:2211.04152v1 [cs.LG])
    Federated learning (FL) has emerged as an instance of distributed machine learning paradigm that avoids the transmission of data generated on the users' side. Although data are not transmitted, edge devices have to deal with limited communication bandwidths, data heterogeneity, and straggler effects due to the limited computational resources of users' devices. A prominent approach to overcome such difficulties is FedADMM, which is based on the classical two-operator consensus alternating direction method of multipliers (ADMM). The common assumption of FL algorithms, including FedADMM, is that they learn a global model using data only on the users' side and not on the edge server. However, in edge learning, the server is expected to be near the base station and have direct access to rich datasets. In this paper, we argue that leveraging the rich data on the edge server is much more beneficial than utilizing only user datasets. Specifically, we show that the mere application of FL with an additional virtual user node representing the data on the edge server is inefficient. We propose FedTOP-ADMM, which generalizes FedADMM and is based on a three-operator ADMM-type technique that exploits a smooth cost function on the edge server to learn a global model parallel to the edge devices. Our numerical experiments indicate that FedTOP-ADMM has substantial gain up to 33\% in communication efficiency to reach a desired test accuracy with respect to FedADMM, including a virtual user on the edge server.
    Theoretical analysis and experimental validation of volume bias of soft Dice optimized segmentation maps in the context of inherent uncertainty. (arXiv:2211.04161v1 [cs.CV])
    The clinical interest is often to measure the volume of a structure, which is typically derived from a segmentation. In order to evaluate and compare segmentation methods, the similarity between a segmentation and a predefined ground truth is measured using popular discrete metrics, such as the Dice score. Recent segmentation methods use a differentiable surrogate metric, such as soft Dice, as part of the loss function during the learning phase. In this work, we first briefly describe how to derive volume estimates from a segmentation that is, potentially, inherently uncertain or ambiguous. This is followed by a theoretical analysis and an experimental validation linking the inherent uncertainty to common loss functions for training CNNs, namely cross-entropy and soft Dice. We find that, even though soft Dice optimization leads to an improved performance with respect to the Dice score and other measures, it may introduce a volume bias for tasks with high inherent uncertainty. These findings indicate some of the method's clinical limitations and suggest doing a closer ad-hoc volume analysis with an optional re-calibration step.
    Learning advisor networks for noisy image classification. (arXiv:2211.04177v1 [cs.CV])
    In this paper, we introduced the novel concept of advisor network to address the problem of noisy labels in image classification. Deep neural networks (DNN) are prone to performance reduction and overfitting problems on training data with noisy annotations. Weighting loss methods aim to mitigate the influence of noisy labels during the training, completely removing their contribution. This discarding process prevents DNNs from learning wrong associations between images and their correct labels but reduces the amount of data used, especially when most of the samples have noisy labels. Differently, our method weighs the feature extracted directly from the classifier without altering the loss value of each data. The advisor helps to focus only on some part of the information present in mislabeled examples, allowing the classifier to leverage that data as well. We trained it with a meta-learning strategy so that it can adapt throughout the training of the main model. We tested our method on CIFAR10 and CIFAR100 with synthetic noise, and on Clothing1M which contains real-world noise, reporting state-of-the-art results.
    UAV-Aided Multi-Community Federated Learning. (arXiv:2206.02043v2 [cs.IT] UPDATED)
    In this work, we investigate the problem of an online trajectory design for an Unmanned Aerial Vehicle (UAV) in a Federated Learning (FL) setting where several different communities exist, each defined by a unique task to be learned. In this setting, spatially distributed devices belonging to each community collaboratively contribute towards training their community model via wireless links provided by the UAV. Accordingly, the UAV acts as a mobile orchestrator coordinating the transmissions and the learning schedule among the devices in each community, intending to accelerate the learning process of all tasks. We propose a heuristic metric as a proxy for the training performance of the different tasks. Capitalizing on this metric, a surrogate objective is defined which enables us to jointly optimize the UAV trajectory and the scheduling of the devices by employing convex optimization techniques and graph theory. The simulations illustrate the out-performance of our solution when compared to other handpicked static and mobile UAV deployment baselines.  ( 2 min )
    Time-Varying Correlation Networks for Interpretable Change Point Detection. (arXiv:2211.03991v1 [cs.LG])
    Change point detection (CPD) methods aim to detect abrupt changes in time-series data. Recent CPD methods have demonstrated their potential in identifying changes in underlying statistical distributions but often fail to capture complex changes in the correlation structure in time-series data. These methods also fail to generalize effectively, as even within the same time-series, different kinds of change points (CPs) may arise that are best characterized by different types of time-series perturbations. To address this issue, we propose TiVaCPD, a CPD methodology that uses a time-varying graphical lasso based method to identify changes in correlation patterns between features over time, and combines that with an aggregate Kernel Maximum Mean Discrepancy (MMD) test to identify subtle changes in the underlying statistical distributions of dynamically established time windows. We evaluate the performance of TiVaCPD in identifying and characterizing various types of CPs in time-series and show that our method outperforms current state-of-the-art CPD methods for all categories of CPs.
    Reconstruct from BEV: A 3D Lane Detection Approach based on Geometry Structure Prior. (arXiv:2206.10098v2 [cs.CV] UPDATED)
    In this paper, we propose an advanced approach in targeting the problem of monocular 3D lane detection by leveraging geometry structure underneath the process of 2D to 3D lane reconstruction. Inspired by previous methods, we first analyze the geometry heuristic between the 3D lane and its 2D representation on the ground and propose to impose explicit supervision based on the structure prior, which makes it achievable to build inter-lane and intra-lane relationships to facilitate the reconstruction of 3D lanes from local to global. Second, to reduce the structure loss in 2D lane representation, we directly extract BEV lane information from front view images, which tremendously eases the confusion of distant lane features in previous methods. Furthermore, we propose a novel task-specific data augmentation method by synthesizing new training data for both segmentation and reconstruction tasks in our pipeline, to counter the imbalanced data distribution of camera pose and ground slope to improve generalization on unseen data. Our work marks the first attempt to employ the geometry prior information into DNN-based 3D lane detection and makes it achievable for detecting lanes in an extra-long distance, doubling the original detection range. The proposed method can be smoothly adopted by other frameworks without extra costs. Experimental results show that our work outperforms state-of-the-art approaches by 3.8% F-Score on Apollo 3D synthetic dataset at real-time speed of 82 FPS without introducing extra parameters.
    A Semiparametric Efficient Approach To Label Shift Estimation and Quantification. (arXiv:2211.04274v1 [cs.LG])
    Transfer Learning is an area of statistics and machine learning research that seeks answers to the following question: how do we build successful learning algorithms when the data available for training our model is qualitatively different from the data we hope the model will perform well on? In this thesis, we focus on a specific area of Transfer Learning called label shift, also known as quantification. In quantification, the aforementioned discrepancy is isolated to a shift in the distribution of the response variable. In such a setting, accurately inferring the response variable's new distribution is both an important estimation task in its own right and a crucial step for ensuring that the learning algorithm can adapt to the new data. We make two contributions to this field. First, we present a new procedure called SELSE which estimates the shift in the response variable's distribution. Second, we prove that SELSE is semiparametric efficient among a large family of quantification algorithms, i.e., SELSE's normalized error has the smallest possible asymptotic variance matrix compared to any other algorithm in that family. This family includes nearly all existing algorithms, including ACC/PACC quantifiers and maximum likelihood based quantifiers such as EMQ and MLLS. Empirical experiments reveal that SELSE is competitive with, and in many cases outperforms, existing state-of-the-art quantification methods, and that this improvement is especially large when the number of test samples is far greater than the number of train samples.
    Toward Human-AI Co-creation to Accelerate Material Discovery. (arXiv:2211.04257v1 [cs.LG])
    There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment.
    TimeKit: A Time-series Forecasting-based Upgrade Kit for Collaborative Filtering. (arXiv:2211.04266v1 [cs.IR])
    Recommender systems are a long-standing research problem in data mining and machine learning. They are incremental in nature, as new user-item interaction logs arrive. In real-world applications, we need to periodically train a collaborative filtering algorithm to extract user/item embedding vectors and therefore, a time-series of embedding vectors can be naturally defined. We present a time-series forecasting-based upgrade kit (TimeKit), which works in the following way: it i) first decides a base collaborative filtering algorithm, ii) extracts user/item embedding vectors with the base algorithm from user-item interaction logs incrementally, e.g., every month, iii) trains our time-series forecasting model with the extracted time-series of embedding vectors, and then iv) forecasts the future embedding vectors and recommend with their dot-product scores owing to a recent breakthrough in processing complicated time-series data, i.e., neural controlled differential equations (NCDEs). Our experiments with four real-world benchmark datasets show that the proposed time-series forecasting-based upgrade kit can significantly enhance existing popular collaborative filtering algorithms.
    Simulation-Based Parallel Training. (arXiv:2211.04119v1 [cs.AI])
    Numerical simulations are ubiquitous in science and engineering. Machine learning for science investigates how artificial neural architectures can learn from these simulations to speed up scientific discovery and engineering processes. Most of these architectures are trained in a supervised manner. They require tremendous amounts of data from simulations that are slow to generate and memory greedy. In this article, we present our ongoing work to design a training framework that alleviates those bottlenecks. It generates data in parallel with the training process. Such simultaneity induces a bias in the data available during the training. We present a strategy to mitigate this bias with a memory buffer. We test our framework on the multi-parametric Lorenz's attractor. We show the benefit of our framework compared to offline training and the success of our data bias mitigation strategy to capture the complex chaotic dynamics of the system.
    Graph Summarization via Node Grouping: A Spectral Algorithm. (arXiv:2211.04169v1 [cs.SI])
    Graph summarization via node grouping is a popular method to build concise graph representations by grouping nodes from the original graph into supernodes and encoding edges into superedges such that the loss of adjacency information is minimized. Such summaries have immense applications in large-scale graph analytics due to their small size and high query processing efficiency. In this paper, we reformulate the loss minimization problem for summarization into an equivalent integer maximization problem. By initially allowing relaxed (fractional) solutions for integer maximization, we analytically expose the underlying connections to the spectral properties of the adjacency matrix. Consequently, we design an algorithm called SpecSumm that consists of two phases. In the first phase, motivated by spectral graph theory, we apply k-means clustering on the k largest (in magnitude) eigenvectors of the adjacency matrix to assign nodes to supernodes. In the second phase, we propose a greedy heuristic that updates the initial assignment to further improve summary quality. Finally, via extensive experiments on 11 datasets, we show that SpecSumm efficiently produces high-quality summaries compared to state-of-the-art summarization algorithms and scales to graphs with millions of nodes.
    The Technological Emergence of AutoML: A Survey of Performant Software and Applications in the Context of Industry. (arXiv:2211.04148v1 [cs.LG])
    With most technical fields, there exists a delay between fundamental academic research and practical industrial uptake. Whilst some sciences have robust and well-established processes for commercialisation, such as the pharmaceutical practice of regimented drug trials, other fields face transitory periods in which fundamental academic advancements diffuse gradually into the space of commerce and industry. For the still relatively young field of Automated/Autonomous Machine Learning (AutoML/AutonoML), that transitory period is under way, spurred on by a burgeoning interest from broader society. Yet, to date, little research has been undertaken to assess the current state of this dissemination and its uptake. Thus, this review makes two primary contributions to knowledge around this topic. Firstly, it provides the most up-to-date and comprehensive survey of existing AutoML tools, both open-source and commercial. Secondly, it motivates and outlines a framework for assessing whether an AutoML solution designed for real-world application is 'performant'; this framework extends beyond the limitations of typical academic criteria, considering a variety of stakeholder needs and the human-computer interactions required to service them. Thus, additionally supported by an extensive assessment and comparison of academic and commercial case-studies, this review evaluates mainstream engagement with AutoML in the early 2020s, identifying obstacles and opportunities for accelerating future uptake.
    Hyperbolic Graph Representation Learning: A Tutorial. (arXiv:2211.04050v1 [cs.LG])
    Graph-structured data are widespread in real-world applications, such as social networks, recommender systems, knowledge graphs, chemical molecules etc. Despite the success of Euclidean space for graph-related learning tasks, its ability to model complex patterns is essentially constrained by its polynomially growing capacity. Recently, hyperbolic spaces have emerged as a promising alternative for processing graph data with tree-like structure or power-law distribution, owing to the exponential growth property. Different from Euclidean space, which expands polynomially, the hyperbolic space grows exponentially which makes it gains natural advantages in abstracting tree-like or scale-free graphs with hierarchical organizations. In this tutorial, we aim to give an introduction to this emerging field of graph representation learning with the express purpose of being accessible to all audiences. We first give a brief introduction to graph representation learning as well as some preliminary Riemannian and hyperbolic geometry. We then comprehensively revisit the hyperbolic embedding techniques, including hyperbolic shallow models and hyperbolic neural networks. In addition, we introduce the technical details of the current hyperbolic graph neural networks by unifying them into a general framework and summarizing the variants of each component. Moreover, we further introduce a series of related applications in a variety of fields. In the last part, we discuss several advanced topics about hyperbolic geometry for graph representation learning, which potentially serve as guidelines for further flourishing the non-Euclidean graph learning community.
    Hyperparameter optimization in deep multi-target prediction. (arXiv:2211.04362v1 [cs.LG])
    As a result of the ever increasing complexity of configuring and fine-tuning machine learning models, the field of automated machine learning (AutoML) has emerged over the past decade. However, software implementations like Auto-WEKA and Auto-sklearn typically focus on classical machine learning (ML) tasks such as classification and regression. Our work can be seen as the first attempt at offering a single AutoML framework for most problem settings that fall under the umbrella of multi-target prediction, which includes popular ML settings such as multi-label classification, multivariate regression, multi-task learning, dyadic prediction, matrix completion, and zero-shot learning. Automated problem selection and model configuration are achieved by extending DeepMTP, a general deep learning framework for MTP problem settings, with popular hyperparameter optimization (HPO) methods. Our extensive benchmarking across different datasets and MTP problem settings identifies cases where specific HPO methods outperform others.
    Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning. (arXiv:2211.04325v1 [cs.LG])
    We analyze the growth of dataset sizes used in machine learning for natural language processing and computer vision, and extrapolate these using two methods; using the historical growth rate and estimating the compute-optimal dataset size for future predicted compute budgets. We investigate the growth in data usage by estimating the total stock of unlabeled data available on the internet over the coming decades. Our analysis indicates that the stock of high-quality language data will be exhausted soon; likely before 2026. By contrast, the stock of low-quality language data and image data will be exhausted only much later; between 2030 and 2050 (for low-quality language) and between 2030 and 2060 (for images). Our work suggests that the current trend of ever-growing ML models that rely on enormous datasets might slow down if data efficiency is not drastically improved or new sources of data become available.
    Fine-grained Population Mapping from Coarse Census Counts and Open Geodata. (arXiv:2211.04039v1 [cs.LG])
    Fine-grained population maps are needed in several domains, like urban planning, environmental monitoring, public health, and humanitarian operations. Unfortunately, in many countries only aggregate census counts over large spatial units are collected, moreover, these are not always up-to-date. We present POMELO, a deep learning model that employs coarse census counts and open geodata to estimate fine-grained population maps with 100m ground sampling distance. Moreover, the model can also estimate population numbers when no census counts at all are available, by generalizing across countries. In a series of experiments for several countries in sub-Saharan Africa, the maps produced with POMELOare in good agreement with the most detailed available reference counts: disaggregation of coarse census counts reaches R2 values of 85-89%; unconstrained prediction in the absence of any counts reaches 48-69%.
    DetAIL : A Tool to Automatically Detect and Analyze Drift In Language. (arXiv:2211.04250v1 [cs.LG])
    Machine learning and deep learning-based decision making has become part of today's software. The goal of this work is to ensure that machine learning and deep learning-based systems are as trusted as traditional software. Traditional software is made dependable by following rigorous practice like static analysis, testing, debugging, verifying, and repairing throughout the development and maintenance life-cycle. Similarly for machine learning systems, we need to keep these models up to date so that their performance is not compromised. For this, current systems rely on scheduled re-training of these models as new data kicks in. In this work, we propose to measure the data drift that takes place when new data kicks in so that one can adaptively re-train the models whenever re-training is actually required irrespective of schedules. In addition to that, we generate various explanations at sentence level and dataset level to capture why a given payload text has drifted.
    A review of TinyML. (arXiv:2211.04448v1 [cs.LG])
    In this current technological world, the application of machine learning is becoming ubiquitous. Incorporating machine learning algorithms on extremely low-power and inexpensive embedded devices at the edge level is now possible due to the combination of the Internet of Things (IoT) and edge computing. To estimate an outcome, traditional machine learning demands vast amounts of resources. The TinyML concept for embedded machine learning attempts to push such diversity from usual high-end approaches to low-end applications. TinyML is a rapidly expanding interdisciplinary topic at the convergence of machine learning, software, and hardware centered on deploying deep neural network models on embedded (micro-controller-driven) systems. TinyML will pave the way for novel edge-level services and applications that survive on distributed edge inferring and independent decision-making rather than server computation. In this paper, we explore TinyML's methodology, how TinyML can benefit a few specific industrial fields, its obstacles, and its future scope.
    Self-conditioned Embedding Diffusion for Text Generation. (arXiv:2211.04236v1 [cs.CL])
    Can continuous diffusion models bring the same performance breakthrough on natural language they did for image generation? To circumvent the discrete nature of text data, we can simply project tokens in a continuous space of embeddings, as is standard in language modeling. We propose Self-conditioned Embedding Diffusion, a continuous diffusion mechanism that operates on token embeddings and allows to learn flexible and scalable diffusion models for both conditional and unconditional text generation. Through qualitative and quantitative evaluation, we show that our text diffusion models generate samples comparable with those produced by standard autoregressive language models - while being in theory more efficient on accelerator hardware at inference time. Our work paves the way for scaling up diffusion models for text, similarly to autoregressive models, and for improving performance with recent refinements to continuous diffusion.
    Parameter and Data Efficient Continual Pre-training for Robustness to Dialectal Variance in Arabic. (arXiv:2211.03966v1 [cs.CL])
    The use of multilingual language models for tasks in low and high-resource languages has been a success story in deep learning. In recent times, Arabic has been receiving widespread attention on account of its dialectal variance. While prior research studies have tried to adapt these multilingual models for dialectal variants of Arabic, it still remains a challenging problem owing to the lack of sufficient monolingual dialectal data and parallel translation data of such dialectal variants. It remains an open problem on whether the limited dialectical data can be used to improve the models trained in Arabic on its dialectal variants. First, we show that multilingual-BERT (mBERT) incrementally pretrained on Arabic monolingual data takes less training time and yields comparable accuracy when compared to our custom monolingual Arabic model and beat existing models (by an avg metric of +$6.41$). We then explore two continual pre-training methods -- (1) using small amounts of dialectical data for continual finetuning and (2) parallel Arabic to English data and a Translation Language Modeling loss function. We show that both approaches help improve performance on dialectal classification tasks ($+4.64$ avg. gain) when used on monolingual models.  ( 2 min )
    Selective compression learning of latent representations for variable-rate image compression. (arXiv:2211.04104v1 [eess.IV])
    Recently, many neural network-based image compression methods have shown promising results superior to the existing tool-based conventional codecs. However, most of them are often trained as separate models for different target bit rates, thus increasing the model complexity. Therefore, several studies have been conducted for learned compression that supports variable rates with single models, but they require additional network modules, layers, or inputs that often lead to complexity overhead, or do not provide sufficient coding efficiency. In this paper, we firstly propose a selective compression method that partially encodes the latent representations in a fully generalized manner for deep learning-based variable-rate image compression. The proposed method adaptively determines essential representation elements for compression of different target quality levels. For this, we first generate a 3D importance map as the nature of input content to represent the underlying importance of the representation elements. The 3D importance map is then adjusted for different target quality levels using importance adjustment curves. The adjusted 3D importance map is finally converted into a 3D binary mask to determine the essential representation elements for compression. The proposed method can be easily integrated with the existing compression models with a negligible amount of overhead increase. Our method can also enable continuously variable-rate compression via simple interpolation of the importance adjustment curves among different quality levels. The extensive experimental results show that the proposed method can achieve comparable compression efficiency as those of the separately trained reference compression models and can reduce decoding time owing to the selective compression. The sample codes are publicly available at https://github.com/JooyoungLeeETRI/SCR.
    ToDD: Topological Compound Fingerprinting in Computer-Aided Drug Discovery. (arXiv:2211.03808v1 [cs.LG])
    In computer-aided drug discovery (CADD), virtual screening (VS) is used for identifying the drug candidates that are most likely to bind to a molecular target in a large library of compounds. Most VS methods to date have focused on using canonical compound representations (e.g., SMILES strings, Morgan fingerprints) or generating alternative fingerprints of the compounds by training progressively more complex variational autoencoders (VAEs) and graph neural networks (GNNs). Although VAEs and GNNs led to significant improvements in VS performance, these methods suffer from reduced performance when scaling to large virtual compound datasets. The performance of these methods has shown only incremental improvements in the past few years. To address this problem, we developed a novel method using multiparameter persistence (MP) homology that produces topological fingerprints of the compounds as multidimensional vectors. Our primary contribution is framing the VS process as a new topology-based graph ranking problem by partitioning a compound into chemical substructures informed by the periodic properties of its atoms and extracting their persistent homology features at multiple resolution levels. We show that the margin loss fine-tuning of pretrained Triplet networks attains highly competitive results in differentiating between compounds in the embedding space and ranking their likelihood of becoming effective drug candidates. We further establish theoretical guarantees for the stability properties of our proposed MP signatures, and demonstrate that our models, enhanced by the MP signatures, outperform state-of-the-art methods on benchmark datasets by a wide and highly statistically significant margin (e.g., 93% gain for Cleves-Jain and 54% gain for DUD-E Diverse dataset).
    Privacy Meets Explainability: A Comprehensive Impact Benchmark. (arXiv:2211.04110v1 [cs.LG])
    Since the mid-10s, the era of Deep Learning (DL) has continued to this day, bringing forth new superlatives and innovations each year. Nevertheless, the speed with which these innovations translate into real applications lags behind this fast pace. Safety-critical applications, in particular, underlie strict regulatory and ethical requirements which need to be taken care of and are still active areas of debate. eXplainable AI (XAI) and privacy-preserving machine learning (PPML) are both crucial research fields, aiming at mitigating some of the drawbacks of prevailing data-hungry black-box models in DL. Despite brisk research activity in the respective fields, no attention has yet been paid to their interaction. This work is the first to investigate the impact of private learning techniques on generated explanations for DL-based models. In an extensive experimental analysis covering various image and time series datasets from multiple domains, as well as varying privacy techniques, XAI methods, and model architectures, the effects of private training on generated explanations are studied. The findings suggest non-negligible changes in explanations through the introduction of privacy. Apart from reporting individual effects of PPML on XAI, the paper gives clear recommendations for the choice of techniques in real applications. By unveiling the interdependencies of these pivotal technologies, this work is a first step towards overcoming the remaining hurdles for practically applicable AI in safety-critical domains.
    Progress and summary of reinforcement learning on energy management of MPS-EV. (arXiv:2211.04001v1 [cs.LG])
    The high emission and low energy efficiency caused by internal combustion engines (ICE) have become unacceptable under environmental regulations and the energy crisis. As a promising alternative solution, multi-power source electric vehicles (MPS-EVs) introduce different clean energy systems to improve powertrain efficiency. The energy management strategy (EMS) is a critical technology for MPS-EVs to maximize efficiency, fuel economy, and range. Reinforcement learning (RL) has become an effective methodology for the development of EMS. RL has received continuous attention and research, but there is still a lack of systematic analysis of the design elements of RL-based EMS. To this end, this paper presents an in-depth analysis of the current research on RL-based EMS (RL-EMS) and summarizes the design elements of RL-based EMS. This paper first summarizes the previous applications of RL in EMS from five aspects: algorithm, perception scheme, decision scheme, reward function, and innovative training method. The contribution of advanced algorithms to the training effect is shown, the perception and control schemes in the literature are analyzed in detail, different reward function settings are classified, and innovative training methods with their roles are elaborated. Finally, by comparing the development routes of RL and RL-EMS, this paper identifies the gap between advanced RL solutions and existing RL-EMS. Finally, this paper suggests potential development directions for implementing advanced artificial intelligence (AI) solutions in EMS.  ( 2 min )
    Proactive Detractor Detection Framework Based on Message-Wise Sentiment Analysis Over Customer Support Interactions. (arXiv:2211.03923v1 [cs.CL])
    In this work, we propose a framework relying solely on chat-based customer support (CS) interactions for predicting the recommendation decision of individual users. For our case study, we analyzed a total number of 16.4k users and 48.7k customer support conversations within the financial vertical of a large e-commerce company in Latin America. Consequently, our main contributions and objectives are to use Natural Language Processing (NLP) to assess and predict the recommendation behavior where, in addition to using static sentiment analysis, we exploit the predictive power of each user's sentiment dynamics. Our results show that, with respective feature interpretability, it is possible to predict the likelihood of a user to recommend a product or service, based solely on the message-wise sentiment evolution of their CS conversations in a fully automated way.  ( 2 min )
    Quantum-probabilistic Hamiltonian learning for generative modelling & anomaly detection. (arXiv:2211.03803v1 [quant-ph])
    The Hamiltonian of an isolated quantum mechanical system determines its dynamics and physical behaviour. This study investigates the possibility of learning and utilising a system's Hamiltonian and its variational thermal state estimation for data analysis techniques. For this purpose, we employ the method of Quantum Hamiltonian-Based Models for the generative modelling of simulated Large Hadron Collider data and demonstrate the representability of such data as a mixed state. In a further step, we use the learned Hamiltonian for anomaly detection, showing that different sample types can form distinct dynamical behaviours once treated as a quantum many-body system. We exploit these characteristics to quantify the difference between sample types. Our findings show that the methodologies designed for field theory computations can be utilised in machine learning applications to employ theoretical approaches in data analysis techniques.  ( 2 min )
    Posterior samples of source galaxies in strong gravitational lenses with score-based priors. (arXiv:2211.03812v1 [astro-ph.IM])
    Inferring accurate posteriors for high-dimensional representations of the brightness of gravitationally-lensed sources is a major challenge, in part due to the difficulties of accurately quantifying the priors. Here, we report the use of a score-based model to encode the prior for the inference of undistorted images of background galaxies. This model is trained on a set of high-resolution images of undistorted galaxies. By adding the likelihood score to the prior score and using a reverse-time stochastic differential equation solver, we obtain samples from the posterior. Our method produces independent posterior samples and models the data almost down to the noise level. We show how the balance between the likelihood and the prior meet our expectations in an experiment with out-of-distribution data.  ( 2 min )
    Exploration of Convolutional Neural Network Architectures for Large Region Map Automation. (arXiv:2211.03854v1 [cs.CV])
    Deep learning semantic segmentation algorithms have provided improved frameworks for the automated production of Land-Use and Land-Cover (LULC) maps, which significantly increases the frequency of map generation as well as consistency of production quality. In this research, a total of 28 different model variations were examined to improve the accuracy of LULC maps. The experiments were carried out using Landsat 5/7 or Landsat 8 satellite images with the North American Land Change Monitoring System labels. The performance of various CNNs and extension combinations were assessed, where VGGNet with an output stride of 4, and modified U-Net architecture provided the best results. Additional expanded analysis of the generated LULC maps was also provided. Using a deep neural network, this work achieved 92.4% accuracy for 13 LULC classes within southern Manitoba representing a 15.8% improvement over published results for the NALCMS. Based on the large regions of interest, higher radiometric resolution of Landsat 8 data resulted in better overall accuracies (88.04%) compare to Landsat 5/7 (80.66%) for 16 LULC classes. This represents an 11.44% and 4.06% increase in overall accuracy compared to previously published NALCMS results, including larger land area and higher number of LULC classes incorporated into the models compared to other published LULC map automation methods.  ( 2 min )
    Regimes of charged particle dynamics in current sheets: the machine learning approach. (arXiv:2211.03787v1 [physics.plasm-ph])
    Current sheets are spatially localized almost-1D structures with intense plasma currents. They play a key role in storing the magnetic field energy and they separate different plasma populations in planetary magnetospheres, the solar wind, and the solar corona. Current sheets are primary regions for the magnetic field line reconnection responsible for plasma heating and charged particle acceleration. One of the most interesting and widely observed type of 1D current sheets is the rotational discontinuity, that can be force-free or include plasma compression. Theoretical models of such 1D current sheets are based on the assumption of adiabatic motion of ions, i.e. ion adiabatic invariants are conserved. We focus on three current sheet configurations, widely observed in the Earth magnetopause and magnetotail and in the near-Earth solar wind. Magnetic field in such current sheets is supported by currents carried by transient ions, which exist only when there is a sufficient number of invariants. In this paper, we apply a novel machine learning approach, AI Poincar'e, to determine parametrical domains where adiabatic invariants are conserved. For all three current sheet configurations, these domains are quite narrow and do not cover the entire parametrical range of observed current sheets. We discuss possible interpretation of obtained results indicating that 1D current sheets are dynamical rather than static plasma equilibria.  ( 2 min )
    Automatic Change-Point Detection in Time Series via Deep Learning. (arXiv:2211.03860v1 [stat.ML])
    Detecting change-points in data is challenging because of the range of possible types of change and types of behaviour of data when there is no change. Statistically efficient methods for detecting a change will depend on both of these features, and it can be difficult for a practitioner to develop an appropriate detection method for their application of interest. We show how to automatically generate new detection methods based on training a neural network. Our approach is motivated by many existing tests for the presence of a change-point being able to be represented by a simple neural network, and thus a neural network trained with sufficient data should have performance at least as good as these methods. We present theory that quantifies the error rate for such an approach, and how it depends on the amount of training data. Empirical results show that, even with limited training data, its performance is competitive with the standard CUSUM test for detecting a change in mean when the noise is independent and Gaussian, and can substantially outperform it in the presence of auto-correlated or heavy-tailed noise. Our method also shows strong results in detecting and localising changes in activity based on accelerometer data.  ( 2 min )
    EEG-Fest: Few-shot based Attention Network for Driver's Vigilance Estimation with EEG Signals. (arXiv:2211.03878v1 [cs.LG])
    A lack of driver's vigilance is the main cause of most vehicle crashes. Electroencephalography(EEG) has been reliable and efficient tool for drivers' drowsiness estimation. Even though previous studies have developed accurate and robust driver's vigilance detection algorithms, these methods are still facing challenges on following areas: (a) small sample size training, (b) anomaly signal detection, and (c) subject-independent classification. In this paper, we propose a generalized few-shot model, namely EEG-Fest, to improve aforementioned drawbacks. The EEG-Fest model can (a) classify the query sample's drowsiness with a few samples, (b) identify whether a query sample is anomaly signals or not, and (c) achieve subject independent classification. The proposed algorithm achieves state-of-the-art results on the SEED-VIG dataset and the SADT dataset. The accuracy of the drowsy class achieves 92% and 94% for 1-shot and 5-shot support samples in the SEED-VIG dataset, and 62% and 78% for 1-shot and 5-shot support samples in the SADT dataset.  ( 2 min )
    Uncertainty Quantification for Atlas-Level Cell Type Transfer. (arXiv:2211.03793v1 [q-bio.GN])
    Single-cell reference atlases are large-scale, cell-level maps that capture cellular heterogeneity within an organ using single cell genomics. Given their size and cellular diversity, these atlases serve as high-quality training data for the transfer of cell type labels to new datasets. Such label transfer, however, must be robust to domain shifts in gene expression due to measurement technique, lab specifics and more general batch effects. This requires methods that provide uncertainty estimates on the cell type predictions to ensure correct interpretation. Here, for the first time, we introduce uncertainty quantification methods for cell type classification on single-cell reference atlases. We benchmark four model classes and show that currently used models lack calibration, robustness, and actionable uncertainty scores. Furthermore, we demonstrate how models that quantify uncertainty are better suited to detect unseen cell types in the setting of atlas-level cell type transfer.  ( 2 min )
    Comparative layer-wise analysis of self-supervised speech models. (arXiv:2211.03929v1 [cs.CL])
    Many self-supervised speech models, varying in their pre-training objective, input modality, and pre-training data, have been proposed in the last few years. Despite impressive empirical successes on downstream tasks, we still have a limited understanding of the properties encoded by the models and the differences across models. In this work, we examine the intermediate representations for a variety of recent models. Specifically, we measure acoustic, phonetic, and word-level properties encoded in individual layers, using a lightweight analysis tool based on canonical correlation analysis (CCA). We find that these properties evolve across layers differently depending on the model, and the variations relate to the choice of pre-training objective. We further investigate the utility of our analyses for downstream tasks by comparing the property trends with performance on speech recognition and spoken language understanding tasks. We discover that CCA trends provide reliable guidance to choose layers of interest for downstream tasks and that single-layer performance often matches or improves upon using all layers, suggesting implications for more efficient use of pre-trained models.  ( 2 min )
    On the Algorithmic Stability and Generalization of Adaptive Optimization Methods. (arXiv:2211.03970v1 [cs.LG])
    Despite their popularity in deep learning and machine learning in general, the theoretical properties of adaptive optimizers such as Adagrad, RMSProp, Adam or AdamW are not yet fully understood. In this paper, we develop a novel framework to study the stability and generalization of these optimization methods. Based on this framework, we show provable guarantees about such properties that depend heavily on a single parameter $\beta_2$. Our empirical experiments support our claims and provide practical insights into the stability and generalization properties of adaptive optimization methods.  ( 2 min )
    FED-CD: Federated Causal Discovery from Interventional and Observational Data. (arXiv:2211.03846v1 [cs.LG])
    Causal discovery, the inference of causal relations from data, is a core task of fundamental importance in all scientific domains, and several new machine learning methods for addressing the causal discovery problem have been proposed recently. However, existing machine learning methods for causal discovery typically require that the data used for inference is pooled and available in a centralized location. In many domains of high practical importance, such as in healthcare, data is only available at local data-generating entities (e.g. hospitals in the healthcare context), and cannot be shared across entities due to, among others, privacy and regulatory reasons. In this work, we address the problem of inferring causal structure - in the form of a directed acyclic graph (DAG) - from a distributed data set that contains both observational and interventional data in a privacy-preserving manner by exchanging updates instead of samples. To this end, we introduce a new federated framework, FED-CD, that enables the discovery of global causal structures both when the set of intervened covariates is the same across decentralized entities, and when the set of intervened covariates are potentially disjoint. We perform a comprehensive experimental evaluation on synthetic data that demonstrates that FED-CD enables effective aggregation of decentralized data for causal discovery without direct sample sharing, even when the contributing distributed data sets cover disjoint sets of interventions. Effective methods for causal discovery in distributed data sets could significantly advance scientific discovery and knowledge sharing in important settings, for instance, healthcare, in which sharing of data across local sites is difficult or prohibited.  ( 3 min )
    Inferring Class Label Distribution of Training Data from Classifiers: An Accuracy-Augmented Meta-Classifier Attack. (arXiv:2211.04157v1 [cs.LG])
    Property inference attacks against machine learning (ML) models aim to infer properties of the training data that are unrelated to the primary task of the model, and have so far been formulated as binary decision problems, i.e., whether or not the training data have a certain property. However, in industrial and healthcare applications, the proportion of labels in the training data is quite often also considered sensitive information. In this paper we introduce a new type of property inference attack that unlike binary decision problems in literature, aim at inferring the class label distribution of the training data from parameters of ML classifier models. We propose a method based on \emph{shadow training} and a \emph{meta-classifier} trained on the parameters of the shadow classifiers augmented with the accuracy of the classifiers on auxiliary data. We evaluate the proposed approach for ML classifiers with fully connected neural network architectures. We find that the proposed \emph{meta-classifier} attack provides a maximum relative improvement of $52\%$ over state of the art.
    Significance-Based Categorical Data Clustering. (arXiv:2211.03956v1 [cs.LG])
    Although numerous algorithms have been proposed to solve the categorical data clustering problem, how to access the statistical significance of a set of categorical clusters remains unaddressed. To fulfill this void, we employ the likelihood ratio test to derive a test statistic that can serve as a significance-based objective function in categorical data clustering. Consequently, a new clustering algorithm is proposed in which the significance-based objective function is optimized via a Monte Carlo search procedure. As a by-product, we can further calculate an empirical $p$-value to assess the statistical significance of a set of clusters and develop an improved gap statistic for estimating the cluster number. Extensive experimental studies suggest that our method is able to achieve comparable performance to state-of-the-art categorical data clustering algorithms. Moreover, the effectiveness of such a significance-based formulation on statistical cluster validation and cluster number estimation is demonstrated through comprehensive empirical results.  ( 2 min )
    AutoML-based Almond Yield Prediction and Projection in California. (arXiv:2211.03925v1 [cs.LG])
    Almonds are one of the most lucrative products of California, but are also among the most sensitive to climate change. In order to better understand the relationship between climatic factors and almond yield, an automated machine learning framework is used to build a collection of machine learning models. The prediction skill is assessed using historical records. Future projections are derived using 17 downscaled climate outputs. The ensemble mean projection displays almond yield changes under two different climate scenarios, along with two technology development scenarios, where the role of technology development is highlighted. The mean projections and distributions provide insightful results to stakeholders and can be utilized by policymakers for climate adaptation.  ( 2 min )
    NSNet: A General Neural Probabilistic Framework for Satisfiability Problems. (arXiv:2211.03880v1 [cs.AI])
    We present the Neural Satisfiability Network (NSNet), a general neural framework that models satisfiability problems as probabilistic inference and meanwhile exhibits proper explainability. Inspired by the Belief Propagation (BP), NSNet uses a novel graph neural network (GNN) to parameterize BP in the latent space, where its hidden representations maintain the same probabilistic interpretation as BP. NSNet can be flexibly configured to solve both SAT and #SAT problems by applying different learning objectives. For SAT, instead of directly predicting a satisfying assignment, NSNet performs marginal inference among all satisfying solutions, which we empirically find is more feasible for neural networks to learn. With the estimated marginals, a satisfying assignment can be efficiently generated by rounding and executing a stochastic local search. For #SAT, NSNet performs approximate model counting by learning the Bethe approximation of the partition function. Our evaluations show that NSNet achieves competitive results in terms of inference accuracy and time efficiency on multiple SAT and #SAT datasets.  ( 2 min )
    Unsupervised vocal dereverberation with diffusion-based generative models. (arXiv:2211.04124v1 [eess.AS])
    Removing reverb from reverberant music is a necessary technique to clean up audio for downstream music manipulations. Reverberation of music contains two categories, natural reverb, and artificial reverb. Artificial reverb has a wider diversity than natural reverb due to its various parameter setups and reverberation types. However, recent supervised dereverberation methods may fail because they rely on sufficiently diverse and numerous pairs of reverberant observations and retrieved data for training in order to be generalizable to unseen observations during inference. To resolve these problems, we propose an unsupervised method that can remove a general kind of artificial reverb for music without requiring pairs of data for training. The proposed method is based on diffusion models, where it initializes the unknown reverberation operator with a conventional signal processing technique and simultaneously refines the estimate with the help of diffusion models. We show through objective and perceptual evaluations that our method outperforms the current leading vocal dereverberation benchmarks.
    Astronomia ex machina: a history, primer, and outlook on neural networks in astronomy. (arXiv:2211.03796v1 [astro-ph.IM])
    In recent years, deep learning has infiltrated every field it has touched, reducing the need for specialist knowledge and automating the process of knowledge discovery from data. This review argues that astronomy is no different, and that we are currently in the midst of a deep learning revolution that is transforming the way we do astronomy. We trace the history of astronomical connectionism from the early days of multilayer perceptrons, through the second wave of convolutional and recurrent neural networks, to the current third wave of self-supervised and unsupervised deep learning. We then predict that we will soon enter a fourth wave of astronomical connectionism, in which finetuned versions of an all-encompassing 'foundation' model will replace expertly crafted deep learning models. We argue that such a model can only be brought about through a symbiotic relationship between astronomy and connectionism, whereby astronomy provides high quality multimodal data to train the foundation model, and in turn the foundation model is used to advance astronomical research.  ( 2 min )
    Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models. (arXiv:2211.03827v1 [cs.LG])
    Tensor decomposition serves as a powerful primitive in statistics and machine learning. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufficient for the power method to recover one of the tensor components. In this paper, we analyze the dynamics of tensor power iteration from random initialization in the overcomplete regime. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters, despite that it takes at least polynomially many steps to converge. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path. Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.  ( 3 min )
    Polite Teacher: Semi-Supervised Instance Segmentation with Mutual Learning and Pseudo-Label Thresholding. (arXiv:2211.03850v1 [cs.CV])
    We present Polite Teacher, a simple yet effective method for the task of semi-supervised instance segmentation. The proposed architecture relies on the Teacher-Student mutual learning framework. To filter out noisy pseudo-labels, we use confidence thresholding for bounding boxes and mask scoring for masks. The approach has been tested with CenterMask, a single-stage anchor-free detector. Tested on the COCO 2017 val dataset, our architecture significantly (approx. +8 pp. in mask AP) outperforms the baseline at different supervision regimes. To the best of our knowledge, this is one of the first works tackling the problem of semi-supervised instance segmentation and the first one devoted to an anchor-free detector.  ( 2 min )
    CoNMix for Source-free Single and Multi-target Domain Adaptation. (arXiv:2211.03876v1 [cs.LG])
    This work introduces the novel task of Source-free Multi-target Domain Adaptation and proposes adaptation framework comprising of \textbf{Co}nsistency with \textbf{N}uclear-Norm Maximization and \textbf{Mix}Up knowledge distillation (\textit{CoNMix}) as a solution to this problem. The main motive of this work is to solve for Single and Multi target Domain Adaptation (SMTDA) for the source-free paradigm, which enforces a constraint where the labeled source data is not available during target adaptation due to various privacy-related restrictions on data sharing. The source-free approach leverages target pseudo labels, which can be noisy, to improve the target adaptation. We introduce consistency between label preserving augmentations and utilize pseudo label refinement methods to reduce noisy pseudo labels. Further, we propose novel MixUp Knowledge Distillation (MKD) for better generalization on multiple target domains using various source-free STDA models. We also show that the Vision Transformer (VT) backbone gives better feature representation with improved domain transferability and class discriminability. Our proposed framework achieves the state-of-the-art (SOTA) results in various paradigms of source-free STDA and MTDA settings on popular domain adaptation datasets like Office-Home, Office-Caltech, and DomainNet. Project Page: https://sites.google.com/view/conmix-vcl  ( 2 min )
    Challenges and Opportunities in Deep Reinforcement Learning with Graph Neural Networks: A Comprehensive review of Algorithms and Applications. (arXiv:2206.07922v2 [cs.LG] UPDATED)
    Deep reinforcement learning (DRL) has empowered a variety of artificial intelligence fields, including pattern recognition, robotics, recommendation-systems, and gaming. Similarly, graph neural networks (GNN) have also demonstrated their superior performance in supervised learning for graph-structured data. In recent times, the fusion of GNN with DRL for graph-structured environments has attracted a lot of attention. This paper provides a comprehensive review of these hybrid works. These works can be classified into two categories: (1) algorithmic enhancement, where DRL and GNN complement each other for better utility; (2) application-specific enhancement, where DRL and GNN support each other. This fusion effectively addresses various complex problems in engineering and life sciences. Based on the review, we further analyze the applicability and benefits of fusing these two domains, especially in terms of increasing generalizability and reducing computational complexity. Finally, the key challenges in integrating DRL and GNN, and potential future research directions are highlighted, which will be of interest to the broader machine learning community.  ( 2 min )
    Complexity of High-Dimensional Identity Testing with Coordinate Conditional Sampling. (arXiv:2207.09102v2 [cs.DS] UPDATED)
    We study the identity testing problem for high-dimensional distributions. Given as input an explicit distribution $\mu$, an $\varepsilon>0$, and access to sampling oracle(s) for a hidden distribution $\pi$, the goal in identity testing is to distinguish whether the two distributions $\mu$ and $\pi$ are identical or are at least $\varepsilon$-far apart. When there is only access to full samples from the hidden distribution $\pi$, it is known that exponentially many samples (in the dimension) may be needed for identity testing, and hence previous works have studied identity testing with additional access to various "conditional" sampling oracles. We consider a significantly weaker conditional sampling oracle, which we call the $\mathsf{Coordinate\ Oracle}$, and provide a computational and statistical characterization of the identity testing problem in this new model. We prove that if an analytic property known as approximate tensorization of entropy holds for an $n$-dimensional visible distribution $\mu$, then there is an efficient identity testing algorithm for any hidden distribution $\pi$ using $\tilde{O}(n/\varepsilon)$ queries to the $\mathsf{Coordinate\ Oracle}$. Approximate tensorization of entropy is a pertinent condition as recent works have established it for a large class of high-dimensional distributions. We also prove a computational phase transition: for a well-studied class of $n$-dimensional distributions, specifically sparse antiferromagnetic Ising models over $\{+1,-1\}^n$, we show that in the regime where approximate tensorization of entropy fails, there is no efficient identity testing algorithm unless $\mathsf{RP}=\mathsf{NP}$. We complement our results with a matching $\Omega(n/\varepsilon)$ statistical lower bound for the sample complexity of identity testing in the $\mathsf{Coordinate\ Oracle}$ model.  ( 3 min )
    On-Device Domain Generalization. (arXiv:2209.07521v2 [cs.CV] UPDATED)
    We present a systematic study of domain generalization (DG) for tiny neural networks. This problem is critical to on-device machine learning applications but has been overlooked in the literature where research has been merely focused on large models. Tiny neural networks have much fewer parameters and lower complexity and therefore should not be trained the same way as their large counterparts for DG applications. By conducting extensive experiments, we find that knowledge distillation (KD), a well-known technique for model compression, is much better for tackling the on-device DG problem than conventional DG methods. Another interesting observation is that the teacher-student gap on out-of-distribution data is bigger than that on in-distribution data, which highlights the capacity mismatch issue as well as the shortcoming of KD. We further propose a method called out-of-distribution knowledge distillation (OKD) where the idea is to teach the student how the teacher handles out-of-distribution data synthesized via disruptive data augmentation. Without adding any extra parameter to the model -- hence keeping the deployment cost unchanged -- OKD significantly improves DG performance for tiny neural networks in a variety of on-device DG scenarios for image and speech applications. We also contribute a scalable approach for synthesizing visual domain shifts, along with a new suite of DG datasets to complement existing testbeds.  ( 2 min )
    Do-Operation Guided Causal Representation Learning with Reduced Supervision Strength. (arXiv:2206.01802v2 [cs.LG] UPDATED)
    Causal representation learning has been proposed to encode relationships between factors presented in the high dimensional data. However, existing methods suffer from merely using a large amount of labeled data and ignore the fact that samples generated by the same causal mechanism follow the same causal relationships. In this paper, we seek to explore such information by leveraging do-operation to reduce supervision strength. We propose a framework that implements do-operation by swapping latent cause and effect factors encoded from a pair of inputs. Moreover, we also identify the inadequacy of existing causal representation metrics empirically and theoretically and introduce new metrics for better evaluation. Experiments conducted on both synthetic and real datasets demonstrate the superiorities of our method compared with state-of-the-art methods.  ( 2 min )
    Adaptive Asynchronous Control using Meta-learned Neural Ordinary Differential Equations. (arXiv:2207.12062v2 [cs.LG] UPDATED)
    Model-based Reinforcement Learning and Control have demonstrated great potential in various sequential decision making problem domains, including in robotics settings. However, real-world robotics systems often present challenges that limit the applicability of those methods. In particular, we note two problems that jointly happen in many industrial systems: 1) Irregular/asynchronous observations and actions and 2) Dramatic changes in environment dynamics from an episode to another (e.g. varying payload inertial properties). We propose a general framework that overcomes those difficulties by meta-learning adaptive dynamics models for continuous-time prediction and control. We evaluate the proposed approach on two robotic simulations, the first of which is inspired by a real-world industrial robot.  ( 2 min )
    Automatic Semantic Segmentation of the Lumbar Spine: Clinical Applicability in a Multi-parametric and Multi-centre Study on Magnetic Resonance Images. (arXiv:2111.08712v3 [eess.IV] UPDATED)
    One of the major difficulties in medical image segmentation is the high variability of these images, which is caused by their origin (multi-centre), the acquisition protocols (multi-parametric), as well as the variability of human anatomy, the severity of the illness, the effect of age and gender, among others. The problem addressed in this work is the automatic semantic segmentation of lumbar spine Magnetic Resonance images using convolutional neural networks. The purpose is to assign a class label to each pixel of an image. Classes were defined by radiologists and correspond to different structural elements like vertebrae, intervertebral discs, nerves, blood vessels, and other tissues. The proposed network topologies are variants of the U-Net architecture. Several complementary blocks were used to define the variants: Three types of convolutional blocks, spatial attention models, deep supervision and multilevel feature extractor. This document describes the topologies and analyses the results of the neural network designs that obtained the most accurate segmentations. Several of the proposed designs outperform the standard U-Net used as baseline, especially when used in ensembles where the output of multiple neural networks is combined according to different strategies.  ( 3 min )
    RobustLR: Evaluating Robustness to Logical Perturbation in Deductive Reasoning. (arXiv:2205.12598v2 [cs.CL] UPDATED)
    Transformers have been shown to be able to perform deductive reasoning on a logical rulebase containing rules and statements written in English natural language. While the progress is promising, it is currently unclear if these models indeed perform logical reasoning by understanding the underlying logical semantics in the language. To this end, we propose RobustLR, a suite of evaluation datasets that evaluate the robustness of these models to minimal logical edits in rulebases and some standard logical equivalence conditions. In our experiments with RoBERTa and T5, we find that the models trained in prior works do not perform consistently on the different perturbations in RobustLR, thus showing that the models are not robust to the proposed logical perturbations. Further, we find that the models find it especially hard to learn logical negation and disjunction operators. Overall, using our evaluation sets, we demonstrate some shortcomings of the deductive reasoning-based language models, which can eventually help towards designing better models for logical reasoning over natural language. All the datasets and code base have been made publicly available.  ( 2 min )
    Sparse Graph Learning from Spatiotemporal Time Series. (arXiv:2205.13492v2 [cs.LG] UPDATED)
    Outstanding achievements of graph neural networks for spatiotemporal time series analysis show that relational constraints introduce an effective inductive bias into neural forecasting architectures. Often, however, the relational information characterizing the underlying data-generating process is unavailable and the practitioner is left with the problem of inferring from data which relational graph to use in the subsequent processing stages. We propose novel, principled - yet practical - probabilistic score-based methods that learn the relational dependencies as distributions over graphs while maximizing end-to-end the performance at task. The proposed graph learning framework is based on consolidated variance reduction techniques for Monte Carlo score-based gradient estimation, is theoretically grounded, and, as we show, effective in practice. In this paper, we focus on the time series forecasting problem and show that, by tailoring the gradient estimators to the graph learning problem, we are able to achieve state-of-the-art performance while controlling the sparsity of the learned graph and the computational scalability. We empirically assess the effectiveness of the proposed method on synthetic and real-world benchmarks, showing that the proposed solution can be used as a stand-alone graph identification procedure as well as a graph learning component of an end-to-end forecasting architecture.  ( 2 min )
    Sparse Mixture-of-Experts are Domain Generalizable Learners. (arXiv:2206.04046v5 [cs.CV] UPDATED)
    Human visual perception can easily generalize to out-of-distributed visual data, which is far beyond the capability of modern machine learning models. Domain generalization (DG) aims to close this gap, with existing DG methods mainly focusing on the loss function design. In this paper, we propose to explore an orthogonal direction, i.e., the design of the backbone architecture. It is motivated by an empirical finding that transformer-based models trained with empirical risk minimization (ERM) outperform CNN-based models employing state-of-the-art (SOTA) DG algorithms on multiple DG datasets. We develop a formal framework to characterize a network's robustness to distribution shifts by studying its architecture's alignment to the correlations in the dataset. This analysis guides us to propose a novel DG model built upon vision transformers, namely Generalizable Mixture-of-Experts (GMoE). Extensive experiments on DomainBed demonstrate that GMoE trained with ERM outperforms SOTA DG baselines by a large margin. Moreover, GMoE is complementary to existing DG methods and its performance is substantially improved when trained with DG algorithms.  ( 2 min )
    Joint Continuous and Discrete Model Selection via Submodularity. (arXiv:2102.09029v3 [math.OC] UPDATED)
    In model selection problems for machine learning, the desire for a well-performing model with meaningful structure is typically expressed through a regularized optimization problem. In many scenarios, however, the meaningful structure is specified in some discrete space, leading to difficult nonconvex optimization problems. In this paper, we connect the model selection problem with structure-promoting regularizers to submodular function minimization with continuous and discrete arguments. In particular, we leverage the theory of submodular functions to identify a class of these problems that can be solved exactly and efficiently with an agnostic combination of discrete and continuous optimization routines. We show how simple continuous or discrete constraints can also be handled for certain problem classes and extend these ideas to a robust optimization framework. We also show how some problems outside of this class can be embedded within the class, further extending the class of problems our framework can accommodate. Finally, we numerically validate our theoretical results with several proof-of-concept examples with synthetic and real-world data, comparing against state-of-the-art algorithms.  ( 2 min )
    Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. (arXiv:2211.03899v1 [stat.ML])
    We study non-parametric estimation of the value function of an infinite-horizon $\gamma$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(\lambda)$ family for $\lambda \in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.
    EGRU: Event-based GRU for activity-sparse inference and learning. (arXiv:2206.06178v2 [cs.LG] UPDATED)
    The scalability of recurrent neural networks (RNNs) is hindered by the sequential dependence of each time step's computation on the previous time step's output. Therefore, one way to speed up and scale RNNs is to reduce the computation required at each time step independent of model size and task. In this paper, we propose a model that reformulates Gated Recurrent Units (GRU) as an event-based activity-sparse model that we call the Event-based GRU (EGRU), where units compute updates only on receipt of input events (event-based) from other units. When combined with having only a small fraction of the units active at a time (activity-sparse), this model has the potential to be vastly more compute efficient than current RNNs. Notably, activity-sparsity in our model also translates into sparse parameter updates during gradient descent, extending this compute efficiency to the training phase. We show that the EGRU demonstrates competitive performance compared to state-of-the-art recurrent network models in real-world tasks, including language modeling while maintaining high activity sparsity naturally during inference and training. This sets the stage for the next generation of recurrent networks that are scalable and more suitable for novel neuromorphic hardware.  ( 2 min )
    Much Easier Said Than Done: Falsifying the Causal Relevance of Linear Decoding Methods. (arXiv:2211.04367v1 [cs.LG])
    Linear classifier probes are frequently utilized to better understand how neural networks function. Researchers have approached the problem of determining unit importance in neural networks by probing their learned, internal representations. Linear classifier probes identify highly selective units as the most important for network function. Whether or not a network actually relies on high selectivity units can be tested by removing them from the network using ablation. Surprisingly, when highly selective units are ablated they only produce small performance deficits, and even then only in some cases. In spite of the absence of ablation effects for selective neurons, linear decoding methods can be effectively used to interpret network function, leaving their effectiveness a mystery. To falsify the exclusive role of selectivity in network function and resolve this contradiction, we systematically ablate groups of units in subregions of activation space. Here, we find a weak relationship between neurons identified by probes and those identified by ablation. More specifically, we find that an interaction between selectivity and the average activity of the unit better predicts ablation performance deficits for groups of units in AlexNet, VGG16, MobileNetV2, and ResNet101. Linear decoders are likely somewhat effective because they overlap with those units that are causally important for network function. Interpretability methods could be improved by focusing on causally important units.  ( 3 min )
    Algorithmic Bias in Machine Learning Based Delirium Prediction. (arXiv:2211.04442v1 [cs.LG])
    Although prediction models for delirium, a commonly occurring condition during general hospitalization or post-surgery, have not gained huge popularity, their algorithmic bias evaluation is crucial due to the existing association between social determinants of health and delirium risk. In this context, using MIMIC-III and another academic hospital dataset, we present some initial experimental evidence showing how sociodemographic features such as sex and race can impact the model performance across subgroups. With this work, our intent is to initiate a discussion about the intersectionality effects of old age, race and socioeconomic factors on the early-stage detection and prevention of delirium using ML.  ( 2 min )
    Accurate detection of sepsis at ED triage using machine learning with clinical natural language processing. (arXiv:2204.07657v4 [cs.LG] UPDATED)
    Sepsis is a life-threatening condition with organ dysfunction and is a leading cause of death and critical illness worldwide. Accurate detection of sepsis during emergency department triage would allow early initiation of lab analysis, antibiotic administration, and other sepsis treatment protocols. The purpose of this study was to determine whether EHR data can be extracted and synthesized with the latest machine learning algorithms (KATE Sepsis) and clinical natural language processing to produce accurate sepsis models, and compare KATE Sepsis performance with existing sepsis screening protocols, such as SIRS and qSOFA. A machine learning model (KATE Sepsis) was developed using patient encounters with triage data from 16 participating hospitals. KATE Sepsis, SIRS, standard screening (SIRS with source of infection) and qSOFA were tested in three settings. Cohort-A was a retrospective analysis on medical records from a single Site 1. Cohort-B was a prospective analysis of Site 1. Cohort-C was a retrospective analysis on Site 1 with 15 additional sites. Across all cohorts, KATE Sepsis demonstrates an AUC of 0.94-0.963 with 73-74.87% TPR and 3.76-7.17% FPR. Standard screening demonstrates an AUC of 0.682-0.726 with 39.39-51.19% TPR and 2.9-6.02% FPR. The qSOFA protocol demonstrates an AUC of 0.544-0.56, with 10.52-13.18% TPR and 1.22-1.68% FPR. For severe sepsis, across all cohorts, KATE Sepsis demonstrates an AUC of 0.935-0.972 with 70-82.26% TPR and 4.64-8.62% FPR. For septic shock, across all cohorts, KATE Sepsis demonstrates an AUC of 0.96-0.981 with 85.71-89.66% TPR and 4.85-8.8% FPR. SIRS, standard screening, and qSOFA demonstrate low AUC and TPR for severe sepsis and septic shock detection. KATE Sepsis provided substantially better sepsis detection performance in triage than commonly used screening protocols.  ( 3 min )
    Sensitivity Estimation for Dark Matter Subhalos in Synthetic Gaia DR2 using Deep Learning. (arXiv:2203.08161v2 [astro-ph.GA] UPDATED)
    The abundance of dark matter (DM) subhalos orbiting a host galaxy is a generic prediction of the cosmological framework, and is a promising way to constrain the nature of DM. In this paper, we investigate the use of machine learning-based tools to quantify the magnitude of phase-space perturbations caused by the passage of DM subhalos. A simple binary classifier and an anomaly detection model are proposed to estimate if stars or star particles close to DM subhalos are statistically detectable in simulations. The simulated datasets are three Milky Way-like galaxies and nine synthetic Gaia DR2 surveys derived from these. Firstly, we find that the anomaly detection algorithm, trained on a simulated galaxy with full 6D kinematic observables and applied on another galaxy, is nontrivially sensitive to the DM subhalo population. On the other hand, the classification-based approach is not sufficiently sensitive due to the extremely low statistics of signal stars for supervised training. Finally, the sensitivity of both algorithms in the Gaia-like surveys is negligible. The enormous size of the Gaia dataset motivates the further development of scalable and accurate data analysis methods that could be used to select potential regions of interest for DM searches to ultimately constrain the Milky Way's subhalo mass function, as well as simulations where to study the sensitivity of such methods under different signal hypotheses.  ( 3 min )
    Tensor-based Intrinsic Subspace Representation Learning for Multi-view Clustering. (arXiv:2010.09193v7 [cs.LG] UPDATED)
    As a hot research topic, many multi-view clustering approaches are proposed over the past few years. Nevertheless, most existing algorithms merely take the consensus information among different views into consideration for clustering. Actually, it may hinder the multi-view clustering performance in real-life applications, since different views usually contain diverse statistic properties. To address this problem, we propose a novel Tensor-based Intrinsic Subspace Representation Learning (TISRL) for multi-view clustering in this paper. Concretely, the rank preserving decomposition is proposed firstly to effectively deal with the diverse statistic information contained in different views. Then, to achieve the intrinsic subspace representation, the tensor-singular value decomposition based low-rank tensor constraint is also utilized in our method. It can be seen that specific information contained in different views is fully investigated by the rank preserving decomposition, and the high-order correlations of multi-view data are also mined by the low-rank tensor constraint. The objective function can be optimized by an augmented Lagrangian multiplier based alternating direction minimization algorithm. Experimental results on nine common used real-world multi-view datasets illustrate the superiority of TISRL.  ( 3 min )
    Stochastic Coded Federated Learning: Theoretical Analysis and Incentive Mechanism Design. (arXiv:2211.04132v1 [cs.DC])
    Federated learning (FL) has achieved great success as a privacy-preserving distributed training paradigm, where many edge devices collaboratively train a machine learning model by sharing the model updates instead of the raw data with a server. However, the heterogeneous computational and communication resources of edge devices give rise to stragglers that significantly decelerate the training process. To mitigate this issue, we propose a novel FL framework named stochastic coded federated learning (SCFL) that leverages coded computing techniques. In SCFL, before the training process starts, each edge device uploads a privacy-preserving coded dataset to the server, which is generated by adding Gaussian noise to the projected local dataset. During training, the server computes gradients on the global coded dataset to compensate for the missing model updates of the straggling devices. We design a gradient aggregation scheme to ensure that the aggregated model update is an unbiased estimate of the desired global update. Moreover, this aggregation scheme enables periodical model averaging to improve the training efficiency. We characterize the tradeoff between the convergence performance and privacy guarantee of SCFL. In particular, a more noisy coded dataset provides stronger privacy protection for edge devices but results in learning performance degradation. We further develop a contract-based incentive mechanism to coordinate such a conflict. The simulation results show that SCFL learns a better model within the given time and achieves a better privacy-performance tradeoff than the baseline methods. In addition, the proposed incentive mechanism grants better training performance than the conventional Stackelberg game approach.  ( 3 min )
    Motif-guided Time Series Counterfactual Explanations. (arXiv:2211.04411v1 [cs.LG])
    With the rising need of interpretable machine learning methods, there is a necessity for a rise in human effort to provide diverse explanations of the influencing factors of the model decisions. To improve the trust and transparency of AI-based systems, the EXplainable Artificial Intelligence (XAI) field has emerged. The XAI paradigm is bifurcated into two main categories: feature attribution and counterfactual explanation methods. While feature attribution methods are based on explaining the reason behind a model decision, counterfactual explanation methods discover the smallest input changes that will result in a different decision. In this paper, we aim at building trust and transparency in time series models by using motifs to generate counterfactual explanations. We propose Motif-Guided Counterfactual Explanation (MG-CF), a novel model that generates intuitive post-hoc counterfactual explanations that make full use of important motifs to provide interpretive information in decision-making processes. To the best of our knowledge, this is the first effort that leverages motifs to guide the counterfactual explanation generation. We validated our model using five real-world time-series datasets from the UCR repository. Our experimental results show the superiority of MG-CF in balancing all the desirable counterfactual explanations properties in comparison with other competing state-of-the-art baselines.  ( 2 min )
    MIMO Channel Estimation using Score-Based Generative Models. (arXiv:2204.07122v2 [eess.SP] UPDATED)
    Channel estimation is a critical task in multiple-input multiple-output (MIMO) digital communications that substantially effects end-to-end system performance. In this work, we introduce a novel approach for channel estimation using deep score-based generative models. A model is trained to estimate the gradient of the logarithm of a distribution and is used to iteratively refine estimates given measurements of a signal. We introduce a framework for training score-based generative models for wireless MIMO channels and performing channel estimation based on posterior sampling at test time. We derive theoretical robustness guarantees for channel estimation with posterior sampling in single-input single-output scenarios, and experimentally verify performance in the MIMO setting. Our results in simulated channels show competitive in-distribution performance, and robust out-of-distribution performance, with gains of up to $5$ dB in end-to-end coded communication performance compared to supervised deep learning methods. Simulations on the number of pilots show that high fidelity channel estimation with $25$% pilot density is possible for MIMO channel sizes of up to $64 \times 256$. Complexity analysis reveals that model size can efficiently trade performance for estimation latency, and that the proposed approach is competitive with compressed sensing in terms of floating-point operation (FLOP) count.  ( 2 min )
    SLATE: A Sequence Labeling Approach for Task Extraction from Free-form Inked Content. (arXiv:2211.04454v1 [cs.CL])
    We present SLATE, a sequence labeling approach for extracting tasks from free-form content such as digitally handwritten (or "inked") notes on a virtual whiteboard. Our approach allows us to create a single, low-latency model to simultaneously perform sentence segmentation and classification of these sentences into task/non-task sentences. SLATE greatly outperforms a baseline two-model (sentence segmentation followed by classification model) approach, achieving a task F1 score of 84.4\%, a sentence segmentation (boundary similarity) score of 88.4% and three times lower latency compared to the baseline. Furthermore, we provide insights into tackling challenges of performing NLP on the inking domain. We release both our code and dataset for this novel task.  ( 2 min )
  • Open

    Causal Discovery in Linear Structural Causal Models with Deterministic Relations. (arXiv:2111.00341v2 [cs.LG] UPDATED)
    Linear structural causal models (SCMs) -- in which each observed variable is generated by a subset of the other observed variables as well as a subset of the exogenous sources -- are pervasive in causal inference and casual discovery. However, for the task of causal discovery, existing work almost exclusively focus on the submodel where each observed variable is associated with a distinct source with non-zero variance. This results in the restriction that no observed variable can deterministically depend on other observed variables or latent confounders. In this paper, we extend the results on structure learning by focusing on a subclass of linear SCMs which do not have this property, i.e., models in which observed variables can be causally affected by any subset of the sources, and are allowed to be a deterministic function of other observed variables or latent confounders. This allows for a more realistic modeling of influence or information propagation in systems. We focus on the task of causal discovery form observational data generated from a member of this subclass. We derive a set of necessary and sufficient conditions for unique identifiability of the causal structure. To the best of our knowledge, this is the first work that gives identifiability results for causal discovery under both latent confounding and deterministic relationships. Further, we propose an algorithm for recovering the underlying causal structure when the aforementioned conditions are satisfied. We validate our theoretical results both on synthetic and real datasets.  ( 3 min )
    A Hypergraph-Based Machine Learning Ensemble Network Intrusion Detection System. (arXiv:2211.03933v1 [cs.CR])
    Network intrusion detection systems (NIDS) to detect malicious attacks continues to meet challenges. NIDS are vulnerable to auto-generated port scan infiltration attempts and NIDS are often developed offline, resulting in a time lag to prevent the spread of infiltration to other parts of a network. To address these challenges, we use hypergraphs to capture evolving patterns of port scan attacks via the set of internet protocol addresses and destination ports, thereby deriving a set of hypergraph-based metrics to train a robust and resilient ensemble machine learning (ML) NIDS that effectively monitors and detects port scanning activities and adversarial intrusions while evolving intelligently in real-time. Through the combination of (1) intrusion examples, (2) NIDS update rules, (3) attack threshold choices to trigger NIDS retraining requests, and (4) production environment with no prior knowledge of the nature of network traffic 40 scenarios were auto-generated to evaluate the ML ensemble NIDS comprising three tree-based models. Results show that under the model settings of an Update-ALL-NIDS rule (namely, retrain and update all the three models upon the same NIDS retraining request) the proposed ML ensemble NIDS produced the best results with nearly 100% detection performance throughout the simulation, exhibiting robustness in the complex dynamics of the simulated cyber-security scenario.  ( 2 min )
    Gaining Outlier Resistance with Progressive Quantiles: Fast Algorithms and Theoretical Studies. (arXiv:2112.08471v2 [stat.ME] UPDATED)
    Outliers widely occur in big-data applications and may severely affect statistical estimation and inference. In this paper, a framework of outlier-resistant estimation is introduced to robustify an arbitrarily given loss function. It has a close connection to the method of trimming and includes explicit outlyingness parameters for all samples, which in turn facilitates computation, theory, and parameter tuning. To tackle the issues of nonconvexity and nonsmoothness, we develop scalable algorithms with implementation ease and guaranteed fast convergence. In particular, a new technique is proposed to alleviate the requirement on the starting point such that on regular datasets, the number of data resamplings can be substantially reduced. Based on combined statistical and computational treatments, we are able to perform nonasymptotic analysis beyond M-estimation. The obtained resistant estimators, though not necessarily globally or even locally optimal, enjoy minimax rate optimality in both low dimensions and high dimensions. Experiments in regression, classification, and neural networks show excellent performance of the proposed methodology at the occurrence of gross outliers.
    Sensitivity Estimation for Dark Matter Subhalos in Synthetic Gaia DR2 using Deep Learning. (arXiv:2203.08161v2 [astro-ph.GA] UPDATED)
    The abundance of dark matter (DM) subhalos orbiting a host galaxy is a generic prediction of the cosmological framework, and is a promising way to constrain the nature of DM. In this paper, we investigate the use of machine learning-based tools to quantify the magnitude of phase-space perturbations caused by the passage of DM subhalos. A simple binary classifier and an anomaly detection model are proposed to estimate if stars or star particles close to DM subhalos are statistically detectable in simulations. The simulated datasets are three Milky Way-like galaxies and nine synthetic Gaia DR2 surveys derived from these. Firstly, we find that the anomaly detection algorithm, trained on a simulated galaxy with full 6D kinematic observables and applied on another galaxy, is nontrivially sensitive to the DM subhalo population. On the other hand, the classification-based approach is not sufficiently sensitive due to the extremely low statistics of signal stars for supervised training. Finally, the sensitivity of both algorithms in the Gaia-like surveys is negligible. The enormous size of the Gaia dataset motivates the further development of scalable and accurate data analysis methods that could be used to select potential regions of interest for DM searches to ultimately constrain the Milky Way's subhalo mass function, as well as simulations where to study the sensitivity of such methods under different signal hypotheses.  ( 3 min )
    Explaining Preferences with Shapley Values. (arXiv:2205.13662v2 [stat.ML] UPDATED)
    While preference modelling is becoming one of the pillars of machine learning, the problem of preference explanation remains challenging and underexplored. In this paper, we propose \textsc{Pref-SHAP}, a Shapley value-based model explanation framework for pairwise comparison data. We derive the appropriate value functions for preference models and further extend the framework to model and explain \emph{context specific} information, such as the surface type in a tennis game. To demonstrate the utility of \textsc{Pref-SHAP}, we apply our method to a variety of synthetic and real-world datasets and show that richer and more insightful explanations can be obtained over the baseline.  ( 2 min )
    Complex-to-Real Random Features for Polynomial Kernels. (arXiv:2202.02031v3 [stat.ML] UPDATED)
    Polynomial kernels are among the most popular kernels in machine learning, since their feature maps model the interactions between the dimensions of the input data. However, these features correspond to tensor products of the input with itself, which makes their dimension grow exponentially with the polynomial degree. We address this issue by proposing Complexto-Real (CtR) sketches for tensor products that can be used as random feature approximations of polynomial kernels. These sketches leverage intermediate complex random projections, leading to better theoretical guarantees and potentially much lower variances than analogs using real projections. Our sketches are simple to construct and their final output is real-valued, which makes their downstream use straightforward. Finally, we show that they achieve state-of-the-art performance in terms of accuracy and speed.  ( 2 min )
    Observing how deep neural networks understand physics through the energy spectrum of one-dimensional quantum mechanics. (arXiv:2201.06676v2 [physics.comp-ph] UPDATED)
    We investigate how neural networks (NNs) understand physics using 1D quantum mechanics. After training an NN to accurately predict energy eigenvalues from potentials, we used it to confirm the NN's understanding of physics from four different aspects. The trained NN could predict energy eigenvalues of different kinds of potentials than the ones learned, predict the probability distribution of the existence of particles not used during training, reproduce untrained physical phenomena, and predict the energy eigenvalues of potentials with an unknown matter effect. These results show that NNs can learn physical laws from experimental data, predict the results of experiments under conditions different from those used for training, and predict physical quantities of types not provided during training. Because NNs understand physics in a different way than humans, they will be a powerful tool for advancing physics by complementing the human way of understanding.  ( 2 min )
    The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization. (arXiv:2206.02768v2 [stat.ML] UPDATED)
    The logit outputs of a feedforward neural network at initialization are conditionally Gaussian, given a random covariance matrix defined by the penultimate layer. In this work, we study the distribution of this random matrix. Recent work has shown that shaping the activation function as network depth grows large is necessary for this covariance matrix to be non-degenerate. However, the current infinite-width-style understanding of this shaping method is unsatisfactory for large depth: infinite-width analyses ignore the microscopic fluctuations from layer to layer, but these fluctuations accumulate over many layers. To overcome this shortcoming, we study the random covariance matrix in the shaped infinite-depth-and-width limit. We identify the precise scaling of the activation function necessary to arrive at a non-trivial limit, and show that the random covariance matrix is governed by a stochastic differential equation (SDE) that we call the Neural Covariance SDE. Using simulations, we show that the SDE closely matches the distribution of the random covariance matrix of finite networks. Additionally, we recover an if-and-only-if condition for exploding and vanishing norms of large shaped networks based on the activation function.  ( 2 min )
    Semantic Information Retrieval in Wireless Networks. (arXiv:2204.13366v2 [cs.IT] UPDATED)
    Motivated by recent success of Machine Learning (ML) tools in wireless communications, the idea of semantic communication by Weaver from 1949 has received considerable attention. It breaks with the classic design paradigm of Shannon by aiming to transmit the meaning of a message, i.e., semantics, rather than its exact copy and thus allows for savings in channel uses or information rate. In this work, we extend the fundamental approach from Basu et al. for modeling semantics from logical to probabilistic entailment relations between meaning and messages. Thus, we model semantics by means of a hidden random variable and define the task of semantic communication as transmission of messages over a communication channel such that semantics is best preserved. We formulate the semantic communication design either as an Information Maximization or as an Information Bottleneck optimization problem. Finally, we propose the ML-based semantic communication system SINFONI for a distributed multipoint scenario: SINFONI communicates the meaning behind multiple messages that are observed at different senders to a single receiver for semantic retrieval. We analyze SINFONI by processing images as an example of messages. Numerical results reveal a tremendous rate normalized SNR shift up to 20 dB compared to classically designed communication systems.  ( 2 min )
    Lower Bounds for the Convergence of Tensor Power Iteration on Random Overcomplete Models. (arXiv:2211.03827v1 [cs.LG])
    Tensor decomposition serves as a powerful primitive in statistics and machine learning. In this paper, we focus on using power iteration to decompose an overcomplete random tensor. Past work studying the properties of tensor power iteration either requires a non-trivial data-independent initialization, or is restricted to the undercomplete regime. Moreover, several papers implicitly suggest that logarithmically many iterations (in terms of the input dimension) are sufficient for the power method to recover one of the tensor components. In this paper, we analyze the dynamics of tensor power iteration from random initialization in the overcomplete regime. Surprisingly, we show that polynomially many steps are necessary for convergence of tensor power iteration to any of the true component, which refutes the previous conjecture. On the other hand, our numerical experiments suggest that tensor power iteration successfully recovers tensor components for a broad range of parameters, despite that it takes at least polynomially many steps to converge. To further complement our empirical evidence, we prove that a popular objective function for tensor decomposition is strictly increasing along the power iteration path. Our proof is based on the Gaussian conditioning technique, which has been applied to analyze the approximate message passing (AMP) algorithm. The major ingredient of our argument is a conditioning lemma that allows us to generalize AMP-type analysis to non-proportional limit and polynomially many iterations of the power method.  ( 3 min )
    Causal Discovery in Linear Latent Variable Models Subject to Measurement Error. (arXiv:2211.03984v1 [cs.LG])
    We focus on causal discovery in the presence of measurement error in linear systems where the mixing matrix, i.e., the matrix indicating the independent exogenous noise terms pertaining to the observed variables, is identified up to permutation and scaling of the columns. We demonstrate a somewhat surprising connection between this problem and causal discovery in the presence of unobserved parentless causes, in the sense that there is a mapping, given by the mixing matrix, between the underlying models to be inferred in these problems. Consequently, any identifiability result based on the mixing matrix for one model translates to an identifiability result for the other model. We characterize to what extent the causal models can be identified under a two-part faithfulness assumption. Under only the first part of the assumption (corresponding to the conventional definition of faithfulness), the structure can be learned up to the causal ordering among an ordered grouping of the variables but not all the edges across the groups can be identified. We further show that if both parts of the faithfulness assumption are imposed, the structure can be learned up to a more refined ordered grouping. As a result of this refinement, for the latent variable model with unobserved parentless causes, the structure can be identified. Based on our theoretical results, we propose causal structure learning methods for both models, and evaluate their performance on synthetic data.  ( 3 min )
    Improving Graph Neural Networks at Scale: Combining Approximate PageRank and CoreRank. (arXiv:2211.04248v1 [cs.LG])
    Graph Neural Networks (GNNs) have achieved great successes in many learning tasks performed on graph structures. Nonetheless, to propagate information GNNs rely on a message passing scheme which can become prohibitively expensive when working with industrial-scale graphs. Inspired by the PPRGo model, we propose the CorePPR model, a scalable solution that utilises a learnable convex combination of the approximate personalised PageRank and the CoreRank to diffuse multi-hop neighbourhood information in GNNs. Additionally, we incorporate a dynamic mechanism to select the most influential neighbours for a particular node which reduces training time while preserving the performance of the model. Overall, we demonstrate that CorePPR outperforms PPRGo, particularly on large graphs where selecting the most influential nodes is particularly relevant for scalability. Our code is publicly available at: https://github.com/arielramos97/CorePPR.  ( 2 min )
    Policy evaluation from a single path: Multi-step methods, mixing and mis-specification. (arXiv:2211.03899v1 [stat.ML])
    We study non-parametric estimation of the value function of an infinite-horizon $\gamma$-discounted Markov reward process (MRP) using observations from a single trajectory. We provide non-asymptotic guarantees for a general family of kernel-based multi-step temporal difference (TD) estimates, including canonical $K$-step look-ahead TD for $K = 1, 2, \ldots$ and the TD$(\lambda)$ family for $\lambda \in [0,1)$ as special cases. Our bounds capture its dependence on Bellman fluctuations, mixing time of the Markov chain, any mis-specification in the model, as well as the choice of weight function defining the estimator itself, and reveal some delicate interactions between mixing time and model mis-specification. For a given TD method applied to a well-specified model, its statistical error under trajectory data is similar to that of i.i.d. sample transition pairs, whereas under mis-specification, temporal dependence in data inflates the statistical error. However, any such deterioration can be mitigated by increased look-ahead. We complement our upper bounds by proving minimax lower bounds that establish optimality of TD-based methods with appropriately chosen look-ahead and weighting, and reveal some fundamental differences between value function estimation and ordinary non-parametric regression.  ( 2 min )
    Doubly Inhomogeneous Reinforcement Learning. (arXiv:2211.03983v1 [stat.ML])
    This paper studies reinforcement learning (RL) in doubly inhomogeneous environments under temporal non-stationarity and subject heterogeneity. In a number of applications, it is commonplace to encounter datasets generated by system dynamics that may change over time and population, challenging high-quality sequential decision making. Nonetheless, most existing RL solutions require either temporal stationarity or subject homogeneity, which would result in sub-optimal policies if both assumptions were violated. To address both challenges simultaneously, we propose an original algorithm to determine the ``best data chunks" that display similar dynamics over time and across individuals for policy learning, which alternates between most recent change point detection and cluster identification. Our method is general, and works with a wide range of clustering and change point detection algorithms. It is multiply robust in the sense that it takes multiple initial estimators as input and only requires one of them to be consistent. Moreover, by borrowing information over time and population, it allows us to detect weaker signals and has better convergence properties when compared to applying the clustering algorithm per time or the change point detection algorithm per subject. Empirically, we demonstrate the usefulness of our method through extensive simulations and a real data application.  ( 2 min )
    Efficient probabilistic reconciliation of forecasts for real-valued and count time series. (arXiv:2210.02286v2 [stat.ML] UPDATED)
    Hierarchical time series are common in several applied fields. Forecasts are required to be coherent, that is, to satisfy the constraints given by the hierarchy. The most popular technique to enforce coherence is called reconciliation, which adjusts the base forecasts computed for each time series. However, recent works on probabilistic reconciliation present several limitations. In this paper, we propose a new approach based on conditioning to reconcile any type of forecast distribution. We then introduce a new algorithm, called Bottom-Up Importance Sampling, to efficiently sample from the reconciled distribution. It can be used for any base forecast distribution: discrete, continuous, or in the form of samples, providing a major speedup compared to the current methods. Experiments on several temporal hierarchies show a significant improvement over base probabilistic forecasts.  ( 2 min )
    Statistical Learning for Individualized Asset Allocation. (arXiv:2201.07998v3 [stat.ML] UPDATED)
    We establish a high-dimensional statistical learning framework for individualized asset allocation. Our proposed methodology addresses continuous-action decision-making with a large number of characteristics. We develop a discretization approach to model the effect of continuous actions and allow the discretization frequency to be large and diverge with the number of observations. The value function of continuous-action is estimated using penalized regression with our proposed generalized penalties that are imposed on linear transformations of the model coefficients. We show that our proposed Discretization and Regression with generalized fOlded concaVe penalty on Effect discontinuity (DROVE) approach enjoys desirable theoretical properties and allows for statistical inference of the optimal value associated with optimal decision-making. Empirically, the proposed framework is exercised with the Health and Retirement Study data in finding individualized optimal asset allocation. The results show that our individualized optimal strategy improves the population financial well-being.  ( 2 min )
    Beyond spectral gap: The role of the topology in decentralized learning. (arXiv:2206.03093v2 [cs.LG] UPDATED)
    In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. We consider the setting in which all workers sample from the same dataset, and communicate over a sparse graph (decentralized). In this setting, current theory fails to capture important aspects of real-world behavior. First, the 'spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization when workers share the same data distribution. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies.  ( 2 min )
    A new BART prior for flexible modeling with categorical predictors. (arXiv:2211.04459v1 [stat.ME])
    Default implementations of Bayesian Additive Regression Trees (BART) represent categorical predictors using several binary indicators, one for each level of each categorical predictor. Regression trees built with these indicators partition the levels using a ``remove one a time strategy.'' Unfortunately, the vast majority of partitions of the levels cannot be built with this strategy, severely limiting BART's ability to ``borrow strength'' across groups of levels. We overcome this limitation with a new class of regression tree and a new decision rule prior that can assign multiple levels to both the left and right child of a decision node. Motivated by spatial applications with areal data, we introduce a further decision rule prior that partitions the areas into spatially contiguous regions by deleting edges from random spanning trees of a suitably defined network. We implemented our new regression tree priors in the flexBART package, which, compared to existing implementations, often yields improved out-of-sample predictive performance without much additional computational burden. We demonstrate the efficacy of flexBART using examples from baseball and the spatiotemporal modeling of crime.  ( 2 min )
    Black Box Lie Group Preconditioners for SGD. (arXiv:2211.04422v1 [stat.ML])
    A matrix free and a low rank approximation preconditioner are proposed to accelerate the convergence of stochastic gradient descent (SGD) by exploiting curvature information sampled from Hessian-vector products or finite differences of parameters and gradients similar to the BFGS algorithm. Both preconditioners are fitted with an online updating manner minimizing a criterion that is free of line search and robust to stochastic gradient noise, and further constrained to be on certain connected Lie groups to preserve their corresponding symmetry or invariance, e.g., orientation of coordinates by the connected general linear group with positive determinants. The Lie group's equivariance property facilitates preconditioner fitting, and its invariance property saves any need of damping, which is common in second-order optimizers, but difficult to tune. The learning rate for parameter updating and step size for preconditioner fitting are naturally normalized, and their default values work well in most situations.  ( 2 min )
    A local approach to parameter space reduction for regression and classification tasks. (arXiv:2107.10867v2 [stat.ML] UPDATED)
    Parameter space reduction has been proved to be a crucial tool to speed-up the execution of many numerical tasks such as optimization, inverse problems, sensitivity analysis, and surrogate models' design, especially when in presence of high-dimensional parametrized systems. In this work we propose a new method called local active subspaces (LAS), which explores the synergies of active subspaces with supervised clustering techniques in order to carry out a more efficient dimension reduction in the parameter space. The clustering is performed without losing the input-output relations by introducing a distance metric induced by the global active subspace. We present two possible clustering algorithms: K-medoids and a hierarchical top-down approach, which is able to impose a variety of subdivision criteria specifically tailored for parameter space reduction tasks. This method is particularly useful for the community working on surrogate modelling. Frequently, the parameter space presents subdomains where the objective function of interest varies less on average along different directions. So, it could be approximated more accurately if restricted to those subdomains and studied separately. We tested the new method over several numerical experiments of increasing complexity, we show how to deal with vectorial outputs, and how to classify the different regions with respect to the local active subspace dimension. Employing this classification technique as a preprocessing step in the parameter space, or output space in case of vectorial outputs, brings remarkable results for the purpose of surrogate modelling.  ( 3 min )
    Automatic Change-Point Detection in Time Series via Deep Learning. (arXiv:2211.03860v1 [stat.ML])
    Detecting change-points in data is challenging because of the range of possible types of change and types of behaviour of data when there is no change. Statistically efficient methods for detecting a change will depend on both of these features, and it can be difficult for a practitioner to develop an appropriate detection method for their application of interest. We show how to automatically generate new detection methods based on training a neural network. Our approach is motivated by many existing tests for the presence of a change-point being able to be represented by a simple neural network, and thus a neural network trained with sufficient data should have performance at least as good as these methods. We present theory that quantifies the error rate for such an approach, and how it depends on the amount of training data. Empirical results show that, even with limited training data, its performance is competitive with the standard CUSUM test for detecting a change in mean when the noise is independent and Gaussian, and can substantially outperform it in the presence of auto-correlated or heavy-tailed noise. Our method also shows strong results in detecting and localising changes in activity based on accelerometer data.  ( 2 min )
    Individualized and Global Feature Attributions for Gradient Boosted Trees in the Presence of $\ell_2$ Regularization. (arXiv:2211.04409v1 [stat.ML])
    While $\ell_2$ regularization is widely used in training gradient boosted trees, popular individualized feature attribution methods for trees such as Saabas and TreeSHAP overlook the training procedure. We propose Prediction Decomposition Attribution (PreDecomp), a novel individualized feature attribution for gradient boosted trees when they are trained with $\ell_2$ regularization. Theoretical analysis shows that the inner product between PreDecomp and labels on in-sample data is essentially the total gain of a tree, and that it can faithfully recover additive models in the population case when features are independent. Inspired by the connection between PreDecomp and total gain, we also propose TreeInner, a family of debiased global feature attributions defined in terms of the inner product between any individualized feature attribution and labels on out-sample data for each tree. Numerical experiments on a simulated dataset and a genomic ChIP dataset show that TreeInner has state-of-the-art feature selection performance. Code reproducing experiments is available at https://github.com/nalzok/TreeInner .  ( 2 min )

  • Open

    The Tourist Who Astral Projected To Another Galaxy - A Fictional AI Story.
    submitted by /u/AstralTourist [link] [comments]  ( 46 min )
    how
    So I came to know about hotpot colourize old photo, and I started playing with it, using random black and white photos from diff website So one photo had a white girl and the colourised photo had golden blonde hair with slightly whitish brown skin I was intrigued by how it chooses colour So I added a lion, and the colorized photo was closed to real, tawny brown colour then I added the photo of Indian boys playing with water and they had dark skin with black hair, they could have been slightly lighter skin, but still pretty close to accurate Then I added a black woman with Afro hair and she got dark skin, it was not that great coloured, but pretty close And then I lastly added a Japanese woman and she got light skin And then I added Indonesian man, got dark skin does AI know what ethnicity the person in the image belongs to? if so, then pretty intriguing and scary at same time thanks for your response, have a great day/night submitted by /u/Awkward-Insurance-59 [link] [comments]  ( 47 min )
    FUTURE IN AI [survey]
    submitted by /u/elitta23 [link] [comments]  ( 46 min )
    DREAMBOOTH LOCAL Training Inside Stable Diffusion Tutorial
    submitted by /u/PuppetHere [link] [comments]  ( 46 min )
    AI Professor Says Sex With Dolls Will Be No Different Than Real People In 50 Years
    submitted by /u/Educational_Sector98 [link] [comments]  ( 45 min )
    How can system-wide collaboration fix system-wide problems? | Caroline Gorski, CEO, R² Factory at Rolls-Royce
    submitted by /u/chelsea_bear [link] [comments]  ( 49 min )
    The highest funded plagiarist is probably this AI ethicist
    This academic plagiarist has gotten hundreds of thousands of dollars for AI ethics research(https://www.umu.se/globalassets/qbank/andreastheodorouvirginiadignum-26433crop001155650resize1154649autoorientquality90density150stripextensionjpgid16.jpg?format=webp&mode=crop&width=1280): $300000 for EXPLAIN (2022-25) where he is the primary investigator for Umeå University. Funded by the Vinnova under Eureka ITEA/4 AI Cluster. $140000 for the project "RAIN"(that appears to be just a dubious presentation long after the funder has forgotten about their money) and several hundred thousand dollars more for other "research projects" related to AI ethics and transparency. The funniest part is that he claimed on a number of occasions that he has been cleared of the accusations by NPOF(a Swedish institution investigating academic misconduct) but NPOF writes the following: Thank you for your inquiryWe received a report regarding Andreas Theodorou and his dissertation in 2020. Theodorou made the dissertation in the UK. The dissertation is not made within Swedish jurisdiction and can not be investigated by Npof.Kind regardsRegistrator and his alma mater for obvious reasons refuses to investigate him. All the while he is being aggressively promoted by some Virginia Dignum a self-promoting AI-ethics professor who among academics is considered highly proficient in academic politics. For more info go here:https://andreasplagiarism.wordpress.com/2020/12/02/andreas-theodorou-committed-plagiarism-in-his-phd-thesis/ or google "Theodorou plagiarism". submitted by /u/lexquests [link] [comments]  ( 46 min )
    Venom - Eminem (Unofficial AI Music Video) with Captions
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 44 min )
    New open-source GA (genetic algorithm) library
    Hey guys! This is on here cause there is no genetic algorithm subreddit and it is a similar field to AI. A while ago I noticed there were no good genetic algorithm libraries. So these last few weeks have been busy... Anyways, here is the GitHub link: dadukhankevin/Finch: A Keras style GA genetic algorithm library (github.com) Here is a basic example of a GA that finds solutions to equations. It evolves and presents the best answer it can come up with. There's more info on the GitHub page ^ from Finch.FinchGA import generic from Finch.FinchGA import FitnessFunctions as ff from Finch.FinchGA import Environments, GenePools from Finch.FinchGA import Layers as l desired = 10 expression = """((x**y+z)*h)/j+t+l*k/x""" eq = generic.Equation(["x", "y", "z", "h", "j","t","l","k"], expression, desired=desired) fitness = ff.EquationFitness(desired_result=desired, equation = eq) pool = GenePools.GenePool(list(range(1, 150)), fitness.func, mx=100, mn=0.001) env = Environments.SequentialEnvironment(layers=[ l.GenerateData(pool, population=30, array_length=8), l.SortFitness(), l.Mutate(pool, select_percent=50, likelihood=10), l.NarrowGRN(pool, delay=1, method="best", amount=1, reward=.2, penalty=.05, mn=.1, mx=200, every=1), l.UpdateWeights(pool), l.Parents(pool, gene_size=1, family_size=6, percent=100, every=4, method="best", amount=2), # parents the best ones l.KeepLength(100), #keeps a low population ]) env.compile(epochs=100,every=10, fitness=fitness.func, stop_threshold=.99) hist, data = env.simulate_env() info = env.display_history() print("best percent: "+str(env.best)) print("best individual: "+str(env.best_ind.chromosome.get_raw())) print(info) env.plot() Please play around with it and give me feedback! submitted by /u/danieljames10000 [link] [comments]  ( 45 min )
  • Open

    How should I approach this?
    So I'm interested in Multi-Level Models and am looking at building a simulator using some data that I have acquired (with the eventual goal to do some hierarchical RL using the simulator). I have two primary decisions that I'm analyzing and a ton of samples of actions being taken/the outcome being recorded re: these two decisions. Is there a direction you guys would suggest heading in to build this simulator? I know that it obviously won't be exact because there are actions within certain states that my sample set does not contain, but I'd like to estimate that in my simulator so that when I apply hierarchical RL, I can have a proof-of-concept of it actually working. Any help would be greatly appreciated! submitted by /u/darth_catnip [link] [comments]  ( 51 min )
  • Open

    What Is Denoising?
    Anyone who’s taken a photo with a digital camera is likely familiar with a “noisy” image: discolored spots that make the photo lose clarity and sharpness. Many photographers have tips and tricks to reduce noise in images, including fixing the settings on the camera lens or taking photos in different lighting. But it isn’t just Read article > The post What Is Denoising? appeared first on NVIDIA Blog.  ( 6 min )
    NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform
    Just like many businesses, the world of industrial scientific computing has a data problem. Solving seemingly intractable challenges — from developing new energy sources and creating new modes of transportation, to addressing mission-critical issues such as driving operational efficiencies and improving customer support — requires massive amounts of high performance computing. Instead of having to Read article > The post NVIDIA AI Turbocharges Industrial Research, Scientific Discovery in the Cloud on Rescale HPC-as-a-Service Platform appeared first on NVIDIA Blog.  ( 5 min )
    NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training
    Two months after their debut sweeping MLPerf inference benchmarks, NVIDIA H100 Tensor Core GPUs set world records across enterprise AI workloads in the industry group’s latest tests of AI training. Together, the results show H100 is the best choice for users who demand utmost performance when creating and deploying advanced AI models. MLPerf is the Read article > The post NVIDIA Hopper, Ampere GPUs Sweep Benchmarks in AI Training appeared first on NVIDIA Blog.  ( 5 min )
    New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE
    It’s a new age for safety. Volvo Cars unveiled the Volvo EX90 SUV today in Stockholm, marking the beginning of a new era of electrification, technology and safety for the automaker. The flagship vehicle is redesigned from tip to tail — with a new powertrain, branding and software-defined AI compute — powered by the centralized Read article > The post New Volvo EX90 SUV Heralds AI Era for Swedish Automaker, Built on NVIDIA DRIVE appeared first on NVIDIA Blog.  ( 4 min )
    HORN Free! Roaming Rhinos Could Be Guarded by AI Drones
    Call it the ultimate example of a job that’s sometimes best done remotely. Wildlife researchers say rhinos are magnificent beasts, but they like to be left alone, especially when they’re with their young. In the latest example of how researchers are using the latest technologies to track animals less invasively, a team of researchers has Read article > The post HORN Free! Roaming Rhinos Could Be Guarded by AI Drones appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    [R] Neurosymbolic Programming for Science
    Neurosymbolic Programming (NP) techniques have the potential to accelerate scientific discovery. These models combine neural and symbolic components to learn complex patterns and representations from data, using high-level concepts or known constraints. NP techniques can interface with symbolic domain knowledge from scientists, such as prior knowledge and experimental context, to produce interpretable outputs. We identify opportunities and challenges between current NP models and scientific workflows, with real-world examples from behavior analysis in science: to enable the use of NP broadly for workflows across the natural and social sciences. https://preview.redd.it/kkzt8t6f8zy91.png?width=2258&format=png&auto=webp&s=d808675ecd7837425fb12a92009dd0d3e0fa5f68 Paper: https://arxiv.org/abs/2210.05050 submitted by /u/insider_7 [link] [comments]  ( 56 min )
    [D] Best learning rate for fine tuning a pretained CNN
    Hey, I was wondering if there rules on how to choose the base learning to use to finetune a pre-trained CNN. Like for a pretrained mobilnet,resnext or a convnext what learning rate to use for finetuning ? Thank ! submitted by /u/Meddhouib10 [link] [comments]  ( 58 min )
    [D] Is there an advantage in learning when taking the average Gradient compared to the Gradient of just one point
    Hello, I'm wondering if during learning tasks there is an advantage in taking the average Gradient of a small Volume in our parameter space compared to just taking the Gradient of the one center point in that parameter-space volume. Edit: I noticed that i have some people confused, and rightly so, because i didn't mention some aspects. The computational overhead of the average of such many points not important, why is not important but it has to do with the fact that this can be extremely cheap on a quantum computer. My question more precisely would be, given the face that computational overhead can be ignored, are there in general theoretical advantages in taking the average over a volume in comparison to just at a point. I am a physicist and have only some experience in ML. My thesis is at the moment in Quantum Machine Learning and this question ahs been central to my research for some weeks. But unfortunately i can't find anything online that regards this question. I was wondering if someone here, would have some insights to this question. submitted by /u/CPOOCPOS [link] [comments]  ( 66 min )
    [Discussion] Could someone explain the math behind the number of distinct images that can be generated with a latent diffusion model?
    I'm interested in understanding how "deep" the pool of images is that can be brought up with e.g. the model used by Stable Diffusion (as it is deterministic). So how many distinct 512x512 images (feel free to disregard off aspect ratio generation) can be constructed given e.g., The default training set used in Stable Diffusion 1.5 (a subset of LAION-5b) The 860M UNet and 123M CLIP ViT-L/14 text encoder If it makes any difference, feel free to disregard the "image to image" capabilities as well. submitted by /u/spcrngr [link] [comments]  ( 58 min )
    [D] Video: The New AI Model Licenses have a Legal Loophole (OpenRAIL-M of BLOOM, Stable Diffusion, etc.)
    https://youtu.be/W5M-dvzpzSQ So-called responsible AI licenses are stupid, counterproductive, and have a dangerous legal loophole in them. ​ OpenRAIL++ License here: https://www.ykilcher.com/license ​ OUTLINE: 0:00 - Introduction 0:40 - Responsible AI Licenses (RAIL) of BLOOM and Stable Diffusion 3:35 - Open source software's dilemma of bad usage and restrictions 8:45 - Good applications, bad applications 12:45 - A dangerous legal loophole 15:50 - OpenRAIL++ License 16:50 - This has nothing to do with copyright 26:00 - Final thoughts submitted by /u/ykilcher [link] [comments]  ( 61 min )
    [P] Serverless Jupyter Lab with GPUs and persistent storage
    I've been building unweave.io as a zero setup way of running your machine learning code serverlessly on GPUs. We get asked whether we support Notebooks on Unweave all the time. So, we put together a serverless Jupyter Lab Portal built on top of Unweave: playground.unweave.io You can choose from 5 different GPUs or use CPU only. The lab comes with an `uwstore` folder at the root of the repository. Any files you add there will be persisted across multiple sessions. We've pre-loaded all labs with a set of commonly used ML dependencies. If you'd like to add additional dependencies, you can install them directly in the Jupyter Lab and save them to a `requirements.txt` or `environment.yml` file. Unweave will automatically reinstall these dependencies the next time you start a lab. Since we let you run on GPUs (which are very expensive) unfortunately we can't make it free. However, we've credited each account with $5 to start off with. Billing is by the minute and cheaper than AWS and GCP :) I'd love to hear what you think! P.S. Here's a demo: https://www.loom.com/share/0a06836e9832472d853b75bcae334020 submitted by /u/doyougitme [link] [comments]  ( 59 min )
    [R] Astronomia ex machina: a history, primer, and outlook on neural networks in astronomy
    Author here -- here is a review paper that we have been working on for a while, was hoping to get a little discussion going on it! In it we survey deep learning within astronomy, and identify three historical waves. We plot astronomical connectionism's course from the early days of training multilayer perceptrons on expertly derived emergent parameters, through the second wave of training recurrent and convolutional neural networks on raw data, and the third wave of self-supervised, unsupervised and generative learning. Along the way we try to build up a solid theoretical foundation and intuition for deep learning frameworks as a primer for someone new to the field. We also have some cool conclusions in section 9 about where the field is going (foundation models!), and a surprise at the end of section 1. https://arxiv.org/abs/2211.03796 submitted by /u/Smith4242 [link] [comments]  ( 55 min )
    [D] Improve machine learning with same number of images
    Hi, will modifying the same images slightly improve the algorithm's capacity to recognize? I have a fixed set of images and want to maximize the variations without adding new images. submitted by /u/Stemvid [link] [comments]  ( 58 min )
    [R] Bigscience releases BLOOMZ and mT0 - LLMs that follow multilingual instructions
    submitted by /u/C0hentheBarbarian [link] [comments]  ( 56 min )
    [D] Can anyone explain the MinTrace method for reconciliation of Hierarchical Time Series Forecast?
    I have gone through a lot of stuff which explains the MinTrace reconciliation method but haven't found anything substantial. Any sort of help would be highly appreciated! submitted by /u/Impossible_Special92 [link] [comments]  ( 56 min )
    [D] What are some exciting and challenging RL environments you know of? and how do you choose what RL environment to test your algorithm in?
    Let me start by sharing this list: https://github.com/clvrai/awesome-rl-envs Are there other awesome RL environments you know of? and when you have an algorithm or an idea, how do you choose what environment to test it on? do you simply use the same environments are everyone? I have seen a lot of papers do maze navigation and kitchen manipulation, but maybe my sample is biased towards those papers. Is there a standard benchmark environment for RL? Of course, I know it also depends whether you are doing online vs offline RL, dense vs sparse reward. So, I don't expect a concrete answer, but I'd like to hear some guiding principles. submitted by /u/carlml [link] [comments]  ( 55 min )
    [D] How to parallelize training over devices?
    Assuming my loss is of the form sum |NN(x_i) - y_i|^2 Then I can take minibatches to train my neural network. I've observed that smaller batches lead to better performance (with bigger batches the loss plateaus quickly, with smaller batches I am able to get a smaller loss). Now I want to parallelize training over multiple devices. If I would just send batches to different devices, take the gradient w.r.t. that batch (of say size 16) on each device (say 4) and then apply all the gradients simultaneously that would be equivalent to just using a larger batch (in this case 16x4 = 64). Is there any way to parallelize the training without increasing batch size? submitted by /u/ButterscotchLost421 [link] [comments]  ( 57 min )
    [D] Modern Forecasting in Practice with Jan Gasthaus (AWS) and Tim Januschowski (Zalando)
    Just wanted to give a heads up that we’ve got an upcoming course on Time Series & Forecasting. The goal is to help you solve complex business problems by making more accurate predictions with modern forecasting techniques. This course will be led by two industry leaders: Jan Gasthaus (AWS) and Tim Januschowski (ex-AWS, Zalando). In the past, Tim and his team built multiple AI services for AWS such as SageMaker, Forecast, Lookout for Metrics, and DevOps Guru. Jan was part of the teams pushing these projects forward, and also co-created the open-source deep learning forecasting library Gluon TS. Plus, like all of our courses, Time Series & Forecasting qualifies for coverage from your org’s L&D budget or personal learning stipend. Come join Tim and Jan live for 5-days of hands-on training. You can learn more about the course by clicking here: https://www.getsphere.com/cohorts/modern-forecasting-in-practice?source=Sphere-Communities-r-MachineLearning submitted by /u/lorenzo_1999 [link] [comments]  ( 58 min )
    [P] Open source project using ML to help you create more efficient search algorithms 1k+ Github Stars
    Source code: https://search-the-way-you-th.ink/3FNq2lG Marqo is a super cool new open-source initiative that enables developers to integrate user-focused search and analytics. Marqo uses deep-learning algorithms like CLIP to pull semantic meaning from images, meaning that it can seamlessly handle image-to-image, image-to-text and text-to-image search and analytics. Think command-f on your computer. We have a very comprehensive guide that allows even beginners to grasp the main concepts and experiment themselves. Can be found here: https://docs.marqo.ai/ If you are even remotely interested in implementing Marqo or playing around with it, check out Marqo.ai and star our GitHub for any future releases we bring out! P.S. If you have any feedback at all please shoot me a message, we want to make this the best possible solution for developers :) happy hacking! submitted by /u/asxyolo123 [link] [comments]  ( 58 min )
  • Open

    How to Explore Historical Data Patterns with Machine Learning
    The possibility to render historical data into comprehensive patterns has added soundness to many areas, and when we consider trading, it has left a giant mark and keeps growing. It is as if you’ve numerous eyes and decades of hard-earned experience to execute wonders and gains. However, mining the data alone isn’t enough to produce… Read More »How to Explore Historical Data Patterns with Machine Learning The post How to Explore Historical Data Patterns with Machine Learning appeared first on Data Science Central.  ( 20 min )
  • Open

    Brain tumor segmentation at scale using AWS Inferentia
    Medical imaging is an important tool for the diagnosis and localization of disease. Over the past decade, collections of medical images have grown rapidly, and open repositories such as The Cancer Imaging Archive and Imaging Data Commons have democratized access to this vast imaging data. Computational tools such as machine learning (ML) and artificial intelligence […]  ( 6 min )
    Serve multiple models with Amazon SageMaker and Triton Inference Server
    Amazon SageMaker is a fully managed service for data science and machine learning (ML) workflows. It helps data scientists and developers prepare, build, train, and deploy high-quality ML models quickly by bringing together a broad set of capabilities purpose-built for ML. In 2021, AWS announced the integration of NVIDIA Triton Inference Server in SageMaker. You […]  ( 10 min )
    Model Hosting Patterns in SageMaker: Best practices in testing and updating models on SageMaker
    Amazon SageMaker is a fully managed service that provides developers and data scientists the ability to quickly build, train, and deploy machine learning (ML) models. With SageMaker, you can deploy your ML models on hosted endpoints and get inference results in real time. You can easily view the performance metrics for your endpoints in Amazon […]  ( 12 min )
  • Open

    Elliptic coordinates and Laplace’s equation
    In rectangular coordinates, constant values of x are vertical lines and constant values of y are horizontal lines. In polar coordinates, constant values of r are circles and constant values of θ are lines from the origin. In elliptic coordinates, the position of a point is specified by two numbers, μ and ν. Constant values […] Elliptic coordinates and Laplace’s equation first appeared on John D. Cook.  ( 6 min )
    Computing arccos
    Suppose you take two numbers, a and b, and repeatedly take their arithmetic mean and their geometric mean. That is, suppose we set a0 = a b0 = b then a1 = (a0 + b0)/2 b1 = √(a0 b0) and repeat this process, each new a becoming the arithmetic mean of the previous a and […] Computing arccos first appeared on John D. Cook.  ( 6 min )
    Three diagrams
    This post will give examples of three similar diagrams that occur in three dissimilar areas: design of experiments, finite difference methods for PDEs, and numerical integration. Central Composite Design (CCD) The most popular design for fitting a second-order response surface is the central composite design or CCD. When there are two factors being tested, the […] Three diagrams first appeared on John D. Cook.  ( 5 min )
  • Open

    A Universal Trade-off Between the Model Size, Test Loss, and Training Loss of Linear Predictors. (arXiv:2207.11621v2 [stat.ML] UPDATED)
    In this work we establish an algorithm and distribution independent non-asymptotic trade-off between the model size, excess test loss, and training loss of linear predictors. Specifically, we show that models that perform well on the test data (have low excess loss) are either "classical" -- have training loss close to the noise level, or are "modern" -- have a much larger number of parameters compared to the minimum needed to fit the training data exactly. We also provide a more precise asymptotic analysis when the limiting spectral distribution of the whitened features is Marchenko-Pastur. Remarkably, while the Marchenko-Pastur analysis is far more precise near the interpolation peak, where the number of parameters is just enough to fit the training data, in settings of most practical interest it differs from the distribution independent bound by only a modest multiplicative constant.  ( 2 min )
    Efficient Deep Learning-based Estimation of the Vital Signs on Smartphones. (arXiv:2204.08989v2 [eess.SP] UPDATED)
    Nowadays, due to the widespread use of smartphones in everyday life and the improvement of computational capabilities of these devices, many complex tasks can now be deployed on them. Concerning the need for continuous monitoring of vital signs, especially for the elderly or those with certain types of diseases, the development of algorithms that can estimate vital signs using smartphones has attracted researchers worldwide. Such algorithms estimate vital signs (heart rate and oxygen saturation level) by processing an input PPG signal. These methods often apply multiple pre-processing steps to the input signal before the prediction step. This can increase the computational complexity of these methods, meaning only a limited number of mobile devices can run them. Furthermore, multiple pre-processing steps also require the design of a couple of hand-crafted stages to obtain an optimal result. This research proposes a novel end-to-end solution to mobile-based vital sign estimation by deep learning. The proposed method does not require any pre-processing. Due to the use of fully convolutional architecture, the parameter count of our proposed model is, on average, a quarter of the ordinary architectures that use fully-connected layers as the prediction heads. As a result, the proposed model has less over-fitting chance and computational complexity. A public dataset for vital sign estimation, including 62 videos collected from 35 men and 27 women, is also provided. The experimental results demonstrate state-of-the-art estimation accuracy.  ( 3 min )
    Semantic Segmentation of Legal Documents via Rhetorical Roles. (arXiv:2112.01836v2 [cs.CL] UPDATED)
    Legal documents are unstructured, use legal jargon, and have considerable length, making them difficult to process automatically via conventional text processing techniques. A legal document processing system would benefit substantially if the documents could be segmented into coherent information units. This paper proposes a new corpus of legal documents annotated (with the help of legal experts) with a set of 13 semantically coherent units labels (referred to as Rhetorical Roles), e.g., facts, arguments, statute, issue, precedent, ruling, and ratio. We perform a thorough analysis of the corpus and the annotations. For automatically segmenting the legal documents, we experiment with the task of rhetorical role prediction: given a document, predict the text segments corresponding to various roles. Using the created corpus, we experiment extensively with various deep learning-based baseline models for the task. Further, we develop a multitask learning (MTL) based deep model with document rhetorical role label shift as an auxiliary task for segmenting a legal document. The proposed model shows superior performance over the existing models. We also experiment with model performance in the case of domain transfer and model distillation techniques to see the model performance in limited data conditions.
    Momentum-based Weight Interpolation of Strong Zero-Shot Models for Continual Learning. (arXiv:2211.03186v1 [cs.LG])
    Large pre-trained, zero-shot capable models have shown considerable success both for standard transfer and adaptation tasks, with particular robustness towards distribution shifts. In addition, subsequent fine-tuning can considerably improve performance on a selected downstream task. However, through naive fine-tuning, these zero-shot models lose their generalizability and robustness towards distribution shifts. This is a particular problem for tasks such as Continual Learning (CL), where continuous adaptation has to be performed as new task distributions are introduced sequentially. In this work, we showcase that where fine-tuning falls short to adapt such zero-shot capable models, simple momentum-based weight interpolation can provide consistent improvements for CL tasks in both memory-free and memory-based settings. In particular, we find improvements of over $+4\%$ on standard CL benchmarks, while reducing the error to the upper limit of jointly training on all tasks at once in parts by more than half, allowing the continual learner to inch closer to the joint training limits.
    Optimal Diagonal Preconditioning. (arXiv:2209.00809v2 [math.OC] UPDATED)
    Preconditioning has long been a staple technique in optimization, often applied to reduce the condition number of a matrix and speed up the convergence of algorithms. Although there are many popular preconditioning techniques in practice, most lack guarantees on reductions in condition number. Moreover, the degree to which we can improve over existing heuristic preconditioners remains an important practical question. In this paper, we study the problem of optimal diagonal preconditioning that achieves maximal reduction in the condition number of any full-rank matrix by scaling its rows and/or columns. We first reformulate the problem as a quasi-convex problem and provide a simple algorithm based on bisection. Then we develop an interior point algorithm with $O(\log(1/\epsilon))$ iteration complexity, where each iteration consists of a Newton update based on the Nesterov-Todd direction. Next, we specialize to one-sided optimal diagonal preconditioning problems, and demonstrate that they can be formulated as standard dual SDP problems. We then develop efficient customized solvers and study the empirical performance of our optimal diagonal preconditioning procedures through extensive experiments on large matrices. Our findings suggest that optimal diagonal preconditioners can significantly improve upon existing heuristics-based diagonal preconditioners at reducing condition numbers and speeding up iterative methods. Moreover, our implementation of customized solvers, combined with a random row/column sampling step, can find near-optimal diagonal preconditioners for matrices up to size 200,000 in reasonable time, demonstrating their practical appeal.
    ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts. (arXiv:2206.14786v2 [cs.LG] UPDATED)
    Post-processing ensemble prediction systems can improve the reliability of weather forecasting, especially for extreme event prediction. In recent years, different machine learning models have been developed to improve the quality of weather post-processing. However, these models require a comprehensive dataset of weather simulations to produce high-accuracy results, which comes at a high computational cost to generate. This paper introduces the ENS-10 dataset, consisting of ten ensemble members spanning 20 years (1998-2017). The ensemble members are generated by perturbing numerical weather simulations to capture the chaotic behavior of the Earth. To represent the three-dimensional state of the atmosphere, ENS-10 provides the most relevant atmospheric variables at 11 distinct pressure levels and the surface at 0.5-degree resolution for forecast lead times T=0, 24, and 48 hours (two data points per week). We propose the ENS-10 prediction correction task for improving the forecast quality at a 48-hour lead time through ensemble post-processing. We provide a set of baselines and compare their skill at correcting the predictions of three important atmospheric variables. Moreover, we measure the baselines' skill at improving predictions of extreme weather events using our dataset. The ENS-10 dataset is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
    Unsupervised Machine Learning for Explainable Medicare Fraud Detection. (arXiv:2211.02927v1 [cs.CY])
    The US federal government spends more than a trillion dollars per year on health care, largely provided by private third parties and reimbursed by the government. A major concern in this system is overbilling, waste and fraud by providers, who face incentives to misreport on their claims in order to receive higher payments. In this paper, we develop novel machine learning tools to identify providers that overbill Medicare, the US federal health insurance program for elderly adults and the disabled. Using large-scale Medicare claims data, we identify patterns consistent with fraud or overbilling among inpatient hospitalizations. Our proposed approach for Medicare fraud detection is fully unsupervised, not relying on any labeled training data, and is explainable to end users, providing reasoning and interpretable insights into the potentially suspicious behavior of the flagged providers. Data from the Department of Justice on providers facing anti-fraud lawsuits and several case studies validate our approach and findings both quantitatively and qualitatively.
    GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation. (arXiv:2207.14467v2 [cs.CL] UPDATED)
    Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. However, vanilla Transformer mainly exploits the top-layer representation, assuming the lower layers provide trivial or redundant information and thus ignoring the bottom-layer feature that is potentially valuable. In this work, we propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words. To corroborate the effectiveness of the proposed method, extensive experiments and analytic experiments are conducted on three bilingual translation benchmarks and two multilingual translation tasks, including the IWLST-14, IWLST-17, LDC, WMT-14 and OPUS-100 benchmark. Experimental and analytical results demonstrate that our model outperforms its Transformer counterparts by a consistent gain. Furthermore, it can be successfully scaled up to 60 encoder layers and 36 decoder layers.
    Robust Testing in High-Dimensional Sparse Models. (arXiv:2205.07488v2 [cs.IT] UPDATED)
    We consider the problem of robustly testing the norm of a high-dimensional sparse signal vector under two different observation models. In the first model, we are given $n$ i.i.d. samples from the distribution $\mathcal{N}\left(\theta,I_d\right)$ (with unknown $\theta$), of which a small fraction has been arbitrarily corrupted. Under the promise that $\|\theta\|_0\le s$, we want to correctly distinguish whether $\|\theta\|_2=0$ or $\|\theta\|_2>\gamma$, for some input parameter $\gamma>0$. We show that any algorithm for this task requires $n=\Omega\left(s\log\frac{ed}{s}\right)$ samples, which is tight up to logarithmic factors. We also extend our results to other common notions of sparsity, namely, $\|\theta\|_q\le s$ for any $0 < q < 2$. In the second observation model that we consider, the data is generated according to a sparse linear regression model, where the covariates are i.i.d. Gaussian and the regression coefficient (signal) is known to be $s$-sparse. Here too we assume that an $\epsilon$-fraction of the data is arbitrarily corrupted. We show that any algorithm that reliably tests the norm of the regression coefficient requires at least $n=\Omega\left(\min(s\log d,{1}/{\gamma^4})\right)$ samples. Our results show that the complexity of testing in these two settings significantly increases under robustness constraints. This is in line with the recent observations made in robust mean testing and robust covariance testing.
    Federated Multi-Task Learning under a Mixture of Distributions. (arXiv:2108.10252v4 [cs.LG] UPDATED)
    The increasing size of data generated by smartphones and IoT devices motivated the development of Federated Learning (FL), a framework for on-device collaborative training of machine learning models. First efforts in FL focused on learning a single global model with good average performance across clients, but the global model may be arbitrarily bad for a given client, due to the inherent heterogeneity of local data distributions. Federated multi-task learning (MTL) approaches can learn personalized models by formulating an opportune penalized optimization problem. The penalization term can capture complex relations among personalized models, but eschews clear statistical assumptions about local data distributions. In this work, we propose to study federated MTL under the flexible assumption that each local data distribution is a mixture of unknown underlying distributions. This assumption encompasses most of the existing personalized FL approaches and leads to federated EM-like algorithms for both client-server and fully decentralized settings. Moreover, it provides a principled way to serve personalized models to clients not seen at training time. The algorithms' convergence is analyzed through a novel federated surrogate optimization framework, which can be of general interest. Experimental results on FL benchmarks show that our approach provides models with higher accuracy and fairness than state-of-the-art methods.
    Quantitative Assessment of Drought Impacts Using XGBoost based on the Drought Impact Reporter. (arXiv:2211.02768v1 [cs.LG])
    Under climate change, the increasing frequency, intensity, and spatial extent of drought events lead to higher socio-economic costs. However, the relationships between the hydro-meteorological indicators and drought impacts are not identified well yet because of the complexity and data scarcity. In this paper, we proposed a framework based on the extreme gradient model (XGBoost) for Texas to predict multi-category drought impacts and connected a typical drought indicator, Standardized Precipitation Index (SPI), to the text-based impacts from the Drought Impact Reporter (DIR). The preliminary results of this study showed an outstanding performance of the well-trained models to assess drought impacts on agriculture, fire, society & public health, plants & wildlife, as well as relief, response & restrictions in Texas. It also provided a possibility to appraise drought impacts using hydro-meteorological indicators with the proposed framework in the United States, which could help drought risk management by giving additional information and improving the updating frequency of drought impacts. Our interpretation results using the Shapley additive explanation (SHAP) interpretability technique revealed that the rules guiding the predictions of XGBoost comply with domain expertise knowledge around the role that SPI indicators play around drought impacts.
    ON-DEMAND-FL: A Dynamic and Efficient Multi-Criteria Federated Learning Client Deployment Scheme. (arXiv:2211.02906v1 [cs.AI])
    In this paper, we increase the availability and integration of devices in the learning process to enhance the convergence of federated learning (FL) models. To address the issue of having all the data in one location, federated learning, which maintains the ability to learn over decentralized data sets, combines privacy and technology. Until the model converges, the server combines the updated weights obtained from each dataset over a number of rounds. The majority of the literature suggested client selection techniques to accelerate convergence and boost accuracy. However, none of the existing proposals have focused on the flexibility to deploy and select clients as needed, wherever and whenever that may be. Due to the extremely dynamic surroundings, some devices are actually not available to serve as clients in FL, which affects the availability of data for learning and the applicability of the existing solution for client selection. In this paper, we address the aforementioned limitations by introducing an On-Demand-FL, a client deployment approach for FL, offering more volume and heterogeneity of data in the learning process. We make use of the containerization technology such as Docker to build efficient environments using IoT and mobile devices serving as volunteers. Furthermore, Kubernetes is used for orchestration. The Genetic algorithm (GA) is used to solve the multi-objective optimization problem due to its evolutionary strategy. The performed experiments using the Mobile Data Challenge (MDC) dataset and the Localfed framework illustrate the relevance of the proposed approach and the efficiency of the on-the-fly deployment of clients whenever and wherever needed with less discarded rounds and more available data.
    On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs. (arXiv:2209.15141v2 [cs.LG] UPDATED)
    We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPs are the most general MDPs that can be solved by a learning algorithm with a single stream of experience. The original convergence proofs of the two algorithms require that the solution set of the average-reward optimality equation only has one degree of freedom, which is not necessarily true for weakly communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly communicating MDPs. As a direct extension, we show that average-reward options algorithms for temporal abstraction introduced by Wan, Naik, & Sutton (2021b) converge if the Semi-MDP induced by options is weakly communicating.
    GFlowOut: Dropout with Generative Flow Networks. (arXiv:2210.12928v2 [cs.LG] UPDATED)
    Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and generalization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.
    Off-Policy Evaluation with Policy-Dependent Optimization Response. (arXiv:2202.12958v2 [cs.LG] UPDATED)
    The intersection of causal inference and machine learning for decision-making is rapidly expanding, but the default decision criterion remains an \textit{average} of individual causal outcomes across a population. In practice, various operational restrictions ensure that a decision-maker's utility is not realized as an \textit{average} but rather as an \textit{output} of a downstream decision-making problem (such as matching, assignment, network flow, minimizing predictive risk). In this work, we develop a new framework for off-policy evaluation with \textit{policy-dependent} linear optimization responses: causal outcomes introduce stochasticity in objective function coefficients. Under this framework, a decision-maker's utility depends on the policy-dependent optimization, which introduces a fundamental challenge of \textit{optimization} bias even for the case of policy evaluation. We construct unbiased estimators for the policy-dependent estimand by a perturbation method, and discuss asymptotic variance properties for a set of adjusted plug-in estimators. Lastly, attaining unbiased policy evaluation allows for policy optimization: we provide a general algorithm for optimizing causal interventions. We corroborate our theoretical results with numerical simulations.
    Plasticity Neural Network Based on Astrocytic effects at Critical Period, Synaptic Competition and Strength Rebalance by Current and Mnemonic Brain Plasticity and Synapse Formation. (arXiv:2203.11740v7 [cs.NE] UPDATED)
    Besides weights of synaptic connections, Forward propagation and Back propagation also include weights of synaptic ranges [15,16,19-22]. PNN considers synaptic strength balance in dynamic of phagocytosing of synapses and static of constant sum of synapses length [15],the lead behavior of the school of fish is well embodied in our PNN. Synapse formation will inhibit dendrites generation to a certain extent in experiments, by simulations synapse formation will inhibit the function of dendrites [16]. Closing the critical period will cause neurological disorder in experiments, but worse results in PNN simulations [19]. The memory persistence gradient information of backward circuit similar to the Enforcing Resilience in a Spring Boot. The relatively good and inferior gradient information in synapse formation of backward circuit like the folds of the brain. Considering both negative and positive memories persistence help activate synapse length changes with iterations better than only considering positive memory. So using memory of fear learning and improving of synaptic activity to observe obviously [20]. Memory persistence factor also inhibit local synaptic accumulation. And refers PNN can also introduce the relatively good and inferior solution to update the velocity of particle in PSO. Astrocytic phagocytosis will avoid the local accumulation of synapses by simulation (Lack of astrocytic phagocytosis causes excitatory synapses and functionally impaired synapses accumulate in experiments and lead to destruction of cognition, but local longer synapses and worse results in PNN simulations) [21]. It gives relationship of human intelligence and cortical thickness, individual differences in brain[22].The simple PNN which only has the synaptic phagocytosis regardless of the gradient update. Therefore, is it possible to simulate and plan the factors of biological experiments through modeling?
    antGLasso: An Efficient Tensor Graphical Lasso Algorithm. (arXiv:2211.02920v1 [stat.ML])
    The class of bigraphical lasso algorithms (and, more broadly, 'tensor'-graphical lasso algorithms) has been used to estimate dependency structures within matrix and tensor data. However, all current methods to do so take prohibitively long on modestly sized datasets. We present a novel tensor-graphical lasso algorithm that analytically estimates the dependency structure, unlike its iterative predecessors. This provides a speedup of multiple orders of magnitude, allowing this class of algorithms to be used on large, real-world datasets.
    Deep Q-learning: a robust control approach. (arXiv:2201.08610v2 [cs.LG] UPDATED)
    In this paper, we place deep Q-learning into a control-oriented perspective and study its learning dynamics with well-established techniques from robust control. We formulate an uncertain linear time-invariant model by means of the neural tangent kernel to describe learning. We show the instability of learning and analyze the agent's behavior in frequency-domain. Then, we ensure convergence via robust controllers acting as dynamical rewards in the loss function. We synthesize three controllers: state-feedback gain scheduling H2, dynamic Hinf, and constant gain Hinf controllers. Setting up the learning agent with a control-oriented tuning methodology is more transparent and has well-established literature compared to the heuristics in reinforcement learning. In addition, our approach does not use a target network and randomized replay memory. The role of the target network is overtaken by the control input, which also exploits the temporal dependency of samples (opposed to a randomized memory buffer). Numerical simulations in different OpenAI Gym environments suggest that the Hinf controlled learning performs slightly better than Double deep Q-learning.
    Information in Infinite Ensembles of Infinitely-Wide Neural Networks. (arXiv:1911.09189v3 [cs.LG] UPDATED)
    In this preliminary work, we study the generalization properties of infinite ensembles of infinitely-wide neural networks. Amazingly, this model family admits tractable calculations for many information-theoretic quantities. We report analytical and empirical investigations in the search for signals that correlate with generalization.
    Random initialisations performing above chance and how to find them. (arXiv:2209.07509v2 [cs.LG] UPDATED)
    Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al.\ recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes.
    Near-optimal multiple testing in Bayesian linear models with finite-sample FDR control. (arXiv:2211.02778v1 [math.ST])
    In high dimensional variable selection problems, statisticians often seek to design multiple testing procedures controlling the false discovery rate (FDR) and simultaneously discovering more relevant variables. Model-X methods, such as Knockoffs and conditional randomization tests, achieve the first goal of finite-sample FDR control under the assumption of known covariates distribution. However, it is not clear whether these methods can concurrently achieve the second goal of maximizing the number of discoveries. In fact, designing procedures to discover more relevant variables with finite-sample FDR control is a largely open question, even in the arguably simplest linear models. In this paper, we derive near-optimal testing procedures in high dimensional Bayesian linear models with isotropic covariates. We propose a Model-X multiple testing procedure, PoEdCe, which provably controls the frequentist FDR from finite samples even under model misspecification, and conjecturally achieves near-optimal power when the data follow the Bayesian linear model with a known prior. PoEdCe has three important ingredients: Posterior Expectation, distilled Conditional randomization test (dCRT), and the Benjamini-Hochberg procedure with e-values (eBH). The optimality conjecture of PoEdCe is based on a heuristic calculation of its asymptotic true positive proportion (TPP) and false discovery proportion (FDP), which is supported by methods from statistical physics as well as extensive numerical simulations. Furthermore, when the prior is unknown, we show that an empirical Bayes variant of PoEdCe still has finite-sample FDR control and achieves near-optimal power.
    Decentralized Multi-Target Cross-Domain Recommendation for Multi-Organization Collaborations. (arXiv:2110.13340v3 [cs.IR] UPDATED)
    Recommender Systems (RSs) are operated locally by different organizations in many realistic scenarios. If various organizations can fully share their data and perform computation in a centralized manner, they may significantly improve the accuracy of recommendations. However, collaborations among multiple organizations in enhancing the performance of recommendations are primarily limited due to the difficulty of sharing data and models. To address this challenge, we propose Decentralized Multi-Target Cross-Domain Recommendation (DMTCDR) with Multi-Target Assisted Learning (MTAL) and Assisted AutoEncoder (AAE). Our method can help multiple organizations collaboratively improve their recommendation performance in a decentralized manner without sharing sensitive assets. Consequently, it allows decentralized organizations to collaborate and form a community of shared interest. We conduct extensive experiments to demonstrate that the new method can significantly outperform locally trained RSs and mitigate the cold start problem.
    A Survey on Influence Maximization: From an ML-Based Combinatorial Optimization. (arXiv:2211.03074v1 [cs.SI])
    Influence Maximization (IM) is a classical combinatorial optimization problem, which can be widely used in mobile networks, social computing, and recommendation systems. It aims at selecting a small number of users such that maximizing the influence spread across the online social network. Because of its potential commercial and academic value, there are a lot of researchers focusing on studying the IM problem from different perspectives. The main challenge comes from the NP-hardness of the IM problem and \#P-hardness of estimating the influence spread, thus traditional algorithms for overcoming them can be categorized into two classes: heuristic algorithms and approximation algorithms. However, there is no theoretical guarantee for heuristic algorithms, and the theoretical design is close to the limit. Therefore, it is almost impossible to further optimize and improve their performance. With the rapid development of artificial intelligence, the technology based on Machine Learning (ML) has achieved remarkable achievements in many fields. In view of this, in recent years, a number of new methods have emerged to solve combinatorial optimization problems by using ML-based techniques. These methods have the advantages of fast solving speed and strong generalization ability to unknown graphs, which provide a brand-new direction for solving combinatorial optimization problems. Therefore, we abandon the traditional algorithms based on iterative search and review the recent development of ML-based methods, especially Deep Reinforcement Learning, to solve the IM problem and other variants in social networks. We focus on summarizing the relevant background knowledge, basic principles, common methods, and applied research. Finally, the challenges that need to be solved urgently in future IM research are pointed out.
    Clustering with Tangles: Algorithmic Framework and Theoretical Guarantees. (arXiv:2006.14444v3 [cs.LG] UPDATED)
    Originally, tangles were invented as an abstract tool in mathematical graph theory to prove the famous graph minor theorem. In this paper, we showcase the practical potential of tangles in machine learning applications. Given a collection of cuts of any dataset, tangles aggregate these cuts to point in the direction of a dense structure. As a result, a cluster is softly characterized by a set of consistent pointers. This highly flexible approach can solve clustering problems in various setups, ranging from questionnaires over community detection in graphs to clustering points in metric spaces. The output of our proposed framework is hierarchical and induces the notion of a soft dendrogram, which can help explore the cluster structure of a dataset. The computational complexity of aggregating the cuts is linear in the number of data points. Thus the bottleneck of the tangle approach is to generate the cuts, for which simple and fast algorithms form a sufficient basis. In our paper we construct the algorithmic framework for clustering with tangles, prove theoretical guarantees in various settings, and provide extensive simulations and use cases. Python code is available on github.
    Shapes of Emotions: Multimodal Emotion Recognition in Conversations via Emotion Shifts. (arXiv:2112.01938v2 [cs.CL] UPDATED)
    Emotion Recognition in Conversations (ERC) is an important and active research area. Recent work has shown the benefits of using multiple modalities (e.g., text, audio, and video) for the ERC task. In a conversation, participants tend to maintain a particular emotional state unless some stimuli evokes a change. There is a continuous ebb and flow of emotions in a conversation. Inspired by this observation, we propose a multimodal ERC model and augment it with an emotion-shift component that improves performance. The proposed emotion-shift component is modular and can be added to any existing multimodal ERC model (with a few modifications). We experiment with different variants of the model, and results show that the inclusion of emotion shift signal helps the model to outperform existing models for ERC on MOSEI and IEMOCAP datasets.
    Disentangled and Side-aware Unsupervised Domain Adaptation for Cross-dataset Subjective Tinnitus Diagnosis. (arXiv:2205.03230v2 [eess.SP] UPDATED)
    EEG-based tinnitus classification is a valuable tool for tinnitus diagnosis, research, and treatments. Most current works are limited to a single dataset where data patterns are similar. But EEG signals are highly non-stationary, resulting in model's poor generalization to new users, sessions or datasets. Thus, designing a model that can generalize to new datasets is beneficial and indispensable. To mitigate distribution discrepancy across datasets, we propose to achieve Disentangled and Side-aware Unsupervised Domain Adaptation (DSUDA) for cross-dataset tinnitus diagnosis. A disentangled auto-encoder is developed to decouple class-irrelevant information from the EEG signals to improve the classifying ability. The side-aware unsupervised domain adaptation module adapts the class-irrelevant information as domain variance to a new dataset and excludes the variance to obtain the class-distill features for the new dataset classification. It also align signals of left and right ears to overcome inherent EEG pattern difference. We compare DSUDA with state-of-the-art methods, and our model achieves significant improvements over competitors regarding comprehensive evaluation criteria. The results demonstrate our model can successfully generalize to a new dataset and effectively diagnose tinnitus.
    Toward Neural Network Simulation of Variational Quantum Algorithms. (arXiv:2211.02929v1 [quant-ph])
    Variational quantum algorithms (VQAs) utilize a hybrid quantum-classical architecture to recast problems of high-dimensional linear algebra as ones of stochastic optimization. Despite the promise of leveraging near- to intermediate-term quantum resources to accelerate this task, the computational advantage of VQAs over wholly classical algorithms has not been firmly established. For instance, while the variational quantum eigensolver (VQE) has been developed to approximate low-lying eigenmodes of high-dimensional sparse linear operators, analogous classical optimization algorithms exist in the variational Monte Carlo (VMC) literature, utilizing neural networks in place of quantum circuits to represent quantum states. In this paper we ask if classical stochastic optimization algorithms can be constructed paralleling other VQAs, focusing on the example of the variational quantum linear solver (VQLS). We find that such a construction can be applied to the VQLS, yielding a paradigm that could theoretically extend to other VQAs of similar form.
    An Evolutionary Approach to Dynamic Introduction of Tasks in Large-scale Multitask Learning Systems. (arXiv:2205.12755v4 [cs.LG] UPDATED)
    Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learning, that adds the temporal aspect to multitask, is often focused to the study of common pitfalls such as catastrophic forgetting instead of being studied at a large scale as a critical component to build the next generation artificial intelligence.We propose an evolutionary method capable of generating large scale multitask models that support the dynamic addition of new tasks. The generated multitask models are sparsely activated and integrates a task-based routing that guarantees bounded compute cost and fewer added parameters per task as the model expands.The proposed method relies on a knowledge compartmentalization technique to achieve immunity against catastrophic forgetting and other common pitfalls such as gradient interference and negative transfer. We demonstrate empirically that the proposed method can jointly solve and achieve competitive results on 69public image classification tasks, for example improving the state of the art on a competitive benchmark such as cifar10 by achieving a 15% relative error reduction compared to the best model trained on public data.
    Provable and Efficient Continual Representation Learning. (arXiv:2203.02026v2 [cs.LG] UPDATED)
    In continual learning (CL), the goal is to design models that can learn a sequence of tasks without catastrophic forgetting. While there is a rich set of techniques for CL, relatively little understanding exists on how representations built by previous tasks benefit new tasks that are added to the network. To address this, we study the problem of continual representation learning (CRL) where we learn an evolving representation as new tasks arrive. Focusing on zero-forgetting methods where tasks are embedded in subnetworks (e.g., PackNet), we first provide experiments demonstrating CRL can significantly boost sample efficiency when learning new tasks. To explain this, we establish theoretical guarantees for CRL by providing sample complexity and generalization error bounds for new tasks by formalizing the statistical benefits of previously-learned representations. Our analysis and experiments also highlight the importance of the order in which we learn the tasks. Specifically, we show that CL benefits if the initial tasks have large sample size and high "representation diversity". Diversity ensures that adding new tasks incurs small representation mismatch and can be learned with few samples while training only few additional nonzero weights. Finally, we ask whether one can ensure each task subnetwork to be efficient during inference time while retaining the benefits of representation learning. To this end, we propose an inference-efficient variation of PackNet called Efficient Sparse PackNet (ESPN) which employs joint channel & weight pruning. ESPN embeds tasks in channel-sparse subnets requiring up to 80% less FLOPs to compute while approximately retaining accuracy and is very competitive with a variety of baselines. In summary, this work takes a step towards data and compute-efficient CL with a representation learning perspective. GitHub page: https://github.com/ucr-optml/CtRL
    List-Mode PET Image Reconstruction Using Deep Image Prior. (arXiv:2204.13404v2 [physics.med-ph] UPDATED)
    List-mode positron emission tomography (PET) image reconstruction is an important tool for PET scanners with many lines-of-response and additional information such as time-of-flight and depth-of-interaction. Deep learning is one possible solution to enhance the quality of PET image reconstruction. However, the application of deep learning techniques to list-mode PET image reconstruction has not been progressed because list data is a sequence of bit codes and unsuitable for processing by convolutional neural networks (CNN). In this study, we propose a novel list-mode PET image reconstruction method using an unsupervised CNN called deep image prior (DIP) which is the first trial to integrate list-mode PET image reconstruction and CNN. The proposed list-mode DIP reconstruction (LM-DIPRecon) method alternatively iterates the regularized list-mode dynamic row action maximum likelihood algorithm (LM-DRAMA) and magnetic resonance imaging conditioned DIP (MR-DIP) using an alternating direction method of multipliers. We evaluated LM-DIPRecon using both simulation and clinical data, and it achieved sharper images and better tradeoff curves between contrast and noise than the LM-DRAMA, MR-DIP and sinogram-based DIPRecon methods. These results indicated that the LM-DIPRecon is useful for quantitative PET imaging with limited events while keeping accurate raw data information. In addition, as list data has finer temporal information than dynamic sinograms, list-mode deep image prior reconstruction is expected to be useful for 4D PET imaging and motion correction.
    Making Intelligence: Ethics, IQ, and ML Benchmarks. (arXiv:2209.00692v2 [cs.LG] UPDATED)
    The ML community recognizes the importance of anticipating and mitigating the potential negative impacts of benchmark research. In this position paper, we argue that more attention must be paid to areas of ethical risk at the technical and scientific core of ML benchmarks. We identify overlooked structural similarities between human IQ and ML benchmarks. These share similarities in setting standards for describing, evaluating, and comparing performance on tasks relevant to intelligence. Drawing on prior research on IQ benchmarks from feminist philosophy of science, we argue that values need to be considered when creating ML benchmarks and datasets, and that it is not possible to avoid this choice by creating benchmarks that are value-neutral. Finally, we outline practical recommendations for benchmark research ethics and ethics review.
    Synthetic Data for Feature Selection. (arXiv:2211.03035v1 [cs.LG])
    Feature selection is an important and active field of research in machine learning and data science. Our goal in this paper is to propose a collection of synthetic datasets that can be used as a common reference point for feature selection algorithms. Synthetic datasets allow for precise evaluation of selected features and control of the data parameters for comprehensive assessment. The proposed datasets are based on applications from electronics in order to mimic real life scenarios. To illustrate the utility of the proposed data we employ one of the datasets to test several popular feature selection algorithms. The datasets are made publicly available on GitHub and can be used by researchers to evaluate feature selection algorithms.
    XG-BoT: An Explainable Deep Graph Neural Network for Botnet Detection and Forensics. (arXiv:2207.09088v2 [cs.CR] UPDATED)
    In this paper, we propose XG-BoT, an explainable deep graph neural network model for botnet node detection. The proposed model is mainly composed of a botnet detector and an explainer for automatic forensics. The XG-BoT detector can effectively detect malicious botnet nodes under large-scale networks. Specifically, it utilizes a grouped reversible residual connection with a graph isomorphism network to learn expressive node representations from the botnet communication graphs. The explainer in XG-BoT can perform automatic network forensics by highlighting suspicious network flows and related botnet nodes. We evaluated XG-BoT based on real-world, large-scale botnet network graph datasets. Overall, XG-BoT is able to outperform the state-of-the-art in terms of key evaluation metrics. In addition, we show that the XG-BoT explainer can generate useful explanations based on GNNExplainer and saliency map for automatic network forensics.
    Lipschitz regularized gradient flows and latent generative particles. (arXiv:2210.17230v2 [stat.ML] UPDATED)
    Lipschitz regularized f-divergences are constructed by imposing a bound on the Lipschitz constant of the discriminator in the variational representation. They interpolate between the Wasserstein metric and f-divergences and provide a flexible family of loss functions for non-absolutely continuous (e.g. empirical) distributions, possibly with heavy tails. We construct Lipschitz regularized gradient flows on the space of probability measures based on these divergences. Examples of such gradient flows are Lipschitz regularized Fokker-Planck and porous medium partial differential equations (PDEs) for the Kullback-Leibler and alpha-divergences, respectively. The regularization corresponds to imposing a Courant-Friedrichs-Lewy numerical stability condition on the PDEs. For empirical measures, the Lipschitz regularization on gradient flows induces a numerically stable transporter/discriminator particle algorithm, where the generative particles are transported along the gradient of the discriminator. The gradient structure leads to a regularized Fisher information (particle kinetic energy) used to track the convergence of the algorithm. The Lipschitz regularized discriminator can be implemented via neural network spectral normalization and the particle algorithm generates approximate samples from possibly high-dimensional distributions known only from data. Notably, our particle algorithm can generate synthetic data even in small sample size regimes. A new data processing inequality for the regularized divergence allows us to combine our particle algorithm with representation learning, e.g. autoencoder architectures. The resulting algorithm yields markedly improved generative properties in terms of efficiency and quality of the synthetic samples. From a statistical mechanics perspective the encoding can be interpreted dynamically as learning a better mobility for the generative particles.
    Prototypical quadruplet for few-shot class incremental learning. (arXiv:2211.02947v1 [cs.CV])
    Many modern computer vision algorithms suffer from two major bottlenecks: scarcity of data and learning new tasks incrementally. While training the model with new batches of data the model looses it's ability to classify the previous data judiciously which is termed as catastrophic forgetting. Conventional methods have tried to mitigate catastrophic forgetting of the previously learned data while the training at the current session has been compromised. The state-of-the-art generative replay based approaches use complicated structures such as generative adversarial network (GAN) to deal with catastrophic forgetting. Additionally, training a GAN with few samples may lead to instability. In this work, we present a novel method to deal with these two major hurdles. Our method identifies a better embedding space with an improved contrasting loss to make classification more robust. Moreover, our approach is able to retain previously acquired knowledge in the embedding space even when trained with new classes. We update previous session class prototypes while training in such a way that it is able to represent the true class mean. This is of prime importance as our classification rule is based on the nearest class mean classification strategy. We have demonstrated our results by showing that the embedding space remains intact after training the model with new classes. We showed that our method preformed better than the existing state-of-the-art algorithms in terms of accuracy across different sessions.
    The Importance of Suppressing Complete Reconstruction in Autoencoders for Unsupervised Outlier Detection. (arXiv:2211.03054v1 [stat.ML])
    Autoencoders are widely used in outlier detection due to their superiority in handling high-dimensional and nonlinear datasets. The reconstruction of any dataset by the autoencoder can be considered as a complex regression process. In regression analysis, outliers can usually be divided into high leverage points and influential points. Although the autoencoder has shown good results for the identification of influential points, there are still some problems when detect high leverage points. Through theoretical derivation, we found that most outliers are detected in the direction corresponding to the worst-recovered principal component, but in the direction of the well-recovered principal components, the anomalies are often ignored. We propose a new loss function which solve the above deficiencies in outlier detection. The core idea of our scheme is that in order to better detect high leverage points, we should suppress the complete reconstruction of the dataset to convert high leverage points into influential points, and it is also necessary to ensure that the differences between the eigenvalues of the covariance matrix of the original dataset and their corresponding reconstructed results in the direction of each principal component are equal. Besides, we explain the rationality of our scheme through rigorous theoretical derivation. Finally, our experiments on multiple datasets confirm that our scheme significantly improves the accuracy of outlier detection.
    Neural multi-event forecasting on spatio-temporal point processes using probabilistically enriched transformers. (arXiv:2211.02922v1 [cs.LG])
    Predicting discrete events in time and space has many scientific applications, such as predicting hazardous earthquakes and outbreaks of infectious diseases. History-dependent spatio-temporal Hawkes processes are often used to mathematically model these point events. However, previous approaches have faced numerous challenges, particularly when attempting to forecast one or multiple future events. In this work, we propose a new neural architecture for multi-event forecasting of spatio-temporal point processes, utilizing transformers, augmented with normalizing flows and probabilistic layers. Our network makes batched predictions of complex history-dependent spatio-temporal distributions of future discrete events, achieving state-of-the-art performance on a variety of benchmark datasets including the South California Earthquakes, Citibike, Covid-19, and Hawkes synthetic pinwheel datasets. More generally, we illustrate how our network can be applied to any dataset of discrete events with associated markers, even when no underlying physics is known.
    A Neural Tangent Kernel Perspective of GANs. (arXiv:2106.05566v5 [cs.LG] UPDATED)
    We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We reveal a fundamental flaw of previous analyses which, by incorrectly modeling GANs' training scheme, are subject to ill-defined discriminator gradients. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network. From this, we derive new insights about the convergence of the generated distribution, advancing our understanding of GANs' training dynamics. We empirically corroborate these results via an analysis toolkit based on our framework, unveiling intuitions that are consistent with GAN practice.
    A Continual Development Methodology for Large-scale Multitask Dynamic ML Systems. (arXiv:2209.07326v3 [cs.LG] UPDATED)
    The traditional Machine Learning (ML) methodology requires to fragment the development and experimental process into disconnected iterations whose feedback is used to guide design or tuning choices. This methodology has multiple efficiency and scalability disadvantages, such as leading to spend significant resources into the creation of multiple trial models that do not contribute to the final solution.The presented work is based on the intuition that defining ML models as modular and extensible artefacts allows to introduce a novel ML development methodology enabling the integration of multiple design and evaluation iterations into the continuous enrichment of a single unbounded intelligent system. We define a novel method for the generation of dynamic multitask ML models as a sequence of extensions and generalizations. We first analyze the capabilities of the proposed method by using the standard ML empirical evaluation methodology. Finally, we propose a novel continuous development methodology that allows to dynamically extend a pre-existing multitask large-scale ML system while analyzing the properties of the proposed method extensions. This results in the generation of an ML model capable of jointly solving 124 image classification tasks achieving state of the art quality with improved size and compute cost.
    On Output Activation Functions for Adversarial Losses: A Theoretical Analysis via Variational Divergence Minimization and An Empirical Study on MNIST Classification. (arXiv:1901.08753v3 [cs.LG] UPDATED)
    Recent years have seen adversarial losses been applied to many fields. Their applications extend beyond the originally proposed generative modeling to conditional generative and discriminative settings. While prior work has proposed various output activation functions and regularization approaches, some open questions still remain unanswered. In this paper, we aim to study the following two research questions: 1) What types of output activation functions form a well-behaved adversarial loss? 2) How different combinations of output activation functions and regularization approaches perform empirically against one another? To answer the first question, we adopt the perspective of variational divergence minimization and consider an adversarial loss well-behaved if it behaves as a divergence-like measure between the data and model distributions. Using a generalized formulation for adversarial losses, we derive the necessary and sufficient conditions of a well-behaved adversarial loss. Our analysis reveals a large class of theoretically valid adversarial losses. For the second question, we propose a simple comparative framework for adversarial losses using discriminative adversarial networks. The proposed framework allows us to efficiently evaluate adversarial losses using a standard evaluation metric such as the classification accuracy. With the proposed framework, we evaluate a comprehensive set of 168 combinations of twelve output activation functions and fourteen regularization approaches on the handwritten digit classification problem to decouple their effects. Our empirical findings suggest that there is no single winning combination of output activation functions and regularization approaches across all settings. Our theoretical and empirical results may together serve as a reference for choosing or designing adversarial losses in future research.
    Deliberation Networks and How to Train Them. (arXiv:2211.03217v1 [cs.CL])
    Deliberation networks are a family of sequence-to-sequence models, which have achieved state-of-the-art performance in a wide range of tasks such as machine translation and speech synthesis. A deliberation network consists of multiple standard sequence-to-sequence models, each one conditioned on the initial input and the output of the previous model. During training, there are several key questions: whether to apply Monte Carlo approximation to the gradients or the loss, whether to train the standard models jointly or separately, whether to run an intermediate model in teacher forcing or free running mode, whether to apply task-specific techniques. Previous work on deliberation networks typically explores one or two training options for a specific task. This work introduces a unifying framework, covering various training options, and addresses the above questions. In general, it is simpler to approximate the gradients. When parallel training is essential, separate training should be adopted. Regardless of the task, the intermediate model should be in free running mode. For tasks where the output is continuous, a guided attention loss can be used to prevent degradation into a standard model.
    An Interpretable Probabilistic Model for Short-Term Solar Power Forecasting Using Natural Gradient Boosting. (arXiv:2108.04058v2 [stat.AP] UPDATED)
    PV power forecasting models are predominantly based on machine learning algorithms which do not provide any insight into or explanation about their predictions (black boxes). Therefore, their direct implementation in environments where transparency is required, and the trust associated with their predictions may be questioned. To this end, we propose a two stage probabilistic forecasting framework able to generate highly accurate, reliable, and sharp forecasts yet offering full transparency on both the point forecasts and the prediction intervals (PIs). In the first stage, we exploit natural gradient boosting (NGBoost) for yielding probabilistic forecasts, while in the second stage, we calculate the Shapley additive explanation (SHAP) values in order to fully comprehend why a prediction was made. To highlight the performance and the applicability of the proposed framework, real data from two PV parks located in Southern Germany are employed. Comparative results with two state-of-the-art algorithms, namely Gaussian process and lower upper bound estimation, manifest a significant increase in the point forecast accuracy and in the overall probabilistic performance. Most importantly, a detailed analysis of the model's complex nonlinear relationships and interaction effects between the various features is presented. This allows interpreting the model, identifying some learned physical properties, explaining individual predictions, reducing the computational requirements for the training without jeopardizing the model accuracy, detecting possible bugs, and gaining trust in the model. Finally, we conclude that the model was able to develop complex nonlinear relationships which follow known physical properties as well as human logic and intuition.
    Calibration Matters: Tackling Maximization Bias in Large-scale Advertising Recommendation Systems. (arXiv:2205.09809v4 [cs.LG] UPDATED)
    Calibration is defined as the ratio of the average predicted click rate to the true click rate. The optimization of calibration is essential to many online advertising recommendation systems because it directly affects the downstream bids in ads auctions and the amount of money charged to advertisers. Despite its importance, calibration optimization often suffers from a problem called "maximization bias". Maximization bias refers to the phenomenon that the maximum of predicted values overestimates the true maximum. The problem is introduced because the calibration is computed on the set selected by the prediction model itself. It persists even if unbiased predictions can be achieved on every datapoint and worsens when covariate shifts exist between the training and test sets. To mitigate this problem, we theorize the quantification of maximization bias and propose a variance-adjusting debiasing (VAD) meta-algorithm in this paper. The algorithm is efficient, robust, and practical as it is able to mitigate maximization bias problems under covariate shifts, neither incurring additional online serving costs nor compromising the ranking performance. We demonstrate the effectiveness of the proposed algorithm using a state-of-the-art recommendation neural network model on a large-scale real-world dataset.
    Personalizing Sustainable Agriculture with Causal Machine Learning. (arXiv:2211.03179v1 [cs.LG])
    To fight climate change and accommodate the increasing population, global crop production has to be strengthened. To achieve the "sustainable intensification" of agriculture, transforming it from carbon emitter to carbon sink is a priority, and understanding the environmental impact of agricultural management practices is a fundamental prerequisite to that. At the same time, the global agricultural landscape is deeply heterogeneous, with differences in climate, soil, and land use inducing variations in how agricultural systems respond to farmer actions. The "personalization" of sustainable agriculture with the provision of locally adapted management advice is thus a necessary condition for the efficient uplift of green metrics, and an integral development in imminent policies. Here, we formulate personalized sustainable agriculture as a Conditional Average Treatment Effect estimation task and use Causal Machine Learning for tackling it. Leveraging climate data, land use information and employing Double Machine Learning, we estimate the heterogeneous effect of sustainable practices on the field-level Soil Organic Carbon content in Lithuania. We thus provide a data-driven perspective for targeting sustainable practices and effectively expanding the global carbon sink.
    WeakIdent: Weak formulation for Identifying Differential Equations using Narrow-fit and Trimming. (arXiv:2211.03134v1 [math.NA])
    Data-driven identification of differential equations is an interesting but challenging problem, especially when the given data are corrupted by noise. When the governing differential equation is a linear combination of various differential terms, the identification problem can be formulated as solving a linear system, with the feature matrix consisting of linear and nonlinear terms multiplied by a coefficient vector. This product is equal to the time derivative term, and thus generates dynamical behaviors. The goal is to identify the correct terms that form the equation to capture the dynamics of the given data. We propose a general and robust framework to recover differential equations using a weak formulation, for both ordinary and partial differential equations (ODEs and PDEs). The weak formulation facilitates an efficient and robust way to handle noise. For a robust recovery against noise and the choice of hyper-parameters, we introduce two new mechanisms, narrow-fit and trimming, for the coefficient support and value recovery, respectively. For each sparsity level, Subspace Pursuit is utilized to find an initial set of support from the large dictionary. Then, we focus on highly dynamic regions (rows of the feature matrix), and error normalize the feature matrix in the narrow-fit step. The support is further updated via trimming of the terms that contribute the least. Finally, the support set of features with the smallest Cross-Validation error is chosen as the result. A comprehensive set of numerical experiments are presented for both systems of ODEs and PDEs with various noise levels. The proposed method gives a robust recovery of the coefficients, and a significant denoising effect which can handle up to $100\%$ noise-to-signal ratio for some equations. We compare the proposed method with several state-of-the-art algorithms for the recovery of differential equations.
    Comparison of Data Representations and Machine Learning Architectures for User Identification on Arbitrary Motion Sequences. (arXiv:2210.00527v2 [cs.LG] UPDATED)
    Reliable and robust user identification and authentication are important and often necessary requirements for many digital services. It becomes paramount in social virtual reality (VR) to ensure trust, specifically in digital encounters with lifelike realistic-looking avatars as faithful replications of real persons. Recent research has shown that the movements of users in extended reality (XR) systems carry user-specific information and can thus be used to verify their identities. This article compares three different potential encodings of the motion data from head and hands (scene-relative, body-relative, and body-relative velocities), and the performances of five different machine learning architectures (random forest, multi-layer perceptron, fully recurrent neural network, long-short term memory, gated recurrent unit). We use the publicly available dataset "Talking with Hands" and publish all code to allow reproducibility and to provide baselines for future work. After hyperparameter optimization, the combination of a long-short term memory architecture and body-relative data outperformed competing combinations: the model correctly identifies any of the 34 subjects with an accuracy of 100% within 150 seconds. Altogether, our approach provides an effective foundation for behaviometric-based identification and authentication to guide researchers and practitioners. Data and code are published under https://go.uniwue.de/58w1r.
    Networked Federated Learning. (arXiv:2105.12769v2 [cs.LG] UPDATED)
    We develop the theory and algorithmic toolbox for networked federated learning in decentralized collections of local datasets with an intrinsic network structure. This network structure arises from domain-specific notions of similarity between local datasets. Different notions of similarity are induced by spatio-temporal proximity, statistical dependencies or functional relations. Our main conceptual contribution is to formulate networked federated learning using a generalized total variation minimization. This formulation unifies and considerably extends existing federated multi-task learning methods. It is highly flexible and can be combined with a broad range of parametric models including Lasso or deep neural networks. Our main algorithmic contribution is a novel networked federated learning algorithm which is well suited for distributed computing environments such as edge computing over wireless networks. This algorithm is robust against inexact computations arising from limited computational resources including processing time or bandwidth. For local models resulting in convex problems, we derive precise conditions on the local models and their network structure such that our algorithm learns nearly optimal local models. Our analysis reveals an interesting interplay between the convex geometry of local models and the (cluster-) geometry of their network structure.
    Simulation-guided Beam Search for Neural Combinatorial Optimization. (arXiv:2207.06190v2 [cs.LG] UPDATED)
    Neural approaches for combinatorial optimization (CO) equip a learning mechanism to discover powerful heuristics for solving complex real-world problems. While neural approaches capable of high-quality solutions in a single shot are emerging, state-of-the-art approaches are often unable to take full advantage of the solving time available to them. In contrast, hand-crafted heuristics perform highly effective search well and exploit the computation time given to them, but contain heuristics that are difficult to adapt to a dataset being solved. With the goal of providing a powerful search procedure to neural CO approaches, we propose simulation-guided beam search (SGBS), which examines candidate solutions within a fixed-width tree search that both a neural net-learned policy and a simulation (rollout) identify as promising. We further hybridize SGBS with efficient active search (EAS), where SGBS enhances the quality of solutions backpropagated in EAS, and EAS improves the quality of the policy used in SGBS. We evaluate our methods on well-known CO benchmarks and show that SGBS significantly improves the quality of the solutions found under reasonable runtime assumptions.
    Predicting Treatment Adherence of Tuberculosis Patients at Scale. (arXiv:2211.02943v1 [cs.LG])
    Tuberculosis (TB), an infectious bacterial disease, is a significant cause of death, especially in low-income countries, with an estimated ten million new cases reported globally in $2020$. While TB is treatable, non-adherence to the medication regimen is a significant cause of morbidity and mortality. Thus, proactively identifying patients at risk of dropping off their medication regimen enables corrective measures to mitigate adverse outcomes. Using a proxy measure of extreme non-adherence and a dataset of nearly $700,000$ patients from four states in India, we formulate and solve the machine learning (ML) problem of early prediction of non-adherence based on a custom rank-based metric. We train ML models and evaluate against baselines, achieving a $\sim 100\%$ lift over rule-based baselines and $\sim 214\%$ over a random classifier, taking into account country-wide large-scale future deployment. We deal with various issues in the process, including data quality, high-cardinality categorical data, low target prevalence, distribution shift, variation across cohorts, algorithmic fairness, and the need for robustness and explainability. Our findings indicate that risk stratification of non-adherent patients is a viable, deployable-at-scale ML solution.
    Evaluating Digital Tools for Sustainable Agriculture using Causal Inference. (arXiv:2211.03195v1 [cs.LG])
    In contrast to the rapid digitalization of several industries, agriculture suffers from low adoption of climate-smart farming tools. Even though AI-driven digital agriculture can offer high-performing predictive functionalities, it lacks tangible quantitative evidence on its benefits to the farmers. Field experiments can derive such evidence, but are often costly and time consuming. To this end, we propose an observational causal inference framework for the empirical evaluation of the impact of digital tools on target farm performance indicators. This way, we can increase farmers' trust by enhancing the transparency of the digital agriculture market, and in turn accelerate the adoption of technologies that aim to increase productivity and secure a sustainable and resilient agriculture against a changing climate. As a case study, we perform an empirical evaluation of a recommendation system for optimal cotton sowing, which was used by a farmers' cooperative during the growing season of 2021. We leverage agricultural knowledge to develop a causal graph of the farm system, we use the back-door criterion to identify the impact of recommendations on the yield and subsequently estimate it using several methods on observational data. The results show that a field sown according to our recommendations enjoyed a significant increase in yield (12% to 17%).
    Confidence Intervals for Unobserved Events. (arXiv:2211.03052v1 [math.ST])
    Consider a finite sample from an unknown distribution over a countable alphabet. Unobserved events are alphabet symbols which do not appear in the sample. Estimating the probabilities of unobserved events is a basic problem in statistics and related fields, which was extensively studied in the context of point estimation. In this work we introduce a novel interval estimation scheme for unobserved events. Our proposed framework applies selective inference, as we construct confidence intervals (CIs) for the desired set of parameters. Interestingly, we show that obtained CIs are dimension-free, as they do not grow with the alphabet size. Further, we show that these CIs are (almost) tight, in the sense that they cannot be further improved without violating the prescribed coverage rate. We demonstrate the performance of our proposed scheme in synthetic and real-world experiments, showing a significant improvement over the alternatives. Finally, we apply our proposed scheme to large alphabet modeling. We introduce a novel simultaneous CI scheme for large alphabet distributions which outperforms currently known methods while maintaining the prescribed coverage rate.
    Inferring subhalo effective density slopes from strong lensing observations with neural likelihood-ratio estimation. (arXiv:2208.13796v2 [astro-ph.CO] UPDATED)
    Strong gravitational lensing has emerged as a promising approach for probing dark matter models on sub-galactic scales. Recent work has proposed the subhalo effective density slope as a more reliable observable than the commonly used subhalo mass function. The subhalo effective density slope is a measurement independent of assumptions about the underlying density profile and can be inferred for individual subhalos through traditional sampling methods. To go beyond individual subhalo measurements, we leverage recent advances in machine learning and introduce a neural likelihood-ratio estimator to infer an effective density slope for populations of subhalos. We demonstrate that our method is capable of harnessing the statistical power of multiple subhalos (within and across multiple images) to distinguish between characteristics of different subhalo populations. The computational efficiency warranted by the neural likelihood-ratio estimator over traditional sampling enables statistical studies of dark matter perturbers and is particularly useful as we expect an influx of strong lensing systems from upcoming surveys.
    Towards real-time 6D pose estimation of objects in single-view cone-beam X-ray. (arXiv:2211.03211v1 [cs.CV])
    Deep learning-based pose estimation algorithms can successfully estimate the pose of objects in an image, especially in the field of color images. 6D Object pose estimation based on deep learning models for X-ray images often use custom architectures that employ extensive CAD models and simulated data for training purposes. Recent RGB-based methods opt to solve pose estimation problems using small datasets, making them more attractive for the X-ray domain where medical data is scarcely available. We refine an existing RGB-based model (SingleShotPose) to estimate the 6D pose of a marked cube from grayscale X-ray images by creating a generic solution trained on only real X-ray data and adjusted for X-ray acquisition geometry. The model regresses 2D control points and calculates the pose through 2D/3D correspondences using Perspective-n-Point(PnP), allowing a single trained model to be used across all supporting cone-beam-based X-ray geometries. Since modern X-ray systems continuously adjust acquisition parameters during a procedure, it is essential for such a pose estimation network to consider these parameters in order to be deployed successfully and find a real use case. With a 5-cm/5-degree accuracy of 93% and an average 3D rotation error of 2.2 degrees, the results of the proposed approach are comparable with state-of-the-art alternatives, while requiring significantly less real training examples and being applicable in real-time applications.
    Applying Association Rules Mining to Investigate Pedestrian Fatal and Injury Crash Patterns Under Different Lighting Conditions. (arXiv:2211.03187v1 [stat.ML])
    The pattern of pedestrian crashes varies greatly depending on lighting circumstances, emphasizing the need of examining pedestrian crashes in various lighting conditions. Using Louisiana pedestrian fatal and injury crash data (2010-2019), this study applied Association Rules Mining (ARM) to identify the hidden pattern of crash risk factors according to three different lighting conditions (daylight, dark-with-streetlight, and dark-no-streetlight). Based on the generated rules, the results show that daylight pedestrian crashes are associated with children (less than 15 years), senior pedestrians (greater than 64 years), older drivers (>64 years), and other driving behaviors such as failure to yield, inattentive/distracted, illness/fatigue/asleep. Additionally, young drivers (15-24 years) are involved in severe pedestrian crashes in daylight conditions. This study also found pedestrian alcohol/drug involvement as the most frequent item in the dark-with-streetlight condition. This crash type is particularly associated with pedestrian action (crossing intersection/midblock), driver age (55-64 years), speed limit (30-35 mph), and specific area type (business with mixed residential area). Fatal pedestrian crashes are found to be associated with roadways with high-speed limits (>50 mph) during the dark without streetlight condition. Some other risk factors linked with high-speed limit related crashes are pedestrians walking with/against the traffic, presence of pedestrian dark clothing, pedestrian alcohol/drug involvement. The research findings are expected to provide an improved understanding of the underlying relationships between pedestrian crash risk factors and specific lighting conditions. Highway safety experts can utilize these findings to conduct a decision-making process for selecting effective countermeasures to reduce pedestrian crashes strategically.
    Reducing the dimensionality of data using tempered distributions. (arXiv:1903.05083v2 [math.ST] UPDATED)
    We reformulate unsupervised dimension reduction problem (UDR) in the language of tempered distributions, i.e. as a problem of approximating an empirical probability density function by another tempered distribution, supported in a $k$-dimensional subspace. We show that this task is connected with another classical problem of data science -- the sufficient dimension reduction problem (SDR). In fact, an algorithm for the first problem induces an algorithm for the second and vice versa. In order to reduce an optimization problem over distributions to an optimization problem over ordinary functions we introduce a nonnegative penalty function that ``forces'' the support of the model distribution to be $k$-dimensional. Then we present an algorithm for the minimization of the penalized objective, based on the infinite-dimensional low-rank optimization, which we call the alternating scheme. Also, we design an efficient approximate algorithm for a special case of the problem, where the distance between the empirical distribution and the model distribution is measured by Maximum Mean Discrepancy defined by a Mercer kernel of a certain type. We test our methods on four examples (three UDR and one SDR) using synthetic data and standard datasets.
    Understanding the properties and limitations of contrastive learning for Out-of-Distribution detection. (arXiv:2211.03183v1 [cs.LG])
    A recent popular approach to out-of-distribution (OOD) detection is based on a self-supervised learning technique referred to as contrastive learning. There are two main variants of contrastive learning, namely instance and class discrimination, targeting features that can discriminate between different instances for the former, and different classes for the latter. In this paper, we aim to understand the effectiveness and limitation of existing contrastive learning methods for OOD detection. We approach this in 3 ways. First, we systematically study the performance difference between the instance discrimination and supervised contrastive learning variants in different OOD detection settings. Second, we study which in-distribution (ID) classes OOD data tend to be classified into. Finally, we study the spectral decay property of the different contrastive learning approaches and examine how it correlates with OOD detection performance. In scenarios where the ID and OOD datasets are sufficiently different from one another, we see that instance discrimination, in the absence of fine-tuning, is competitive with supervised approaches in OOD detection. We see that OOD samples tend to be classified into classes that have a distribution similar to the distribution of the entire dataset. Furthermore, we show that contrastive learning learns a feature space that contains singular vectors containing several directions with a high variance which can be detrimental or beneficial to OOD detection depending on the inference approach used.
    On Image Segmentation With Noisy Labels: Characterization and Volume Properties of the Optimal Solutions to Accuracy and Dice. (arXiv:2206.06484v2 [cs.CV] UPDATED)
    We study two of the most popular performance metrics in medical image segmentation, Accuracy and Dice, when the target labels are noisy. For both metrics, several statements related to characterization and volume properties of the set of optimal segmentations are proved, and associated experiments are provided. Our main insights are: (i) the volume of the solutions to both metrics may deviate significantly from the expected volume of the target, (ii) the volume of a solution to Accuracy is always less than or equal to the volume of a solution to Dice and (iii) the optimal solutions to both of these metrics coincide when the set of feasible segmentations is constrained to the set of segmentations with the volume equal to the expected volume of the target.
    Learning Product Graphs from Spectral Templates. (arXiv:2211.02893v1 [cs.LG])
    Graph Learning (GL) is at the core of inference and analysis of connections in data mining and machine learning (ML). By observing a dataset of graph signals, and considering specific assumptions, Graph Signal Processing (GSP) tools can provide practical constraints in the GL approach. One applicable constraint can infer a graph with desired frequency signatures, i.e., spectral templates. However, a severe computational burden is a challenging barrier, especially for inference from high-dimensional graph signals. To address this issue and in the case of the underlying graph having graph product structure, we propose learning product (high dimensional) graphs from product spectral templates with significantly reduced complexity rather than learning them directly from high-dimensional graph signals, which, to the best of our knowledge, has not been addressed in the related areas. In contrast to the rare current approaches, our approach can learn all types of product graphs (with more than two graphs) without knowing the type of graph products and has fewer parameters. Experimental results on both the synthetic and real-world data, i.e., brain signal analysis and multi-view object images, illustrate explainable and meaningful factor graphs supported by expert-related research, as well as outperforming the rare current restricted approaches.
    Pseudo-Riemannian Graph Convolutional Networks. (arXiv:2106.03134v3 [cs.LG] UPDATED)
    Graph convolutional networks (GCNs) are powerful frameworks for learning embeddings of graph-structured data. GCNs are traditionally studied through the lens of Euclidean geometry. Recent works find that non-Euclidean Riemannian manifolds provide specific inductive biases for embedding hierarchical or spherical data. However, they cannot align well with data of mixed graph topologies. We consider a larger class of pseudo-Riemannian manifolds that generalize hyperboloid and sphere. We develop new geodesic tools that allow for extending neural network operations into geodesically disconnected pseudo-Riemannian manifolds. As a consequence, we derive a pseudo-Riemannian GCN that models data in pseudo-Riemannian manifolds of constant nonzero curvature in the context of graph neural networks. Our method provides a geometric inductive bias that is sufficiently flexible to model mixed heterogeneous topologies like hierarchical graphs with cycles. We demonstrate the representational capabilities of this method by applying it to the tasks of graph reconstruction, node classification and link prediction on a series of standard graphs with mixed topologies. Empirical results demonstrate that our method outperforms Riemannian counterparts when embedding graphs of complex topologies.
    Graph Attention Retrospective. (arXiv:2202.13060v4 [cs.LG] UPDATED)
    Graph-based learning is a rapidly growing sub-field of machine learning with applications in social networks, citation networks, and bioinformatics. One of the most popular models is graph attention networks. They were introduced to allow a node to aggregate information from features of neighbor nodes in a non-uniform way, in contrast to simple graph convolution which does not distinguish the neighbors of a node. In this paper, we study theoretically this expected behaviour of graph attention networks. We prove multiple results on the performance of graph attention mechanism for the problem of node classification for a contextual stochastic block model. Here the node features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We show that in an "easy" regime, where the distance between the means of the Gaussians is large enough, graph attention is able to distinguish inter-class from intra-class edges, and thus it maintains the weights of important edges and significantly reduces the weights of unimportant edges. Consequently, we show that this implies perfect node classification. In the "hard" regime, we show that every attention mechanism fails to distinguish intra-class from inter-class edges. We evaluate our theoretical results on synthetic and real-world data.
    ProtoX: Explaining a Reinforcement Learning Agent via Prototyping. (arXiv:2211.03162v1 [cs.LG])
    While deep reinforcement learning has proven to be successful in solving control tasks, the "black-box" nature of an agent has received increasing concerns. We propose a prototype-based post-hoc policy explainer, ProtoX, that explains a blackbox agent by prototyping the agent's behaviors into scenarios, each represented by a prototypical state. When learning prototypes, ProtoX considers both visual similarity and scenario similarity. The latter is unique to the reinforcement learning context, since it explains why the same action is taken in visually different states. To teach ProtoX about visual similarity, we pre-train an encoder using contrastive learning via self-supervised learning to recognize states as similar if they occur close together in time and receive the same action from the black-box agent. We then add an isometry layer to allow ProtoX to adapt scenario similarity to the downstream task. ProtoX is trained via imitation learning using behavior cloning, and thus requires no access to the environment or agent. In addition to explanation fidelity, we design different prototype shaping terms in the objective function to encourage better interpretability. We conduct various experiments to test ProtoX. Results show that ProtoX achieved high fidelity to the original black-box agent while providing meaningful and understandable explanations.
    Design Process is a Reinforcement Learning Problem. (arXiv:2211.03136v1 [cs.LG])
    While reinforcement learning has been used widely in research during the past few years, it found fewer real-world applications than supervised learning due to some weaknesses that the RL algorithms suffer from, such as performance degradation in transitioning from the simulator to the real world. Here, we argue the design process is a reinforcement learning problem and can potentially be a proper application for RL algorithms as it is an offline process and conventionally is done in CAD software - a sort of simulator. This creates opportunities for using RL methods and, at the same time, raises challenges. While the design processes are so diverse, here we focus on the space layout planning (SLP), frame it as an RL problem under the Markov Decision Process, and use PPO to address the layout design problem. To do so, we developed an environment named RLDesigner, to simulate the SLP. The RLDesigner is an OpenAI Gym compatible environment that can be easily customized to define a diverse range of design scenarios. We publicly share the environment to encourage both RL and architecture communities to use it for testing different RL algorithms or in their design practice. The codes are available in the following GitHub repository https://github.com/ RezaKakooee/rldesigner/tree/Second_Paper
    Adversarial Causal Augmentation for Graph Covariate Shift. (arXiv:2211.02843v1 [cs.LG])
    Out-of-distribution (OOD) generalization on graphs is drawing widespread attention. However, existing efforts mainly focus on the OOD issue of correlation shift. While another type, covariate shift, remains largely unexplored but is the focus of this work. From a data generation view, causal features are stable substructures in data, which play key roles in OOD generalization. While their complementary parts, environments, are unstable features that often lead to various distribution shifts. Correlation shift establishes spurious statistical correlations between environments and labels. In contrast, covariate shift means that there exist unseen environmental features in test data. Existing strategies of graph invariant learning and data augmentation suffer from limited environments or unstable causal features, which greatly limits their generalization ability on covariate shift. In view of that, we propose a novel graph augmentation strategy: Adversarial Causal Augmentation (AdvCA), to alleviate the covariate shift. Specifically, it adversarially augments the data to explore diverse distributions of the environments. Meanwhile, it keeps the causal features invariant across diverse environments. It maintains the environmental diversity while ensuring the invariance of the causal features, thereby effectively alleviating the covariate shift. Extensive experimental results with in-depth analyses demonstrate that AdvCA can outperform 14 baselines on synthetic and real-world datasets with various covariate shifts.
    Characterizing the Efficiency of Graph Neural Network Frameworks with a Magnifying Glass. (arXiv:2211.03021v1 [cs.LG])
    Graph neural networks (GNNs) have received great attention due to their success in various graph-related learning tasks. Several GNN frameworks have then been developed for fast and easy implementation of GNN models. Despite their popularity, they are not well documented, and their implementations and system performance have not been well understood. In particular, unlike the traditional GNNs that are trained based on the entire graph in a full-batch manner, recent GNNs have been developed with different graph sampling techniques for mini-batch training of GNNs on large graphs. While they improve the scalability, their training times still depend on the implementations in the frameworks as sampling and its associated operations can introduce non-negligible overhead and computational cost. In addition, it is unknown how much the frameworks are 'eco-friendly' from a green computing perspective. In this paper, we provide an in-depth study of two mainstream GNN frameworks along with three state-of-the-art GNNs to analyze their performance in terms of runtime and power/energy consumption. We conduct extensive benchmark experiments at several different levels and present detailed analysis results and observations, which could be helpful for further improvement and optimization.
    Discovering and Explaining the Representation Bottleneck of DNNs. (arXiv:2111.06236v4 [cs.LG] UPDATED)
    This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple interactions and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and human beings, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck. Furthermore, we propose a loss to encourage/penalize the learning of interactions of specific complexities, and analyze the representation capacities of interactions of different complexities.
    Unsupervised dynamic modeling of medical image transformation. (arXiv:2103.00930v2 [physics.med-ph] UPDATED)
    Spatiotemporal imaging has applications in e.g. cardiac diagnostics, surgical guidance, and radiotherapy monitoring, In this paper, we explain the temporal motion by identifying the underlying dynamics, only based on the sequential images. Our dynamical model maps the inputs of observed high-dimensional sequential images to a low-dimensional latent space wherein a linear relationship between a hidden state process and the lower-dimensional representation of the inputs holds. For this, we use a conditional variational auto-encoder (CVAE) to nonlinearly map the higher-dimensional image to a lower-dimensional space, wherein we model the dynamics with a linear Gaussian state-space model (LG-SSM). The model, a modified version of the Kalman variational auto-encoder, is end-to-end trainable, and the weights, both in the CVAE and LG-SSM, are simultaneously updated by maximizing the evidence lower bound of the marginal likelihood. In contrast to the original model, we explain the motion with a spatial transformation from one image to another. This results in sharper reconstructions and the possibility of transferring auxiliary information, such as segmentation, through the image sequence. Our experiments, on cardiac ultrasound time series, show that the dynamic model outperforms traditional image registration in execution time, to a similar performance. Further, our model offers the possibility to impute and extrapolate for missing samples.
    Discovering ordinary differential equations that govern time-series. (arXiv:2211.02830v1 [cs.LG])
    Natural laws are often described through differential equations yet finding a differential equation that describes the governing law underlying observed data is a challenging and still mostly manual task. In this paper we make a step towards the automation of this process: we propose a transformer-based sequence-to-sequence model that recovers scalar autonomous ordinary differential equations (ODEs) in symbolic form from time-series data of a single observed solution of the ODE. Our method is efficiently scalable: after one-time pretraining on a large set of ODEs, we can infer the governing laws of a new observed solution in a few forward passes of the model. Then we show that our model performs better or on par with existing methods in various test cases in terms of accurate symbolic recovery of the ODE, especially for more complex expressions.
    Deep fusion of gray level co-occurrence matrices for lung nodule classification. (arXiv:2205.05123v2 [eess.IV] UPDATED)
    Lung cancer is a severe menace to human health, due to which millions of people die because of late diagnoses of cancer; thus, it is vital to detect the disease as early as possible. The Computerized chest analysis Tomography of scan is assumed to be one of the efficient solutions for detecting and classifying lung nodules. The necessity of high accuracy of analyzing C.T. scan images of the lung is considered as one of the crucial challenges in detecting and classifying lung cancer. A new long-short-term-memory (LSTM) based deep fusion structure, is introduced, where, the texture features computed from lung nodules through new volumetric grey-level-co-occurrence-matrices (GLCM) computations are applied to classify the nodules into: benign, malignant and ambiguous. An improved Otsu segmentation method combined with the water strider optimization algorithm (WSA) is proposed to detect the lung nodules. Otsu-WSA thresholding can overcome the restrictions present in previous thresholding methods. Extended experiments are run to assess this fusion structure by considering 2D-GLCM computations based 2D-slices fusion, and an approximation of this 3D-GLCM with volumetric 2.5D-GLCM computations-based LSTM fusion structure. The proposed methods are trained and assessed through the LIDC-IDRI dataset, where 94.4%, 91.6%, and 95.8% Accuracy, sensitivity, and specificity are obtained, respectively for 2D-GLCM fusion and 97.33%, 96%, and 98%, accuracy, sensitivity, and specificity, respectively, for 2.5D-GLCM fusion. The yield of the same are 98.7%, 98%, and 99%, for the 3D-GLCM fusion. The obtained results and analysis indicate that the WSA-Otsu method requires less execution time and yields a more accurate thresholding process. It is found that 3D-GLCM based LSTM outperforms its counterparts.
    Active-Learning-as-a-Service: An Automatic and Efficient MLOps System for Data-Centric AI. (arXiv:2207.09109v2 [cs.LG] UPDATED)
    The success of today's AI applications requires not only model training (Model-centric) but also data engineering (Data-centric). In data-centric AI, active learning (AL) plays a vital role, but current AL tools 1) require users to manually select AL strategies, and 2) can not perform AL tasks efficiently. To this end, this paper presents an automatic and efficient MLOps system for AL, named ALaaS (Active-Learning-as-a-Service). Specifically, 1) ALaaS implements an AL agent, including a performance predictor and a workflow controller, to decide the most suitable AL strategies given users' datasets and budgets. We call this a predictive-based successive halving early-stop (PSHEA) procedure. 2) ALaaS adopts a server-client architecture to support an AL pipeline and implements stage-level parallelism for high efficiency. Meanwhile, caching and batching techniques are employed to further accelerate the AL process. In addition to efficiency, ALaaS ensures accessibility with the help of the design philosophy of configuration-as-a-service. Extensive experiments show that ALaaS outperforms all other baselines in terms of latency and throughput. Also, guided by the AL agent, ALaaS can automatically select and run AL strategies for non-expert users under different datasets and budgets. Our code is available at \url{https://github.com/MLSysOps/Active-Learning-as-a-Service}.
    Modelling Technical and Biological Effects in scRNA-seq data with Scalable GPLVMs. (arXiv:2209.06716v2 [cs.LG] UPDATED)
    Single-cell RNA-seq datasets are growing in size and complexity, enabling the study of cellular composition changes in various biological/clinical contexts. Scalable dimensionality reduction techniques are in need to disentangle biological variation in them, while accounting for technical and biological confounders. In this work, we extend a popular approach for probabilistic non-linear dimensionality reduction, the Gaussian process latent variable model, to scale to massive single-cell datasets while explicitly accounting for technical and biological confounders. The key idea is to use an augmented kernel which preserves the factorisability of the lower bound allowing for fast stochastic variational inference. We demonstrate its ability to reconstruct latent signatures of innate immunity recovered in Kumasaka et al. (2021) with 9x lower training time. We further analyze a COVID dataset and demonstrate across a cohort of 130 individuals, that this framework enables data integration while capturing interpretable signatures of infection. Specifically, we explore COVID severity as a latent dimension to refine patient stratification and capture disease-specific gene expression.
    UATTA-ENS: Uncertainty Aware Test Time Augmented Ensemble for PIRC Diabetic Retinopathy Detection. (arXiv:2211.03148v1 [cs.CV])
    Deep Ensemble Convolutional Neural Networks has become a methodology of choice for analyzing medical images with a diagnostic performance comparable to a physician, including the diagnosis of Diabetic Retinopathy. However, commonly used techniques are deterministic and are therefore unable to provide any estimate of predictive uncertainty. Quantifying model uncertainty is crucial for reducing the risk of misdiagnosis. A reliable architecture should be well-calibrated to avoid over-confident predictions. To address this, we propose a UATTA-ENS: Uncertainty-Aware Test-Time Augmented Ensemble Technique for 5 Class PIRC Diabetic Retinopathy Classification to produce reliable and well-calibrated predictions.
    Differentiable Neural Computers with Memory Demon. (arXiv:2211.02987v1 [cs.LG])
    A Differentiable Neural Computer (DNC) is a neural network with an external memory which allows for iterative content modification via read, write and delete operations. We show that information theoretic properties of the memory contents play an important role in the performance of such architectures. We introduce a novel concept of memory demon to DNC architectures which modifies the memory contents implicitly via additive input encoding. The goal of the memory demon is to maximize the expected sum of mutual information of the consecutive external memory contents.
    Developing Decentralised Resilience to Malicious Influence in Collective Perception Problem. (arXiv:2211.03063v1 [cs.MA])
    In collective decision-making, designing algorithms that use only local information to effect swarm-level behaviour is a non-trivial problem. We used machine learning techniques to teach swarm members to map their local perceptions of the environment to an optimal action. A curriculum inspired by Machine Education approaches was designed to facilitate this learning process and teach the members the skills required for optimal performance in the collective perception problem. We extended upon previous approaches by creating a curriculum that taught agents resilience to malicious influence. The experimental results show that well-designed rules-based algorithms can produce effective agents. When performing opinion fusion, we implemented decentralised resilience by having agents dynamically weight received opinion. We found a non-significant difference between constant and dynamic weights, suggesting that momentum-based opinion fusion is perhaps already a resilience mechanism.
    A Machine Learning-based Framework for Predictive Maintenance of Semiconductor Laser for Optical Communication. (arXiv:2211.02842v1 [cs.LG])
    Semiconductor lasers, one of the key components for optical communication systems, have been rapidly evolving to meet the requirements of next generation optical networks with respect to high speed, low power consumption, small form factor etc. However, these demands have brought severe challenges to the semiconductor laser reliability. Therefore, a great deal of attention has been devoted to improving it and thereby ensuring reliable transmission. In this paper, a predictive maintenance framework using machine learning techniques is proposed for real-time heath monitoring and prognosis of semiconductor laser and thus enhancing its reliability. The proposed approach is composed of three stages: i) real-time performance degradation prediction, ii) degradation detection, and iii) remaining useful life (RUL) prediction. First of all, an attention based gated recurrent unit (GRU) model is adopted for real-time prediction of performance degradation. Then, a convolutional autoencoder is used to detect the degradation or abnormal behavior of a laser, given the predicted degradation performance values. Once an abnormal state is detected, a RUL prediction model based on attention-based deep learning is utilized. Afterwards, the estimated RUL is input for decision making and maintenance planning. The proposed framework is validated using experimental data derived from accelerated aging tests conducted for semiconductor tunable lasers. The proposed approach achieves a very good degradation performance prediction capability with a small root mean square error (RMSE) of 0.01, a good anomaly detection accuracy of 94.24% and a better RUL estimation capability compared to the existing ML-based laser RUL prediction models.
    Utilizing Language-Image Pretraining for Efficient and Robust Bilingual Word Alignment. (arXiv:2205.11616v2 [cs.CL] UPDATED)
    Word translation without parallel corpora has become feasible, rivaling the performance of supervised methods. Recent findings have shown that the accuracy and robustness of unsupervised word translation (UWT) can be improved by making use of visual observations, which are universal representations across languages. In this work, we investigate the potential of using not only visual observations but also pretrained language-image models for enabling a more efficient and robust UWT. Specifically, we develop a novel UWT method dubbed Word Alignment using Language-Image Pretraining (WALIP), which leverages visual observations via the shared embedding space of images and texts provided by CLIP models (Radford et al., 2021). WALIP has a two-step procedure. First, we retrieve word pairs with high confidences of similarity, computed using our proposed image-based fingerprints, which define the initial pivot for the word alignment. Second, we apply our robust Procrustes algorithm to estimate the linear mapping between two embedding spaces, which iteratively corrects and refines the estimated alignment. Our extensive experiments show that WALIP improves upon the state-of-the-art performance of bilingual word alignment for a few language pairs across different word embeddings and displays great robustness to the dissimilarity of language pairs or training corpora for two word embeddings.
    Multilayer Perceptron Perceptron Network Discriminates Larval Zebrafish Genotype using Behaviour. (arXiv:2211.03051v1 [q-bio.QM])
    Zebrafish are a common model organism used to identify new disease therapeutics. High-throughput drug screens can be performed on larval zebrafish in multi-well plates by observing changes in behaviour following a treatment. Analysis of this behaviour can be difficult, however, due to the high dimensionality of the data obtained. Statistical analysis of individual statistics (such as the distance travelled) is generally not powerful enough to detect meaningful differences between treatment groups. Here, we propose a method for classifying zebrafish models of Parkinson's disease by genotype at 5 days old. Using a set of 2D behavioural features, we train a multi-layer perceptron neural network. We further show that the use of integrated gradients can give insight into the impact of each behaviour feature on genotype classifications by the model. In this way, we provide a novel pipeline for classifying zebrafish larvae, beginning with feature preparation and ending with an impact analysis of said features.
    Robust Fine-Tuning of Deep Neural Networks with Hessian-based Generalization Guarantees. (arXiv:2206.02659v3 [cs.LG] UPDATED)
    We consider transfer learning approaches that fine-tune a pretrained deep neural network on a target task. We study generalization properties of fine-tuning to understand the problem of overfitting, which commonly occurs in practice. Previous works have shown that constraining the distance from the initialization of fine-tuning improves generalization. Using a PAC-Bayesian analysis, we observe that besides distance from initialization, Hessians affect generalization through the noise stability of deep neural networks against noise injections. Motivated by the observation, we develop Hessian distance-based generalization bounds for a wide range of fine-tuning methods. Additionally, we study the robustness of fine-tuning in the presence of noisy labels. Motivated by our theory, we design an algorithm that incorporates consistent losses and distance-based regularization for fine-tuning, along with a generalization error guarantee under class conditional independent noise in the training set labels. We perform a detailed empirical study of our algorithm on various noisy environments and architectures. On six image classification tasks whose training labels are generated with programmatic labeling, we find a 3.26% accuracy gain over prior fine-tuning methods. Meanwhile, the Hessian distance measure of the fine-tuned model decreases by six times more than existing approaches.
    Small Language Models for Tabular Data. (arXiv:2211.02941v1 [cs.LG])
    Supervised deep learning is most commonly applied to difficult problems defined on large and often extensively curated datasets. Here we demonstrate the ability of deep representation learning to address problems of classification and regression from small and poorly formed tabular datasets by encoding input information as abstracted sequences composed of a fixed number of characters per input field. We find that small models have sufficient capacity for approximation of various functions and achieve record classification benchmark accuracy. Such models are shown to form useful embeddings of various input features in their hidden layers, even if the learned task does not explicitly require knowledge of those features. These models are also amenable to input attribution, allowing for an estimation of the importance of each input element to the model output as well as of which inputs features are effectively embedded in the model. We present a proof-of-concept for the application of small language models to mixed tabular data without explicit feature engineering, cleaning, or preprocessing, relying on the model to perform these tasks as part of the representation learning process.
    A Note on "Assessing Generalization of SGD via Disagreement". (arXiv:2202.01851v2 [cs.LG] UPDATED)
    Several recent works find empirically that the average test error of deep neural networks can be estimated via the prediction disagreement of models, which does not require labels. In particular, Jiang et al. (2022) show for the disagreement between two separately trained networks that this `Generalization Disagreement Equality' follows from the well-calibrated nature of deep ensembles under the notion of a proposed `class-aggregated calibration.' In this reproduction, we show that the suggested theory might be impractical because a deep ensemble's calibration can deteriorate as prediction disagreement increases, which is precisely when the coupling of test error and disagreement is of interest, while labels are needed to estimate the calibration on new datasets. Further, we simplify the theoretical statements and proofs, showing them to be straightforward within a probabilistic context, unlike the original hypothesis space view employed by Jiang et al. (2022).
    HAQJSK: Hierarchical-Aligned Quantum Jensen-Shannon Kernels for Graph Classification. (arXiv:2211.02904v1 [cs.LG])
    In this work, we propose a family of novel quantum kernels, namely the Hierarchical Aligned Quantum Jensen-Shannon Kernels (HAQJSK), for un-attributed graphs. Different from most existing classical graph kernels, the proposed HAQJSK kernels can incorporate hierarchical aligned structure information between graphs and transform graphs of random sizes into fixed-sized aligned graph structures, i.e., the Hierarchical Transitive Aligned Adjacency Matrix of vertices and the Hierarchical Transitive Aligned Density Matrix of the Continuous-Time Quantum Walk (CTQW). For a pair of graphs to hand, the resulting HAQJSK kernels are defined by measuring the Quantum Jensen-Shannon Divergence (QJSD) between their transitive aligned graph structures. We show that the proposed HAQJSK kernels not only reflect richer intrinsic global graph characteristics in terms of the CTQW, but also address the drawback of neglecting structural correspondence information arising in most existing R-convolution kernels. Furthermore, unlike the previous Quantum Jensen-Shannon Kernels associated with the QJSD and the CTQW, the proposed HAQJSK kernels can simultaneously guarantee the properties of permutation invariant and positive definiteness, explaining the theoretical advantages of the HAQJSK kernels. Experiments indicate the effectiveness of the proposed kernels.
    Inside Out: Transforming Images of Lab-Grown Plants for Machine Learning Applications in Agriculture. (arXiv:2211.02972v1 [cs.CV])
    Machine learning tasks often require a significant amount of training data for the resultant network to perform suitably for a given problem in any domain. In agriculture, dataset sizes are further limited by phenotypical differences between two plants of the same genotype, often as a result of differing growing conditions. Synthetically-augmented datasets have shown promise in improving existing models when real data is not available. In this paper, we employ a contrastive unpaired translation (CUT) generative adversarial network (GAN) and simple image processing techniques to translate indoor plant images to appear as field images. While we train our network to translate an image containing only a single plant, we show that our method is easily extendable to produce multiple-plant field images. Furthermore, we use our synthetic multi-plant images to train several YoloV5 nano object detection models to perform the task of plant detection and measure the accuracy of the model on real field data images. Including training data generated by the CUT-GAN leads to better plant detection performance compared to a network trained solely on real data.
    TEN: Twin Embedding Networks for the Jigsaw Puzzle Problem with Eroded Boundaries. (arXiv:2203.06488v2 [cs.CV] UPDATED)
    This paper introduces the novel CNN-based encoder Twin Embedding Network (TEN), for the jigsaw puzzle problem (JPP), which represents a puzzle piece with respect to its boundary in a latent embedding space. Combining this latent representation with a simple distance measure, we demonstrate improved accuracy levels of our newly proposed pairwise compatibility measure (CM), compared to that of various classical methods, for degraded puzzles with eroded tile boundaries. We focus on this problem instance for our case study, as it serves as an appropriate testbed for real-world scenarios. Specifically, we demonstrated an improvement of up to 8.5% and 16.8% in reconstruction accuracy, for so-called Type-1 and Type-2 problem variants, respectively. Furthermore, we also demonstrated that TEN is faster by a few orders of magnitude, on average, than a typical deep neural network (NN) model, i.e., it is as fast as the classical methods. In this regard, the paper makes a significant first attempt at bridging the gap between the relatively low accuracy (of classical methods and the intensive computational complexity (of NN models), for practical, real-world puzzle-like problems.
    Conformal Isometry of Lie Group Representation in Recurrent Network of Grid Cells. (arXiv:2210.02684v2 [q-bio.NC] UPDATED)
    The activity of the grid cell population in the medial entorhinal cortex (MEC) of the mammalian brain forms a vector representation of the self-position of the animal. Recurrent neural networks have been proposed to explain the properties of the grid cells by updating the neural activity vector based on the velocity input of the animal. In doing so, the grid cell system effectively performs path integration. In this paper, we investigate the algebraic, geometric, and topological properties of grid cells using recurrent network models. Algebraically, we study the Lie group and Lie algebra of the recurrent transformation as a representation of self-motion. Geometrically, we study the conformal isometry of the Lie group representation where the local displacement of the activity vector in the neural space is proportional to the local displacement of the agent in the 2D physical space. Topologically, the compact abelian Lie group representation automatically leads to the torus topology commonly assumed and observed in neuroscience. We then focus on a simple non-linear recurrent model that underlies the continuous attractor neural networks of grid cells. Our numerical experiments show that conformal isometry leads to hexagon periodic patterns in the grid cell responses and our model is capable of accurate path integration. Code is available at \url{https://github.com/DehongXu/grid-cell-rnn}.
    NS3: Neuro-Symbolic Semantic Code Search. (arXiv:2205.10674v2 [cs.LG] UPDATED)
    Semantic code search is the task of retrieving a code snippet given a textual description of its functionality. Recent work has been focused on using similarity metrics between neural embeddings of text and code. However, current language models are known to struggle with longer, compositional text, and multi-step reasoning. To overcome this limitation, we propose supplementing the query sentence with a layout of its semantic structure. The semantic layout is used to break down the final reasoning decision into a series of lower-level decisions. We use a Neural Module Network architecture to implement this idea. We compare our model - NS3 (Neuro-Symbolic Semantic Search) - to a number of baselines, including state-of-the-art semantic code retrieval methods, and evaluate on two datasets - CodeSearchNet and Code Search and Question Answering. We demonstrate that our approach results in more precise code retrieval, and we study the effectiveness of our modular design when handling compositional queries.
    TINYCD: A (Not So) Deep Learning Model For Change Detection. (arXiv:2207.13159v2 [cs.CV] UPDATED)
    In this paper, we present a lightweight and effective change detection model, called TinyCD. This model has been designed to be faster and smaller than current state-of-the-art change detection models due to industrial needs. Despite being from 13 to 140 times smaller than the compared change detection models, and exposing at least a third of the computational complexity, our model outperforms the current state-of-the-art models by at least $1\%$ on both F1 score and IoU on the LEVIR-CD dataset, and more than $8\%$ on the WHU-CD dataset. To reach these results, TinyCD uses a Siamese U-Net architecture exploiting low-level features in a globally temporal and locally spatial way. In addition, it adopts a new strategy to mix features in the space-time domain both to merge the embeddings obtained from the Siamese backbones, and, coupled with an MLP block, it forms a novel space-semantic attention mechanism, the Mix and Attention Mask Block (MAMB). Source code, models and results are available here: https://github.com/AndreaCodegoni/Tiny_model_4_CD
    Distributed Online Learning Algorithm With Differential Privacy Strategy for Convex Nondecomposable Global Objectives. (arXiv:2206.07944v2 [math.OC] UPDATED)
    In this paper, we deal with a general distributed constrained online learning problem with privacy over time-varying networks, where a class of nondecomposable objective functions are considered. Under this setting, each node only controls a part of the global decision variable, and the goal of all nodes is to collaboratively minimize the global objective over a time horizon $T$ while guarantees the security of the transmitted information. For such problems, we first design a novel generic algorithm framework, named as DPSDA, of differentially private distributed online learning using the Laplace mechanism and the stochastic variants of dual averaging method. Then, we propose two algorithms, named as DPSDA-C and DPSDA-PS, under this framework. Theoretical results show that both algorithms attain an expected regret upper bound in $\mathcal{O}( \sqrt{T} )$ when the objective function is convex, which matches the best utility achievable by cutting-edge algorithms. Finally, numerical experiment results on both real-world and randomly generated datasets verify the effectiveness of our algorithms.
    Learning the shape of protein micro-environments with a holographic convolutional neural network. (arXiv:2211.02936v1 [physics.bio-ph])
    Proteins play a central role in biology from immune recognition to brain activity. While major advances in machine learning have improved our ability to predict protein structure from sequence, determining protein function from structure remains a major challenge. Here, we introduce Holographic Convolutional Neural Network (H-CNN) for proteins, which is a physically motivated machine learning approach to model amino acid preferences in protein structures. H-CNN reflects physical interactions in a protein structure and recapitulates the functional information stored in evolutionary data. H-CNN accurately predicts the impact of mutations on protein function, including stability and binding of protein complexes. Our interpretable computational model for protein structure-function maps could guide design of novel proteins with desired function.
    Learning Riemannian Stable Dynamical Systems via Diffeomorphisms. (arXiv:2211.03169v1 [cs.RO])
    Dexterous and autonomous robots should be capable of executing elaborated dynamical motions skillfully. Learning techniques may be leveraged to build models of such dynamic skills. To accomplish this, the learning model needs to encode a stable vector field that resembles the desired motion dynamics. This is challenging as the robot state does not evolve on a Euclidean space, and therefore the stability guarantees and vector field encoding need to account for the geometry arising from, for example, the orientation representation. To tackle this problem, we propose learning Riemannian stable dynamical systems (RSDS) from demonstrations, allowing us to account for different geometric constraints resulting from the dynamical system state representation. Our approach provides Lyapunov-stability guarantees on Riemannian manifolds that are enforced on the desired motion dynamics via diffeomorphisms built on neural manifold ODEs. We show that our Riemannian approach makes it possible to learn stable dynamical systems displaying complicated vector fields on both illustrative examples and real-world manipulation tasks, where Euclidean approximations fail.
    Neighborhood Attention Transformer. (arXiv:2204.07143v3 [cs.CV] UPDATED)
    We present Neighborhood Attention (NA), the first efficient and scalable sliding-window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding-window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding-window attention, we open source our project and release our checkpoints at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer.
    Physics-Informed CNNs for Super-Resolution of Sparse Observations on Dynamical Systems. (arXiv:2210.17319v2 [physics.flu-dyn] UPDATED)
    In the absence of high-resolution samples, super-resolution of sparse observations on dynamical systems is a challenging problem with wide-reaching applications in experimental settings. We showcase the application of physics-informed convolutional neural networks for super-resolution of sparse observations on grids. Results are shown for the chaotic-turbulent Kolmogorov flow, demonstrating the potential of this method for resolving finer scales of turbulence when compared with classic interpolation methods, and thus effectively reconstructing missing physics.
    Integrating Physics-Based Modeling with Machine Learning for Lithium-Ion Batteries. (arXiv:2112.12979v2 [cs.CE] UPDATED)
    Mathematical modeling of lithium-ion batteries (LiBs) is a primary challenge in advanced battery management. This paper proposes two new frameworks to integrate physics-based models with machine learning to achieve high-precision modeling for LiBs. The frameworks are characterized by informing the machine learning model of the state information of the physical model, enabling a deep integration between physics and machine learning. Based on the frameworks, a series of hybrid models are constructed, through combining an electrochemical model and an equivalent circuit model, respectively, with a feedforward neural network. The hybrid models are relatively parsimonious in structure and can provide considerable voltage predictive accuracy under a broad range of C-rates, as shown by extensive simulations and experiments. The study further expands to conduct aging-aware hybrid modeling, leading to the design of a hybrid model conscious of the state-of-health to make prediction. The experiments show that the model has high voltage predictive accuracy throughout a LiB's cycle life.
    A New Family of Dual-norm regularized $p$-Wasserstein Metrics. (arXiv:2011.05001v5 [cs.LG] UPDATED)
    We develop a novel family of metrics over measures, using $p$-Wasserstein style optimal transport (OT) formulation with dual-norm based regularized marginal constraints. Our study is motivated by the observation that existing works have only explored $\phi$-divergence regularized Wasserstein metrics like the Generalized Wasserstein metrics or the Gaussian-Hellinger-Kantorovich metrics. It is an open question if Wasserstein style metrics can be defined using regularizers that are not $\phi$-divergence based. Our work provides an affirmative answer by proving that the proposed formulation, under mild conditions, indeed induces valid metrics for any dual norm. The proposed regularized metrics seem to achieve the best of both worlds by inheriting useful properties from the parent metrics, viz., the $p$-Wasserstein and the dual-norm involved. For example, when the dual norm is Maximum Mean Discrepancy (MMD), we prove that the proposed regularized metrics inherit the dimension-free sample complexity from the MMD regularizer; while preserving/enhancing other useful properties of the $p$-Wasserstein metric. Further, when $p=1$, we derive a Fenchel dual, which enables proving that the proposed metrics actually induce novel norms over measures. Also, in this case, we show that the mixture geodesic, which is a common geodesic for the parent metrics, remains a geodesic. We empirically study various properties of the proposed metrics and show their utility in diverse applications.
    Learning Based Joint Coding-Modulation for Digital Semantic Communication Systems. (arXiv:2208.05704v2 [cs.IT] UPDATED)
    In learning-based semantic communications, neural networks have replaced different building blocks in traditional communication systems. However, the digital modulation still remains a challenge for neural networks. The intrinsic mechanism of neural network based digital modulation is mapping continuous output of the neural network encoder into discrete constellation symbols, which is a non-differentiable function that cannot be trained with existing gradient descend algorithms. To overcome this challenge, in this paper we develop a joint coding-modulation scheme for digital semantic communications with BPSK modulation. In our method, the neural network outputs the likelihood of each constellation point, instead of having a concrete mapping. A random code rather than a deterministic code is hence used, which preserves more information for the symbols with a close likelihood on each constellation point. The joint coding-modulation design can match the modulation process with channel states, and hence improve the performance of digital semantic communications. Experiment results show that our method outperforms existing digital modulation methods in semantic communications over a wide range of SNR, and outperforms neural network based analog modulation method in low SNR regime.
    Into-TTS : Intonation Template Based Prosody Control System. (arXiv:2204.01271v2 [eess.AS] UPDATED)
    Intonations play an important role in delivering the intention of a speaker. However, current end-to-end TTS systems often fail to model proper intonations. To alleviate this problem, we propose a novel, intuitive method to synthesize speech in different intonations using predefined intonation templates. Prior to TTS model training, speech data are grouped into intonation templates in an unsupervised manner. Two proposed modules are added to the end-to-end TTS framework: an intonation predictor and an intonation encoder. The intonation predictor recommends a suitable intonation template to the given text. The intonation encoder, attached to the text encoder output, synthesizes speech abiding the requested intonation template. Main contributions of our paper are: (a) an easy-to-use intonation control system covering a wide range of users; (b) better performance in wrapping speech in a requested intonation with improved objective and subjective evaluation; and (c) incorporating a pre-trained language model for intonation modelling. Audio samples are available at https://srtts.github.io/IntoTTS.
    CgAT: Center-Guided Adversarial Training for Deep Hashing-Based Retrieval. (arXiv:2204.10779v2 [cs.CV] UPDATED)
    Deep hashing has been extensively utilized in massive image retrieval because of its efficiency and effectiveness. However, deep hashing models are vulnerable to adversarial examples, making it essential to develop adversarial defense methods for image retrieval. Existing solutions achieved limited defense performance because of using weak adversarial samples for training and lacking discriminative optimization objectives to learn robust features. In this paper, we present a min-max based Center-guided Adversarial Training, namely CgAT, to improve the robustness of deep hashing networks through worst adversarial examples. Specifically, we first formulate the center code as a semantically-discriminative representative of the input image content, which preserves the semantic similarity with positive samples and dissimilarity with negative examples. We prove that a mathematical formula can calculate the center code immediately. After obtaining the center codes in each optimization iteration of the deep hashing network, they are adopted to guide the adversarial training process. On the one hand, CgAT generates the worst adversarial examples as augmented data by maximizing the Hamming distance between the hash codes of the adversarial examples and the center codes. On the other hand, CgAT learns to mitigate the effects of adversarial samples by minimizing the Hamming distance to the center codes. Extensive experiments on the benchmark datasets demonstrate the effectiveness of our adversarial training algorithm in defending against adversarial attacks for deep hashing-based retrieval. Compared with the current state-of-the-art defense method, we significantly improve the defense performance by an average of 18.61%, 12.35%, and 11.56% on FLICKR-25K, NUS-WIDE, and MS-COCO, respectively.
    KGTN-ens: Few-Shot Image Classification with Knowledge Graph Ensembles. (arXiv:2211.03199v1 [cs.CV])
    We propose KGTN-ens, a framework extending the recent Knowledge Graph Transfer Network (KGTN) in order to incorporate multiple knowledge graph embeddings at a small cost. We evaluate it with different combinations of embeddings in a few-shot image classification task. We also construct a new knowledge source - Wikidata embeddings - and evaluate it with KGTN and KGTN-ens. Our approach outperforms KGTN in terms of the top-5 accuracy on the ImageNet-FS dataset for the majority of tested settings.
    Decentralized Policy Optimization. (arXiv:2211.03032v1 [cs.LG])
    The study of decentralized learning or independent learning in cooperative multi-agent reinforcement learning has a history of decades. Recently empirical studies show that independent PPO (IPPO) can obtain good performance, close to or even better than the methods of centralized training with decentralized execution, in several benchmarks. However, decentralized actor-critic with convergence guarantee is still open. In this paper, we propose \textit{decentralized policy optimization} (DPO), a decentralized actor-critic algorithm with monotonic improvement and convergence guarantee. We derive a novel decentralized surrogate for policy optimization such that the monotonic improvement of joint policy can be guaranteed by each agent \textit{independently} optimizing the surrogate. In practice, this decentralized surrogate can be realized by two adaptive coefficients for policy optimization at each agent. Empirically, we compare DPO with IPPO in a variety of cooperative multi-agent tasks, covering discrete and continuous action spaces, and fully and partially observable environments. The results show DPO outperforms IPPO in most tasks, which can be the evidence for our theoretical results.
    Direct deduction of chemical class from NMR spectra. (arXiv:2211.03173v1 [physics.chem-ph])
    This paper presents a proof-of-concept method for classifying chemical compounds directly from NMR data without doing structure elucidation. This can help to reduce time in finding good structure candidates, as in most cases matching must be done by a human engineer, or at the very least a process for matching must be meaningfully interpreted by one. Therefore, for a long time automation in the area of NMR has been actively sought. The method identified as suitable for the classification is a convolutional neural network (CNN). Other methods, including clustering and image registration, have not been found suitable for the task in a comparative analysis. The result shows that deep learning can offer solutions to automation problems in cheminformatics.
    Graph Neural Networks for Multimodal Single-Cell Data Integration. (arXiv:2203.01884v3 [cs.LG] UPDATED)
    Recent advances in multimodal single-cell technologies have enabled simultaneous acquisitions of multiple omics data from the same cell, providing deeper insights into cellular states and dynamics. However, it is challenging to learn the joint representations from the multimodal data, model the relationship between modalities, and, more importantly, incorporate the vast amount of single-modality datasets into the downstream analyses. To address these challenges and correspondingly facilitate multimodal single-cell data analyses, three key tasks have been introduced: $\textit{modality prediction}$, $\textit{modality matching}$ and $\textit{joint embedding}$. In this work, we present a general Graph Neural Network framework $\textit{scMoGNN}$ to tackle these three tasks and show that $\textit{scMoGNN}$ demonstrates superior results in all three tasks compared with the state-of-the-art and conventional approaches. Our method is an official winner in the overall ranking of $\textit{Modality prediction}$ from NeurIPS 2021 Competition, and all implementations of our methods have been integrated into DANCE package~\url{https://github.com/OmicsML/dance}.
    Multi-Objective Evolutionary for Object Detection Mobile Architectures Search. (arXiv:2211.02791v1 [cs.CV])
    Recently, Neural architecture search has achieved great success on classification tasks for mobile devices. The backbone network for object detection is usually obtained on the image classification task. However, the architecture which is searched through the classification task is sub-optimal because of the gap between the task of image and object detection. As while work focuses on backbone network architecture search for mobile device object detection is limited, mainly because the backbone always requires expensive ImageNet pre-training. Accordingly, it is necessary to study the approach of network architecture search for mobile device object detection without expensive pre-training. In this work, we propose a mobile object detection backbone network architecture search algorithm which is a kind of evolutionary optimized method based on non-dominated sorting for NAS scenarios. It can quickly search to obtain the backbone network architecture within certain constraints. It better solves the problem of suboptimal linear combination accuracy and computational cost. The proposed approach can search the backbone networks with different depths, widths, or expansion sizes via a technique of weight mapping, making it possible to use NAS for mobile devices detection tasks a lot more efficiently. In our experiments, we verify the effectiveness of the proposed approach on YoloX-Lite, a lightweight version of the target detection framework. Under similar computational complexity, the accuracy of the backbone network architecture we search for is 2.0% mAP higher than MobileDet. Our improved backbone network can reduce the computational effort while improving the accuracy of the object detection network. To prove its effectiveness, a series of ablation studies have been carried out and the working mechanism has been analyzed in detail.
    Nix-TTS: Lightweight and End-to-End Text-to-Speech via Module-wise Distillation. (arXiv:2203.15643v2 [cs.SD] UPDATED)
    Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04x and 8.36x inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model. We provide pretrained models and audio samples of Nix-TTS.
    Prediction of superconducting properties of materials based on machine learning models. (arXiv:2211.03075v1 [cond-mat.supr-con])
    The application of superconducting materials is becoming more and more widespread. Traditionally, the discovery of new superconducting materials relies on the experience of experts and a large number of "trial and error" experiments, which not only increases the cost of experiments but also prolongs the period of discovering new superconducting materials. In recent years, machine learning has been increasingly applied to materials science. Based on this, this manuscript proposes the use of XGBoost model to identify superconductors; the first application of deep forest model to predict the critical temperature of superconductors; the first application of deep forest to predict the band gap of materials; and application of a new sub-network model to predict the Fermi energy level of materials. Compared with our known similar literature, all the above algorithms reach state-of-the-art. Finally, this manuscript uses the above models to search the COD public dataset and identify 50 candidate superconducting materials with possible critical temperature greater than 90 K.
    Stochastic Halpern Iteration with Variance Reduction for Stochastic Monotone Inclusions. (arXiv:2203.09436v3 [math.OC] UPDATED)
    We study stochastic monotone inclusion problems, which widely appear in machine learning applications, including robust regression and adversarial learning. We propose novel variants of stochastic Halpern iteration with recursive variance reduction. In the cocoercive -- and more generally Lipschitz-monotone -- setup, our algorithm attains $\epsilon$ norm of the operator with $\mathcal{O}(\frac{1}{\epsilon^3})$ stochastic operator evaluations, which significantly improves over state of the art $\mathcal{O}(\frac{1}{\epsilon^4})$ stochastic operator evaluations required for existing monotone inclusion solvers applied to the same problem classes. We further show how to couple one of the proposed variants of stochastic Halpern iteration with a scheduled restart scheme to solve stochastic monotone inclusion problems with ${\mathcal{O}}(\frac{\log(1/\epsilon)}{\epsilon^2})$ stochastic operator evaluations under additional sharpness or strong monotonicity assumptions.
    6D Rotation Representation For Unconstrained Head Pose Estimation. (arXiv:2202.12555v2 [cs.CV] UPDATED)
    In this paper, we present a method for unconstrained end-to-end head pose estimation. We address the problem of ambiguous rotation labels by introducing the rotation matrix formalism for our ground truth data and propose a continuous 6D rotation matrix representation for efficient and robust direct regression. This way, our method can learn the full rotation appearance which is contrary to previous approaches that restrict the pose prediction to a narrow-angle for satisfactory results. In addition, we propose a geodesic distance-based loss to penalize our network with respect to the SO(3) manifold geometry. Experiments on the public AFLW2000 and BIWI datasets demonstrate that our proposed method significantly outperforms other state-of-the-art methods by up to 20\%. We open-source our training and testing code along with our pre-trained models: https://github.com/thohemp/6DRepNet.
    Speaker Normalization for Self-supervised Speech Emotion Recognition. (arXiv:2202.01252v2 [cs.LG] UPDATED)
    Large speech emotion recognition datasets are hard to obtain, and small datasets may contain biases. Deep-net-based classifiers, in turn, are prone to exploit those biases and find shortcuts such as speaker characteristics. These shortcuts usually harm a model's ability to generalize. To address this challenge, we propose a gradient-based adversary learning framework that learns a speech emotion recognition task while normalizing speaker characteristics from the feature representation. We demonstrate the efficacy of our method on both speaker-independent and speaker-dependent settings and obtain new state-of-the-art results on the challenging IEMOCAP dataset.
    A Filtering-based General Approach to Learning Rational Constraints of Epistemic Graphs. (arXiv:2211.02918v1 [cs.AI])
    Epistemic graphs generalize the epistemic approach to probabilistic argumentation and tackle the uncertainties in and between arguments. A framework was proposed to generate epistemic constraints from data using a two-way generalization method in the perspective of only considering the beliefs of participants without considering the nature of relations represented in an epistemic graph. The deficiency of original framework is that it is unable to learn rules using tighter constraints, and the learnt rules might be counterintuitive. Meanwhile, when dealing with more restricted values, the filtering computational complexity will increase sharply, and the time performance would become unreasonable. This paper introduces a filtering-based approach using a multiple-way generalization step to generate a set of rational rules based on both the beliefs of each agent on different arguments and the epistemic graph corresponding to the epistemic constraints. This approach is able to generated rational rules with multiple restricted values in higher efficiency. Meanwhile, we have proposed a standard to analyze the rationality of a dataset based on the postulates of deciding rational rules. We evaluate the filtering-based approach on two suitable data bases. The empirical results show that the filtering-based approach performs well with a better efficiency comparing to the original framework, and rules generated from the improved approach are ensured to be rational.
    Exposing Surveillance Detection Routes via Reinforcement Learning, Attack Graphs, and Cyber Terrain. (arXiv:2211.03027v1 [cs.LG])
    Reinforcement learning (RL) operating on attack graphs leveraging cyber terrain principles are used to develop reward and state associated with determination of surveillance detection routes (SDR). This work extends previous efforts on developing RL methods for path analysis within enterprise networks. This work focuses on building SDR where the routes focus on exploring the network services while trying to evade risk. RL is utilized to support the development of these routes by building a reward mechanism that would help in realization of these paths. The RL algorithm is modified to have a novel warm-up phase which decides in the initial exploration which areas of the network are safe to explore based on the rewards and penalty scale factor.
    EventEA: Benchmarking Entity Alignment for Event-centric Knowledge Graphs. (arXiv:2211.02817v1 [cs.CL])
    Entity alignment is to find identical entities in different knowledge graphs (KGs) that refer to the same real-world object. Embedding-based entity alignment techniques have been drawing a lot of attention recently because they can help solve the issue of symbolic heterogeneity in different KGs. However, in this paper, we show that the progress made in the past was due to biased and unchallenging evaluation. We highlight two major flaws in existing datasets that favor embedding-based entity alignment techniques, i.e., the isomorphic graph structures in relation triples and the weak heterogeneity in attribute triples. Towards a critical evaluation of embedding-based entity alignment methods, we construct a new dataset with heterogeneous relations and attributes based on event-centric KGs. We conduct extensive experiments to evaluate existing popular methods, and find that they fail to achieve promising performance. As a new approach to this difficult problem, we propose a time-aware literal encoder for entity alignment. The dataset and source code are publicly available to foster future research. Our work calls for more effective and practical embedding-based solutions to entity alignment.
    PAEDID: Patch Autoencoder Based Deep Image Decomposition For Pixel-level Defective Region Segmentation. (arXiv:2203.14457v3 [cs.CV] UPDATED)
    Unsupervised pixel-level defective region segmentation is an important task in image-based anomaly detection for various industrial applications. The state-of-the-art methods have their own advantages and limitations: matrix-decomposition-based methods are robust to noise but lack complex background image modeling capability; representation-based methods are good at defective region localization but lack accuracy in defective region shape contour extraction; reconstruction-based methods detected defective region match well with the ground truth defective region shape contour but are noisy. To combine the best of both worlds, we present an unsupervised patch autoencoder based deep image decomposition (PAEDID) method for defective region segmentation. In the training stage, we learn the common background as a deep image prior by a patch autoencoder (PAE) network. In the inference stage, we formulate anomaly detection as an image decomposition problem with the deep image prior and domain-specific regularizations. By adopting the proposed approach, the defective regions in the image can be accurately extracted in an unsupervised fashion. We demonstrate the effectiveness of the PAEDID method in simulation studies and an industrial dataset in the case study.
    The Shape of Learning Curves: a Review. (arXiv:2103.10948v2 [cs.LG] UPDATED)
    Learning curves provide insight into the dependence of a learner's generalization performance on the training set size. This important tool can be used for model selection, to predict the effect of more training data, and to reduce the computational complexity of model training and hyperparameter tuning. This review recounts the origins of the term, provides a formal definition of the learning curve, and briefly covers basics such as its estimation. Our main contribution is a comprehensive overview of the literature regarding the shape of learning curves. We discuss empirical and theoretical evidence that supports well-behaved curves that often have the shape of a power law or an exponential. We consider the learning curves of Gaussian processes, the complex shapes they can display, and the factors influencing them. We draw specific attention to examples of learning curves that are ill-behaved, showing worse learning performance with more training data. To wrap up, we point out various open problems that warrant deeper empirical and theoretical investigation. All in all, our review underscores that learning curves are surprisingly diverse and no universal model can be identified.
    CLOOB: Modern Hopfield Networks with InfoLOOB Outperform CLIP. (arXiv:2110.11316v4 [cs.LG] UPDATED)
    CLIP yielded impressive results on zero-shot transfer learning tasks and is considered as a foundation model like BERT or GPT3. CLIP vision models that have a rich representation are pre-trained using the InfoNCE objective and natural language supervision before they are fine-tuned on particular tasks. Though CLIP excels at zero-shot transfer learning, it suffers from an explaining away problem, that is, it focuses on one or few features, while neglecting other relevant features. This problem is caused by insufficiently extracting the covariance structure in the original multi-modal data. We suggest to use modern Hopfield networks to tackle the problem of explaining away. Their retrieved embeddings have an enriched covariance structure derived from co-occurrences of features in the stored embeddings. However, modern Hopfield networks increase the saturation effect of the InfoNCE objective which hampers learning. We propose to use the InfoLOOB objective to mitigate this saturation effect. We introduce the novel "Contrastive Leave One Out Boost" (CLOOB), which uses modern Hopfield networks for covariance enrichment together with the InfoLOOB objective. In experiments we compare CLOOB to CLIP after pre-training on the Conceptual Captions and the YFCC dataset with respect to their zero-shot transfer learning performance on other datasets. CLOOB consistently outperforms CLIP at zero-shot transfer learning across all considered architectures and datasets.
    Forecasting User Interests Through Topic Tag Predictions in Online Health Communities. (arXiv:2211.02789v1 [cs.LG])
    The increasing reliance on online communities for healthcare information by patients and caregivers has led to the increase in the spread of misinformation, or subjective, anecdotal and inaccurate or non-specific recommendations, which, if acted on, could cause serious harm to the patients. Hence, there is an urgent need to connect users with accurate and tailored health information in a timely manner to prevent such harm. This paper proposes an innovative approach to suggesting reliable information to participants in online communities as they move through different stages in their disease or treatment. We hypothesize that patients with similar histories of disease progression or course of treatment would have similar information needs at comparable stages. Specifically, we pose the problem of predicting topic tags or keywords that describe the future information needs of users based on their profiles, traces of their online interactions within the community (past posts, replies) and the profiles and traces of online interactions of other users with similar profiles and similar traces of past interaction with the target users. The result is a variant of the collaborative information filtering or recommendation system tailored to the needs of users of online health communities. We report results of our experiments on an expert curated data set which demonstrate the superiority of the proposed approach over the state of the art baselines with respect to accurate and timely prediction of topic tags (and hence information sources of interest).
    A Robust and Low Complexity Deep Learning Model for Remote Sensing Image Classification. (arXiv:2211.02820v1 [cs.CV])
    In this paper, we present a robust and low complexity deep learning model for Remote Sensing Image Classification (RSIC), the task of identifying the scene of a remote sensing image. In particular, we firstly evaluate different low complexity and benchmark deep neural networks: MobileNetV1, MobileNetV2, NASNetMobile, and EfficientNetB0, which present the number of trainable parameters lower than 5 Million (M). After indicating best network architecture, we further improve the network performance by applying attention schemes to multiple feature maps extracted from middle layers of the network. To deal with the issue of increasing the model footprint as using attention schemes, we apply the quantization technique to satisfies the number trainable parameter of the model lower than 5 M. By conducting extensive experiments on the benchmark datasets NWPU-RESISC45, we achieve a robust and low-complexity model, which is very competitive to the state-of-the-art systems and potential for real-life applications on edge devices.
    Unifying Approaches in Active Learning and Active Sampling via Fisher Information and Information-Theoretic Quantities. (arXiv:2208.00549v2 [cs.LG] UPDATED)
    Recently proposed methods in data subset selection, that is active learning and active sampling, use Fisher information, Hessians, similarity matrices based on gradients, and gradient lengths to estimate how informative data is for a model's training. Are these different approaches connected, and if so, how? We revisit the fundamentals of Bayesian optimal experiment design and show that these recently proposed methods can be understood as approximations to information-theoretic quantities: among them, the mutual information between predictions and model parameters, known as expected information gain or BALD in machine learning, and the mutual information between predictions of acquisition candidates and test samples, known as expected predictive information gain. We develop a comprehensive set of approximations using Fisher information and observed information and derive a unified framework that connects seemingly disparate literature. Although Bayesian methods are often seen as separate from non-Bayesian ones, the sometimes fuzzy notion of "informativeness" expressed in various non-Bayesian objectives leads to the same couple of information quantities, which were, in principle, already known by Lindley (1956) and MacKay (1992).
    Can Ensemble of Classifiers Provide Better Recognition Results in Packaging Activity?. (arXiv:2211.02965v1 [cs.CV])
    Skeleton-based Motion Capture (MoCap) systems have been widely used in the game and film industry for mimicking complex human actions for a long time. MoCap data has also proved its effectiveness in human activity recognition tasks. However, it is a quite challenging task for smaller datasets. The lack of such data for industrial activities further adds to the difficulties. In this work, we have proposed an ensemble-based machine learning methodology that is targeted to work better on MoCap datasets. The experiments have been performed on the MoCap data given in the Bento Packaging Activity Recognition Challenge 2021. Bento is a Japanese word that resembles lunch-box. Upon processing the raw MoCap data at first, we have achieved an astonishing accuracy of 98% on 10-fold Cross-Validation and 82% on Leave-One-Out-Cross-Validation by using the proposed ensemble model.
    SHARP: Environment and Person Independent Activity Recognition with Commodity IEEE 802.11 Access Points. (arXiv:2103.09924v2 [eess.SP] UPDATED)
    In this article we present SHARP, an original approach for obtaining human activity recognition (HAR) through the use of commercial IEEE 802.11 (Wi-Fi) devices. SHARP grants the possibility to discern the activities of different persons, across different time-spans and environments. To achieve this, we devise a new technique to clean and process the channel frequency response (CFR) phase of the Wi-Fi channel, obtaining an estimate of the Doppler shift at a radio monitor device. The Doppler shift reveals the presence of moving scatterers in the environment, while not being affected by (environment-specific) static objects. SHARP is trained on data collected as a person performs seven different activities in a single environment. It is then tested on different setups, to assess its performance as the person, the day and/or the environment change with respect to those considered at training time. In the worst-case scenario, it reaches an average accuracy higher than 95%, validating the effectiveness of the extracted Doppler information, used in conjunction with a learning algorithm based on a neural network, in recognizing human activities in a subject and environment independent way. The collected CFR dataset and the code are publicly available for replicability and benchmarking purposes.
    Bayesian learning of Causal Structure and Mechanisms with GFlowNets and Variational Bayes. (arXiv:2211.02763v1 [cs.LG])
    Bayesian causal structure learning aims to learn a posterior distribution over directed acyclic graphs (DAGs), and the mechanisms that define the relationship between parent and child variables. By taking a Bayesian approach, it is possible to reason about the uncertainty of the causal model. The notion of modelling the uncertainty over models is particularly crucial for causal structure learning since the model could be unidentifiable when given only a finite amount of observational data. In this paper, we introduce a novel method to jointly learn the structure and mechanisms of the causal model using Variational Bayes, which we call Variational Bayes-DAG-GFlowNet (VBG). We extend the method of Bayesian causal structure learning using GFlowNets to learn not only the posterior distribution over the structure, but also the parameters of a linear-Gaussian model. Our results on simulated data suggest that VBG is competitive against several baselines in modelling the posterior over DAGs and mechanisms, while offering several advantages over existing methods, including the guarantee to sample acyclic graphs, and the flexibility to generalize to non-linear causal mechanisms.
    Lightweight 3D Convolutional Neural Network for Schizophrenia diagnosis using MRI Images and Ensemble Bagging Classifier. (arXiv:2211.02868v1 [eess.IV])
    Structural alterations have been thoroughly investigated in the brain during the early onset of schizophrenia (SCZ) with the development of neuroimaging methods. The objective of the paper is an efficient classification of SCZ in 2 different classes: Cognitive Normal (CN), and SCZ using magnetic resonance imaging (MRI) images. This paper proposed a lightweight 3D convolutional neural network (CNN) based framework for SCZ diagnosis using MRI images. In the proposed model, lightweight 3D CNN is used to extract both spatial and spectral features simultaneously from 3D volume MRI scans, and classification is done using an ensemble bagging classifier. Ensemble bagging classifier contributes to preventing overfitting, reduces variance, and improves the model's accuracy. The proposed algorithm is tested on datasets taken from three benchmark databases available as open-source: MCICShare, COBRE, and fBRINPhase-II. These datasets have undergone preprocessing steps to register all the MRI images to the standard template and reduce the artifacts. The model achieves the highest accuracy 92.22%, sensitivity 94.44%, specificity 90%, precision 90.43%, recall 94.44%, F1-score 92.39% and G-mean 92.19% as compared to the current state-of-the-art techniques. The performance metrics evidenced the use of this model to assist the clinicians for automatic accurate diagnosis of SCZ.
    1-D Convolutional Graph Convolutional Networks for Fault Detection in Distributed Energy Systems. (arXiv:2211.02930v1 [eess.SY])
    This paper presents a 1-D convolutional graph neural network for fault detection in microgrids. The combination of 1-D convolutional neural networks (1D-CNN) and graph convolutional networks (GCN) helps extract both spatial-temporal correlations from the voltage measurements in microgrids. The fault detection scheme includes fault event detection, fault type and phase classification, and fault location. There are five neural network model training to handle these tasks. Transfer learning and fine-tuning are applied to reduce training efforts. The combined recurrent graph convolutional neural networks (1D-CGCN) is compared with the traditional ANN structure on the Potsdam 13-bus microgrid dataset. The achievable accuracy of 99.27%, 98.1%, 98.75%, and 95.6% for fault detection, fault type classification, fault phase identification, and fault location respectively.
    "Seeing Sound": Audio Classification with the Wigner-Wille Distribution and Convolutional Neural Networks. (arXiv:2211.03202v1 [cs.SD])
    With big data becoming increasingly available, IoT hardware becoming widely adopted, and AI capabilities becoming more powerful, organizations are continuously investing in sensing. Data coming from sensor networks are currently combined with sensor fusion and AI algorithms to drive innovation in fields such as self-driving cars. Data from these sensors can be utilized in numerous use cases, including alerts in safety systems of urban settings, for events such as gun shots and explosions. Moreover, diverse types of sensors, such as sound sensors, can be utilized in low-light conditions or at locations where a camera is not available. This paper investigates the potential of the utilization of sound-sensor data in an urban context. Technically, we propose a novel approach of classifying sound data using the Wigner-Ville distribution and Convolutional Neural Networks. In this paper, we report on the performance of the approach on open-source datasets. The concept and work presented is based on my doctoral thesis, which was performed as part of the Engineering Doctorate program in Data Science at the University of Eindhoven, in collaboration with the Dutch National Police. Additional work on real-world datasets was performed during the thesis, which are not presented here due to confidentiality.
    Active Learning for Saddle Point Calculation. (arXiv:2108.04698v2 [stat.ML] UPDATED)
    The saddle point (SP) calculation is a grand challenge for computationally intensive energy function in computational chemistry area, where the saddle point may represent the transition state (TS). The traditional methods need to evaluate the gradients of the energy function at a very large number of locations. To reduce the number of expensive computations of the true gradients, we propose an active learning framework consisting of a statistical surrogate model, Gaussian process regression (GPR) for the energy function, and a single-walker dynamics method, gentle accent dynamics (GAD), for the saddle-type transition states. SP is detected by the GAD applied to the GPR surrogate for the gradient vector and the Hessian matrix. Our key ingredient for efficiency improvements is an active learning method which sequentially designs the most informative locations and takes evaluations of the original model at these locations to train GPR. We formulate this active learning task as the optimal experimental design problem and propose a very efficient sample-based sub-optimal criterion to construct the optimal locations. We show that the new method significantly decreases the required number of energy or force evaluations of the original model.
    Incompleteness of graph neural networks for points clouds in three dimensions. (arXiv:2201.07136v4 [stat.ML] UPDATED)
    Graph neural networks (GNN) are very popular methods in machine learning and have been applied very successfully to the prediction of the properties of molecules and materials. First-order GNNs are well known to be incomplete, i.e., there exist graphs that are distinct but appear identical when seen through the lens of the GNN. More complicated schemes have thus been designed to increase their resolving power. Applications to molecules (and more generally, point clouds), however, add a geometric dimension to the problem. The most straightforward and prevalent approach to construct graph representation for molecules regards atoms as vertices in a graph and draws a bond between each pair of atoms within a chosen cutoff. Bonds can be decorated with the distance between atoms, and the resulting "distance graph NNs" (dGNN) have empirically demonstrated excellent resolving power and are widely used in chemical ML, with all known indistinguishable configurations being resolved in the fully-connected limit, which is equivalent to infinite or sufficiently large cutoff. Here we present a counterexample that proves that dGNNs are not complete even for the restricted case of fully-connected graphs induced by 3D atom clouds. We construct pairs of distinct point clouds whose associated graphs are, for any cutoff radius, equivalent based on a first-order Weisfeiler-Lehman test. This class of degenerate structures includes chemically-plausible configurations, both for isolated structures and for infinite structures that are periodic in 1, 2, and 3 dimensions. The existence of indistinguishable configurations sets an ultimate limit to the expressive power of some of the well-established GNN architectures for atomistic machine learning. Models that explicitly use angular or directional information in the description of atomic environments can resolve this class of degeneracies.
    Neural Stein critics with staged $L^2$-regularization. (arXiv:2207.03406v2 [stat.ML] UPDATED)
    Learning to differentiate model distributions from observed data is a fundamental problem in statistics and machine learning, and high-dimensional data remains a challenging setting for such problems. Metrics that quantify the disparity in probability distributions, such as the Stein discrepancy, play an important role in statistical testing in high dimensions. In this paper, we investigate the role of $L^2$ regularization in training a neural network Stein critic so as to distinguish between data sampled from an unknown probability distribution and a nominal model distribution. Motivated by the Neural Tangent Kernel (NTK) theory, we develop a novel staging procedure for the weight of regularization over training time. This leverages the advantages of highly-regularized training at early times while also empirically delaying overfitting. Theoretically, we prove the approximation of the training dynamic by the kernel optimization, namely the ``lazy training'', when the $L^2$ regularization weight is large. The result provides a guarantee of learning the optimal critic assuming sufficient alignment with the leading eigen-modes of the zero-time NTK. The benefit of the staged $L^2$ regularization is demonstrated on simulated high dimensional distribution drift data and an application to evaluating generative models of image data.
    Recent Developments in Structure-Based Virtual Screening Approaches. (arXiv:2211.03208v1 [q-bio.BM])
    Drug development is a wide scientific field that faces many challenges these days. Among them are extremely high development costs, long development times, as well as a low number of new drugs that are approved each year. To solve these problems, new and innovate technologies are needed that make the drug discovery process of small-molecules more time and cost-efficient, and which allow to target previously undruggable target classes such as protein-protein interactions. Structure-based virtual screenings have become a leading contender in this context. In this review, we give an introduction to the foundations of structure-based virtual screenings, and survey their progress in the past few years. We outline key principles, recent success stories, new methods, available software, and promising future research directions. Virtual screenings have an enormous potential for the development of new small-molecule drugs, and are already starting to transform early-stage drug discovery.
    Degradation Prediction of Semiconductor Lasers using Conditional Variational Autoencoder. (arXiv:2211.02847v1 [cs.LG])
    Semiconductor lasers have been rapidly evolving to meet the demands of next-generation optical networks. This imposes much more stringent requirements on the laser reliability, which are dominated by degradation mechanisms (e.g., sudden degradation) limiting the semiconductor laser lifetime. Physics-based approaches are often used to characterize the degradation behavior analytically, yet explicit domain knowledge and accurate mathematical models are required. Building such models can be very challenging due to a lack of a full understanding of the complex physical processes inducing the degradation under various operating conditions. To overcome the aforementioned limitations, we propose a new data-driven approach, extracting useful insights from the operational monitored data to predict the degradation trend without requiring any specific knowledge or using any physical model. The proposed approach is based on an unsupervised technique, a conditional variational autoencoder, and validated using vertical-cavity surface-emitting laser (VCSEL) and tunable edge emitting laser reliability data. The experimental results confirm that our model (i) achieves a good degradation prediction and generalization performance by yielding an F1 score of 95.3%, (ii) outperforms several baseline ML based anomaly detection techniques, and (iii) helps to shorten the aging tests by early predicting the failed devices before the end of the test and thereby saving costs
    GRIMGEP: Learning Progress for Robust Goal Sampling in Visual Deep Reinforcement Learning. (arXiv:2008.04388v3 [cs.LG] UPDATED)
    Designing agents, capable of learning autonomously a wide range of skills is critical in order to increase the scope of reinforcement learning. It will both increase the diversity of learned skills and reduce the burden of manually designing reward functions for each skill. Self-supervised agents, setting their own goals, and trying to maximize the diversity of those goals have shown great promise towards this end. However, a currently known limitation of agents trying to maximize the diversity of sampled goals is that they tend to get attracted to noise or more generally to parts of the environments that cannot be controlled (distractors). When agents have access to predefined goal features or expert knowledge, absolute Learning Progress (ALP) provides a way to distinguish between regions that can be controlled and those that cannot. However, those methods often fall short when the agents are only provided with raw sensory inputs such as images. In this work we extend those concepts to unsupervised image-based goal exploration. We propose a framework that allows agents to autonomously identify and ignore noisy distracting regions while searching for novelty in the learnable regions to both improve overall performance and avoid catastrophic forgetting. Our framework can be combined with any state-of-the-art novelty seeking goal exploration approaches. We construct a rich 3D image based environment with distractors. Experiments on this environment show that agents using our framework successfully identify interesting regions of the environment, resulting in drastically improved performances. The source code is available at https://sites.google.com/view/grimgep.
    Domain Generalization -- A Causal Perspective. (arXiv:2209.15177v2 [cs.LG] UPDATED)
    Machine learning models rely on various assumptions to attain high accuracy. One of the preliminary assumptions of these models is the independent and identical distribution, which suggests that the train and test data are sampled from the same distribution. However, this assumption seldom holds in the real world due to distribution shifts. As a result models that rely on this assumption exhibit poor generalization capabilities. Over the recent years, dedicated efforts have been made to improve the generalization capabilities of these models collectively known as -- \textit{domain generalization methods}. The primary idea behind these methods is to identify stable features or mechanisms that remain invariant across the different distributions. Many generalization approaches employ causal theories to describe invariance since causality and invariance are inextricably intertwined. However, current surveys deal with the causality-aware domain generalization methods on a very high-level. Furthermore, we argue that it is possible to categorize the methods based on how causality is leveraged in that method and in which part of the model pipeline is it used. To this end, we categorize the causal domain generalization methods into three categories, namely, (i) Invariance via Causal Data Augmentation methods which are applied during the data pre-processing stage, (ii) Invariance via Causal representation learning methods that are utilized during the representation learning stage, and (iii) Invariance via Transferring Causal mechanisms methods that are applied during the classification stage of the pipeline. Furthermore, this survey includes in-depth insights into benchmark datasets and code repositories for domain generalization methods. We conclude the survey with insights and discussions on future directions.
    Feature Selection for Classification with QAOA. (arXiv:2211.02861v1 [cs.IR])
    Feature selection is of great importance in Machine Learning, where it can be used to reduce the dimensionality of classification, ranking and prediction problems. The removal of redundant and noisy features can improve both the accuracy and scalability of the trained models. However, feature selection is a computationally expensive task with a solution space that grows combinatorically. In this work, we consider in particular a quadratic feature selection problem that can be tackled with the Quantum Approximate Optimization Algorithm (QAOA), already employed in combinatorial optimization. First we represent the feature selection problem with the QUBO formulation, which is then mapped to an Ising spin Hamiltonian. Then we apply QAOA with the goal of finding the ground state of this Hamiltonian, which corresponds to the optimal selection of features. In our experiments, we consider seven different real-world datasets with dimensionality up to 21 and run QAOA on both a quantum simulator and, for small datasets, the 7-qubit IBM (ibm-perth) quantum computer. We use the set of selected features to train a classification model and evaluate its accuracy. Our analysis shows that it is possible to tackle the feature selection problem with QAOA and that currently available quantum devices can be used effectively. Future studies could test a wider range of classification models as well as improve the effectiveness of QAOA by exploring better performing optimizers for its classical step.
    New Definitions and Evaluations for Saliency Methods: Staying Intrinsic, Complete and Sound. (arXiv:2211.02912v1 [stat.ML])
    Saliency methods compute heat maps that highlight portions of an input that were most {\em important} for the label assigned to it by a deep net. Evaluations of saliency methods convert this heat map into a new {\em masked input} by retaining the $k$ highest-ranked pixels of the original input and replacing the rest with \textquotedblleft uninformative\textquotedblright\ pixels, and checking if the net's output is mostly unchanged. This is usually seen as an {\em explanation} of the output, but the current paper highlights reasons why this inference of causality may be suspect. Inspired by logic concepts of {\em completeness \& soundness}, it observes that the above type of evaluation focuses on completeness of the explanation, but ignores soundness. New evaluation metrics are introduced to capture both notions, while staying in an {\em intrinsic} framework -- i.e., using the dataset and the net, but no separately trained nets, human evaluations, etc. A simple saliency method is described that matches or outperforms prior methods in the evaluations. Experiments also suggest new intrinsic justifications, based on soundness, for popular heuristic tricks such as TV regularization and upsampling.
    Leveraging Siamese Networks for One-Shot Intrusion Detection Model. (arXiv:2006.15343v3 [cs.CR] UPDATED)
    The use of supervised Machine Learning (ML) to enhance Intrusion Detection Systems has been the subject of significant research. Supervised ML is based upon learning by example, demanding significant volumes of representative instances for effective training and the need to re-train the model for every unseen cyber-attack class. However, retraining the models in-situ renders the network susceptible to attacks owing to the time-window required to acquire a sufficient volume of data. Although anomaly detection systems provide a coarse-grained defence against unseen attacks, these approaches are significantly less accurate and suffer from high false-positive rates. Here, a complementary approach referred to as 'One-Shot Learning', whereby a limited number of examples of a new attack-class is used to identify a new attack-class (out of many) is detailed. The model grants a new cyber-attack classification without retraining. A Siamese Network is trained to differentiate between classes based on pairs similarities, rather than features, allowing to identify new and previously unseen attacks. The performance of a pre-trained model to classify attack-classes based only on one example is evaluated using three datasets. Results confirm the adaptability of the model in classifying unseen attacks and the trade-off between performance and the need for distinctive class representation.
    Hyperspectral Pixel Unmixing with Latent Dirichlet Variational Autoencoder. (arXiv:2203.01327v2 [eess.IV] UPDATED)
    Hyperspectral pixel intensities result from a mixing of reflectances from several materials. This paper develops a method of hyperspectral pixel {\it unmixing} that aims to recover the "pure" spectral signal of each material (hereafter referred to as {\it endmembers}) together with the mixing ratios ({\it abundances}) given the spectrum of a single pixel. The unmixing problem is particularly relevant in the case of low-resolution hyperspectral images captured in a remote sensing setting, where individual pixels can cover large regions of the scene. Under the assumptions that (1) a multivariate Normal distribution can represent the spectra of an endmember and (2) a Dirichlet distribution can encode abundances of different endmembers, we develop a Latent Dirichlet Variational Autoencoder for hyperspectral pixel unmixing. Our approach achieves state-of-the-art results on standard benchmarks and on synthetic data generated using United States Geological Survey spectral library.
    Adversarial Attacks on Transformers-Based Malware Detectors. (arXiv:2210.00008v2 [cs.CR] UPDATED)
    Signature-based malware detectors have proven to be insufficient as even a small change in malignant executable code can bypass these signature-based detectors. Many machine learning-based models have been proposed to efficiently detect a wide variety of malware. Many of these models are found to be susceptible to adversarial attacks - attacks that work by generating intentionally designed inputs that can force these models to misclassify. Our work aims to explore vulnerabilities in the current state of the art malware detectors to adversarial attacks. We train a Transformers-based malware detector, carry out adversarial attacks resulting in a misclassification rate of 23.9% and propose defenses that reduce this misclassification rate to half. An implementation of our work can be found at https://github.com/yashjakhotiya/Adversarial-Attacks-On-Transformers.
    Modeling Multi-Dimensional Datasets via a Fast Scale-Free Network Model. (arXiv:2211.02811v1 [cs.SI])
    Compared with network datasets, multi-dimensional data are much more common nowadays. If we can model multi-dimensional datasets into networks with accurate network properties, while, in the meantime, preserving the original dataset features, we can not only explore the dataset dynamic but also acquire abundant synthetic network data. This paper proposed a fast scale-free network model for large-scale multi-dimensional data not limited to the network domain. The proposed network model is dynamic and able to generate scale-free graphs within linear time regardless of the scale or field of the modeled dataset. We further argued that in a dynamic network where edge-generation probability represents influence, as the network evolves, that influence also decays. We demonstrated how this influence decay phenomenon is reflected in our model and provided a case study using the Global Terrorism Database.
    Discovering Closed-Loop Failures of Vision-Based Controllers via Reachability Analysis. (arXiv:2211.02736v1 [cs.RO])
    Machine learning driven image-based controllers allow robotic systems to take intelligent actions based on the visual feedback from their environment. Understanding when these controllers might lead to system safety violations is important for their integration in safety-critical applications and engineering corrective safety measures for the system. Existing methods leverage simulation-based testing (or falsification) to find the failures of vision-based controllers, i.e., the visual inputs that lead to closed-loop safety violations. However, these techniques do not scale well to the scenarios involving high-dimensional and complex visual inputs, such as RGB images. In this work, we cast the problem of finding closed-loop vision failures as a Hamilton-Jacobi (HJ) reachability problem. Our approach blends simulation-based analysis with HJ reachability methods to compute an approximation of the backward reachable tube (BRT) of the system, i.e., the set of unsafe states for the system under vision-based controllers. Utilizing the BRT, we can tractably and systematically find the system states and corresponding visual inputs that lead to closed-loop failures. These visual inputs can be subsequently analyzed to find the input characteristics that might have caused the failure. Besides its scalability to high-dimensional visual inputs, an explicit computation of BRT allows the proposed approach to capture non-trivial system failures that are difficult to expose via random simulations. We demonstrate our framework on two case studies involving an RGB image-based neural network controller for (a) autonomous indoor navigation, and (b) autonomous aircraft taxiing.
    Accurate and Reliable Methods for 5G UAV Jamming Identification With Calibrated Uncertainty. (arXiv:2211.02924v1 [cs.AI])
    Only increasing accuracy without considering uncertainty may negatively impact Deep Neural Network (DNN) decision-making and decrease its reliability. This paper proposes five combined preprocessing and post-processing methods for time-series binary classification problems that simultaneously increase the accuracy and reliability of DNN outputs applied in a 5G UAV security dataset. These techniques use DNN outputs as input parameters and process them in different ways. Two methods use a well-known Machine Learning (ML) algorithm as a complement, and the other three use only confidence values that the DNN estimates. We compare seven different metrics, such as the Expected Calibration Error (ECE), Maximum Calibration Error (MCE), Mean Confidence (MC), Mean Accuracy (MA), Normalized Negative Log Likelihood (NLL), Brier Score Loss (BSL), and Reliability Score (RS) and the tradeoffs between them to evaluate the proposed hybrid algorithms. First, we show that the eXtreme Gradient Boosting (XGB) classifier might not be reliable for binary classification under the conditions this work presents. Second, we demonstrate that at least one of the potential methods can achieve better results than the classification in the DNN softmax layer. Finally, we show that the prospective methods may improve accuracy and reliability with better uncertainty calibration based on the assumption that the RS determines the difference between MC and MA metrics, and this difference should be zero to increase reliability. For example, Method 3 presents the best RS of 0.65 even when compared to the XGB classifier, which achieves RS of 7.22.
    Predicting User-specific Future Activities using LSTM-based Multi-label Classification. (arXiv:2211.03100v1 [cs.LG])
    User-specific future activity prediction in the healthcare domain based on previous activities can drastically improve the services provided by the nurses. It is challenging because, unlike other domains, activities in healthcare involve both nurses and patients, and they also vary from hour to hour. In this paper, we employ various data processing techniques to organize and modify the data structure and an LSTM-based multi-label classifier for a novel 2-stage training approach (user-agnostic pre-training and user-specific fine-tuning). Our experiment achieves a validation accuracy of 31.58\%, precision 57.94%, recall 68.31%, and F1 score 60.38%. We concluded that proper data pre-processing and a 2-stage training process resulted in better performance. This experiment is a part of the "Fourth Nurse Care Activity Recognition Challenge" by our team "Not A Fan of Local Minima".
    Efficient Traffic State Forecasting using Spatio-Temporal Network Dependencies: A Sparse Graph Neural Network Approach. (arXiv:2211.03033v1 [cs.LG])
    Traffic state prediction in a transportation network is paramount for effective traffic operations and management, as well as informed user and system-level decision-making. However, long-term traffic prediction (beyond 30 minutes into the future) remains challenging in current research. In this work, we integrate the spatio-temporal dependencies in the transportation network from network modeling, together with the graph convolutional network (GCN) and graph attention network (GAT). To further tackle the dramatic computation and memory cost caused by the giant model size (i.e., number of weights) caused by multiple cascaded layers, we propose sparse training to mitigate the training cost, while preserving the prediction accuracy. It is a process of training using a fixed number of nonzero weights in each layer in each iteration. We consider the problem of long-term traffic speed forecasting for a real large-scale transportation network data from the California Department of Transportation (Caltrans) Performance Measurement System (PeMS). Experimental results show that the proposed GCN-STGT and GAT-STGT models achieve low prediction errors on short-, mid- and long-term prediction horizons, of 15, 30 and 45 minutes in duration, respectively. Using our sparse training, we could train from scratch with high sparsity (e.g., up to 90%), equivalent to 10 times floating point operations per second (FLOPs) reduction on computational cost using the same epochs as dense training, and arrive at a model with very small accuracy loss compared with the original dense training
    FedSL: Federated Split Learning on Distributed Sequential Data in Recurrent Neural Networks. (arXiv:2011.03180v3 [cs.LG] UPDATED)
    Federated Learning (FL) and Split Learning (SL) are privacy-preserving Machine-Learning (ML) techniques that enable training ML models over data distributed among clients without requiring direct access to their raw data. Existing FL and SL approaches work on horizontally or vertically partitioned data and cannot handle sequentially partitioned data where segments of multiple-segment sequential data are distributed across clients. In this paper, we propose a novel federated split learning framework, FedSL, to train models on distributed sequential data. The most common ML models to train on sequential data are Recurrent Neural Networks (RNNs). Since the proposed framework is privacy-preserving, segments of multiple-segment sequential data cannot be shared between clients or between clients and server. To circumvent this limitation, we propose a novel SL approach tailored for RNNs. A RNN is split into sub-networks, and each sub-network is trained on one client containing single segments of multiple-segment training sequences. During local training, the sub-networks on different clients communicate with each other to capture latent dependencies between consecutive segments of multiple-segment sequential data on different clients, but without sharing raw data or complete model parameters. After training local sub-networks with local sequential data segments, all clients send their sub-networks to a federated server where sub-networks are aggregated to generate a global model. The experimental results on simulated and real-world datasets demonstrate that the proposed method successfully trains models on distributed sequential data, while preserving privacy, and outperforms previous FL and centralized learning approaches in terms of achieving higher accuracy in fewer communication rounds.
    Generative models and Bayesian inversion using Laplace approximation. (arXiv:2203.07755v2 [stat.ML] UPDATED)
    The Bayesian approach to solving inverse problems relies on the choice of a prior. This critical ingredient allows the formulation of expert knowledge or physical constraints in a probabilistic fashion and plays an important role for the success of the inference. Recently, Bayesian inverse problems were solved using generative models as highly informative priors. Generative models are a popular tool in machine learning to generate data whose properties closely resemble those of a given database. Typically, the generated distribution of data is embedded in a low-dimensional manifold. For the inverse problem, a generative model is trained on a database that reflects the properties of the sought solution, such as typical structures of the tissue in the human brain in magnetic resonance (MR) imaging. The inference is carried out in the low-dimensional manifold determined by the generative model which strongly reduces the dimensionality of the inverse problem. However, this proceeding produces a posterior that admits no Lebesgue density in the actual variables and the accuracy reached can strongly depend on the quality of the generative model. For linear Gaussian models we explore an alternative Bayesian inference based on probabilistic generative models which is carried out in the original high-dimensional space. A Laplace approximation is employed to analytically derive the required prior probability density function induced by the generative model. Properties of the resulting inference are investigated. Specifically, we show that derived Bayes estimates are consistent, in contrast to the approach employing the low-dimensional manifold of the generative model. The MNIST data set is used to construct numerical experiments which confirm our theoretical findings.
    Diagnostic Tool for Out-of-Sample Model Evaluation. (arXiv:2206.10982v2 [stat.ML] UPDATED)
    Assessment of model fitness is a key part of machine learning. The standard paradigm is to learn models by minimizing a chosen loss function averaged over training data, with the aim of achieving small losses on future data. In this paper, we consider the use of a finite calibration data set to characterize the future, out-of-sample losses of a model. We propose a simple model diagnostic tool that provides finite-sample guarantees under weak assumptions. The tool is simple to compute and to interpret. Several numerical experiments are presented to show how the proposed method quantifies the impact of distribution shifts, aids the analysis of regression, and enables model selection as well as hyper-parameter tuning.
    Tuning Language Models as Training Data Generators for Augmentation-Enhanced Few-Shot Learning. (arXiv:2211.03044v1 [cs.CL])
    Recent studies have revealed the intriguing few-shot learning ability of pretrained language models (PLMs): They can quickly adapt to a new task when fine-tuned on a small amount of labeled data formulated as prompts, without requiring abundant task-specific annotations. Despite their promising performance, most existing few-shot approaches that only learn from the small training set still underperform fully supervised training by nontrivial margins. In this work, we study few-shot learning with PLMs from a different perspective: We first tune an autoregressive PLM on the few-shot samples and then use it as a generator to synthesize a large amount of novel training samples which augment the original training set. To encourage the generator to produce label-discriminative samples, we train it via weighted maximum likelihood where the weight of each token is automatically adjusted based on a discriminative meta-learning objective. A classification PLM can then be fine-tuned on both the few-shot and the synthetic samples with regularization for better generalization and stability. Our approach FewGen achieves an overall better result across seven classification tasks of the GLUE benchmark than existing few-shot learning methods, improving no-augmentation methods by 5+ average points, and outperforming augmentation methods by 3+ average points.
    Grassmann Manifold Flow. (arXiv:2211.02900v1 [cs.LG])
    Recently, studies on machine learning have focused on methods that use symmetry implicit in a specific manifold as an inductive bias. In particular, approaches using Grassmann manifolds have been found to exhibit effective performance in fields such as point cloud and image set analysis. However, there is a lack of research on the construction of general learning models to learn distributions on the Grassmann manifold. In this paper, we lay the theoretical foundations for learning distributions on the Grassmann manifold via continuous normalizing flows. Experimental results show that the proposed method can generate high-quality samples by capturing the data structure. Further, the proposed method significantly outperformed state-of-the-art methods in terms of log-likelihood or evidence lower bound. The results obtained are expected to usher in further research in this field of study.
    Improved Techniques for the Conditional Generative Augmentation of Clinical Audio Data. (arXiv:2211.02874v1 [cs.LG])
    Data augmentation is a valuable tool for the design of deep learning systems to overcome data limitations and stabilize the training process. Especially in the medical domain, where the collection of large-scale data sets is challenging and expensive due to limited access to patient data, relevant environments, as well as strict regulations, community-curated large-scale public datasets, pretrained models, and advanced data augmentation methods are the main factors for developing reliable systems to improve patient care. However, for the development of medical acoustic sensing systems, an emerging field of research, the community lacks large-scale publicly available data sets and pretrained models. To address the problem of limited data, we propose a conditional generative adversarial neural network-based augmentation method which is able to synthesize mel spectrograms from a learned data distribution of a source data set. In contrast to previously proposed fully convolutional models, the proposed model implements residual Squeeze and Excitation modules in the generator architecture. We show that our method outperforms all classical audio augmentation techniques and previously published generative methods in terms of generated sample quality and a performance improvement of 2.84% of Macro F1-Score for a classifier trained on the augmented data set, an enhancement of $1.14\%$ in relation to previous work. By analyzing the correlation of intermediate feature spaces, we show that the residual Squeeze and Excitation modules help the model to reduce redundancy in the latent features. Therefore, the proposed model advances the state-of-the-art in the augmentation of clinical audio data and improves the data bottleneck for the design of clinical acoustic sensing systems.
    Pitfalls of Climate Network Construction: A Statistical Perspective. (arXiv:2211.02888v1 [cs.LG])
    Network-based analyses of dynamical systems have become increasingly popular in climate science. Here we address network construction from a statistical perspective and highlight the often ignored fact that the calculated correlation values are only empirical estimates. To measure spurious behaviour as deviation from a ground truth network, we simulate time-dependent isotropic random fields on the sphere and apply common network construction techniques. We find several ways in which the uncertainty stemming from the estimation procedure has major impact on network characteristics. When the data has locally coherent correlation structure, spurious link bundle teleconnections and spurious high-degree clusters have to be expected. Anisotropic estimation variance can also induce severe biases into empirical networks. We validate our findings with ERA5 reanalysis data. Moreover we explain why commonly applied resampling procedures are inappropriate for significance evaluation and propose a statistically more meaningful ensemble construction framework. By communicating which difficulties arise in estimation from scarce data and by presenting which design decisions increase robustness, we hope to contribute to more reliable climate network construction in the future.
    On learning history based policies for controlling Markov decision processes. (arXiv:2211.03011v1 [cs.LG])
    Reinforcementlearning(RL)folkloresuggeststhathistory-basedfunctionapproximationmethods,suchas recurrent neural nets or history-based state abstraction, perform better than their memory-less counterparts, due to the fact that function approximation in Markov decision processes (MDP) can be viewed as inducing a Partially observable MDP. However, there has been little formal analysis of such history-based algorithms, as most existing frameworks focus exclusively on memory-less features. In this paper, we introduce a theoretical framework for studying the behaviour of RL algorithms that learn to control an MDP using history-based feature abstraction mappings. Furthermore, we use this framework to design a practical RL algorithm and we numerically evaluate its effectiveness on a set of continuous control tasks.
    Projection Robust Wasserstein Distance and Riemannian Optimization. (arXiv:2006.07458v9 [cs.LG] UPDATED)
    Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.
    Contrastive Weighted Learning for Near-Infrared Gaze Estimation. (arXiv:2211.03073v1 [cs.CV])
    Appearance-based gaze estimation has been very successful with the use of deep learning. Many following works improved domain generalization for gaze estimation. However, even though there has been much progress in domain generalization for gaze estimation, most of the recent work have been focused on cross-dataset performance -- accounting for different distributions in illuminations, head pose, and lighting. Although improving gaze estimation in different distributions of RGB images is important, near-infrared image based gaze estimation is also critical for gaze estimation in dark settings. Also there are inherent limitations relying solely on supervised learning for regression tasks. This paper contributes to solving these problems and proposes GazeCWL, a novel framework for gaze estimation with near-infrared images using contrastive learning. This leverages adversarial attack techniques for data augmentation and a novel contrastive loss function specifically for regression tasks that effectively clusters the features of different samples in the latent space. Our model outperforms previous domain generalization models in infrared image based gaze estimation and outperforms the baseline by 45.6\% while improving the state-of-the-art by 8.6\%, we demonstrate the efficacy of our method.
    On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems. (arXiv:2107.08225v3 [math.OC] UPDATED)
    We introduce a class of first-order methods for smooth constrained optimization that are based on an analogy to non-smooth dynamical systems. Two distinctive features of our approach are that (i) projections or optimizations over the entire feasible set are avoided, in stark contrast to projected gradient methods or the Frank-Wolfe method, and (ii) iterates are allowed to become infeasible, which differs from active set or feasible direction methods, where the descent motion stops as soon as a new constraint is encountered. The resulting algorithmic procedure is simple to implement even when constraints are nonlinear, and is suitable for large-scale constrained optimization problems in which the feasible set fails to have a simple structure. The key underlying idea is that constraints are expressed in terms of velocities instead of positions, which has the algorithmic consequence that optimizations over feasible sets at each iteration are replaced with optimizations over local, sparse convex approximations. In particular, this means that at each iteration only constraints that are violated are taken into account. The result is a simplified suite of algorithms and an expanded range of possible applications in machine learning.
    ReLU Neural Networks of Polynomial Size for Exact Maximum Flow Computation. (arXiv:2102.06635v4 [cs.LG] UPDATED)
    This paper studies the expressive power of artificial neural networks with rectified linear units. In order to study them as a model of real-valued computation, we introduce the concept of Max-Affine Arithmetic Programs and show equivalence between them and neural networks concerning natural complexity measures. We then use this result to show that two fundamental combinatorial optimization problems can be solved with polynomial-size neural networks. First, we show that for any undirected graph with $n$ nodes, there is a neural network (with fixed weights and biases) of size $\mathcal{O}(n^3)$ that takes the edge weights as input and computes the value of a minimum spanning tree of the graph. Second, we show that for any directed graph with $n$ nodes and $m$ arcs, there is a neural network of size $\mathcal{O}(m^2n^2)$ that takes the arc capacities as input and computes a maximum flow. Our results imply that these two problems can be solved with strongly polynomial time algorithms that solely uses affine transformations and maxima computations, but no comparison-based branchings.  ( 2 min )
    Unlearning Nonlinear Graph Classifiers in the Limited Training Data Regime. (arXiv:2211.03216v1 [cs.LG])
    As the demand for user privacy grows, controlled data removal (machine unlearning) is becoming an important feature of machine learning models for data-sensitive Web applications such as social networks and recommender systems. Nevertheless, at this point it is still largely unknown how to perform efficient machine unlearning of graph neural networks (GNNs); this is especially the case when the number of training samples is small, in which case unlearning can seriously compromise the performance of the model. To address this issue, we initiate the study of unlearning the Graph Scattering Transform (GST), a mathematical framework that is efficient, provably stable under feature or graph topology perturbations, and offers graph classification performance comparable to that of GNNs. Our main contribution is the first known nonlinear approximate graph unlearning method based on GSTs. Our second contribution is a theoretical analysis of the computational complexity of the proposed unlearning mechanism, which is hard to replicate for deep neural networks. Our third contribution are extensive simulation results which show that, compared to complete retraining of GNNs after each removal request, the new GST-based approach offers, on average, a $10.38$x speed-up and leads to a $2.6$% increase in test accuracy during unlearning of $90$ out of $100$ training graphs from the IMDB dataset ($10$% training ratio).
    Towards a methodology for addressing missingness in datasets, with an application to demographic health datasets. (arXiv:2211.02856v1 [cs.LG])
    Missing data is a common concern in health datasets, and its impact on good decision-making processes is well documented. Our study's contribution is a methodology for tackling missing data problems using a combination of synthetic dataset generation, missing data imputation and deep learning methods to resolve missing data challenges. Specifically, we conducted a series of experiments with these objectives; $a)$ generating a realistic synthetic dataset, $b)$ simulating data missingness, $c)$ recovering the missing data, and $d)$ analyzing imputation performance. Our methodology used a gaussian mixture model whose parameters were learned from a cleaned subset of a real demographic and health dataset to generate the synthetic data. We simulated various missingness degrees ranging from $10 \%$, $20 \%$, $30 \%$, and $40\%$ under the missing completely at random scheme MCAR. We used an integrated performance analysis framework involving clustering, classification and direct imputation analysis. Our results show that models trained on synthetic and imputed datasets could make predictions with an accuracy of $83 \%$ and $80 \%$ on $a) $ an unseen real dataset and $b)$ an unseen reserved synthetic test dataset, respectively. Moreover, the models that used the DAE method for imputed yielded the lowest log loss an indication of good performance, even though the accuracy measures were slightly lower. In conclusion, our work demonstrates that using our methodology, one can reverse engineer a solution to resolve missingness on an unseen dataset with missingness. Moreover, though we used a health dataset, our methodology can be utilized in other contexts.
    Ensemble Conformalized Quantile Regression for Probabilistic Time Series Forecasting. (arXiv:2202.08756v2 [cs.LG] UPDATED)
    This paper presents a novel probabilistic forecasting method called ensemble conformalized quantile regression (EnCQR). EnCQR constructs distribution-free and approximately marginally valid prediction intervals (PIs), which are suitable for nonstationary and heteroscedastic time series data. EnCQR can be applied on top of a generic forecasting model, including deep learning architectures. EnCQR exploits a bootstrap ensemble estimator, which enables the use of conformal predictors for time series by removing the requirement of data exchangeability. The ensemble learners are implemented as generic machine learning algorithms performing quantile regression, which allow the length of the PIs to adapt to local variability in the data. In the experiments, we predict time series characterized by a different amount of heteroscedasticity. The results demonstrate that EnCQR outperforms models based only on quantile regression or conformal prediction, and it provides sharper, more informative, and valid PIs.  ( 2 min )
    Clustering above Exponential Families with Tempered Exponential Measures. (arXiv:2211.02765v1 [cs.LG])
    The link with exponential families has allowed $k$-means clustering to be generalized to a wide variety of data generating distributions in exponential families and clustering distortions among Bregman divergences. Getting the framework to work above exponential families is important to lift roadblocks like the lack of robustness of some population minimizers carved in their axiomatization. Current generalisations of exponential families like $q$-exponential families or even deformed exponential families fail at achieving the goal. In this paper, we provide a new attempt at getting the complete framework, grounded in a new generalisation of exponential families that we introduce, tempered exponential measures (TEM). TEMs keep the maximum entropy axiomatization framework of $q$-exponential families, but instead of normalizing the measure, normalize a dual called a co-distribution. Numerous interesting properties arise for clustering such as improved and controllable robustness for population minimizers, that keep a simple analytic form.  ( 2 min )
    Machine Learning Workflow to Explain Black-box Models for Early Alzheimer's Disease Classification Evaluated for Multiple Datasets. (arXiv:2205.05907v2 [cs.LG] UPDATED)
    Purpose: Hard-to-interpret Black-box Machine Learning (ML) were often used for early Alzheimer's Disease (AD) detection. Methods: To interpret eXtreme Gradient Boosting (XGBoost), Random Forest (RF), and Support Vector Machine (SVM) black-box models a workflow based on Shapley values was developed. All models were trained on the Alzheimer's Disease Neuroimaging Initiative (ADNI) dataset and evaluated for an independent ADNI test set, as well as the external Australian Imaging and Lifestyle flagship study of Ageing (AIBL), and Open Access Series of Imaging Studies (OASIS) datasets. Shapley values were compared to intuitively interpretable Decision Trees (DTs), and Logistic Regression (LR), as well as natural and permutation feature importances. To avoid the reduction of the explanation validity caused by correlated features, forward selection and aspect consolidation were implemented. Results: Some black-box models outperformed DTs and LR. The forward-selected features correspond to brain areas previously associated with AD. Shapley values identified biologically plausible associations with moderate to strong correlations with feature importances. The most important RF features to predict AD conversion were the volume of the amygdalae, and a cognitive test score. Good cognitive test performances and large brain volumes decreased the AD risk. The models trained using cognitive test scores significantly outperformed brain volumetric models ($p<0.05$). Cognitive Normal (CN) vs. AD models were successfully transferred to external datasets. Conclusion: In comparison to previous work, improved performances for ADNI and AIBL were achieved for CN vs. Mild Cognitive Impairment (MCI) classification using brain volumes. The Shapley values and the feature importances showed moderate to strong correlations.
    TENSILE: A Tensor granularity dynamic GPU memory scheduling method toward multiple dynamic workloads system. (arXiv:2105.13336v5 [cs.DC] UPDATED)
    Recently, deep learning has been an area of intense research. However, as a kind of computing-intensive task, deep learning highly relies on the scale of GPU memory, which is usually prohibitive and scarce. Although some extensive works have been proposed for dynamic GPU memory management, they are hard to apply to systems with multiple dynamic workloads, such as in-database machine learning systems. In this paper, we demonstrated TENSILE, a method of managing GPU memory in tensor granularity to reduce the GPU memory peak, considering the multiple dynamic workloads. TENSILE tackled the cold-starting and across-iteration scheduling problem existing in previous works. We implemented TENSILE on a deep learning framework built by ourselves and evaluated its performance. The experiment results show that TENSILE can save more GPU memory with less extra overhead than prior works in single and multiple dynamic workloads scenarios.  ( 2 min )
    A Comprehensive Survey of Regression Based Loss Functions for Time Series Forecasting. (arXiv:2211.02989v1 [cs.LG])
    Time Series Forecasting has been an active area of research due to its many applications ranging from network usage prediction, resource allocation, anomaly detection, and predictive maintenance. Numerous publications published in the last five years have proposed diverse sets of objective loss functions to address cases such as biased data, long-term forecasting, multicollinear features, etc. In this paper, we have summarized 14 well-known regression loss functions commonly used for time series forecasting and listed out the circumstances where their application can aid in faster and better model convergence. We have also demonstrated how certain categories of loss functions perform well across all data sets and can be considered as a baseline objective function in circumstances where the distribution of the data is unknown. Our code is available at GitHub: https://github.com/aryan-jadon/Regression-Loss-Functions-in-Time-Series-Forecasting-Tensorflow.  ( 2 min )
    Diversity-based Deep Reinforcement Learning Towards Multidimensional Difficulty for Fighting Game AI. (arXiv:2211.02759v1 [cs.LG])
    In fighting games, individual players of the same skill level often exhibit distinct strategies from one another through their gameplay. Despite this, the majority of AI agents for fighting games have only a single strategy for each "level" of difficulty. To make AI opponents more human-like, we'd ideally like to see multiple different strategies at each level of difficulty, a concept we refer to as "multidimensional" difficulty. In this paper, we introduce a diversity-based deep reinforcement learning approach for generating a set of agents of similar difficulty that utilize diverse strategies. We find this approach outperforms a baseline trained with specialized, human-authored reward functions in both diversity and performance.
    Going In Style: Audio Backdoors Through Stylistic Transformations. (arXiv:2211.03117v1 [cs.CR])
    A backdoor attack places triggers in victims' deep learning models to enable a targeted misclassification at testing time. In general, triggers are fixed artifacts attached to samples, making backdoor attacks easy to spot. Only recently, a new trigger generation harder to detect has been proposed: the stylistic triggers that apply stylistic transformations to the input samples (e.g., a specific writing style). Currently, stylistic backdoor literature lacks a proper formalization of the attack, which is established in this paper. Moreover, most studies of stylistic triggers focus on text and images, while there is no understanding of whether they can work in sound. This work fills this gap. We propose JingleBack, the first stylistic backdoor attack based on audio transformations such as chorus and gain. Using 444 models in a speech classification task, we confirm the feasibility of stylistic triggers in audio, achieving 96% attack success.  ( 2 min )
    Shape Modeling with Spline Partitions. (arXiv:2108.02507v2 [stat.ML] UPDATED)
    Shape modelling (with methods that output shapes) is a new and important task in Bayesian nonparametrics and bioinformatics. In this work, we focus on Bayesian nonparametric methods for capturing shapes by partitioning a space using curves. In related work, the classical Mondrian process is used to partition spaces recursively with axis-aligned cuts, and is widely applied in multi-dimensional and relational data. The Mondrian process outputs hyper-rectangles. Recently, the random tessellation process was introduced as a generalization of the Mondrian process, partitioning a domain with non-axis aligned cuts in an arbitrary dimensional space, and outputting polytopes. Motivated by these processes, in this work, we propose a novel parallelized Bayesian nonparametric approach to partition a domain with curves, enabling complex data-shapes to be acquired. We apply our method to HIV-1-infected human macrophage image dataset, and also simulated datasets sets to illustrate our approach. We compare to support vector machines, random forests and state-of-the-art computer vision methods such as simple linear iterative clustering super pixel image segmentation. We develop an R package that is available at \url{https://github.com/ShufeiGe/Shape-Modeling-with-Spline-Partitions}.  ( 2 min )
    Textual Manifold-based Defense Against Natural Language Adversarial Examples. (arXiv:2211.02878v1 [cs.CL])
    Recent studies on adversarial images have shown that they tend to leave the underlying low-dimensional data manifold, making them significantly more challenging for current models to make correct predictions. This so-called off-manifold conjecture has inspired a novel line of defenses against adversarial attacks on images. In this study, we find a similar phenomenon occurs in the contextualized embedding space induced by pretrained language models, in which adversarial texts tend to have their embeddings diverge from the manifold of natural ones. Based on this finding, we propose Textual Manifold-based Defense (TMD), a defense mechanism that projects text embeddings onto an approximated embedding manifold before classification. It reduces the complexity of potential adversarial examples, which ultimately enhances the robustness of the protected model. Through extensive experiments, our method consistently and significantly outperforms previous defenses under various attack settings without trading off clean accuracy. To the best of our knowledge, this is the first NLP defense that leverages the manifold structure against adversarial attacks. Our code is available at \url{https://github.com/dangne/tmd}.  ( 2 min )
    A Comparison of Automatic Labelling Approaches for Sentiment Analysis. (arXiv:2211.02976v1 [cs.CL])
    Labelling a large quantity of social media data for the task of supervised machine learning is not only time-consuming but also difficult and expensive. On the other hand, the accuracy of supervised machine learning models is strongly related to the quality of the labelled data on which they train, and automatic sentiment labelling techniques could reduce the time and cost of human labelling. We have compared three automatic sentiment labelling techniques: TextBlob, Vader, and Afinn to assign sentiments to tweets without any human assistance. We compare three scenarios: one uses training and testing datasets with existing ground truth labels; the second experiment uses automatic labels as training and testing datasets; and the third experiment uses three automatic labelling techniques to label the training dataset and uses the ground truth labels for testing. The experiments were evaluated on two Twitter datasets: SemEval-2013 (DS-1) and SemEval-2016 (DS-2). Results show that the Afinn labelling technique obtains the highest accuracy of 80.17% (DS-1) and 80.05% (DS-2) using a BiLSTM deep learning model. These findings imply that automatic text labelling could provide significant benefits, and suggest a feasible alternative to the time and cost of human labelling efforts.  ( 2 min )
    PASTA: Table-Operations Aware Fact Verification via Sentence-Table Cloze Pre-training. (arXiv:2211.02816v1 [cs.CL])
    Fact verification has attracted a lot of research attention recently, e.g., in journalism, marketing, and policymaking, as misinformation and disinformation online can sway one's opinion and affect one's actions. While fact-checking is a hard task in general, in many cases, false statements can be easily debunked based on analytics over tables with reliable information. Hence, table-based fact verification has recently emerged as an important and growing research area. Yet, progress has been limited due to the lack of datasets that can be used to pre-train language models (LMs) to be aware of common table operations, such as aggregating a column or comparing tuples. To bridge this gap, in this paper we introduce PASTA, a novel state-of-the-art framework for table-based fact verification via pre-training with synthesized sentence-table cloze questions. In particular, we design six types of common sentence-table cloze tasks, including Filter, Aggregation, Superlative, Comparative, Ordinal, and Unique, based on which we synthesize a large corpus consisting of 1.2 million sentence-table pairs from WikiTables. PASTA uses a recent pre-trained LM, DeBERTaV3, and further pretrains it on our corpus. Our experimental results show that PASTA achieves new state-of-the-art performance on two table-based fact verification benchmarks: TabFact and SEM-TAB-FACTS. In particular, on the complex set of TabFact, which contains multiple operations, PASTA largely outperforms the previous state of the art by 4.7 points (85.6% vs. 80.9%), and the gap between PASTA and human performance on the small TabFact test set is narrowed to just 1.5 points (90.6% vs. 92.1%).  ( 3 min )
    Inductive Graph Transformer for Delivery Time Estimation. (arXiv:2211.02863v1 [cs.LG])
    Providing accurate estimated time of package delivery on users' purchasing pages for e-commerce platforms is of great importance to their purchasing decisions and post-purchase experiences. Although this problem shares some common issues with the conventional estimated time of arrival (ETA), it is more challenging with the following aspects: 1) Inductive inference. Models are required to predict ETA for orders with unseen retailers and addresses; 2) High-order interaction of order semantic information. Apart from the spatio-temporal features, the estimated time also varies greatly with other factors, such as the packaging efficiency of retailers, as well as the high-order interaction of these factors. In this paper, we propose an inductive graph transformer (IGT) that leverages raw feature information and structural graph data to estimate package delivery time. Different from previous graph transformer architectures, IGT adopts a decoupled pipeline and trains transformer as a regression function that can capture the multiplex information from both raw feature and dense embeddings encoded by a graph neural network (GNN). In addition, we further simplify the GNN structure by removing its non-linear activation and the learnable linear transformation matrix. The reduced parameter search space and linear information propagation in the simplified GNN enable the IGT to be applied in large-scale industrial scenarios. Experiments on real-world logistics datasets show that our proposed model can significantly outperform the state-of-the-art methods on estimation of delivery time. The source code is available at: https://github.com/enoche/IGT-WSDM23.  ( 3 min )
    Gauge Equivariant Neural Networks for 2+1D U(1) Gauge Theory Simulations in Hamiltonian Formulation. (arXiv:2211.03198v1 [hep-lat])
    Gauge Theory plays a crucial role in many areas in science, including high energy physics, condensed matter physics and quantum information science. In quantum simulations of lattice gauge theory, an important step is to construct a wave function that obeys gauge symmetry. In this paper, we have developed gauge equivariant neural network wave function techniques for simulating continuous-variable quantum lattice gauge theories in the Hamiltonian formulation. We have applied the gauge equivariant neural network approach to find the ground state of 2+1-dimensional lattice gauge theory with U(1) gauge group using variational Monte Carlo. We have benchmarked our approach against the state-of-the-art complex Gaussian wave functions, demonstrating improved performance in the strong coupling regime and comparable results in the weak coupling regime.  ( 2 min )
    Safe Exploration Incurs Nearly No Additional Sample Complexity for Reward-free RL. (arXiv:2206.14057v2 [cs.LG] UPDATED)
    Reward-free reinforcement learning (RF-RL), a recently introduced RL paradigm, relies on random action-taking to explore the unknown environment without any reward feedback information. While the primary goal of the exploration phase in RF-RL is to reduce the uncertainty in the estimated model with minimum number of trajectories, in practice, the agent often needs to abide by certain safety constraint at the same time. It remains unclear how such safe exploration requirement would affect the corresponding sample complexity in order to achieve the desired optimality of the obtained policy in planning. In this work, we make a first attempt to answer this question. In particular, we consider the scenario where a safe baseline policy is known beforehand, and propose a unified Safe reWard-frEe ExploraTion (SWEET) framework. We then particularize the SWEET framework to the tabular and the low-rank MDP settings, and develop algorithms coined Tabular-SWEET and Low-rank-SWEET, respectively. Both algorithms leverage the concavity and continuity of the newly introduced truncated value functions, and are guaranteed to achieve zero constraint violation during exploration with high probability. Furthermore, both algorithms can provably find a near-optimal policy subject to any constraint in the planning phase. Remarkably, the sample complexities under both algorithms match or even outperform the state of the art in their constraint-free counterparts up to some constant factors, proving that safety constraint hardly increases the sample complexity for RF-RL.  ( 3 min )
    Physics Informed Machine Learning for Chemistry Tabulation. (arXiv:2211.03022v1 [cs.LG])
    Modeling of turbulent combustion system requires modeling the underlying chemistry and the turbulent flow. Solving both systems simultaneously is computationally prohibitive. Instead, given the difference in scales at which the two sub-systems evolve, the two sub-systems are typically (re)solved separately. Popular approaches such as the Flamelet Generated Manifolds (FGM) use a two-step strategy where the governing reaction kinetics are pre-computed and mapped to a low-dimensional manifold, characterized by a few reaction progress variables (model reduction) and the manifold is then ``looked-up'' during the runtime to estimate the high-dimensional system state by the flow system. While existing works have focused on these two steps independently, in this work we show that joint learning of the progress variables and the look--up model, can yield more accurate results. We build on the base formulation and implementation ChemTab to include the dynamically generated Themochemical State Variables (Lower Dimensional Dynamic Source Terms). We discuss the challenges in the implementation of this deep neural network architecture and experimentally demonstrate it's superior performance.  ( 2 min )
    Enabling Deep Learning-based Physical-layer Secret Key Generation for FDD-OFDM Systems in Multi-Environments. (arXiv:2211.03065v1 [cs.IT])
    Deep learning-based physical-layer secret key generation (PKG) has been used to overcome the imperfect uplink/downlink channel reciprocity in frequency division duplexing (FDD) orthogonal frequency division multiplexing (OFDM) systems. However, existing efforts have focused on key generation for users in a specific environment where the training samples and test samples obey the same distribution, which is unrealistic for real world applications. This paper formulates the PKG problem in multiple environments as a learning-based problem by learning the knowledge such as data and models from known environments to generate keys quickly and efficiently in multiple new environments. Specifically, we propose deep transfer learning (DTL) and meta-learning-based channel feature mapping algorithms for key generation. The two algorithms use different training methods to pre-train the model in the known environments, and then quickly adapt and deploy the model to new environments. Simulation results show that compared with the methods without adaptation, the DTL and meta-learning algorithms both can improve the performance of generated keys. In addition, the complexity analysis shows that the meta-learning algorithm can achieve better performance than the DTL algorithm with less time, lower CPU and GPU resources.  ( 2 min )
    Confidence-Ranked Reconstruction of Census Microdata from Published Statistics. (arXiv:2211.03128v1 [cs.CY])
    A reconstruction attack on a private dataset $D$ takes as input some publicly accessible information about the dataset and produces a list of candidate elements of $D$. We introduce a new class of data reconstruction attacks based on randomized methods for non-convex optimization. We empirically demonstrate that our attacks can not only reconstruct full rows of $D$ from aggregate query statistics $Q(D)\in \mathbb{R}^m$, but can do so in a way that reliably ranks reconstructed rows by their odds of appearing in the private data, providing a signature that could be used for prioritizing reconstructed rows for further actions such as identify theft or hate crime. We also design a sequence of baselines for evaluating reconstruction attacks. Our attacks significantly outperform those that are based only on access to a public distribution or population from which the private dataset $D$ was sampled, demonstrating that they are exploiting information in the aggregate statistics $Q(D)$, and not simply the overall structure of the distribution. In other words, the queries $Q(D)$ are permitting reconstruction of elements of this dataset, not the distribution from which $D$ was drawn. These findings are established both on 2010 U.S. decennial Census data and queries and Census-derived American Community Survey datasets. Taken together, our methods and experiments illustrate the risks in releasing numerically precise aggregate statistics of a large dataset, and provide further motivation for the careful application of provably private techniques such as differential privacy.
    Wind Power Forecasting Considering Data Privacy Protection: A Federated Deep Reinforcement Learning Approach. (arXiv:2211.02674v1 [cs.LG])
    In a modern power system with an increasing proportion of renewable energy, wind power prediction is crucial to the arrangement of power grid dispatching plans due to the volatility of wind power. However, traditional centralized forecasting methods raise concerns regarding data privacy-preserving and data islands problem. To handle the data privacy and openness, we propose a forecasting scheme that combines federated learning and deep reinforcement learning (DRL) for ultra-short-term wind power forecasting, called federated deep reinforcement learning (FedDRL). Firstly, this paper uses the deep deterministic policy gradient (DDPG) algorithm as the basic forecasting model to improve prediction accuracy. Secondly, we integrate the DDPG forecasting model into the framework of federated learning. The designed FedDRL can obtain an accurate prediction model in a decentralized way by sharing model parameters instead of sharing private data which can avoid sensitive privacy issues. The simulation results show that the proposed FedDRL outperforms the traditional prediction methods in terms of forecasting accuracy. More importantly, while ensuring the forecasting performance, FedDRL can effectively protect the data privacy and relieve the communication pressure compared with the traditional centralized forecasting method. In addition, a simulation with different federated learning parameters is conducted to confirm the robustness of the proposed scheme.  ( 2 min )
    Efficient Cavity Searching for Gene Network of Influenza A Virus. (arXiv:2211.02935v1 [q-bio.GN])
    High order structures (cavities and cliques) of the gene network of influenza A virus reveal tight associations among viruses during evolution and are key signals that indicate viral cross-species infection and cause pandemics. As indicators for sensing the dynamic changes of viral genes, these higher order structures have been the focus of attention in the field of virology. However, the size of the viral gene network is usually huge, and searching these structures in the networks introduces unacceptable delay. To mitigate this issue, in this paper, we propose a simple-yet-effective model named HyperSearch based on deep learning to search cavities in a computable complex network for influenza virus genetics. Extensive experiments conducted on a public influenza virus dataset demonstrate the effectiveness of HyperSearch over other advanced deep-learning methods without any elaborated model crafting. Moreover, HyperSearch can finish the search works in minutes while 0-1 programming takes days. Since the proposed method is simple and easy to be transferred to other complex networks, HyperSearch has the potential to facilitate the monitoring of dynamic changes in viral genes and help humans keep up with the pace of virus mutations.  ( 2 min )
    SAMO: Speaker Attractor Multi-Center One-Class Learning for Voice Anti-Spoofing. (arXiv:2211.02718v1 [eess.AS])
    Voice anti-spoofing systems are crucial auxiliaries for automatic speaker verification (ASV) systems. A major challenge is caused by unseen attacks empowered by advanced speech synthesis technologies. Our previous research on one-class learning has improved the generalization ability to unseen attacks by compacting the bona fide speech in the embedding space. However, such compactness lacks consideration of the diversity of speakers. In this work, we propose speaker attractor multi-center one-class learning (SAMO), which clusters bona fide speech around a number of speaker attractors and pushes away spoofing attacks from all the attractors in a high-dimensional embedding space. For training, we propose an algorithm for the co-optimization of bona fide speech clustering and bona fide/spoof classification. For inference, we propose strategies to enable anti-spoofing for speakers without enrollment. Our proposed system outperforms existing state-of-the-art single systems with a relative improvement of 38% on equal error rate (EER) on the ASVspoof2019 LA evaluation set.
    Deconfounded Imitation Learning. (arXiv:2211.02667v1 [cs.LG])
    Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. We break down the space of confounded imitation learning problems and identify three settings with different data requirements in which the correct imitation policy can be identified. We then introduce an algorithm for deconfounded imitation learning, which trains an inference model jointly with a latent-conditional policy. At test time, the agent alternates between updating its belief over the latent and acting under the belief. We show in theory and practice that this algorithm converges to the correct interventional policy, solves the confounding issue, and can under certain assumptions achieve an asymptotically optimal imitation performance.  ( 2 min )
    Towards Efficient ECG-based Atrial Fibrillation Detection via Parameterised Hypercomplex Neural Networks. (arXiv:2211.02678v1 [eess.SP])
    Atrial fibrillation (AF) is the most common cardiac arrhythmia and associated with a higher risk for serious conditions like stroke. Long-term recording of the electrocardiogram (ECG) with wearable devices embedded with an automatic and timely evaluation of AF helps to avoid life-threatening situations. However, the use of a deep neural network for auto-analysis of ECG on wearable devices is limited by its complexity. In this work, we propose lightweight convolutional neural networks (CNNs) for AF detection inspired by the recently proposed parameterised hypercomplex (PH) neural networks. Specifically, the convolutional and fully-connected layers of a real-valued CNN are replaced by PH convolutions and multiplications, respectively. PH layers are flexible to operate in any channel dimension n and able to capture inter-channel relations. We evaluate PH-CNNs on publicly available databases of dynamic and in-hospital ECG recordings and show comparable performance to corresponding real-valued CNNs while using approx. $1/n$ model parameters.  ( 2 min )
    Dealing with Drift of Adaptation Spaces in Learning-based Self-Adaptive Systems using Lifelong Self-Adaptation. (arXiv:2211.02658v1 [cs.LG])
    Recently, machine learning (ML) has become a popular approach to support self-adaptation. ML has been used to deal with several problems in self-adaptation, such as maintaining an up-to-date runtime model under uncertainty and scalable decision-making. Yet, exploiting ML comes with inherent challenges. In this paper, we focus on a particularly important challenge for learning-based self-adaptive systems: drift in adaptation spaces. With adaptation space we refer to the set of adaptation options a self-adaptive system can select from at a given time to adapt based on the estimated quality properties of the adaptation options. Drift of adaptation spaces originates from uncertainties, affecting the quality properties of the adaptation options. Such drift may imply that eventually no adaptation option can satisfy the initial set of the adaptation goals, deteriorating the quality of the system, or adaptation options may emerge that allow enhancing the adaptation goals. In ML, such shift corresponds to novel class appearance, a type of concept drift in target data that common ML techniques have problems dealing with. To tackle this problem, we present a novel approach to self-adaptation that enhances learning-based self-adaptive systems with a lifelong ML layer. We refer to this approach as lifelong self-adaptation. The lifelong ML layer tracks the system and its environment, associates this knowledge with the current tasks, identifies new tasks based on differences, and updates the learning models of the self-adaptive system accordingly. A human stakeholder may be involved to support the learning process and adjust the learning and goal models. We present a reusable architecture for lifelong self-adaptation and apply it to the case of drift of adaptation spaces that affects the decision-making in self-adaptation. We validate the approach for a series of scenarios using the DeltaIoT exemplar.  ( 3 min )
    Climbing Routes Clustering Using Energy-Efficient Accelerometers Attached to the Quickdraws. (arXiv:2211.02680v1 [eess.SP])
    One of the challenges for climbing gyms is to find out popular routes for the climbers to improve their services and optimally use their infrastructure. This problem must be addressed preserving both the privacy and convenience of the climbers and the costs of the gyms. To this aim, a hardware prototype is developed to collect data using accelerometer sensors attached to a piece of climbing equipment mounted on the wall, called quickdraw, that connects the climbing rope to the bolt anchors. The corresponding sensors are configured to be energy-efficient, hence becoming practical in terms of expenses and time consumption for replacement when used in large quantities in a climbing gym. This paper describes hardware specifications, studies data measured by the sensors in ultra-low power mode, detect patterns in data during climbing different routes, and develops an unsupervised approach for route clustering.  ( 2 min )
    Automatic Seizure Prediction using CNN and LSTM. (arXiv:2211.02679v1 [eess.SP])
    The electroencephalogram (EEG) is one of the most precious technologies to understand the happenings inside our brain and further understand our body's happenings. Automatic prediction of oncoming seizures using the EEG signals helps the doctors and clinical experts and reduces their workload. This paper proposes an end-to-end deep learning algorithm to fully automate seizure prediction's laborious task without any heavy pre-processing on the EEG data or feature engineering. The proposed deep learning network is a blend of signal processing and deep learning pipeline, which automates the seizure prediction framework using the EEG signals. This proposed model was evaluated on an open EEG dataset, CHB-MIT. The network achieved an average sensitivity of 97.746\text{\%} and a false positive rate (FPR) of 0.2373 per hour.  ( 2 min )
    LightNorm: Area and Energy-Efficient Batch Normalization Hardware for On-Device DNN Training. (arXiv:2211.02686v1 [cs.AR])
    When training early-stage deep neural networks (DNNs), generating intermediate features via convolution or linear layers occupied most of the execution time. Accordingly, extensive research has been done to reduce the computational burden of the convolution or linear layers. In recent mobile-friendly DNNs, however, the relative number of operations involved in processing these layers has significantly reduced. As a result, the proportion of the execution time of other layers, such as batch normalization layers, has increased. Thus, in this work, we conduct a detailed analysis of the batch normalization layer to efficiently reduce the runtime overhead in the batch normalization process. Backed up by the thorough analysis, we present an extremely efficient batch normalization, named LightNorm, and its associated hardware module. In more detail, we fuse three approximation techniques that are i) low bit-precision, ii) range batch normalization, and iii) block floating point. All these approximate techniques are carefully utilized not only to maintain the statistics of intermediate feature maps, but also to minimize the off-chip memory accesses. By using the proposed LightNorm hardware, we can achieve significant area and energy savings during the DNN training without hurting the training accuracy. This makes the proposed hardware a great candidate for the on-device training.  ( 2 min )
    Deep Surrogate Docking: Accelerating Automated Drug Discovery with Graph Neural Networks. (arXiv:2211.02720v1 [cs.LG])
    The process of screening molecules for desirable properties is a key step in several applications, ranging from drug discovery to material design. During the process of drug discovery specifically, protein-ligand docking, or chemical docking, is a standard in-silico scoring technique that estimates the binding affinity of molecules with a specific protein target. Recently, however, as the number of virtual molecules available to test has rapidly grown, these classical docking algorithms have created a significant computational bottleneck. We address this problem by introducing Deep Surrogate Docking (DSD), a framework that applies deep learning-based surrogate modeling to accelerate the docking process substantially. DSD can be interpreted as a formalism of several earlier surrogate prefiltering techniques, adding novel metrics and practical training practices. Specifically, we show that graph neural networks (GNNs) can serve as fast and accurate estimators of classical docking algorithms. Additionally, we introduce FiLMv2, a novel GNN architecture which we show outperforms existing state-of-the-art GNN architectures, attaining more accurate and stable performance by allowing the model to filter out irrelevant information from data more efficiently. Through extensive experimentation and analysis, we show that the DSD workflow combined with the FiLMv2 architecture provides a 9.496x speedup in molecule screening with a <3% recall error rate on an example docking task. Our open-source code is available at https://github.com/ryienh/graph-dock.  ( 2 min )
    Deep Distance Sensitivity Oracles. (arXiv:2211.02681v1 [cs.LG])
    One of the most fundamental graph problems is finding a shortest path from a source to a target node. While in its basic forms the problem has been studied extensively and efficient algorithms are known, it becomes significantly harder as soon as parts of the graph are susceptible to failure. Although one can recompute a shortest replacement path after every outage, this is rather inefficient both in time and/or storage. One way to overcome this problem is to shift computational burden from the queries into a pre-processing step, where a data structure is computed that allows for fast querying of replacement paths, typically referred to as a Distance Sensitivity Oracle (DSO). While DSOs have been extensively studied in the theoretical computer science community, to the best of our knowledge this is the first work to construct DSOs using deep learning techniques. We show how to use deep learning to utilize a combinatorial structure of replacement paths. More specifically, we utilize the combinatorial structure of replacement paths as a concatenation of shortest paths and use deep learning to find the pivot nodes for stitching shortest paths into replacement paths.  ( 2 min )
    Fast and efficient speech enhancement with variational autoencoders. (arXiv:2211.02728v1 [cs.SD])
    Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods. This approach involves the use of a pre-trained deep speech prior along with a parametric noise model, where the noise parameters are learned from the noisy speech signal with an expectationmaximization (EM)-based method. The E-step involves an intractable latent posterior distribution. Existing algorithms to solve this step are either based on computationally heavy Monte Carlo Markov Chain sampling methods and variational inference, or inefficient optimization-based methods. In this paper, we propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors. Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.  ( 2 min )
    An Adversarial Robustness Perspective on the Topology of Neural Networks. (arXiv:2211.02675v1 [cs.LG])
    In this paper, we investigate the impact of neural networks (NNs) topology on adversarial robustness. Specifically, we study the graph produced when an input traverses all the layers of a NN, and show that such graphs are different for clean and adversarial inputs. We find that graphs from clean inputs are more centralized around highway edges, whereas those from adversaries are more diffuse, leveraging under-optimized edges. Through experiments on a variety of datasets and architectures, we show that these under-optimized edges are a source of adversarial vulnerability and that they can be used to detect adversarial inputs.  ( 2 min )
    Uncertainty-aware predictive modeling for fair data-driven decisions. (arXiv:2211.02730v1 [stat.ML])
    Both industry and academia have made considerable progress in developing trustworthy and responsible machine learning (ML) systems. While critical concepts like fairness and explainability are often addressed, the safety of systems is typically not sufficiently taken into account. By viewing data-driven decision systems as socio-technical systems, we draw on the uncertainty in ML literature to show how fairML systems can also be safeML systems. We posit that a fair model needs to be an uncertainty-aware model, e.g. by drawing on distributional regression. For fair decisions, we argue that a safe fail option should be used for individuals with uncertain categorization. We introduce semi-structured deep distributional regression as a modeling framework which addresses multiple concerns brought against standard ML models and show its use in a real-world example of algorithmic profiling of job seekers.  ( 2 min )
    De novo PROTAC design using graph-based deep generative models. (arXiv:2211.02660v1 [q-bio.QM])
    PROteolysis TArgeting Chimeras (PROTACs) are an emerging therapeutic modality for degrading a protein of interest (POI) by marking it for degradation by the proteasome. Recent developments in artificial intelligence (AI) suggest that deep generative models can assist with the de novo design of molecules with desired properties, and their application to PROTAC design remains largely unexplored. We show that a graph-based generative model can be used to propose novel PROTAC-like structures from empty graphs. Our model can be guided towards the generation of large molecules (30--140 heavy atoms) predicted to degrade a POI through policy-gradient reinforcement learning (RL). Rewards during RL are applied using a boosted tree surrogate model that predicts a molecule's degradation potential for each POI. Using this approach, we steer the generative model towards compounds with higher likelihoods of predicted degradation activity. Despite being trained on sparse public data, the generative model proposes molecules with substructures found in known degraders. After fine-tuning, predicted activity against a challenging POI increases from 50% to >80% with near-perfect chemical validity for sampled compounds, suggesting this is a promising approach for the optimization of large, PROTAC-like molecules for targeted protein degradation.  ( 2 min )
    Online Learning and Bandits with Queried Hints. (arXiv:2211.02703v1 [cs.DS])
    We consider the classic online learning and stochastic multi-armed bandit (MAB) problems, when at each step, the online policy can probe and find out which of a small number ($k$) of choices has better reward (or loss) before making its choice. In this model, we derive algorithms whose regret bounds have exponentially better dependence on the time horizon compared to the classic regret bounds. In particular, we show that probing with $k=2$ suffices to achieve time-independent regret bounds for online linear and convex optimization. The same number of probes improve the regret bound of stochastic MAB with independent arms from $O(\sqrt{nT})$ to $O(n^2 \log T)$, where $n$ is the number of arms and $T$ is the horizon length. For stochastic MAB, we also consider a stronger model where a probe reveals the reward values of the probed arms, and show that in this case, $k=3$ probes suffice to achieve parameter-independent constant regret, $O(n^2)$. Such regret bounds cannot be achieved even with full feedback after the play, showcasing the power of limited ``advice'' via probing before making the play. We also present extensions to the setting where the hints can be imperfect, and to the case of stochastic MAB where the rewards of the arms can be correlated.  ( 2 min )
    Resource-Efficient Transfer Learning From Speech Foundation Model Using Hierarchical Feature Fusion. (arXiv:2211.02712v1 [cs.LG])
    Self-supervised pre-training of a speech foundation model, followed by supervised fine-tuning, has shown impressive quality improvements on automatic speech recognition (ASR) tasks. Fine-tuning separate foundation models for many downstream tasks are expensive since the foundation model is usually very big. Parameter-efficient fine-tuning methods (e.g. adapter, sparse update methods) offer an alternative paradigm where a small set of parameters are updated to adapt the foundation model to new tasks. However, these methods still suffer from a high computational memory cost and slow training speed because they require backpropagation through the entire neural network at each step. In the paper, we analyze the performance of features at different layers of a foundation model on the speech recognition task and propose a novel hierarchical feature fusion method for resource-efficient transfer learning from speech foundation models. Experimental results show that the proposed method can achieve better performance on speech recognition task than existing algorithms with fewer number of trainable parameters, less computational memory cost and faster training speed. After combining with Adapters at all layers, the proposed method can achieve the same performance as fine-tuning the whole model with $97\%$ fewer trainable encoder parameters and $53\%$ faster training speed.  ( 2 min )
    GLOBEM Dataset: Multi-Year Datasets for Longitudinal Human Behavior Modeling Generalization. (arXiv:2211.02733v1 [cs.LG])
    Recent research has demonstrated the capability of behavior signals captured by smartphones and wearables for longitudinal behavior modeling. However, there is a lack of a comprehensive public dataset that serves as an open testbed for fair comparison among algorithms. Moreover, prior studies mainly evaluate algorithms using data from a single population within a short period, without measuring the cross-dataset generalizability of these algorithms. We present the first multi-year passive sensing datasets, containing over 700 user-years and 497 unique users' data collected from mobile and wearable sensors, together with a wide range of well-being metrics. Our datasets can support multiple cross-dataset evaluations of behavior modeling algorithms' generalizability across different users and years. As a starting point, we provide the benchmark results of 18 algorithms on the task of depression detection. Our results indicate that both prior depression detection algorithms and domain generalization techniques show potential but need further research to achieve adequate cross-dataset generalizability. We envision our multi-year datasets can support the ML community in developing generalizable longitudinal behavior modeling algorithms.  ( 2 min )
    MalGrid: Visualization Of Binary Features In Large Malware Corpora. (arXiv:2211.02696v1 [cs.CR])
    The number of malware is constantly on the rise. Though most new malware are modifications of existing ones, their sheer number is quite overwhelming. In this paper, we present a novel system to visualize and map millions of malware to points in a 2-dimensional (2D) spatial grid. This enables visualizing relationships within large malware datasets that can be used to develop triage solutions to screen different malware rapidly and provide situational awareness. Our approach links two visualizations within an interactive display. Our first view is a spatial point-based visualization of similarity among the samples based on a reduced dimensional projection of binary feature representations of malware. Our second spatial grid-based view provides a better insight into similarities and differences between selected malware samples in terms of the binary-based visual representations they share. We also provide a case study where the effect of packing on the malware data is correlated with the complexity of the packing algorithm.  ( 2 min )
    MONAI: An open-source framework for deep learning in healthcare. (arXiv:2211.02701v1 [cs.LG])
    Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.  ( 3 min )
    NLP Inspired Training Mechanics For Modeling Transient Dynamics. (arXiv:2211.02716v1 [cs.LG])
    In recent years, Machine learning (ML) techniques developed for Natural Language Processing (NLP) have permeated into developing better computer vision algorithms. In this work, we use such NLP-inspired techniques to improve the accuracy, robustness and generalizability of ML models for simulating transient dynamics. We introduce teacher forcing and curriculum learning based training mechanics to model vortical flows and show an enhancement in accuracy for ML models, such as FNO and UNet by more than 50%.  ( 2 min )
    A Fuzzy-set-based Joint Distribution Adaptation Method for Regression and its Application to Online Damage Quantification for Structural Digital Twin. (arXiv:2211.02656v1 [cs.CE])
    Online damage quantification suffers from insufficient labeled data. In this context, adopting the domain adaptation on historical labeled data from similar structures/damages to assist the current diagnosis task would be beneficial. However, most domain adaptation methods are designed for classification and cannot efficiently address damage quantification, a regression problem with continuous real-valued labels. This study first proposes a novel domain adaptation method, the Online Fuzzy-set-based Joint Distribution Adaptation for Regression, to address this challenge. By converting the continuous real-valued labels to fuzzy class labels via fuzzy sets, the conditional distribution discrepancy is measured, and domain adaptation can simultaneously consider the marginal and conditional distribution for the regression task. Furthermore, a framework of online damage quantification integrated with the proposed domain adaptation method is presented. The method has been verified with an example of a damaged helicopter panel, in which domain adaptations are conducted across different damage locations and from simulation to experiment, proving the accuracy of damage quantification can be improved significantly even in a noisy environment. It is expected that the proposed approach to be applied to the fleet-level digital twin considering the individual differences.  ( 2 min )
    MolE: a molecular foundation model for drug discovery. (arXiv:2211.02657v1 [q-bio.QM])
    Models that accurately predict properties based on chemical structure are valuable tools in drug discovery. However, for many properties, public and private training sets are typically small, and it is difficult for the models to generalize well outside of the training data. Recently, large language models have addressed this problem by using self-supervised pretraining on large unlabeled datasets, followed by fine-tuning on smaller, labeled datasets. In this paper, we report MolE, a molecular foundation model that adapts the DeBERTa architecture to be used on molecular graphs together with a two-step pretraining strategy. The first step of pretraining is a self-supervised approach focused on learning chemical structures, and the second step is a massive multi-task approach to learn biological information. We show that fine-tuning pretrained MolE achieves state-of-the-art results on 9 of the 22 ADMET tasks included in the Therapeutic Data Commons.  ( 2 min )
  • Open

    How AI Image Generators Work
    How AI Image Generators Work (Stable Diffusion / Dall-E) - Computerphile: https://www.youtube.com/watch?v=1CIpzeNxIhU Stable Diffusion in Code (AI Image Generation) - Computerphile: https://www.youtube.com/watch?v=-lz30by8-sU submitted by /u/keghn [link] [comments]  ( 44 min )

  • Open

    [R] Evaluation of SSL on existing sleep stage classification models
    Do you work on the sleep stage classification problem? You must have noticed how much large labeled dataset you need to train your model, which might not always be achievable. Since self-supervised learning solves the issue of having few labels, we revisited some existing sleep stage classification models to see how they perform if they are pretrained with a self-supervised learning method. We find that after pertaining, these methods need only 5% of labeled data to achieve the previous supervised performance. Check our work https://arxiv.org/abs/2210.06286 and our codes https://github.com/emadeldeen24/eval_ssl_ssc submitted by /u/emad_eldeen [link] [comments]  ( 58 min )
    [P] Ransomware detection using ML. We need your suggestions to create a better model!
    We are working on #machinelearning applied to behavioural data for #ransomware detection. This is a 3-components PCA (50% of variance is explained) applied to behavioural data from #ransomware and legit programs, as collected by Owlyshield. Trajectories should be interpreted as the evolution of programs activities over time. ​ 3-component PCA showing Ransomware/legit binaries behaviours What is interesting is that we can build a separating hyperplane (a SVM works well but xgboost shows outstanding performances). We focus solely on the activity of programs (or families of programs) in terms of disk activities (read and write). Behaviours are collected by Owlyshield, our open-source anti-malware engine we created and maintain. Owlyshield is a dynamic project and the quantity of metrics we collect is increasing rapidly. We work with timeseries of aggregated data that describe the evolution of the activity of a family of process over time. Some examples of such features : Count of read/setinfo/write operations, Total read/write bytes, Total read/write entropy, Count of files opened/deleted/read/renamed/written, ... As of now we only support Windows which was simple to work with: Minifilters make accessing I/O request packets (IRPs) easy, Ransomware attacks are still uncommon on *nix systems and we lack sufficient binaries for supervised learning. We would like to work on the Linux support but we are still unsure if recompiling the kernel is required to monitor low-level read/write activity. Fuse may be an option. Let us know what you think may be a good way to implement it. As of now, our model classifies individual points (aggregated features at a given time) but does not take into account the "direction" of the trajectory (timeseries). We would like to know how you would build such a model? submitted by /u/dlescos [link] [comments]  ( 59 min )
    [D] Group based activity to teach basics
    Hi, I tried a search on this but not sure how best to word it. Basically, I am wondering if anyone has some resources or ideas of teaching the basics of machine learning but as an in-person group-based activity? I am effectively looking to get other teams who know nothing about it, a basic understanding of what I am doing when I deploy models to the business. submitted by /u/calv420 [link] [comments]  ( 58 min )
    [D] What does it mean for an AI to understand? (Chinese Room Argument) - MLST Video
    Mods feel free to delete this if you feel it's inappropriate. https://youtu.be/_KVAzAzO5HU We interviewed Francois Chollet, Mark Bishop, David Chalmers, Joscha Bach and Karl Friston on the Chinese Room argument. The Chinese Room Argument was first proposed by philosopher John Searle in 1980. It is an argument against the possibility of artificial intelligence (AI) – that is, the idea that a machine could ever be truly intelligent, as opposed to just imitating intelligence. The argument goes like this: Imagine a room in which a person sits at a desk, with a book of rules in front of them. This person does not understand Chinese. Someone outside the room passes a piece of paper through a slot in the door. On this paper is a Chinese character. The person in the room consults the book of rules and, following these rules, writes down another Chinese character and passes it back out through the slot. To someone outside the room, it appears that the person in the room is engaging in a conversation in Chinese. In reality, they have no idea what they are doing – they are just following the rules in the book. The Chinese Room Argument is an argument against the idea that a machine could ever be truly intelligent. It is based on the idea that intelligence requires understanding, and that following rules is not the same as understanding. TL;DR - Chalmers, Chollet, Bach and Friston think that minds can arise from information (functionalists with some interesting distinctions on whether it's causal / strongly emergent etc), Bishop/Searle not, they think there is an ontological difference in "being". submitted by /u/timscarfe [link] [comments]  ( 59 min )
    [D] Resources geared towards deep video understanding?
    I'm trying to get my feet wet with using deep learning based techniques for video understanding tasks - action recognition, video summarization, video temporal grounding, etc. Could someone point to any resources that serve as a gentle introduction into the kind of techniques used for these tasks? submitted by /u/fullgoopy_alchemist [link] [comments]  ( 58 min )
    [D] Academia: The highest funded plagiarist is also an AI ethicist
    This academic plagiarist has gotten hundreds of thousands of dollars for AI ethics research(https://www.umu.se/globalassets/qbank/andreastheodorouvirginiadignum-26433crop001155650resize1154649autoorientquality90density150stripextensionjpgid16.jpg?format=webp&mode=crop&width=1280): $300000 for EXPLAIN (2022-25) where he is the primary investigator for Umeå University. Funded by the Vinnova under Eureka ITEA/4 AI Cluster. $140000 for the project "RAIN"(that appears to be just a dubious presentation long after the funder has forgotten about their money) and several hundred thousand dollars more for other "research projects" related to AI ethics and transparency. The funniest part is that he claimed on a number of occasions that he has been cleared of the accusations by NPOF(a Swedish institution investigating academic misconduct) but NPOF writes the following: Thank you for your inquiryWe received a report regarding Andreas Theodorou and his dissertation in 2020. Theodorou made the dissertation in the UK. The dissertation is not made within Swedish jurisdiction and can not be investigated by Npof.Kind regardsRegistrator and his alma mater for obvious reasons refuses to investigate him. All the while he is being aggressively promoted by some Virginia Dignum a self promoting professor who among academics is considered highly proficient in academic politics. For more info go here:https://forbetterscience.com/2022/10/28/schneider-shorts-28-10-2022-indisputable-scientific-quality/#plagiarism or google "Theodorou plagiarism". submitted by /u/lexquests [link] [comments]  ( 60 min )
    [R] Adversarial Examples of Go AIs (NeurIPS 2022)
    GitHub: https://PaperCode.cc/GoAttack Paper: https://arxiv.org/abs/2211.03769 Can you be smarter than AlphaZero Go agents? In Figure 2, KataGo makes a trivial mistake that even amateur human players can easily tell. More details are at https://PaperCode.cc/GoAttack. https://preview.redd.it/l2jv0l12hsy91.png?width=1768&format=png&auto=webp&s=9491c1a76eb8051289b18b7bd2010ddab23e00cc submitted by /u/sb710031 [link] [comments]  ( 58 min )
    [D] Is there anything like beam search with BERT?
    I am trying to build an NER model, but want multiple options for the spans...ex: "I like green cats." -> {BOBI, BIII, BOOO, etc} that I can feed into another algorithm to choose based on downstream criteria. ​ With something like T5, I would modify the beam search to give me a list of generative texts from most probable to nth most probable. With BERT, I don't know how to do this because I can't condition the result of a token classification on the previous one. submitted by /u/natural_language_guy [link] [comments]  ( 61 min )
    [Project] PixelRNN
    Hi, me and my friend would like to make a project using generative model for face images, and we stumbled upon pixelrnn. Unfortunately, there is not much implementation detail available, except for research papers, and that does not help very much. We don't mind researching, but if it's something that's not possible for side project, we wouldn't like to waste our time. Do you have any advice? submitted by /u/prijateljmitar [link] [comments]  ( 57 min )
    [P] NLP Job Postings Project Volunteer Opportunity
    Looking for volunteers for building NLP models to classify elements in job postings (I.e Summary, Responsibilities, Qualifications, Benefits, etc.) and then categorize spans within the classifications (Skills, Nice to Have, Experience, etc.). I have a scalable web scraper I developed that pulls the job posting data from indeed. Currently using spaCy and prodigy but open to other libraries. DM if interested. So far I have not seen a job site that has all of the job postings in a standard format (Google job search schema is the closest I think). The outcome would be exciting for researching career opportunities and developing resume content with the latest industry insights and trends. submitted by /u/Kbig22 [link] [comments]  ( 58 min )
    [Project] Image detection and recognition
    [P]I've a project to create login in which user will give a photo which then will be matched with dataset( containing picture of all individual who can login) and will login to that particular account. I've to do it in java. I figured out that I will have to use openCV for it. However, I really have no idea of which algorithms to choose and what considerations amd limitations I should work for. Can anyone give like the basuc concept of this implementation. Like which algorithm and library to use and why? I apologize if this is not the right sub to post. Plz redirect me to the right one. Thanks in advance. Ps. I have basic understanding of machune learning( linear, logistic regression and neural networks) and don't know anything about openCV. submitted by /u/Ok_Opportunity_2022 [link] [comments]  ( 58 min )
    [Project] Pose Classification with Sound -- I need some very urgent help -- Anyone who can do t-SNE and PCA?
    Hi, I'm a high school student who needs some serious help on a research project, my grade is depending on this, and my mentor just dipped out on me. I'm working on a project that collects the vibrations of people's motions, then can identify the motion based on the vibrations. I have my data collected, and I converted the data into mel spectrograms. I'm supposed to do t-SNE and PCA on my datasets, but I am having some very serious problems with this, and I can't ask my mentor for help because he ghosted me. My project deadlines are coming up very soon. If anyone here has a familiarity with t-SNE or PCA and is willing to help me work out the problem, PM me and we can DM or call on Discord about the project. If you actually manage to get me out of this hole I can Venmo you some money but I'm a student and it will probably not be very much. Thanks everyone who is willing to try submitted by /u/n-nya [link] [comments]  ( 57 min )
    [R] Text8 dataset
    Hello, I was reading some papers about character level language models, and all of them test on Text8 dataset. I tried to understand the metric of this dataset but it was in vain. All the papers measure the Bit per character which I didn’t understand (have to do with compression I guess). So can someone explain to me how can this metric evaluate how good a LM model is ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 58 min )
    [D] Classification to predict imbalanced dataset with rare event?
    Hi All- I'm dealing with a dataset to train a model to predict the number occurrences of events (probability(event1), probability(event2),etc). I have over 400 features. However, as shown in the table below, I am facing a few issues: 1) The data is sparse, i.e., it contains some rare events. 2) The data is imbalanced, i.e., the number of negative labels is extremely larger than other events. I am currently doing a down-sampling of the negative labels, but I think there is a risk with that approach since the distribution of the data can be changed by down-sampling. If that is true, what would a good approach here? Furthermore, given the rare events, can I simply use a classification model (xgbbost, ...) for this problem? Or do I need to train a separate model for each event? ​ event_name count event1 1324 event2 124200 event3 800000 no_event 13000000 submitted by /u/hopedallas [link] [comments]  ( 59 min )
    [R] How to run inference for encoder-decoder algorithm with teacher enforcement.
    Hi all! I am trying to map two floating point numerical datasets with one dataset (of size 4096 timesteps/elements) going into the encoder and the other dataset (of size 8192) going into the decoder input and decoder output with 0 as the special start/stop character. While training, I am able to reduce the overall loss to around 0.020 but when I try to run the inferencing model, I am not sure if I am doing the right thing because the code I am using is used for outputting an encoded vector... But since my data is of float32 datatype, I cannot even encode it and feed it to the model ( I do carry out some data preprocessing in order to make it suitable for the model). Any sort of suggestions or help will be much appreciated as I am relatively new to encoder-decoder models and not sure what I am doing while inferencing. I am attaching my model architecture and the inferencing model. Training architecture: ​ https://preview.redd.it/mysq6nucoqy91.png?width=1194&format=png&auto=webp&s=2b2b0a8db3ca7770e463835f50679946a507f67d opt = Adam(lr=0.007, clipnorm=1) model_encoder_training.compile(optimizer=opt, loss='mean_squared_error', metrics=['mse']) Inferencing architecture and loop: ​ https://preview.redd.it/81p9zbcioqy91.png?width=1199&format=png&auto=webp&s=429d61e83421e5affccc97daea2ae9bd9d3103c6 submitted by /u/Shoddy-Competition29 [link] [comments]  ( 60 min )
    [N] Stability.ai AMA on November 15
    Excited to announce an upcoming AMA in /r/MachineLearning with the members of Stability.ai on Nov 15 at 11:00 AM PST. They have been one of the driving forces behind the open-source AI movement, being known for being key in the success of StableDiffusion (https://www.reddit.com/r/StableDiffusion/) and many other disruptive changes to the overall landscape of AI. ​ A thread will be created before the official AMA time for those who won't be able to attend on that day. submitted by /u/olaf_nij [link] [comments]  ( 81 min )
    [D] Is there a model that can extract product features from an image?
    Hello, I have a project in mind that requres an algorithm that can extract product features based on the image of a product provided. For example: I have an image of a golden ring with a diamond and ruby inset And the model must output something like "metal - gold, inset - diamond, ruby" ​ Is there a ML model that can do thing? Or maybe a model that I can repurpose for the task? submitted by /u/levch [link] [comments]  ( 57 min )
    [D] using a classifier after leave one out without retraining on independent dataset
    applying a crosss validated model without retraining to the independent set with When we use the LeaveOneOut.split(x) on the training dataset, the model is iteratively fitted for each fold and thus each fold has its own coefficients for the stimator e.g. logistic regression. Now let's say you use this trained classifier outside of the cross validation loop on the independent dataset, is the classifier using parameters from the final fold? Classifier = LogisticRegression() for train test in LeaveOneOut split(X): ::::::::::::::::classifier.fit(x train, y train) :::::::::::::::::classifer.predict(x test) classifier.predict(x independent) Anyway to make sure the classifier, when predicting on the independent set, is taking into account overall coefficients across all folds submitted by /u/macORnvidia [link] [comments]  ( 59 min )
    [N] Interested in how to evaluate Artificial Intelligence? Join our Google Group.
    Jointly with advancements in AI capabilities, interest in the evaluation of artificial intelligence is also rising quickly across disciplines. From the policy making, to cognitive science, and of course in all the domains of AI people are wondering how to construct benchmarks, and what insights we get out of them. In attempt to bundle these perspectives, and the corresponding news & events, we've created an old school mailing list on Google Groups. Open for all, no matter the expertise, so come on and join! https://groups.google.com/g/ai-eval submitted by /u/wschella [link] [comments]  ( 59 min )
    Pytorch Symbolic: an equivalent of Keras Functional API [Project]
    Hello! I just hit 1.0.0 version of a library I've been developing for the past months as a side project. Pytorch Symbolic A library that aims to provide a concise API for neural network creation in PyTorch. The API and the inner workings are similar to Keras / TensorFlow2 Functional API which I always enjoyed using. I decided to go with "Symbolic" in the name instead of "Functional" because I believe it better represents how the library works (also "functional" is kind of taken by torch.nn.functional). I did my best to prepare a useful documentation, so if you are interested, please check it out! It is filled with examples, best practices, benchmarks, explanations of the inner workings and more. See Documentation See on GitHub See on PyPI Example This example shows how to crea…  ( 64 min )
    [Project] Rebel Poker AI
    Any ML/DL peeps interested in taking a stab at implementing the Rebel algorithm for poker by Facebook AI group, along with maybe making a proper game api for training, all as an open source? If you're not familiar with the paper, this is the link: Rebel Let Me Know! submitted by /u/Character_Bluejay601 [link] [comments]  ( 57 min )
  • Open

    AI Dream 86 - CHASE YOUR DREAMS - AIM FOR THE STARS
    submitted by /u/LordPewPew777 [link] [comments]  ( 44 min )
    Weekly China AI News: Alibaba Launches "Model as a Service"; IDEA Open Sources Chinese Stable Diffusion; Pony.ai Delivers Trucks Despite Layoff Rumors
    submitted by /u/trcytony [link] [comments]  ( 44 min )
    How many of you would use a drag and drop visual node-based system for AI development
    Game Engines have had no-code game development tools since eons. Result which is awesome games especially indie. We have visual no-code tools for AI development, but they are: - Limited in functionality i.e., they don't allow changes to their models - Unaffordable for average deep learning researcher or developer - Limited in scope i.e., they have very few models or models only in a certain sub-domain - Impossible/Difficult to import previous work ​ Imagine a web-based drag and drop AI Development tool which allows you to: - import your previous models or any model on github/huggingface - make changes to imported models visually - change/add/remove any layer of neural network - verify neural network design by calculating input and output vector size - export code in tf/pytorch/onnx - prune and quantize deep learning models - train models on cloud or local machine - auto label data using zero shot detectors (ZSD-YOLO) - monitor training history and progress - inspect outputs and gradients of any arbitrary layer in deep learning models ​ Edit 1: Please list your reasons in comments section View Poll submitted by /u/kamiurek [link] [comments]  ( 45 min )
    check this out
    submitted by /u/the_anonymizer [link] [comments]  ( 43 min )
    Do the PhDs that do the cool A.I. R&D stuff at FAANG and the like tend to be from elite universities? Would some great A.I. scientist from no-name African university be able to cope with them for a job?
    Is there bias like with the Wall Street Finance jobs or more fair like with SWE jobs in tech? submitted by /u/uluzg [link] [comments]  ( 48 min )
    Data Visualization with AI Tools
    After almost a year of tinkering with AI tools, we thought it might be good to share some of our knowledge with the community. Here is a quick toolkit with thoughts, tricks, and AI concerns for you 🔥 https://preview.redd.it/e1ypo60eary91.png?width=948&format=png&auto=webp&s=346f2c919e9dbb6c74f6c48ca4787ed5113f4bab Here a first experimental prompt book for data visualization: https://docs.google.com/presentation/d/1V8d6TIlKqB1j5xPFH7cCmgKOV_fMs4Cb4dwgjD5GIsg/edit?usp=sharing Here a prompt book of materials: https://docs.google.com/presentation/d/1eAQ2vKU1esP_bBV_XYfNbS-BUYaBDXS2dFj7NC8sJDw/edit?usp=sharing And here an article of the general tools you could use with some of the main concerns behind: https://domesticdatastreamers.medium.com/a-quick-artificial-intelligence-tooguide-for-designers-and-data-designers-c99fe643c102 submitted by /u/pauerrrr [link] [comments]  ( 44 min )
    How can Robotics and AI support sustainable development? Fascinating talk from Prof. Mirko Kovac, Imperial College London showcasing the latest research
    submitted by /u/chelsea_bear [link] [comments]  ( 46 min )
    Pikachu Made with CAMY AI by Protocol 73
    submitted by /u/GameDevCoach [link] [comments]  ( 41 min )
    Empress Made with CAMY AI by Protocol 73
    submitted by /u/GameDevCoach [link] [comments]  ( 42 min )
    First Ever NASA Humanoid Robot To Release Before Tesla's Optimus| New Nvidia AI Tops OpenAI DALLE-2 & Google Imagen For Text To Image Output | New Meta AI Performs 60 Times Faster Than DeepMind AlphaFold
    submitted by /u/kenickh [link] [comments]  ( 44 min )
    Which BSc is likely to be most ideal for someone who'll pursue MSc A.I. (read body before voting)?
    All have Calculus 1-3, Linear Algebra, at least one probability module, but; Descending order: the most to least stats material (with CS major have one probability module) Ascending order: the most to least CS material Only where there's CS, there's CS material (including Algorithms and, Systems Analysis & Design) i.e. the first two have no CS material These are the only available options. I want a career in tech. More R&D than data science kind of career. View Poll submitted by /u/uluzg [link] [comments]  ( 46 min )
    What would be the best available way to make ai generated videos right now? Is there anything better than EbSynth + Stable Diffusion?
    submitted by /u/PM_ME_LIFE_MEANING [link] [comments]  ( 44 min )
    What are good videos where I can find out about new AIs?
    submitted by /u/Thesmallcookie [link] [comments]  ( 41 min )
    Eminem - Rap God (Unofficial AI Music Video)
    submitted by /u/Available_Tadpole829 [link] [comments]  ( 42 min )
    Google Developed A New Robot That Can Code Itself
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 42 min )
    Perceptron
    submitted by /u/smorga [link] [comments]  ( 43 min )
    Code and models now available for 'MotionBERT: Unified Pretraining for Human Motion Analysis'
    submitted by /u/ai-lover [link] [comments]  ( 44 min )
    It Was Worth A Try
    submitted by /u/galaxy7474 [link] [comments]  ( 44 min )
    Talking to a fake LaMDA about Sentience
    submitted by /u/KarneyHatch [link] [comments]  ( 45 min )
  • Open

    DSC Weekly 8 Nov 2022 – Layoffs in the Work From Home Era
    This week saw the news that two major automotive companies, Ford and VW, were walking away from a multi-billion dollar investment into Argo AI, a venture intended to build self-driving vehicles. Instead, the companies hope to roll at least some of that effort back into augmenting drivers' abilities to drive safely and efficiently. The post DSC Weekly 8 Nov 2022 – Layoffs in the Work From Home Era appeared first on Data Science Central.  ( 22 min )
    Leveraging Agile to Create Economies of Learning Mindset – Part 2
    CDO Data-to-Business Innovation Dilemma: Deliver meaningful and relevant business outcomes in the short-term while simultaneously and continuously building and transforming the organization’s data and analytics assets and capabilities. The post Leveraging Agile to Create Economies of Learning Mindset – Part 2 appeared first on Data Science Central.  ( 23 min )
    What is a Softphone—and Why Do We Use It?
    If you are looking for VoIP options, a softphone is your best bet. Softphones are software applications that help you make calls through a computer or a mobile device. Softphones have all the features of a desk phone. For example, you can transfer calls, hold the line, and toggle between multiple lines like any other… Read More »What is a Softphone—and Why Do We Use It? The post What is a Softphone—and Why Do We Use It? appeared first on Data Science Central.  ( 20 min )
    Cybercriminals Exposed: 5 Kinds and How They Operate
    Cybercriminals across the globe attack vulnerable systems daily. But who are these people, and what motivates them to perform these illegal acts? Some might even call them terrorists, and in some cases they are. To protect yourself and your organization you need to have a clear understanding of the kinds of cyber criminals out there,… Read More »Cybercriminals Exposed: 5 Kinds and How They Operate The post Cybercriminals Exposed: 5 Kinds and How They Operate appeared first on Data Science Central.  ( 20 min )
  • Open

    “ID + Selfie” – Improving digital identity verification using AWS
    The COVID-19 global pandemic has accelerated the need to verify and onboard users online across several industries, such as financial services, insurance, and healthcare. When it comes to user experience it is crucial to provide a frictionless transaction while maintaining a high standard for identity verification.  The question is, how do you verify real people […]  ( 7 min )
    Getting started with deploying real-time models on Amazon SageMaker
    Amazon SageMaker is a fully-managed service that provides every developer and data scientist with the ability to quickly build, train, and deploy machine learning (ML) models at scale. ML is realized in inference. SageMaker offers four Inference options: Real-Time Inference Serverless Inference Asynchronous Inference Batch Transform These four options can be broadly classified into Online […]  ( 13 min )
    Predict lung cancer survival status using multimodal data on Amazon SageMaker JumpStart
    Non-small cell lung cancer (NSCLC) is the most common type of lung cancer, and is composed of tumors with significant molecular heterogeneity resulting from differences in intrinsic oncogenic signaling pathways [1]. Enabling precision medicine, anticipating patient preferences, detecting disease, and improving care quality for NSCLC patients are important topics among healthcare and life sciences (HCLS) […]  ( 17 min )
  • Open

    Research Focus: Week of November 7, 2022
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Microsoft Turing Universal Language Representation model, T-ULRv6, tops both XTREME and GLUE leaderboards with a single model Barun Patra, Saksham Singhal, Shaohan Huang, Zewen Chi, Li Dong, […] The post Research Focus: Week of November 7, 2022 appeared first on Microsoft Research.  ( 10 min )
  • Open

    The AI Therapist Will See You Now
    Gamifying the Patient-Therapist Relationship With Reinforcement Learning Continue reading on Becoming Human: Artificial Intelligence Magazine »
    10 Examples of Digital Technology in Retail Stores | inVerita
    The unprecedented popularity of online shopping has created a threat to the existence and relevance of physical retail. The widespread use… Continue reading on Becoming Human: Artificial Intelligence Magazine »
    Semantic Segmentation| what is it and how does it help? — TechnologyHQ
    No content preview
    AI in Healthcare: Trends and Applications
    No content preview
    The Confusion Between Intelligence, Consciousness And Sentience Will Lead To Destructive AI
    No content preview
  • Open

    First Ever NASA Humanoid Robot To Release Before Tesla's Optimus| New Nvidia AI Tops OpenAI DALLE-2 & Google Imagen For Text To Image Output | New Meta AI Performs 60 Times Faster Than DeepMind AlphaFold
    submitted by /u/kenickh [link] [comments]  ( 46 min )
  • Open

    3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’
    The warm, friendly animation Mushroom Spirit is featured In the NVIDIA Studio this week, modeled by talented 3D illustrator Julie Greenberg, aka Juliestrator. The post 3D Illustrator Juliestrator Makes Marvelous Mushroom Magic This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 8 min )
  • Open

    What's the intuition behind the reward scaling with discounted return statistics?
    PPO has a lot code-level optimizations, in which reward scaling can play an important role in both stabilizing behavior of policy gradient and the target to train the critic. However, what makes me confused is that the scheme claimed in this paper relates the reward statistics with the running variance of discounted return, it doesn't give any insight, however, empirical success can't persuade me. So I want to listen to your understanding, those seemingly inconspicuous tricks affect performance drastically, but not been studied well enough. https://preview.redd.it/t91eyknavpy91.png?width=1266&format=png&auto=webp&s=da997def68d70c250e703ef0a916d54f77106366 submitted by /u/OutOfCharm [link] [comments]  ( 52 min )
    First visit and every visit Monte Carlo methods
    What are the theoretical limitations of first-visit and every-visit Monte Carlo? I understand the definitional differences, and that both converge to the optimal Q values under certain conditions - however are their situations in which one is preferred over the other? submitted by /u/dahkneela [link] [comments]  ( 55 min )
  • Open

    Explanation Uncertainty with Decision Boundary Awareness. (arXiv:2210.02419v2 [cs.LG] UPDATED)
    Post-hoc explanation methods have become increasingly depended upon for understanding black-box classifiers in high-stakes applications, precipitating a need for reliable explanations. While numerous explanation methods have been proposed, recent works have shown that many existing methods can be inconsistent or unstable. In addition, high-performing classifiers are often highly nonlinear and can exhibit complex behavior around the decision boundary, leading to brittle or misleading local explanations. Therefore, there is an impending need to quantify the uncertainty of such explanation methods in order to understand when explanations are trustworthy. We introduce a novel uncertainty quantification method parameterized by a Gaussian Process model, which combines the uncertainty approximation of existing methods with a novel geodesic-based similarity which captures the complexity of the target black-box decision boundary. The proposed framework is highly flexible; it can be used with any black-box classifier and feature attribution method to amortize uncertainty estimates for explanations. We show theoretically that our proposed geodesic-based kernel similarity increases with the complexity of the decision boundary. Empirical results on multiple tabular and image datasets show that our decision boundary-aware uncertainty estimate improves understanding of explanations as compared to existing methods.  ( 2 min )
    Fully Bayesian inference for latent variable Gaussian process models. (arXiv:2211.02218v1 [stat.ML])
    Real engineering and scientific applications often involve one or more qualitative inputs. Standard Gaussian processes (GPs), however, cannot directly accommodate qualitative inputs. The recently introduced latent variable Gaussian process (LVGP) overcomes this issue by first mapping each qualitative factor to underlying latent variables (LVs), and then uses any standard GP covariance function over these LVs. The LVs are estimated similarly to the other GP hyperparameters through maximum likelihood estimation, and then plugged into the prediction expressions. However, this plug-in approach will not account for uncertainty in estimation of the LVs, which can be significant especially with limited training data. In this work, we develop a fully Bayesian approach for the LVGP model and for visualizing the effects of the qualitative inputs via their LVs. We also develop approximations for scaling up LVGPs and fully Bayesian inference for the LVGP hyperparameters. We conduct numerical studies comparing plug-in inference against fully Bayesian inference over a few engineering models and material design applications. In contrast to previous studies on standard GP modeling that have largely concluded that a fully Bayesian treatment offers limited improvements, our results show that for LVGP modeling it offers significant improvements in prediction accuracy and uncertainty quantification over the plug-in approach.  ( 2 min )
    Can Querying for Bias Leak Protected Attributes? Achieving Privacy With Smooth Sensitivity. (arXiv:2211.02139v1 [cs.CR])
    Existing regulations prohibit model developers from accessing protected attributes (gender, race, etc.), often resulting in fairness assessments on populations without knowing their protected groups. In such scenarios, institutions often adopt a separation between the model developers (who train models with no access to the protected attributes) and a compliance team (who may have access to the entire dataset for auditing purpose). However, the model developers might be allowed to test their models for bias by querying the compliance team for group fairness metrics. In this paper, we first demonstrate that simply querying for fairness metrics, such as statistical parity and equalized odds can leak the protected attributes of individuals to the model developers. We demonstrate that there always exist strategies by which the model developers can identify the protected attribute of a targeted individual in the test dataset from just a single query. In particular, we show that one can reconstruct the protected attributes of all the individuals from O(Nk log n/Nk) queries when Nk<<n using techniques from compressed sensing (n: size of the test dataset, Nk: size of smallest group). Our results pose an interesting debate in algorithmic fairness: should querying for fairness metrics be viewed as a neutral-valued solution to ensure compliance with regulations? Or, does it constitute a violation of regulations and privacy if the number of queries answered is enough for the model developers to identify the protected attributes of specific individuals? To address this supposed violation, we also propose Attribute-Conceal, a novel technique that achieves differential privacy by calibrating noise to the smooth sensitivity of our bias query, outperforming naive techniques such as Laplace mechanism. We also include experimental results on the Adult dataset and synthetic data (broad range of parameters).  ( 3 min )
    No Agreement Without Loss: Learning and Social Choice in Peer Review. (arXiv:2211.02144v1 [cs.AI])
    In peer review systems, reviewers are often asked to evaluate various features of submissions, such as technical quality or novelty. A score is given to each of the predefined features and based on these the reviewer has to provide an overall quantitative recommendation. However, reviewers differ in how much they value different features. It may be assumed that each reviewer has her own mapping from a set of criteria scores (score vectors) to a recommendation, and that different reviewers have different mappings in mind. Recently, Noothigattu, Shah and Procaccia introduced a novel framework for obtaining an aggregated mapping by means of Empirical Risk Minimization based on $L(p,q)$ loss functions, and studied its axiomatic properties in the sense of social choice theory. We provide a body of new results about this framework. On the one hand we study a trade-off between strategy-proofness and the ability of the method to properly capture agreements of the majority of reviewers. On the other hand, we show that dropping a certain unrealistic assumption makes the previously reported results to be no longer valid. Moreover, in the general case, strategy-proofness fails dramatically in the sense that a reviewer is able to make significant changes to the solution in her favor by arbitrarily small changes to their true beliefs. In particular, no approximate version of strategy-proofness is possible in this general setting since the method is not even continuous w.r.t. the data. Finally we propose a modified aggregation algorithm which is continuous and show that it has good axiomatic properties.  ( 3 min )
    Music Mixing Style Transfer: A Contrastive Learning Approach to Disentangle Audio Effects. (arXiv:2211.02247v1 [eess.AS])
    We propose an end-to-end music mixing style transfer system that converts the mixing style of an input multitrack to that of a reference song. This is achieved with an encoder pre-trained with a contrastive objective to extract only audio effects related information from a reference music recording. All our models are trained in a self-supervised manner from an already-processed wet multitrack dataset with an effective data preprocessing method that alleviates the data scarcity of obtaining unprocessed dry data. We analyze the proposed encoder for the disentanglement capability of audio effects and also validate its performance for mixing style transfer through both objective and subjective evaluations. From the results, we show the proposed system not only converts the mixing style of multitrack audio close to a reference but is also robust with mixture-wise style transfer upon using a music source separation model.  ( 2 min )
    AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. (arXiv:2205.12410v2 [cs.CL] UPDATED)
    Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules -- given the underlying PEFT method of choice -- introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.  ( 2 min )
    Neural RELAGGS. (arXiv:2211.02363v1 [cs.LG])
    Multi-relational databases are the basis of most consolidated data collections in science and industry today. Most learning and mining algorithms, however, require data to be represented in a propositional form. While there is a variety of specialized machine learning algorithms that can operate directly on multi-relational data sets, propositionalization algorithms transform multi-relational databases into propositional data sets, thereby allowing the application of traditional machine learning and data mining algorithms without their modification. One prominent propositionalization algorithm is RELAGGS by Krogel and Wrobel, which transforms the data by nested aggregations. We propose a new neural network based algorithm in the spirit of RELAGGS that employs trainable composite aggregate functions instead of the static aggregate functions used in the original approach. In this way, we can jointly train the propositionalization with the prediction model, or, alternatively, use the learned aggegrations as embeddings in other algorithms. We demonstrate the increased predictive performance by comparing N-RELAGGS with RELAGGS and multiple other state-of-the-art algorithms.  ( 2 min )
    An Adaptive Batch Normalization in Deep Learning. (arXiv:2211.02050v1 [cs.LG])
    Batch Normalization (BN) is a way to accelerate and stabilize training in deep convolutional neural networks. However, the BN works continuously within the network structure, although some training data may not always require it. In this research work, we propose a threshold-based adaptive BN approach that separates the data that requires the BN and data that does not require it. The experimental evaluation demonstrates that proposed approach achieves better performance mostly in small batch sizes than the traditional BN using MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. It also reduces the occurrence of internal variable transformation to increase network stability  ( 2 min )
    Improved Adaptive Algorithm for Scalable Active Learning with Weak Labeler. (arXiv:2211.02233v1 [cs.LG])
    Active learning with strong and weak labelers considers a practical setting where we have access to both costly but accurate strong labelers and inaccurate but cheap predictions provided by weak labelers. We study this problem in the streaming setting, where decisions must be taken \textit{online}. We design a novel algorithmic template, Weak Labeler Active Cover (WL-AC), that is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy. Prior active learning algorithms with access to weak labelers learn a difference classifier which predicts where the weak labels differ from strong labelers; this requires the strong assumption of realizability of the difference classifier (Zhang and Chaudhuri,2015). WL-AC bypasses this \textit{realizability} assumption and thus is applicable to many real-world scenarios such as random corrupted weak labels and high dimensional family of difference classifiers (\textit{e.g.,} deep neural nets). Moreover, WL-AC cleverly trades off evaluating the quality with full exploitation of weak labelers, which allows to convert any active learning strategy to one that can leverage weak labelers. We provide an instantiation of this template that achieves the optimal query complexity for any given weak labeler, without knowing its accuracy a-priori. Empirically, we propose an instantiation of the WL-AC template that can be efficiently implemented for large-scale models (\textit{e.g}., deep neural nets) and show its effectiveness on the corrupted-MNIST dataset by significantly reducing the number of labels while keeping the same accuracy as in passive learning.  ( 3 min )
    A Survey on Reinforcement Learning in Aviation Applications. (arXiv:2211.02147v1 [eess.SY])
    Compared with model-based control and optimization methods, reinforcement learning (RL) provides a data-driven, learning-based framework to formulate and solve sequential decision-making problems. The RL framework has become promising due to largely improved data availability and computing power in the aviation industry. Many aviation-based applications can be formulated or treated as sequential decision-making problems. Some of them are offline planning problems, while others need to be solved online and are safety-critical. In this survey paper, we first describe standard RL formulations and solutions. Then we survey the landscape of existing RL-based applications in aviation. Finally, we summarize the paper, identify the technical gaps, and suggest future directions of RL research in aviation.  ( 2 min )
    Towards Asteroid Detection in Microlensing Surveys with Deep Learning. (arXiv:2211.02239v1 [astro-ph.EP])
    Asteroids are an indelible part of most astronomical surveys though only a few surveys are dedicated to their detection. Over the years, high cadence microlensing surveys have amassed several terabytes of data while scanning primarily the Galactic Bulge and Magellanic Clouds for microlensing events and thus provide a treasure trove of opportunities for scientific data mining. In particular, numerous asteroids have been observed by visual inspection of selected images. This paper presents novel deep learning-based solutions for the recovery and discovery of asteroids in the microlensing data gathered by the MOA project. Asteroid tracklets can be clearly seen by combining all the observations on a given night and these tracklets inform the structure of the dataset. Known asteroids were identified within these composite images and used for creating the labelled datasets required for supervised learning. Several custom CNN models were developed to identify images with asteroid tracklets. Model ensembling was then employed to reduce the variance in the predictions as well as to improve the generalisation error, achieving a recall of 97.67%. Furthermore, the YOLOv4 object detector was trained to localize asteroid tracklets, achieving a mean Average Precision (mAP) of 90.97%. These trained networks will be applied to 16 years of MOA archival data to find both known and unknown asteroids that have been observed by the survey over the years. The methodologies developed can be adapted for use by other surveys for asteroid recovery and discovery.  ( 3 min )
    Overcoming Barriers to Skill Injection in Language Modeling: Case Study in Arithmetic. (arXiv:2211.02098v1 [cs.CL])
    Through their transfer learning abilities, highly-parameterized large pre-trained language models have dominated the NLP landscape for a multitude of downstream language tasks. Though linguistically proficient, the inability of these models to incorporate the learning of non-linguistic entities (numerals and arithmetic reasoning) limits their usage for tasks that require numeric comprehension or strict mathematical reasoning. However, as we illustrate in this paper, building a general purpose language model that also happens to be proficient in mathematical reasoning is not as straight-forward as training it on a numeric dataset. In this work, we develop a novel framework that enables language models to be mathematically proficient while retaining their linguistic prowess. Specifically, we offer information-theoretic interventions to overcome the catastrophic forgetting of linguistic skills that occurs while injecting non-linguistic skills into language models.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v1 [math.ST])
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Flows for Flows: Training Normalizing Flows Between Arbitrary Distributions with Maximum Likelihood Estimation. (arXiv:2211.02487v1 [cs.LG])
    Normalizing flows are constructed from a base distribution with a known density and a diffeomorphism with a tractable Jacobian. The base density of a normalizing flow can be parameterised by a different normalizing flow, thus allowing maps to be found between arbitrary distributions. We demonstrate and explore the utility of this approach and show it is particularly interesting in the case of conditional normalizing flows and for introducing optimal transport constraints on maps that are constructed using normalizing flows.  ( 2 min )
    Fairness in Federated Learning via Core-Stability. (arXiv:2211.02091v1 [cs.LG])
    Federated learning provides an effective paradigm to jointly optimize a model benefited from rich distributed data while protecting data privacy. Nonetheless, the heterogeneity nature of distributed data makes it challenging to define and ensure fairness among local agents. For instance, it is intuitively "unfair" for agents with data of high quality to sacrifice their performance due to other agents with low quality data. Currently popular egalitarian and weighted equity-based fairness measures suffer from the aforementioned pitfall. In this work, we aim to formally represent this problem and address these fairness issues using concepts from co-operative game theory and social choice theory. We model the task of learning a shared predictor in the federated setting as a fair public decision making problem, and then define the notion of core-stable fairness: Given $N$ agents, there is no subset of agents $S$ that can benefit significantly by forming a coalition among themselves based on their utilities $U_N$ and $U_S$ (i.e., $\frac{|S|}{N} U_S \geq U_N$). Core-stable predictors are robust to low quality local data from some agents, and additionally they satisfy Proportionality and Pareto-optimality, two well sought-after fairness and efficiency notions within social choice. We then propose an efficient federated learning protocol CoreFed to optimize a core stable predictor. CoreFed determines a core-stable predictor when the loss functions of the agents are convex. CoreFed also determines approximate core-stable predictors when the loss functions are not convex, like smooth neural networks. We further show the existence of core-stable predictors in more general settings using Kakutani's fixed point theorem. Finally, we empirically validate our analysis on two real-world datasets, and we show that CoreFed achieves higher core-stability fairness than FedAvg while having similar accuracy.  ( 3 min )
    Decomposing Counterfactual Explanations for Consequential Decision Making. (arXiv:2211.02151v1 [cs.LG])
    The goal of algorithmic recourse is to reverse unfavorable decisions (e.g., from loan denial to approval) under automated decision making by suggesting actionable feature changes (e.g., reduce the number of credit cards). To generate low-cost recourse the majority of methods work under the assumption that the features are independently manipulable (IMF). To address the feature dependency issue the recourse problem is usually studied through the causal recourse paradigm. However, it is well known that strong assumptions, as encoded in causal models and structural equations, hinder the applicability of these methods in complex domains where causal dependency structures are ambiguous. In this work, we develop \texttt{DEAR} (DisEntangling Algorithmic Recourse), a novel and practical recourse framework that bridges the gap between the IMF and the strong causal assumptions. \texttt{DEAR} generates recourses by disentangling the latent representation of co-varying features from a subset of promising recourse features to capture the main practical recourse desiderata. Our experiments on real-world data corroborate our theoretically motivated recourse model and highlight our framework's ability to provide reliable, low-cost recourse in the presence of feature dependencies.  ( 2 min )
    Fault Diagnosis for Power Electronics Converters based on Deep Feedforward Network and Wavelet Compression. (arXiv:2211.02632v1 [eess.SP])
    A fault diagnosis method for power electronics converters based on deep feedforward network and wavelet compression is proposed in this paper. The transient historical data after wavelet compression are used to realize the training of fault diagnosis classifier. Firstly, the correlation analysis of the voltage or current data running in various fault states is performed to remove the redundant features and the sampling point. Secondly, the wavelet transform is used to remove the redundant data of the features, and then the training sample data is greatly compressed. The deep feedforward network is trained by the low frequency component of the features, while the training speed is greatly accelerated. The average accuracy of fault diagnosis classifier can reach over 97%. Finally, the fault diagnosis classifier is tested, and final diagnosis result is determined by multiple-groups transient data, by which the reliability of diagnosis results is improved. The experimental result proves that the classifier has strong generalization ability and can accurately locate the open-circuit faults in IGBTs.  ( 2 min )
    Spatial Graph Signal Interpolation with an Application for Merging BCI Datasets with Various Dimensionalities. (arXiv:2211.02624v1 [eess.SP])
    BCI Motor Imagery datasets usually are small and have different electrodes setups. When training a Deep Neural Network, one may want to capitalize on all these datasets to increase the amount of data available and hence obtain good generalization results. To this end, we introduce a spatial graph signal interpolation technique, that allows to interpolate efficiently multiple electrodes. We conduct a set of experiments with five BCI Motor Imagery datasets comparing the proposed interpolation with spherical splines interpolation. We believe that this work provides novel ideas on how to leverage graphs to interpolate electrodes and on how to homogenize multiple datasets.  ( 2 min )
    Domain Adaptation under Missingness Shift. (arXiv:2211.02093v1 [cs.LG])
    Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that when missing data indicators are available, DAMS can reduce to covariate shift. Focusing on the setting where missing data indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal source predictor can perform worse on the target domain than a constant one; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.  ( 2 min )
    Hardware-accelerated Mars Sample Localization via deep transfer learning from photorealistic simulations. (arXiv:2206.02622v2 [cs.CV] UPDATED)
    The goal of the Mars Sample Return campaign is to collect soil samples from the surface of Mars and return them to Earth for further study. The samples will be acquired and stored in metal tubes by the Perseverance rover and deposited on the Martian surface. As part of this campaign, it is expected that the Sample Fetch Rover will be in charge of localizing and gathering up to 35 sample tubes over 150 Martian sols. Autonomous capabilities are critical for the success of the overall campaign and for the Sample Fetch Rover in particular. This work proposes a novel system architecture for the autonomous detection and pose estimation of the sample tubes. For the detection stage, a Deep Neural Network and transfer learning from a synthetic dataset are proposed. The dataset is created from photorealistic 3D simulations of Martian scenarios. Additionally, the sample tubes poses are estimated using Computer Vision techniques such as contour detection and line fitting on the detected area. Finally, laboratory tests of the Sample Localization procedure are performed using the ExoMars Testing Rover on a Mars-like testbed. These tests validate the proposed approach in different hardware architectures, providing promising results related to the sample detection and pose estimation.
    A Graph Convolution for Signed Directed Graphs. (arXiv:2208.11511v2 [cs.LG] UPDATED)
    There are several types of graphs according to the nature of the data. Directed graphs have directions of links, and signed graphs have link types such as positive and negative. Signed directed graphs are the most complex and informative that have both. Graph convolutions for signed directed graphs have not been delivered much yet. Though many graph convolution studies have been provided, most are designed for undirected or unsigned. In this paper, we investigate a spectral graph convolution network for signed directed graphs. We propose a novel complex Hermitian adjacency matrix that encodes graph information via complex numbers. The complex numbers represent link direction, sign, and connectivity via the phases and magnitudes. Then, we define a magnetic Laplacian with the Hermitian matrix and prove its positive semidefinite property. Finally, we introduce Signed Directed Graph Convolution Network(SD-GCN). To the best of our knowledge, it is the first spectral convolution for graphs with signs. Moreover, unlike the existing convolutions designed for a specific graph type, the proposed model has generality that can be applied to any graphs, including undirected, directed, or signed. The performance of the proposed model was evaluated with four real-world graphs. It outperforms all the other state-of-the-art graph convolutions in the task of link sign prediction.
    Shapley value-based approaches to explain the robustness of classifiers in machine learning. (arXiv:2209.04254v2 [cs.LG] UPDATED)
    The use of algorithm-agnostic approaches is an emerging area of research for explaining the contribution of individual features towards the predicted outcome. Whilst there is a focus on explaining the prediction itself, a little has been done on explaining the robustness of these models, that is, how each feature contributes towards achieving that robustness. In this paper, we propose the use of Shapley values to explain the contribution of each feature towards the model's robustness, measured in terms of Receiver-operating Characteristics (ROC) curve and the Area under the ROC curve (AUC). With the help of an illustrative example, we demonstrate the proposed idea of explaining the ROC curve, and visualising the uncertainties in these curves. For imbalanced datasets, the use of Precision-Recall Curve (PRC) is considered more appropriate, therefore we also demonstrate how to explain the PRCs with the help of Shapley values. The explanation of robustness can help analysts in a number of ways, for example, it can help in feature selection by identifying the irrelevant features that can be removed to reduce the computational complexity. It can also help in identifying the features having critical contributions or negative contributions towards robustness.
    Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt & Predict Paradigm (P5). (arXiv:2203.13366v6 [cs.IR] UPDATED)
    For a long time, different recommendation tasks typically require designing task-specific architectures and training objectives. As a result, it is hard to transfer the learned knowledge and representations from one task to another, thus restricting the generalization ability of existing recommendation approaches, e.g., a sequential recommendation model can hardly be applied or transferred to a review generation method. To deal with such issues, considering that language can describe almost anything and language grounding is a powerful medium to represent various problems or tasks, we present a flexible and unified text-to-text paradigm called "Pretrain, Personalized Prompt, and Predict Paradigm" (P5) for recommendation, which unifies various recommendation tasks in a shared framework. In P5, all data such as user-item interactions, user descriptions, item metadata, and user reviews are converted to a common format -- natural language sequences. The rich information from natural language assists P5 to capture deeper semantics for personalization and recommendation. Specifically, P5 learns different tasks with the same language modeling objective during pretraining. Thus, it serves as the foundation model for various downstream recommendation tasks, allows easy integration with other modalities, and enables instruction-based recommendation based on prompts. P5 advances recommender systems from shallow model to deep model to big model, and will revolutionize the technical form of recommender systems towards universal recommendation engine. With adaptive personalized prompt for different users, P5 is able to make predictions in a zero-shot or few-shot manner and largely reduces the necessity for extensive fine-tuning. On several recommendation benchmarks, we conduct experiments to show the effectiveness of P5. We release the source code at https://github.com/jeykigung/P5.
    From Shapley Values to Generalized Additive Models and back. (arXiv:2209.04012v2 [cs.LG] UPDATED)
    In explainable machine learning, local post-hoc explanation algorithms and inherently interpretable models are often seen as competing approaches. In this work, we offer a partial reconciliation between these two approaches by showing that Shapley Values correspond to Generalized Additive Models (GAMs). We introduce $n$-Shapley Values, a parametric family of local post-hoc explanation algorithms that explain individual predictions with interaction terms up to order $n$. By varying the parameter $n$, these explanations cover the entire range from Shapley Values up to a uniquely determined decomposition of the function that we attempt to explain. The relationship between $n$-Shapley Values and this decomposition offers a functionally-grounded characterization of Shapley Values, and highlights the limitations of these explanations. We then show that $n$-Shapley Values recover GAMs with interaction terms up to order $n$, which implies that the original Shapely Values recover GAMs without interaction terms. Taken together, our results offer a precise characterization of Shapley Values as they are being used in explainable machine learning. Python code to estimate $n$-Shapley Values and replicate the results in this paper is available at \url{https://github.com/tml-tuebingen/nshap}.
    Uncertainty Estimation for Computed Tomography with a Linearised Deep Image Prior. (arXiv:2203.00479v2 [eess.IV] UPDATED)
    Existing deep-learning based tomographic image reconstruction methods do not provide accurate estimates of reconstruction uncertainty, hindering their real-world deployment. This paper develops a method, termed as the linearised deep image prior (DIP), to estimate the uncertainty associated with reconstructions produced by the DIP with total variation regularisation (TV). Specifically, we endow the DIP with conjugate Gaussian-linear model type error-bars computed from a local linearisation of the neural network around its optimised parameters. To preserve conjugacy, we approximate the TV regulariser with a Gaussian surrogate. This approach provides pixel-wise uncertainty estimates and a marginal likelihood objective for hyperparameter optimisation. We demonstrate the method on synthetic data and real-measured high-resolution 2D $\mu$CT data, and show that it provides superior calibration of uncertainty estimates relative to previous probabilistic formulations of the DIP. Our code is available at https://github.com/educating-dip/bayes_dip.
    When Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits. (arXiv:2209.02570v2 [cs.LG] UPDATED)
    We study the problem of multi-armed bandits with $\epsilon$-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with $\epsilon$-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large $\epsilon$), bandits with $\epsilon$-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal $\epsilon$ global DP extension of an index-based optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate $\epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies $\epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.
    Supervised Dimensionality Reduction and Image Classification Utilizing Convolutional Autoencoders. (arXiv:2208.12152v4 [cs.LG] UPDATED)
    The joint optimization of the reconstruction and classification error is a hard non convex problem, especially when a non linear mapping is utilized. In order to overcome this obstacle, a novel optimization strategy is proposed, in which a Convolutional Autoencoder for dimensionality reduction and a classifier composed by a Fully Connected Network, are combined to simultaneously produce supervised dimensionality reduction and predictions. It turned out that this methodology can also be greatly beneficial in enforcing explainability of deep learning architectures. Additionally, the resulting Latent Space, optimized for the classification task, can be utilized to improve traditional, interpretable classification algorithms. The experimental results, showed that the proposed methodology achieved competitive results against the state of the art deep learning methods, while being much more efficient in terms of parameter count. Finally, it was empirically justified that the proposed methodology introduces advanced explainability regarding, not only the data structure through the produced latent space, but also about the classification behaviour.
    Knowledge Management System with NLP-Assisted Annotations: A Brief Survey and Outlook. (arXiv:2206.07304v2 [cs.DB] UPDATED)
    Knowledge management systems (KMS) are in high demand for industrial researchers, chemical or research enterprises, or evidence-based decision making. However, existing systems have limitations in categorizing and organizing paper insights or relationships. Traditional databases are usually disjoint with logging systems, which limit its utility in generating concise, collated overviews. In this work, we briefly survey existing approaches of this problem space and propose a unified framework that utilizes relational databases to log hierarchical information to facilitate the research and writing process, or generate useful knowledge from references or insights from connected concepts. Our framework of bidirectional knowledge management system (BKMS) enables novel functionalities encompassing improved hierarchical note-taking, AI-assisted brainstorming, and multi-directional relationships. Potential applications include managing inventories and changes for manufacture or research enterprises, or generating analytic reports with evidence-based decision making.
    FunQG: Molecular Representation Learning Via Quotient Graphs. (arXiv:2207.08597v2 [cs.LG] UPDATED)
    Learning expressive molecular representations is crucial to facilitate the accurate prediction of molecular properties. Despite the significant advancement of graph neural networks (GNNs) in molecular representation learning, they generally face limitations such as neighbors-explosion, under-reaching, over-smoothing, and over-squashing. Also, GNNs usually have high computational costs because of the large-scale number of parameters. Typically, such limitations emerge or increase when facing relatively large-size graphs or using a deeper GNN model architecture. An idea to overcome these problems is to simplify a molecular graph into a small, rich, and informative one, which is more efficient and less challenging to train GNNs. To this end, we propose a novel molecular graph coarsening framework named FunQG utilizing Functional groups, as influential building blocks of a molecule to determine its properties, based on a graph-theoretic concept called Quotient Graph. By experiments, we show that the resulting informative graphs are much smaller than the molecular graphs and thus are good candidates for training GNNs. We apply the FunQG on popular molecular property prediction benchmarks and then compare the performance of some popular baseline GNNs on the obtained datasets with the performance of several state-of-the-art baselines on the original datasets. By experiments, this method significantly outperforms previous baselines on various datasets, besides its dramatic reduction in the number of parameters and low computational costs. Therefore, the FunQG can be used as a simple, cost-effective, and robust method for solving the molecular representation learning problem.
    Memorization in NLP Fine-tuning Methods. (arXiv:2205.12506v2 [cs.CL] UPDATED)
    Large language models are shown to present privacy risks through memorization of training data, and several recent works have studied such risks for the pre-training phase. Little attention, however, has been given to the fine-tuning phase and it is not well understood how different fine-tuning methods (such as fine-tuning the full model, the model head, and adapter) compare in terms of memorization risk. This presents increasing concern as the "pre-train and fine-tune" paradigm proliferates. In this paper, we empirically study memorization of fine-tuning methods using membership inference and extraction attacks, and show that their susceptibility to attacks is very different. We observe that fine-tuning the head of the model has the highest susceptibility to attacks, whereas fine-tuning smaller adapters appears to be less vulnerable to known extraction attacks.
    Resource-Efficient Federated Learning. (arXiv:2111.01108v2 [cs.LG] UPDATED)
    Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this work, we systematically address the question of resource efficiency in FL, showing the benefits of intelligent participant selection, and incorporation of updates from straggling participants. We demonstrate how these factors enable resource efficiency while also improving trained model quality.
    Machine Learning Simulates Agent-Based Model Towards Policy. (arXiv:2203.02576v2 [cs.MA] UPDATED)
    Public Policies are not intrinsically positive or negative. Rather, policies provide varying levels of effects across different recipients. Methodologically, computational modeling enables the application of multiple influences on empirical data, thus allowing for heterogeneous response to policies. We use a random forest machine learning algorithm to emulate an agent-based model (ABM) and evaluate competing policies across 46 Metropolitan Regions (MRs) in Brazil. In doing so, we use input parameters and output indicators of 11,076 actual simulation runs and one million emulated runs. As a result, we obtain the optimal (and non-optimal) performance of each region over the policies. Optimum is defined as a combination of GDP production and the Gini coefficient inequality indicator for the full ensemble of Metropolitan Regions. Results suggest that MRs already have embedded structures that favor optimal or non-optimal results, but they also illustrate which policy is more beneficial to each place. In addition to providing MR-specific policies' results, the use of machine learning to simulate an ABM reduces the computational burden, whereas allowing for a much larger variation among model parameters. The coherence of results within the context of larger uncertainty--vis-\`a-vis those of the original ABM--reinforces robustness of the model. At the same time the exercise indicates which parameters should policymakers intervene on, in order to work towards precise policy optimal instruments.
    Privacy-preserving Deep Learning based Record Linkage. (arXiv:2211.02161v1 [cs.CR])
    Deep learning-based linkage of records across different databases is becoming increasingly useful in data integration and mining applications to discover new insights from multiple sources of data. However, due to privacy and confidentiality concerns, organisations often are not willing or allowed to share their sensitive data with any external parties, thus making it challenging to build/train deep learning models for record linkage across different organizations' databases. To overcome this limitation, we propose the first deep learning-based multi-party privacy-preserving record linkage (PPRL) protocol that can be used to link sensitive databases held by multiple different organisations. In our approach, each database owner first trains a local deep learning model, which is then uploaded to a secure environment and securely aggregated to create a global model. The global model is then used by a linkage unit to distinguish unlabelled record pairs as matches and non-matches. We utilise differential privacy to achieve provable privacy protection against re-identification attacks. We evaluate the linkage quality and scalability of our approach using several large real-world databases, showing that it can achieve high linkage quality while providing sufficient privacy protection against existing attacks.  ( 2 min )
    Diverse super-resolution with pretrained deep hiererarchical VAEs. (arXiv:2205.10347v2 [cs.CV] UPDATED)
    Image super-resolution is a one-to-many problem, but most deep-learning based methods only provide one single solution to this problem. In this work, we tackle the problem of diverse super-resolution by reusing VD-VAE, a state-of-the art variational autoencoder (VAE). We find that the hierarchical latent representation learned by VD-VAE naturally separates the image low-frequency information, encoded in the latent groups at the top of the hierarchy, from the image high-frequency details, determined by the latent groups at the bottom of the latent hierarchy. Starting from this observation, we design a super-resolution model exploiting the specific structure of VD-VAE latent space. Specifically, we train an encoder to encode low-resolution images in the subset of VD-VAE latent space encoding the low-frequency information, and we combine this encoder with VD-VAE generative model to sample diverse super-resolved version of a low-resolution input. We demonstrate the ability of our method to generate diverse solutions to the super-resolution problem on face super-resolution with upsampling factors x4, x8, and x16.
    Hub-Pathway: Transfer Learning from A Hub of Pre-trained Models. (arXiv:2206.03726v2 [cs.LG] UPDATED)
    Transfer learning aims to leverage knowledge from pre-trained models to benefit the target task. Prior transfer learning work mainly transfers from a single model. However, with the emergence of deep models pre-trained from different resources, model hubs consisting of diverse models with various architectures, pre-trained datasets and learning paradigms are available. Directly applying single-model transfer learning methods to each model wastes the abundant knowledge of the model hub and suffers from high computational cost. In this paper, we propose a Hub-Pathway framework to enable knowledge transfer from a model hub. The framework generates data-dependent pathway weights, based on which we assign the pathway routes at the input level to decide which pre-trained models are activated and passed through, and then set the pathway aggregation at the output level to aggregate the knowledge from different models to make predictions. The proposed framework can be trained end-to-end with the target task-specific loss, where it learns to explore better pathway configurations and exploit the knowledge in pre-trained models for each target datum. We utilize a noisy pathway generator and design an exploration loss to further explore different pathways throughout the model hub. To fully exploit the knowledge in pre-trained models, each model is further trained by specific data that activate it, which ensures its performance and enhances knowledge transfer. Experiment results on computer vision and reinforcement learning tasks demonstrate that the proposed Hub-Pathway framework achieves the state-of-the-art performance for model hub transfer learning.
    Deep neural networks for fast acquisition of aortic 3D pressure and velocity flow fields. (arXiv:2208.12156v2 [physics.flu-dyn] UPDATED)
    Computational fluid dynamics (CFD) can be used to simulate vascular haemodynamics and analyse potential treatment options. CFD has shown to be beneficial in improving patient outcomes. However, the implementation of CFD for routine clinical use is yet to be realised. Barriers for CFD include high computational resources, specialist experience needed for designing simulation set-ups, and long processing times. The aim of this study was to explore the use of machine learning (ML) to replicate conventional aortic CFD with automatic and fast regression models. Data used to train/test the model consisted of 3,000 CFD simulations performed on synthetically generated 3D aortic shapes. These subjects were generated from a statistical shape model (SSM) built on real patient-specific aortas (N=67). Inference performed on 200 test shapes resulted in average errors of 6.01% +/-3.12 SD and 3.99% +/-0.93 SD for pressure and velocity, respectively. Our ML-based models performed CFD in +/-0.075 seconds (4,000x faster than the solver). This proof-of-concept study shows that results from conventional vascular CFD can be reproduced using ML at a much faster rate, in an automatic process, and with high accuracy.
    Graph Lifelong Learning: A Survey. (arXiv:2202.10688v2 [cs.LG] UPDATED)
    Graph learning is a popular approach for performing machine learning on graph-structured data. It has revolutionized the machine learning ability to model graph data to address downstream tasks. Its application is wide due to the availability of graph data ranging from all types of networks to information systems. Most graph learning methods assume that the graph is static and its complete structure is known during training. This limits their applicability since they cannot be applied to problems where the underlying graph grows over time and/or new tasks emerge incrementally. Such applications require a lifelong learning approach that can learn the graph continuously and accommodate new information whilst retaining previously learned knowledge. Lifelong learning methods that enable continuous learning in regular domains like images and text cannot be directly applied to continuously evolving graph data, due to its irregular structure. As a result, graph lifelong learning is gaining attention from the research community. This survey paper provides a comprehensive overview of recent advancements in graph lifelong learning, including the categorization of existing methods, and the discussions of potential applications and open research problems.
    Analysing Diffusion-based Generative Approaches versus Discriminative Approaches for Speech Restoration. (arXiv:2211.02397v1 [eess.AS])
    Diffusion-based generative models have had a high impact on the computer vision and speech processing communities these past years. Besides data generation tasks, they have also been employed for data restoration tasks like speech enhancement and dereverberation. While discriminative models have traditionally been argued to be more powerful e.g. for speech enhancement, generative diffusion approaches have recently been shown to narrow this performance gap considerably. In this paper, we systematically compare the performance of generative diffusion models and discriminative approaches on different speech restoration tasks. For this, we extend our prior contributions on diffusion-based speech enhancement in the complex time-frequency domain to the task of bandwith extension. We then compare it to a discriminatively trained neural network with the same network architecture on three restoration tasks, namely speech denoising, dereverberation and bandwidth extension. We observe that the generative approach performs globally better than its discriminative counterpart on all tasks, with the strongest benefit for non-additive distortion models, like in dereverberation and bandwidth extension. Code and audio examples can be found online at https://uhh.de/inf-sp-sgmsemultitask
    Weight-based Channel-model Matrix Framework provides a reasonable solution for EEG-based cross-dataset emotion recognition. (arXiv:2209.05849v3 [eess.SP] UPDATED)
    Cross-dataset emotion recognition as an extremely challenging task in the field of EEG-based affective computing is influenced by many factors, which makes the universal models yield unsatisfactory results. Facing the situation that lacks EEG information decoding research, we first analyzed the impact of different EEG information(individual, session, emotion and trial) for emotion recognition by sample space visualization, sample aggregation phenomena quantification, and energy pattern analysis on five public datasets. Based on these phenomena and patterns, we provided the processing methods and interpretable work of various EEG differences. Through the analysis of emotional feature distribution patterns, the Individual Emotional Feature Distribution Difference(IEFDD) was found, which was also considered as the main factor of the stability for emotion recognition. After analyzing the limitations of traditional modeling approach suffering from IEFDD, the Weight-based Channel-model Matrix Framework(WCMF) was proposed. To reasonably characterize emotional feature distribution patterns, four weight extraction methods were designed, and the optimal was the correction T-test(CT) weight extraction method. Finally, the performance of WCMF was validated on cross-dataset tasks in two kinds of experiments that simulated different practical scenarios, and the results showed that WCMF had more stable and better emotion recognition ability.
    The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. (arXiv:2103.01955v4 [cs.LG] UPDATED)
    Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.
    Sequential Likelihood-Free Inference with Neural Proposal. (arXiv:2010.07604v3 [stat.ME] UPDATED)
    Bayesian inference without the likelihood evaluation, or likelihood-free inference, has been a key research topic in simulation studies for gaining quantitatively validated simulation models on real-world datasets. As the likelihood evaluation is inaccessible, previous papers train the amortized neural network to estimate the ground-truth posterior for the simulation of interest. Training the network and accumulating the dataset alternatively in a sequential manner could save the total simulation budget by orders of magnitude. In the data accumulation phase, the new simulation inputs are chosen within a portion of the total simulation budget to accumulate upon the collected dataset. This newly accumulated data degenerates because the set of simulation inputs is hardly mixed, and this degenerated data collection process ruins the posterior inference. This paper introduces a new sampling approach, called Neural Proposal (NP), of the simulation input that resolves the biased data collection as it guarantees the i.i.d. sampling. The experiments show the improved performance of our sampler, especially for the simulations with multi-modal posteriors.
    Graph Neural Networks for Wireless Communications: From Theory to Practice. (arXiv:2203.10800v2 [cs.IT] UPDATED)
    Deep learning-based approaches have been developed to solve challenging problems in wireless communications, leading to promising results. Early attempts adopted neural network architectures inherited from applications such as computer vision. They often yield poor performance in large scale networks (i.e., poor scalability) and unseen network settings (i.e., poor generalization). To resolve these issues, graph neural networks (GNNs) have been recently adopted, as they can effectively exploit the domain knowledge, i.e., the graph topology in wireless communications problems. GNN-based methods can achieve near-optimal performance in large-scale networks and generalize well under different system settings, but the theoretical underpinnings and design guidelines remain elusive, which may hinder their practical implementations. This paper endeavors to fill both the theoretical and practical gaps. For theoretical guarantees, we prove that GNNs achieve near-optimal performance in wireless networks with much fewer training samples than traditional neural architectures. Specifically, to solve an optimization problem on an $n$-node graph (where the nodes may represent users, base stations, or antennas), GNNs' generalization error and required number of training samples are $\mathcal{O}(n)$ and $\mathcal{O}(n^2)$ times lower than the unstructured multi-layer perceptrons. For design guidelines, we propose a unified framework that is applicable to general design problems in wireless networks, which includes graph modeling, neural architecture design, and theory-guided performance enhancement. Extensive simulations, which cover a variety of important problems and network settings, verify our theory and the effectiveness of the proposed design framework.
    The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts. (arXiv:2206.08917v2 [cond-mat.mtrl-sci] UPDATED)
    The development of machine learning models for electrocatalysts requires a broad set of training data to enable their use across a wide variety of materials. One class of materials that currently lacks sufficient training data is oxides, which are critical for the development of Oxygen Evolution Reaction (OER) catalysts. To address this, we developed the Open Catalyst 2022 (OC22) dataset, consisting of 62,331 Density Functional Theory (DFT) relaxations (~9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates. We define generalized total energy tasks that enable property prediction beyond adsorption energies; we test baseline performance of several graph neural networks; and we provide pre-defined dataset splits to establish clear benchmarks for future efforts. In the most general task, GemNet-OC sees a ~32% improvement in energy predictions when combining the chemically dissimilar Open Catalyst 2020 Dataset (OC20) and OC22 datasets via fine-tuning. Similarly, we achieved a ~19% improvement in total energy predictions on OC20 and a ~9% improvement in force predictions in OC22 when using joint training. We demonstrate the practical utility of a top performing model by capturing literature adsorption energies and important OER scaling relationships. We expect OC22 to provide an important benchmark for models seeking to incorporate intricate long-range electrostatic and magnetic interactions in oxide surfaces. The dataset and baseline models are open sourced, and a public leaderboard has been made available to encourage continued community developments on the total energy tasks and data.
    Self-Adapting Noise-Contrastive Estimation for Energy-Based Models. (arXiv:2211.02650v1 [cs.LG])
    Training energy-based models (EBMs) with noise-contrastive estimation (NCE) is theoretically feasible but practically challenging. Effective learning requires the noise distribution to be approximately similar to the target distribution, especially in high-dimensional domains. Previous works have explored modelling the noise distribution as a separate generative model, and then concurrently training this noise model with the EBM. While this method allows for more effective noise-contrastive estimation, it comes at the cost of extra memory and training complexity. Instead, this thesis proposes a self-adapting NCE algorithm which uses static instances of the EBM along its training trajectory as the noise distribution. During training, these static instances progressively converge to the target distribution, thereby circumventing the need to simultaneously train an auxiliary noise model. Moreover, we express this self-adapting NCE algorithm in the framework of Bregman divergences and show that it is a generalization of maximum likelihood learning for EBMs. The performance of our algorithm is evaluated across a range of noise update intervals, and experimental results show that shorter update intervals are conducive to higher synthesis quality.
    LDSA: Learning Dynamic Subtask Assignment in Cooperative Multi-Agent Reinforcement Learning. (arXiv:2205.02561v3 [cs.LG] UPDATED)
    Cooperative multi-agent reinforcement learning (MARL) has made prominent progress in recent years. For training efficiency and scalability, most of the MARL algorithms make all agents share the same policy or value network. However, in many complex multi-agent tasks, different agents are expected to possess specific abilities to handle different subtasks. In those scenarios, sharing parameters indiscriminately may lead to similar behavior across all agents, which will limit the exploration efficiency and degrade the final performance. To balance the training complexity and the diversity of agent behavior, we propose a novel framework to learn dynamic subtask assignment (LDSA) in cooperative MARL. Specifically, we first introduce a subtask encoder to construct a vector representation for each subtask according to its identity. To reasonably assign agents to different subtasks, we propose an ability-based subtask selection strategy, which can dynamically group agents with similar abilities into the same subtask. In this way, agents dealing with the same subtask share their learning of specific abilities and different subtasks correspond to different specific abilities. We further introduce two regularizers to increase the representation difference between subtasks and stabilize the training by discouraging agents from frequently changing subtasks, respectively. Empirical results show that LDSA learns reasonable and effective subtask assignment for better collaboration and significantly improves the learning performance on the challenging StarCraft II micromanagement benchmark and Google Research Football.
    A Theoretical Study on Solving Continual Learning. (arXiv:2211.02633v1 [cs.LG])
    Continual learning (CL) learns a sequence of tasks incrementally. There are two popular CL settings, class incremental learning (CIL) and task incremental learning (TIL). A major challenge of CL is catastrophic forgetting (CF). While a number of techniques are already available to effectively overcome CF for TIL, CIL remains to be highly challenging. So far, little theoretical study has been done to provide a principled guidance on how to solve the CIL problem. This paper performs such a study. It first shows that probabilistically, the CIL problem can be decomposed into two sub-problems: Within-task Prediction (WP) and Task-id Prediction (TP). It further proves that TP is correlated with out-of-distribution (OOD) detection, which connects CIL and OOD detection. The key conclusion of this study is that regardless of whether WP and TP or OOD detection are defined explicitly or implicitly by a CIL algorithm, good WP and good TP or OOD detection are necessary and sufficient for good CIL performances. Additionally, TIL is simply WP. Based on the theoretical result, new CIL methods are also designed, which outperform strong baselines in both CIL and TIL settings by a large margin.
    Quantifying Privacy Risks of Masked Language Models Using Membership Inference Attacks. (arXiv:2203.03929v2 [cs.LG] UPDATED)
    The wide adoption and application of Masked language models~(MLMs) on sensitive data (from legal to medical) necessitates a thorough quantitative investigation into their privacy vulnerabilities -- to what extent do MLMs leak information about their training data? Prior attempts at measuring leakage of MLMs via membership inference attacks have been inconclusive, implying the potential robustness of MLMs to privacy attacks. In this work, we posit that prior attempts were inconclusive because they based their attack solely on the MLM's model score. We devise a stronger membership inference attack based on likelihood ratio hypothesis testing that involves an additional reference MLM to more accurately quantify the privacy risks of memorization in MLMs. We show that masked language models are extremely susceptible to likelihood ratio membership inference attacks: Our empirical results, on models trained on medical notes, show that our attack improves the AUC of prior membership inference attacks from 0.66 to an alarmingly high 0.90 level, with a significant improvement in the low-error region: at 1% false positive rate, our attack is 51X more powerful than prior work.
    A Neural Network Model of Continual Learning with Cognitive Control. (arXiv:2202.04773v2 [q-bio.NC] UPDATED)
    Neural networks struggle in continual learning settings from catastrophic forgetting: when trials are blocked, new learning can overwrite the learning from previous blocks. Humans learn effectively in these settings, in some cases even showing an advantage of blocking, suggesting the brain contains mechanisms to overcome this problem. Here, we build on previous work and show that neural networks equipped with a mechanism for cognitive control do not exhibit catastrophic forgetting when trials are blocked. We further show an advantage of blocking over interleaving when there is a bias for active maintenance in the control signal, implying a tradeoff between maintenance and the strength of control. Analyses of map-like representations learned by the networks provided additional insights into these mechanisms. Our work highlights the potential of cognitive control to aid continual learning in neural networks, and offers an explanation for the advantage of blocking that has been observed in humans.
    Deep Learning for Rheumatoid Arthritis: Joint Detection and Damage Scoring in X-rays. (arXiv:2104.13915v2 [cs.CV] UPDATED)
    Recent advancements in computer vision promise to automate medical image analysis. Rheumatoid arthritis is an autoimmune disease that would profit from computer-based diagnosis, as there are no direct markers known, and doctors have to rely on manual inspection of X-ray images. In this work, we present a multi-task deep learning model that simultaneously learns to localize joints on X-ray images and diagnose two kinds of joint damage: narrowing and erosion. Additionally, we propose a modification of label smoothing, which combines classification and regression cues into a single loss and achieves 5% relative error reduction compared to standard loss functions. Our final model obtained 4th place in joint space narrowing and 5th place in joint erosion in the global RA2 DREAM challenge.
    PIPPI2021: An Approach to Automated Diagnosis and Texture Analysis of the Fetal Liver & Placenta in Fetal Growth Restriction. (arXiv:2211.02639v1 [eess.IV])
    Fetal growth restriction (FGR) is a prevalent pregnancy condition characterised by failure of the fetus to reach its genetically predetermined growth potential. We explore the application of model fitting techniques, linear regression machine learning models, deep learning regression, and Haralick textured features from multi-contrast MRI for multi-fetal organ analysis of FGR. We employed T2 relaxometry and diffusion-weighted MRI datasets (using a combined T2-diffusion scan) for 12 normally grown and 12 FGR gestational age (GA) matched pregnancies. We applied the Intravoxel Incoherent Motion Model and novel multi-compartment models for MRI fetal analysis, which exhibit potential to provide a multi-organ FGR assessment, overcoming the limitations of empirical indicators - such as abnormal artery Doppler findings - to evaluate placental dysfunction. The placenta and fetal liver presented key differentiators between FGR and normal controls (decreased perfusion, abnormal fetal blood motion and reduced fetal blood oxygenation. This may be associated with the preferential shunting of the fetal blood towards the fetal brain. These features were further explored to determine their role in assessing FGR severity, by employing simple machine learning models to predict FGR diagnosis (100\% accuracy in test data, n=5), GA at delivery, time from MRI scan to delivery, and baby weight. Moreover, we explored the use of deep learning to regress the latter three variables. Image texture analysis of the fetal organs demonstrated prominent textural variations in the placental perfusion fractions maps between the groups (p$<$0.0009), and spatial differences in the incoherent fetal capillary blood motion in the liver (p$<$0.009). This research serves as a proof-of-concept, investigating the effect of FGR on fetal organs.
    Automatic classification of deformable shapes. (arXiv:2211.02530v1 [cs.CV])
    Let $\mathcal{D}$ be a dataset of smooth 3D-surfaces, partitioned into disjoint classes $\mathit{CL}_j$, $j= 1, \ldots, k$. We show how optimized diffeomorphic registration applied to large numbers of pairs $S,S' \in \mathcal{D}$ can provide descriptive feature vectors to implement automatic classification on $\mathcal{D}$, and generate classifiers invariant by rigid motions in $\mathbb{R}^3$. To enhance accuracy of automatic classification, we enrich the smallest classes $\mathit{CL}_j$ by diffeomorphic interpolation of smooth surfaces between pairs $S,S' \in \mathit{CL}_j$. We also implement small random perturbations of surfaces $S\in \mathit{CL}_j$ by random flows of smooth diffeomorphisms $F_t:\mathbb{R}^3 \to \mathbb{R}^3$. Finally, we test our automatic classification methods on a cardiology data base of discretized mitral valve surfaces.
    A Multi-Head Convolutional Neural Network With Multi-path Attention improves Image Denoising. (arXiv:2204.12736v2 [cs.CV] UPDATED)
    Recently, convolutional neural networks (CNNs) and attention mechanisms have been widely used in image denoising and achieved satisfactory performance. However, the previous works mostly use a single head to receive the noisy image, limiting the richness of extracted features. Therefore, a novel CNN with multiple heads (MH) named MHCNN is proposed in this paper, whose heads will receive the input images rotated by different rotation angles. MH makes MHCNN simultaneously utilize features of rotated images to remove noise. To integrate these features effectively, we present a novel multi-path attention mechanism (MPA). Unlike previous attention mechanisms that handle pixel-level, channel-level, or patch-level features, MPA focuses on features at the image level. Experiments show MHCNN surpasses other state-of-the-art CNN models on additive white Gaussian noise (AWGN) denoising and real-world image denoising. Its peak signal-to-noise ratio (PSNR) results are higher than other networks, such as BRDNet, RIDNet, PAN-Net, and CSANN. The code is accessible at https://github.com/JiaHongZ/MHCNN.
    Adversarial Defense via Neural Oscillation inspired Gradient Masking. (arXiv:2211.02223v1 [cs.LG])
    Spiking neural networks (SNNs) attract great attention due to their low power consumption, low latency, and biological plausibility. As they are widely deployed in neuromorphic devices for low-power brain-inspired computing, security issues become increasingly important. However, compared to deep neural networks (DNNs), SNNs currently lack specifically designed defense methods against adversarial attacks. Inspired by neural membrane potential oscillation, we propose a novel neural model that incorporates the bio-inspired oscillation mechanism to enhance the security of SNNs. Our experiments show that SNNs with neural oscillation neurons have better resistance to adversarial attacks than ordinary SNNs with LIF neurons on kinds of architectures and datasets. Furthermore, we propose a defense method that changes model's gradients by replacing the form of oscillation, which hides the original training gradients and confuses the attacker into using gradients of 'fake' neurons to generate invalid adversarial samples. Our experiments suggest that the proposed defense method can effectively resist both single-step and iterative attacks with comparable defense effectiveness and much less computational costs than adversarial training methods on DNNs. To the best of our knowledge, this is the first work that establishes adversarial defense through masking surrogate gradients on SNNs.
    A Meta-GNN approach to personalized seizure detection and classification. (arXiv:2211.02642v1 [eess.SP])
    In this paper, we propose a personalized seizure detection and classification framework that quickly adapts to a specific patient from limited seizure samples. We achieve this by combining two novel paradigms that have recently seen much success in a wide variety of real-world applications: graph neural networks (GNN), and meta-learning. We train a Meta-GNN based classifier that learns a global model from a set of training patients such that this global model can eventually be adapted to a new unseen patient using very limited samples. We apply our approach on the TUSZ-dataset, one of the largest and publicly available benchmark datasets for epilepsy. We show that our method outperforms the baselines by reaching 82.7% on accuracy and 82.08% on F1 score after only 20 iterations on new unseen patients.
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v5 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.
    MAEEG: Masked Auto-encoder for EEG Representation Learning. (arXiv:2211.02625v1 [eess.SP])
    Decoding information from bio-signals such as EEG, using machine learning has been a challenge due to the small data-sets and difficulty to obtain labels. We propose a reconstruction-based self-supervised learning model, the masked auto-encoder for EEG (MAEEG), for learning EEG representations by learning to reconstruct the masked EEG features using a transformer architecture. We found that MAEEG can learn representations that significantly improve sleep stage classification (~5% accuracy increase) when only a small number of labels are given. We also found that input sample lengths and different ways of masking during reconstruction-based SSL pretraining have a huge effect on downstream model performance. Specifically, learning to reconstruct a larger proportion and more concentrated masked signal results in better performance on sleep classification. Our findings provide insight into how reconstruction-based SSL could help representation learning for EEG.
    Leveraging Statistical Shape Priors in GAN-based ECG Synthesis. (arXiv:2211.02626v1 [eess.SP])
    Due to the difficulty of collecting electrocardiogram (ECG) data during emergency situations, ECG data generation is an efficient solution for dealing with highly imbalanced ECG training datasets. However, due to the complex dynamics of ECG signals, the synthesis of such signals is a challenging task. In this paper, we present a novel approach for ECG signal generation based on Generative Adversarial Networks (GANs). Our approach combines GANs with statistical ECG data modeling to leverage prior knowledge about ECG dynamics in the generation process. To validate the proposed approach, we present experiments using ECG signals from the MIT-BIH arrhythmia database. The obtained results show the benefits of modeling temporal and amplitude variations of ECG signals as 2-D shapes in generating realistic signals and also improving the performance of state-of-the-art arrhythmia classification baselines.
    A Knowledge Distillation Framework For Enhancing Ear-EEG Based Sleep Staging With Scalp-EEG Data. (arXiv:2211.02638v1 [eess.SP])
    Sleep plays a crucial role in the well-being of human lives. Traditional sleep studies using Polysomnography are associated with discomfort and often lower sleep quality caused by the acquisition setup. Previous works have focused on developing less obtrusive methods to conduct high-quality sleep studies, and ear-EEG is among popular alternatives. However, the performance of sleep staging based on ear-EEG is still inferior to scalp-EEG based sleep staging. In order to address the performance gap between scalp-EEG and ear-EEG based sleep staging, we propose a cross-modal knowledge distillation strategy, which is a domain adaptation approach. Our experiments and analysis validate the effectiveness of the proposed approach with existing architectures, where it enhances the accuracy of the ear-EEG based sleep staging by 3.46% and Cohen's kappa coefficient by a margin of 0.038.
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v2 [cs.CV] UPDATED)
    Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from ReLIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, ReLICv2, which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views to avoid learning spurious correlations and obtain more informative representations. ReLICv2 achieves $77.1\%$ top-$1$ accuracy on ImageNet under linear evaluation on a ResNet50, thus improving the previous state-of-the-art by absolute $+1.5\%$; on larger ResNet models, ReLICv2 achieves up to $80.6\%$ outperforming previous self-supervised approaches with margins up to $+2.3\%$. Most notably, ReLICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Using ReLICv2, we also learn more robust and transferable representations that generalize better out-of-distribution than previous work, both on image classification and semantic segmentation. Finally, we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.
    Neural Posterior Regularization for Likelihood-Free Inference. (arXiv:2102.07770v2 [cs.LG] UPDATED)
    A simulation is useful when the phenomenon of interest is either expensive to regenerate or irreproducible with the same context. Recently, Bayesian inference on the distribution of the simulation input parameter has been implemented sequentially to minimize the required simulation budget for the task of simulation validation to the real-world. However, the Bayesian inference is still challenging when the ground-truth posterior is multi-modal with a high-dimensional simulation output. This paper introduces a regularization technique, namely Neural Posterior Regularization (NPR), which enforces the model to explore the input parameter space effectively. Afterward, we provide the closed-form solution of the regularized optimization that enables analyzing the effect of the regularization. We empirically validate that NPR attains the statistically significant gain on benchmark performances for diverse simulation tasks.
    Spatially Selective Deep Non-linear Filters for Speaker Extraction. (arXiv:2211.02420v1 [eess.AS])
    In a scenario with multiple persons talking simultaneously, the spatial characteristics of the signals are the most distinct feature for extracting the target signal. In this work, we develop a deep joint spatial-spectral non-linear filter that can be steered in an arbitrary target direction. For this we propose a simple and effective conditioning mechanism, which sets the initial state of the filter's recurrent layers based on the target direction. We show that this scheme is more effective than the baseline approach and increases the flexibility of the filter at no performance cost. The resulting spatially selective non-linear filters can also be used for speech separation of an arbitrary number of speakers and enable very accurate multi-speaker localization as we demonstrate in this paper.
    GoRela: Go Relative for Viewpoint-Invariant Motion Forecasting. (arXiv:2211.02545v1 [cs.RO])
    The task of motion forecasting is critical for self-driving vehicles (SDVs) to be able to plan a safe maneuver. Towards this goal, modern approaches reason about the map, the agents' past trajectories and their interactions in order to produce accurate forecasts. The predominant approach has been to encode the map and other agents in the reference frame of each target agent. However, this approach is computationally expensive for multi-agent prediction as inference needs to be run for each agent. To tackle the scaling challenge, the solution thus far has been to encode all agents and the map in a shared coordinate frame (e.g., the SDV frame). However, this is sample inefficient and vulnerable to domain shift (e.g., when the SDV visits uncommon states). In contrast, in this paper, we propose an efficient shared encoding for all agents and the map without sacrificing accuracy or generalization. Towards this goal, we leverage pair-wise relative positional encodings to represent geometric relationships between the agents and the map elements in a heterogeneous spatial graph. This parameterization allows us to be invariant to scene viewpoint, and save online computation by re-using map embeddings computed offline. Our decoder is also viewpoint agnostic, predicting agent goals on the lane graph to enable diverse and context-aware multimodal prediction. We demonstrate the effectiveness of our approach on the urban Argoverse 2 benchmark as well as a novel highway dataset.
    Multi-view Multi-label Fine-grained Emotion Decoding from Human Brain Activity. (arXiv:2211.02629v1 [eess.SP])
    Decoding emotional states from human brain activity plays an important role in brain-computer interfaces. Existing emotion decoding methods still have two main limitations: one is only decoding a single emotion category from a brain activity pattern and the decoded emotion categories are coarse-grained, which is inconsistent with the complex emotional expression of human; the other is ignoring the discrepancy of emotion expression between the left and right hemispheres of human brain. In this paper, we propose a novel multi-view multi-label hybrid model for fine-grained emotion decoding (up to 80 emotion categories) which can learn the expressive neural representations and predicting multiple emotional states simultaneously. Specifically, the generative component of our hybrid model is parametrized by a multi-view variational auto-encoder, in which we regard the brain activity of left and right hemispheres and their difference as three distinct views, and use the product of expert mechanism in its inference network. The discriminative component of our hybrid model is implemented by a multi-label classification network with an asymmetric focal loss. For more accurate emotion decoding, we first adopt a label-aware module for emotion-specific neural representations learning and then model the dependency of emotional states by a masked self-attention mechanism. Extensive experiments on two visually evoked emotional datasets show the superiority of our method.
    PhysioGait: Context-Aware Physiological Context Modeling for Person Re-identification Attack on Wearable Sensing. (arXiv:2211.02622v1 [eess.SP])
    Person re-identification is a critical privacy breach in publicly shared healthcare data. We investigate the possibility of a new type of privacy threat on publicly shared privacy insensitive large scale wearable sensing data. In this paper, we investigate user specific biometric signatures in terms of two contextual biometric traits, physiological (photoplethysmography and electrodermal activity) and physical (accelerometer) contexts. In this regard, we propose PhysioGait, a context-aware physiological signal model that consists of a Multi-Modal Siamese Convolutional Neural Network (mmSNN) which learns the spatial and temporal information individually and performs sensor fusion in a Siamese cost with the objective of predicting a person's identity. We evaluated PhysioGait attack model using 4 real-time collected datasets (3-data under IRB #HP-00064387 and one publicly available data) and two combined datasets achieving 89% - 93% accuracy of re-identifying persons.
    A General Purpose Neural Architecture for Geospatial Systems. (arXiv:2211.02348v1 [cs.LG])
    Geospatial Information Systems are used by researchers and Humanitarian Assistance and Disaster Response (HADR) practitioners to support a wide variety of important applications. However, collaboration between these actors is difficult due to the heterogeneous nature of geospatial data modalities (e.g., multi-spectral images of various resolutions, timeseries, weather data) and diversity of tasks (e.g., regression of human activity indicators or detecting forest fires). In this work, we present a roadmap towards the construction of a general-purpose neural architecture (GPNA) with a geospatial inductive bias, pre-trained on large amounts of unlabelled earth observation data in a self-supervised manner. We envision how such a model may facilitate cooperation between members of the community. We show preliminary results on the first step of the roadmap, where we instantiate an architecture that can process a wide variety of geospatial data modalities and demonstrate that it can achieve competitive performance with domain-specific architectures on tasks relating to the U.N.'s Sustainable Development Goals.
    Time Series Synthesis via Multi-scale Patch-based Generation of Wavelet Scalogram. (arXiv:2211.02620v1 [eess.SP])
    A framework is proposed for the unconditional generation of synthetic time series based on learning from a single sample in low-data regime case. The framework aims at capturing the distribution of patches in wavelet scalogram of time series using single image generative models and producing realistic wavelet coefficients for the generation of synthetic time series. It is demonstrated that the framework is effective with respect to fidelity and diversity for time series with insignificant to no trends. Also, the performance is more promising for generating samples with the same duration (reshuffling) rather than longer ones (retargeting).
    Robustness of Fusion-based Multimodal Classifiers to Cross-Modal Content Dilutions. (arXiv:2211.02646v1 [cs.LG])
    As multimodal learning finds applications in a wide variety of high-stakes societal tasks, investigating their robustness becomes important. Existing work has focused on understanding the robustness of vision-and-language models to imperceptible variations on benchmark tasks. In this work, we investigate the robustness of multimodal classifiers to cross-modal dilutions - a plausible variation. We develop a model that, given a multimodal (image + text) input, generates additional dilution text that (a) maintains relevance and topical coherence with the image and existing text, and (b) when added to the original text, leads to misclassification of the multimodal input. Via experiments on Crisis Humanitarianism and Sentiment Detection tasks, we find that the performance of task-specific fusion-based multimodal classifiers drops by 23.3% and 22.5%, respectively, in the presence of dilutions generated by our model. Metric-based comparisons with several baselines and human evaluations indicate that our dilutions show higher relevance and topical coherence, while simultaneously being more effective at demonstrating the brittleness of the multimodal classifiers. Our work aims to highlight and encourage further research on the robustness of deep multimodal models to realistic variations, especially in human-facing societal applications. The code and other resources are available at https://claws-lab.github.io/multimodal-robustness/.
    Improving Adversarial Robustness to Sensitivity and Invariance Attacks with Deep Metric Learning. (arXiv:2211.02468v1 [cs.LG])
    Intentionally crafted adversarial samples have effectively exploited weaknesses in deep neural networks. A standard method in adversarial robustness assumes a framework to defend against samples crafted by minimally perturbing a sample such that its corresponding model output changes. These sensitivity attacks exploit the model's sensitivity toward task-irrelevant features. Another form of adversarial sample can be crafted via invariance attacks, which exploit the model underestimating the importance of relevant features. Previous literature has indicated a tradeoff in defending against both attack types within a strictly L_p bounded defense. To promote robustness toward both types of attacks beyond Euclidean distance metrics, we use metric learning to frame adversarial regularization as an optimal transport problem. Our preliminary results indicate that regularizing over invariant perturbations in our framework improves both invariant and sensitivity defense.
    CCATMos: Convolutional Context-aware Transformer Network for Non-intrusive Speech Quality Assessment. (arXiv:2211.02577v1 [eess.AS])
    Speech quality assessment has been a critical component in many voice communication related applications such as telephony and online conferencing. Traditional intrusive speech quality assessment requires the clean reference of the degraded utterance to provide an accurate quality measurement. This requirement limits the usability of these methods in real-world scenarios. On the other hand, non-intrusive subjective measurement is the ``golden standard" in evaluating speech quality as human listeners can intrinsically evaluate the quality of any degraded speech with ease. In this paper, we propose a novel end-to-end model structure called Convolutional Context-Aware Transformer (CCAT) network to predict the mean opinion score (MOS) of human raters. We evaluate our model on three MOS-annotated datasets spanning multiple languages and distortion types and submit our results to the ConferencingSpeech 2022 Challenge. Our experiments show that CCAT provides promising MOS predictions compared to current state-of-art non-intrusive speech assessment models with average Pearson correlation coefficient (PCC) increasing from 0.530 to 0.697 and average RMSE decreasing from 0.768 to 0.570 compared to the baseline model on the challenge evaluation test set.
    Approximate exploitability: Learning a best response in large games. (arXiv:2004.09677v5 [cs.LG] UPDATED)
    Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.
    PURSUhInT: In Search of Informative Hint Points Based on Layer Clustering for Knowledge Distillation. (arXiv:2103.00053v3 [cs.LG] UPDATED)
    One of the most efficient methods for model compression is hint distillation, where the student model is injected with information (hints) from several different layers of the teacher model. Although the selection of hint points can drastically alter the compression performance, conventional distillation approaches overlook this fact and use the same hint points as in the early studies. Therefore, we propose a clustering based hint selection methodology, where the layers of teacher model are clustered with respect to several metrics and the cluster centers are used as the hint points. Our method is applicable for any student network, once it is applied on a chosen teacher network. The proposed approach is validated in CIFAR-100 and ImageNet datasets, using various teacher-student pairs and numerous hint distillation methods. Our results show that hint points selected by our algorithm results in superior compression performance compared to state-of-the-art knowledge distillation algorithms on the same student models and datasets.
    Data Models for Dataset Drift Controls in Machine Learning With Images. (arXiv:2211.02578v1 [cs.LG])
    Camera images are ubiquitous in machine learning research. They also play a central role in the delivery of important services spanning medicine and environmental surveying. However, the application of machine learning models in these domains has been limited because of robustness concerns. A primary failure mode are performance drops due to differences between the training and deployment data. While there are methods to prospectively validate the robustness of machine learning models to such dataset drifts, existing approaches do not account for explicit models of the primary object of interest: the data. This makes it difficult to create physically faithful drift test cases or to provide specifications of data models that should be avoided when deploying a machine learning model. In this study, we demonstrate how these shortcomings can be overcome by pairing machine learning robustness validation with physical optics. We examine the role raw sensor data and differentiable data models can play in controlling performance risks related to image dataset drift. The findings are distilled into three applications. First, drift synthesis enables the controlled generation of physically faithful drift test cases. The experiments presented here show that the average decrease in model performance is ten to four times less severe than under post-hoc augmentation testing. Second, the gradient connection between task and data models allows for drift forensics that can be used to specify performance-sensitive data models which should be avoided during deployment of a machine learning model. Third, drift adjustment opens up the possibility for processing adjustments in the face of drift. This can lead to speed up and stabilization of classifier training at a margin of up to 20% in validation accuracy. A guide to access the open code and datasets is available at https://github.com/aiaudit-org/raw2logit.
    A Comparison of SVM against Pre-trained Language Models (PLMs) for Text Classification Tasks. (arXiv:2211.02563v1 [cs.CL])
    The emergence of pre-trained language models (PLMs) has shown great success in many Natural Language Processing (NLP) tasks including text classification. Due to the minimal to no feature engineering required when using these models, PLMs are becoming the de facto choice for any NLP task. However, for domain-specific corpora (e.g., financial, legal, and industrial), fine-tuning a pre-trained model for a specific task has shown to provide a performance improvement. In this paper, we compare the performance of four different PLMs on three public domain-free datasets and a real-world dataset containing domain-specific words, against a simple SVM linear classifier with TFIDF vectorized text. The experimental results on the four datasets show that using PLMs, even fine-tuned, do not provide significant gain over the linear SVM classifier. Hence, we recommend that for text classification tasks, traditional SVM along with careful feature engineering can pro-vide a cheaper and superior performance than PLMs.
    Emotion Recognition With Temporarily Localized 'Emotional Events' in Naturalistic Context. (arXiv:2211.02637v1 [eess.SP])
    Emotion recognition using EEG signals is an emerging area of research due to its broad applicability in BCI. Emotional feelings are hard to stimulate in the lab. Emotions do not last long, yet they need enough context to be perceived and felt. However, most EEG-related emotion databases either suffer from emotionally irrelevant details (due to prolonged duration stimulus) or have minimal context doubting the feeling of any emotion using the stimulus. We tried to reduce the impact of this trade-off by designing an experiment in which participants are free to report their emotional feelings simultaneously watching the emotional stimulus. We called these reported emotional feelings "Emotional Events" in our Dataset on Emotion with Naturalistic Stimuli (DENS). We used EEG signals to classify emotional events on different combinations of Valence(V) and Arousal(A) dimensions and compared the results with benchmark datasets of DEAP and SEED. STFT is used for feature extraction and used in the classification model consisting of CNN-LSTM hybrid layers. We achieved significantly higher accuracy with our data compared to DEEP and SEED data. We conclude that having precise information about emotional feelings improves the classification accuracy compared to long-duration EEG signals which might be contaminated by mind-wandering.
    ZerO Initialization: Initializing Neural Networks with only Zeros and Ones. (arXiv:2110.12661v3 [cs.LG] UPDATED)
    Deep neural networks are usually initialized with random weights, with adequately selected initial variance to ensure stable signal propagation during training. However, selecting the appropriate variance becomes challenging especially as the number of layers grows. In this work, we replace random weight initialization with a fully deterministic initialization scheme, viz., ZerO, which initializes the weights of networks with only zeros and ones (up to a normalization factor), based on identity and Hadamard transforms. Through both theoretical and empirical studies, we demonstrate that ZerO is able to train networks without damaging their expressivity. Applying ZerO on ResNet achieves state-of-the-art performance on various datasets, including ImageNet, which suggests random weights may be unnecessary for network initialization. In addition, ZerO has many benefits, such as training ultra deep networks (without batch-normalization), exhibiting low-rank learning trajectories that result in low-rank and sparse solutions, and improving training reproducibility.
    An IoT Cloud and Big Data Architecture for the Maintenance of Home Appliances. (arXiv:2211.02627v1 [eess.SP])
    Billions of interconnected Internet of Things (IoT) sensors and devices collect tremendous amounts of data from real-world scenarios. Big data is generating increasing interest in a wide range of industries. Once data is analyzed through compute-intensive Machine Learning (ML) methods, it can derive critical business value for organizations. Powerfulplatforms are essential to handle and process such massive collections of information cost-effectively and conveniently. This work introduces a distributed and scalable platform architecture that can be deployed for efficient real-world big data collection and analytics. The proposed system was tested with a case study for Predictive Maintenance of Home Appliances, where current and vibration sensors with high acquisition frequency were connected to washing machines and refrigerators. The introduced platform was used to collect, store, and analyze the data. The experimental results demonstrated that the presented system could be advantageous for tackling real-world IoT scenarios in a cost-effective and local approach.
    A Transformer-Based Substitute Recommendation Model Incorporating Weakly Supervised Customer Behavior Data. (arXiv:2211.02533v1 [cs.IR])
    The substitute-based recommendation is widely used in E-commerce to provide better alternatives to customers. However, existing research typically uses the customer behavior signals like co-view and view-but-purchase-another to capture the substitute relationship. Despite its intuitive soundness, we find that such an approach might ignore the functionality and characteristics of products. In this paper, we adapt substitute recommendation into language matching problem by taking product title description as model input to consider product functionality. We design a new transformation method to de-noise the signals derived from production data. In addition, we consider multilingual support from the engineering point of view. Our proposed end-to-end transformer-based model achieves both successes from offline and online experiments. The proposed model has been deployed in a large-scale E-commerce website for 11 marketplaces in 6 languages. Our proposed model is demonstrated to increase revenue by 19% based on an online A/B experiment.
    Reservoir Computing via Quantum Recurrent Neural Networks. (arXiv:2211.02612v1 [cs.NE])
    Recent developments in quantum computing and machine learning have propelled the interdisciplinary study of quantum machine learning. Sequential modeling is an important task with high scientific and commercial value. Existing VQC or QNN-based methods require significant computational resources to perform the gradient-based optimization of a larger number of quantum circuit parameters. The major drawback is that such quantum gradient calculation requires a large amount of circuit evaluation, posing challenges in current near-term quantum hardware and simulation software. In this work, we approach sequential modeling by applying a reservoir computing (RC) framework to quantum recurrent neural networks (QRNN-RC) that are based on classical RNN, LSTM and GRU. The main idea to this RC approach is that the QRNN with randomly initialized weights is treated as a dynamical system and only the final classical linear layer is trained. Our numerical simulations show that the QRNN-RC can reach results comparable to fully trained QRNN models for several function approximation and time series prediction tasks. Since the QRNN training complexity is significantly reduced, the proposed model trains notably faster. In this work we also compare to corresponding classical RNN-based RC implementations and show that the quantum version learns faster by requiring fewer training epochs in most cases. Our results demonstrate a new possibility to utilize quantum neural network for sequential modeling with greater quantum hardware efficiency, an important design consideration for noisy intermediate-scale quantum (NISQ) computers.
    Recursive Estimation of User Intent from Noninvasive Electroencephalography using Discriminative Models. (arXiv:2211.02630v1 [eess.SP])
    We study the problem of inferring user intent from noninvasive electroencephalography (EEG) to restore communication for people with severe speech and physical impairments (SSPI). The focus of this work is improving the estimation of posterior symbol probabilities in a typing task. At each iteration of the typing procedure, a subset of symbols is chosen for the next query based on the current probability estimate. Evidence about the user's response is collected from event-related potentials (ERP) in order to update symbol probabilities, until one symbol exceeds a predefined confidence threshold. We provide a graphical model describing this task, and derive a recursive Bayesian update rule based on a discriminative probability over label vectors for each query, which we approximate using a neural network classifier. We evaluate the proposed method in a simulated typing task and show that it outperforms previous approaches based on generative modeling.
    Pangu-Weather: A 3D High-Resolution Model for Fast and Accurate Global Weather Forecast. (arXiv:2211.02556v1 [physics.ao-ph])
    In this paper, we present Pangu-Weather, a deep learning based system for fast and accurate global weather forecast. For this purpose, we establish a data-driven environment by downloading $43$ years of hourly global weather data from the 5th generation of ECMWF reanalysis (ERA5) data and train a few deep neural networks with about $256$ million parameters in total. The spatial resolution of forecast is $0.25^\circ\times0.25^\circ$, comparable to the ECMWF Integrated Forecast Systems (IFS). More importantly, for the first time, an AI-based method outperforms state-of-the-art numerical weather prediction (NWP) methods in terms of accuracy (latitude-weighted RMSE and ACC) of all factors (e.g., geopotential, specific humidity, wind speed, temperature, etc.) and in all time ranges (from one hour to one week). There are two key strategies to improve the prediction accuracy: (i) designing a 3D Earth Specific Transformer (3DEST) architecture that formulates the height (pressure level) information into cubic data, and (ii) applying a hierarchical temporal aggregation algorithm to alleviate cumulative forecast errors. In deterministic forecast, Pangu-Weather shows great advantages for short to medium-range forecast (i.e., forecast time ranges from one hour to one week). Pangu-Weather supports a wide range of downstream forecast scenarios, including extreme weather forecast (e.g., tropical cyclone tracking) and large-member ensemble forecast in real-time. Pangu-Weather not only ends the debate on whether AI-based methods can surpass conventional NWP methods, but also reveals novel directions for improving deep learning weather forecast systems.
    Lifetime policy reuse and the importance of task capacity. (arXiv:2106.01741v2 [cs.LG] UPDATED)
    A long-standing challenge in artificial intelligence is lifelong learning. In lifelong learning, many tasks are presented in sequence and learners must efficiently transfer knowledge between tasks while avoiding catastrophic forgetting over long lifetimes. On these problems, policy reuse and other multi-policy reinforcement learning techniques can learn many tasks. However, they can generate many temporary or permanent policies, resulting in memory issues. Consequently, there is a need for lifetime-scalable methods that continually refine a policy library of a pre-defined size. This paper presents a first approach to lifetime-scalable policy reuse. To pre-select the number of policies, a notion of task capacity, the maximal number of tasks that a policy can accurately solve, is proposed. To evaluate lifetime policy reuse using this method, two state-of-the-art single-actor base-learners are compared: 1) a value-based reinforcement learner, Deep Q-Network (DQN) or Deep Recurrent Q-Network (DRQN); and 2) an actor-critic reinforcement learner, Proximal Policy Optimisation (PPO) with or without Long Short-Term Memory layer. By selecting the number of policies based on task capacity, D(R)QN achieves near-optimal performance with 6 policies in a 27-task MDP domain and 9 policies in an 18-task POMDP domain; with fewer policies, catastrophic forgetting and negative transfer are observed. Due to slow, monotonic improvement, PPO requires fewer policies, 1 policy for the 27-task domain and 4 policies for the 18-task domain, but it learns the tasks with lower accuracy than D(R)QN. These findings validate lifetime-scalable policy reuse and suggest using D(R)QN for larger and PPO for smaller library sizes.
    Towards Alzheimer's Disease Progression Assessment: A Review of Machine Learning Methods. (arXiv:2211.02636v1 [q-bio.NC])
    Alzheimer's Disease (AD), as the most devastating neurodegenerative disease worldwide, has reached nearly 10 million new cases annually. Current technology provides unprecedented opportunities to study the progression and etiology of this disease with the advanced in imaging techniques. With the recent emergence of a society driven by big data and machine learning (ML), researchers have exerted considerable effort to summarize recent advances in ML-based AD diagnosis. Here, we outline some of the most prevalent and recent ML models for assessing the progression of AD and provide insights on the challenges, opportunities, and future directions that could be advantageous to future research in AD using ML.
    SelecMix: Debiased Learning by Contradicting-pair Sampling. (arXiv:2211.02291v1 [cs.CV])
    Neural networks trained with ERM (empirical risk minimization) sometimes learn unintended decision rules, in particular when their training data is biased, i.e., when training labels are strongly correlated with undesirable features. To prevent a network from learning such features, recent methods augment training data such that examples displaying spurious correlations (i.e., bias-aligned examples) become a minority, whereas the other, bias-conflicting examples become prevalent. However, these approaches are sometimes difficult to train and scale to real-world data because they rely on generative models or disentangled representations. We propose an alternative based on mixup, a popular augmentation that creates convex combinations of training examples. Our method, coined SelecMix, applies mixup to contradicting pairs of examples, defined as showing either (i) the same label but dissimilar biased features, or (ii) different labels but similar biased features. Identifying such pairs requires comparing examples with respect to unknown biased features. For this, we utilize an auxiliary contrastive model with the popular heuristic that biased features are learned preferentially during training. Experiments on standard benchmarks demonstrate the effectiveness of the method, in particular when label noise complicates the identification of bias-conflicting examples.
    Decorrelation with conditional normalizing flows. (arXiv:2211.02486v1 [hep-ph])
    The sensitivity of many physics analyses can be enhanced by constructing discriminants that preferentially select signal events. Such discriminants become much more useful if they are uncorrelated with a set of protected attributes. In this paper we show a normalizing flow conditioned on the protected attributes can be used to find a decorrelated representation for any discriminant. As a normalizing flow is invertible the separation power of the resulting discriminant will be unchanged at any fixed value of the protected attributes. We demonstrate the efficacy of our approach by building supervised jet taggers that produce almost no sculpting in the mass distribution of the background.
    Black-box Coreset Variational Inference. (arXiv:2211.02377v1 [stat.ML])
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.
    The 'Problem' of Human Label Variation: On Ground Truth in Data, Modeling and Evaluation. (arXiv:2211.02570v1 [cs.CL])
    Human variation in labeling is often considered noise. Annotation projects for machine learning (ML) aim at minimizing human label variation, with the assumption to maximize data quality and in turn optimize and maximize machine learning metrics. However, this conventional practice assumes that there exists a ground truth, and neglects that there exists genuine human variation in labeling due to disagreement, subjectivity in annotation or multiple plausible answers. In this position paper, we argue that this big open problem of human label variation persists and critically needs more attention to move our field forward. This is because human label variation impacts all stages of the ML pipeline: data, modeling and evaluation. However, few works consider all of these dimensions jointly; and existing research is fragmented. We reconcile different previously proposed notions of human label variation, provide a repository of publicly-available datasets with un-aggregated labels, depict approaches proposed so far, identify gaps and suggest ways forward. As datasets are becoming increasingly available, we hope that this synthesized view on the 'problem' will lead to an open discussion on possible strategies to devise fundamentally new directions.
    A 3D-Shape Similarity-based Contrastive Approach to Molecular Representation Learning. (arXiv:2211.02130v1 [cs.LG])
    Molecular shape and geometry dictate key biophysical recognition processes, yet many graph neural networks disregard 3D information for molecular property prediction. Here, we propose a new contrastive-learning procedure for graph neural networks, Molecular Contrastive Learning from Shape Similarity (MolCLaSS), that implicitly learns a three-dimensional representation. Rather than directly encoding or targeting three-dimensional poses, MolCLaSS matches a similarity objective based on Gaussian overlays to learn a meaningful representation of molecular shape. We demonstrate how this framework naturally captures key aspects of three-dimensionality that two-dimensional representations cannot and provides an inductive framework for scaffold hopping.
    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v1 [cs.LG])
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available on GitHub.
    HYDRA-HGR: A Hybrid Transformer-based Architecture for Fusion of Macroscopic and Microscopic Neural Drive Information. (arXiv:2211.02619v1 [eess.SP])
    Development of advance surface Electromyogram (sEMG)-based Human-Machine Interface (HMI) systems is of paramount importance to pave the way towards emergence of futuristic Cyber-Physical-Human (CPH) worlds. In this context, the main focus of recent literature was on development of different Deep Neural Network (DNN)-based architectures that perform Hand Gesture Recognition (HGR) at a macroscopic level (i.e., directly from sEMG signals). At the same time, advancements in acquisition of High-Density sEMG signals (HD-sEMG) have resulted in a surge of significant interest on sEMG decomposition techniques to extract microscopic neural drive information. However, due to complexities of sEMG decomposition and added computational overhead, HGR at microscopic level is less explored than its aforementioned DNN-based counterparts. In this regard, we propose the HYDRA-HGR framework, which is a hybrid model that simultaneously extracts a set of temporal and spatial features through its two independent Vision Transformer (ViT)-based parallel architectures (the so called Macro and Micro paths). The Macro Path is trained directly on the pre-processed HD-sEMG signals, while the Micro path is fed with the p-to-p values of the extracted Motor Unit Action Potentials (MUAPs) of each source. Extracted features at macroscopic and microscopic levels are then coupled via a Fully Connected (FC) fusion layer. We evaluate the proposed hybrid HYDRA-HGR framework through a recently released HD-sEMG dataset, and show that it significantly outperforms its stand-alone counterparts. The proposed HYDRA-HGR framework achieves average accuracy of 94.86% for the 250 ms window size, which is 5.52% and 8.22% higher than that of the Macro and Micro paths, respectively.
    Logits are predictive of network type. (arXiv:2211.02272v1 [cs.CV])
    We show that it is possible to predict which deep network has generated a given logit vector with accuracy well above chance. We utilize a number of networks on a dataset, initialized with random weights or pretrained weights, as well as fine-tuned networks. A classifier is then trained on the logit vectors of the trained set of this dataset to map the logit vector to the network index that has generated it. The classifier is then evaluated on the test set of the dataset. Results are better with randomly initialized networks, but also generalize to pretrained networks as well as fine-tuned ones. Classification accuracy is higher using unnormalized logits than normalized ones. We find that there is little transfer when applying a classifier to the same networks but with different sets of weights. In addition to help better understand deep networks and the way they encode uncertainty, we anticipate our finding to be useful in some applications (e.g. tailoring an adversarial attack for a certain type of network). Code is available at https://github.com/aliborji/logits.
    Network Aware Compute and Memory Allocation in Optically Composable Data Centres with Deep Reinforcement Learning and Graph Neural Networks. (arXiv:2211.02466v1 [cs.NI])
    Resource-disaggregated data centre architectures promise a means of pooling resources remotely within data centres, allowing for both more flexibility and resource efficiency underlying the increasingly important infrastructure-as-a-service business. This can be accomplished by means of using an optically circuit switched backbone in the data centre network (DCN); providing the required bandwidth and latency guarantees to ensure reliable performance when applications are run across non-local resource pools. However, resource allocation in this scenario requires both server-level \emph{and} network-level resource to be co-allocated to requests. The online nature and underlying combinatorial complexity of this problem, alongside the typical scale of DCN topologies, makes exact solutions impossible and heuristic based solutions sub-optimal or non-intuitive to design. We demonstrate that \emph{deep reinforcement learning}, where the policy is modelled by a \emph{graph neural network} can be used to learn effective \emph{network-aware} and \emph{topologically-scalable} allocation policies end-to-end. Compared to state-of-the-art heuristics for network-aware resource allocation, the method achieves up to $20\%$ higher acceptance ratio; can achieve the same acceptance ratio as the best performing heuristic with $3\times$ less networking resources available and can maintain all-around performance when directly applied (with no further training) to DCN topologies with $10^2\times$ more servers than the topologies seen during training.
    High-Resolution Boundary Detection for Medical Image Segmentation with Piece-Wise Two-Sample T-Test Augmented Loss. (arXiv:2211.02419v1 [eess.IV])
    Deep learning methods have contributed substantially to the rapid advancement of medical image segmentation, the quality of which relies on the suitable design of loss functions. Popular loss functions, including the cross-entropy and dice losses, often fall short of boundary detection, thereby limiting high-resolution downstream applications such as automated diagnoses and procedures. We developed a novel loss function that is tailored to reflect the boundary information to enhance the boundary detection. As the contrast between segmentation and background regions along the classification boundary naturally induces heterogeneity over the pixels, we propose the piece-wise two-sample t-test augmented (PTA) loss that is infused with the statistical test for such heterogeneity. We demonstrate the improved boundary detection power of the PTA loss compared to benchmark losses without a t-test component.
    scikit-fda: A Python Package for Functional Data Analysis. (arXiv:2211.02566v1 [stat.CO])
    The library scikit-fda is a Python package for Functional Data Analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-Clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use.
    A Deep Learning Approach to Generating Photospheric Vector Magnetograms of Solar Active Regions for SOHO/MDI Using SDO/HMI and BBSO Data. (arXiv:2211.02278v1 [astro-ph.SR])
    Solar activity is usually caused by the evolution of solar magnetic fields. Magnetic field parameters derived from photospheric vector magnetograms of solar active regions have been used to analyze and forecast eruptive events such as solar flares and coronal mass ejections. Unfortunately, the most recent solar cycle 24 was relatively weak with few large flares, though it is the only solar cycle in which consistent time-sequence vector magnetograms have been available through the Helioseismic and Magnetic Imager (HMI) on board the Solar Dynamics Observatory (SDO) since its launch in 2010. In this paper, we look into another major instrument, namely the Michelson Doppler Imager (MDI) on board the Solar and Heliospheric Observatory (SOHO) from 1996 to 2010. The data archive of SOHO/MDI covers more active solar cycle 23 with many large flares. However, SOHO/MDI data only has line-of-sight (LOS) magnetograms. We propose a new deep learning method, named MagNet, to learn from combined LOS magnetograms, Bx and By taken by SDO/HMI along with H-alpha observations collected by the Big Bear Solar Observatory (BBSO), and to generate vector components Bx' and By', which would form vector magnetograms with observed LOS data. In this way, we can expand the availability of vector magnetograms to the period from 1996 to present. Experimental results demonstrate the good performance of the proposed method. To our knowledge, this is the first time that deep learning has been used to generate photospheric vector magnetograms of solar active regions for SOHO/MDI using SDO/HMI and H-alpha data.
    An Efficient FPGA-based Accelerator for Deep Forest. (arXiv:2211.02281v1 [cs.LG])
    Deep Forest is a prominent machine learning algorithm known for its high accuracy in forecasting. Compared with deep neural networks, Deep Forest has almost no multiplication operations and has better performance on small datasets. However, due to the deep structure and large forest quantity, it suffers from large amounts of calculation and memory consumption. In this paper, an efficient hardware accelerator is proposed for deep forest models, which is also the first work to implement Deep Forest on FPGA. Firstly, a delicate node computing unit (NCU) is designed to improve inference speed. Secondly, based on NCU, an efficient architecture and an adaptive dataflow are proposed, in order to alleviate the problem of node computing imbalance in the classification process. Moreover, an optimized storage scheme in this design also improves hardware utilization and power efficiency. The proposed design is implemented on an FPGA board, Intel Stratix V, and it is evaluated by two typical datasets, ADULT and Face Mask Detection. The experimental results show that the proposed design can achieve around 40x speedup compared to that on a 40 cores high performance x86 CPU.
    Can RBMs be trained with zero step contrastive divergence?. (arXiv:2211.02174v1 [cs.LG])
    Restricted Boltzmann Machines (RBMs) are probabilistic generative models that can be trained by maximum likelihood in principle, but are usually trained by an approximate algorithm called Contrastive Divergence (CD) in practice. In general, a CD-k algorithm estimates an average with respect to the model distribution using a sample obtained from a k-step Markov Chain Monte Carlo Algorithm (e.g., block Gibbs sampling) starting from some initial configuration. Choices of k typically vary from 1 to 100. This technical report explores if it's possible to leverage a simple approximate sampling algorithm with a modified version of CD in order to train an RBM with k=0. As usual, the method is illustrated on MNIST.
    Seismic-phase detection using multiple deep learning models for global and local representations of waveforms. (arXiv:2211.02261v1 [physics.geo-ph])
    The detection of earthquakes is a fundamental prerequisite for seismology and contributes to various research areas, such as forecasting earthquakes and understanding the crust/mantle structure. Recent advances in machine learning technologies have enabled the automatic detection of earthquakes from waveform data. In particular, various state-of-the-art deep-learning methods have been applied to this endeavour. In this study, we proposed and tested a novel phase detection method employing deep learning, which is based on a standard convolutional neural network in a new framework. The novelty of the proposed method is its separate explicit learning strategy for global and local representations of waveforms, which enhances its robustness and flexibility. Prior to modelling the proposed method, we identified local representations of the waveform by the multiple clustering of waveforms, in which the data points were optimally partitioned. Based on this result, we considered a global representation and two local representations of the waveform. Subsequently, different phase detection models were trained for each global and local representation. For a new waveform, the overall phase probability was evaluated as a product of the phase probabilities of each model. This additional information on local representations makes the proposed method robust to noise, which is demonstrated by its application to the test data. Furthermore, an application to seismic swarm data demonstrated the robust performance of the proposed method compared with those of other deep learning methods. Finally, in an application to low-frequency earthquakes, we demonstrated the flexibility of the proposed method, which is readily adaptable for the detection of low-frequency earthquakes by retraining only a local model.
    Learning Tool Morphology for Contact-Rich Manipulation Tasks with Differentiable Simulation. (arXiv:2211.02201v1 [cs.RO])
    When humans perform contact-rich manipulation tasks, customized tools are often necessary and play an important role in simplifying the task. For instance, in our daily life, we use various utensils for handling food, such as knives, forks and spoons. Similarly, customized tools for robots may enable them to more easily perform a variety of tasks. Here, we present an end-to-end framework to automatically learn tool morphology for contact-rich manipulation tasks by leveraging differentiable physics simulators. Previous work approached this problem by introducing manually constructed priors that required detailed specification of object 3D model, grasp pose and task description to facilitate the search or optimization. In our approach, we instead only need to define the objective with respect to the task performance and enable learning a robust morphology by randomizing the task variations. The optimization is made tractable by casting this as a continual learning problem. We demonstrate the effectiveness of our method for designing new tools in several scenarios such as winding ropes, flipping a box and pushing peas onto a scoop in simulation. We also validate that the shapes discovered by our method help real robots succeed in these scenarios.
    Robust Time Series Chain Discovery with Incremental Nearest Neighbors. (arXiv:2211.02146v1 [cs.LG])
    Time series motif discovery has been a fundamental task to identify meaningful repeated patterns in time series. Recently, time series chains were introduced as an expansion of time series motifs to identify the continuous evolving patterns in time series data. Informally, a time series chain (TSC) is a temporally ordered set of time series subsequences, in which every subsequence is similar to the one that precedes it, but the last and the first can be arbitrarily dissimilar. TSCs are shown to be able to reveal latent continuous evolving trends in the time series, and identify precursors of unusual events in complex systems. Despite its promising interpretability, unfortunately, we have observed that existing TSC definitions lack the ability to accurately cover the evolving part of a time series: the discovered chains can be easily cut by noise and can include non-evolving patterns, making them impractical in real-world applications. Inspired by a recent work that tracks how the nearest neighbor of a time series subsequence changes over time, we introduce a new TSC definition which is much more robust to noise in the data, in the sense that they can better locate the evolving patterns while excluding the non-evolving ones. We further propose two new quality metrics to rank the discovered chains. With extensive empirical evaluations, we demonstrate that the proposed TSC definition is significantly more robust to noise than the state of the art, and the top ranked chains discovered can reveal meaningful regularities in a variety of real world datasets.
    Improving the Predictive Performances of $k$ Nearest Neighbors Learning by Efficient Variable Selection. (arXiv:2211.02600v1 [stat.ML])
    This paper computationally demonstrates a sharp improvement in predictive performance for $k$ nearest neighbors thanks to an efficient forward selection of the predictor variables. We show both simulated and real-world data that this novel repeatedly approaches outperformance regression models under stepwise selection
    Sample-based Uncertainty Quantification with a Single Deterministic Neural Network. (arXiv:2209.08418v2 [cs.LG] UPDATED)
    Development of an accurate, flexible, and numerically efficient uncertainty quantification (UQ) method is one of fundamental challenges in machine learning. Previously, a UQ method called DISCO Nets has been proposed (Bouchacourt et al., 2016), which trains a neural network by minimizing the energy score. In this method, a random noise vector in $\mathbb{R}^{10\text{--}100}$ is concatenated with the original input vector in order to produce a diverse ensemble forecast despite using a single neural network. While this method has shown promising performance on a hand pose estimation task in computer vision, it remained unexplored whether this method works as nicely for regression on tabular data, and how it competes with more recent advanced UQ methods such as NGBoost. In this paper, we propose an improved neural architecture of DISCO Nets that admits faster and more stable training while only using a compact noise vector of dimension $\sim \mathcal{O}(1)$. We benchmark this approach on miscellaneous real-world tabular datasets and confirm that it is competitive with or even superior to standard UQ baselines. Moreover we observe that it exhibits better point forecast performance than a neural network of the same size trained with the conventional mean squared error. As another advantage of the proposed method, we show that local feature importance computation methods such as SHAP can be easily applied to any subregion of the predictive distribution. A new elementary proof for the validity of using the energy score to learn predictive distributions is also provided.
    Decentralized Federated Reinforcement Learning for User-Centric Dynamic TFDD Control. (arXiv:2211.02296v1 [cs.LG])
    The explosive growth of dynamic and heterogeneous data traffic brings great challenges for 5G and beyond mobile networks. To enhance the network capacity and reliability, we propose a learning-based dynamic time-frequency division duplexing (D-TFDD) scheme that adaptively allocates the uplink and downlink time-frequency resources of base stations (BSs) to meet the asymmetric and heterogeneous traffic demands while alleviating the inter-cell interference. We formulate the problem as a decentralized partially observable Markov decision process (Dec-POMDP) that maximizes the long-term expected sum rate under the users' packet dropping ratio constraints. In order to jointly optimize the global resources in a decentralized manner, we propose a federated reinforcement learning (RL) algorithm named federated Wolpertinger deep deterministic policy gradient (FWDDPG) algorithm. The BSs decide their local time-frequency configurations through RL algorithms and achieve global training via exchanging local RL models with their neighbors under a decentralized federated learning framework. Specifically, to deal with the large-scale discrete action space of each BS, we adopt a DDPG-based algorithm to generate actions in a continuous space, and then utilize Wolpertinger policy to reduce the mapping errors from continuous action space back to discrete action space. Simulation results demonstrate the superiority of our proposed algorithm to benchmark algorithms with respect to system sum rate.
    Conformal Quantitative Predictive Monitoring of STL Requirements for Stochastic Processes. (arXiv:2211.02375v1 [eess.SY])
    We consider the problem of predictive monitoring (PM), i.e., predicting at runtime the satisfaction of a desired property from the current system's state. Due to its relevance for runtime safety assurance and online control, PM methods need to be efficient to enable timely interventions against predicted violations, while providing correctness guarantees. We introduce \textit{quantitative predictive monitoring (QPM)}, the first PM method to support stochastic processes and rich specifications given in Signal Temporal Logic (STL). Unlike most of the existing PM techniques that predict whether or not some property $\phi$ is satisfied, QPM provides a quantitative measure of satisfaction by predicting the quantitative (aka robust) STL semantics of $\phi$. QPM derives prediction intervals that are highly efficient to compute and with probabilistic guarantees, in that the intervals cover with arbitrary probability the STL robustness values relative to the stochastic evolution of the system. To do so, we take a machine-learning approach and leverage recent advances in conformal inference for quantile regression, thereby avoiding expensive Monte-Carlo simulations at runtime to estimate the intervals. We also show how our monitors can be combined in a compositional manner to handle composite formulas, without retraining the predictors nor sacrificing the guarantees. We demonstrate the effectiveness and scalability of QPM over a benchmark of four discrete-time stochastic processes with varying degrees of complexity.
    Materials Property Prediction with Uncertainty Quantification: A Benchmark Study. (arXiv:2211.02235v1 [cond-mat.mtrl-sci])
    Uncertainty quantification (UQ) has increasing importance in building robust high-performance and generalizable materials property prediction models. It can also be used in active learning to train better models by focusing on getting new training data from uncertain regions. There are several categories of UQ methods each considering different types of uncertainty sources. Here we conduct a comprehensive evaluation on the UQ methods for graph neural network based materials property prediction and evaluate how they truly reflect the uncertainty that we want in error bound estimation or active learning. Our experimental results over four crystal materials datasets (including formation energy, adsorption energy, total energy, and band gap properties) show that the popular ensemble methods for uncertainty estimation is NOT the best choice for UQ in materials property prediction. For the convenience of the community, all the source code and data sets can be accessed freely at \url{https://github.com/usccolumbia/materialsUQ}.
    Sparse Gaussian Process Hyperparameters: Optimize or Integrate?. (arXiv:2211.02476v1 [stat.ML])
    The kernel function and its hyperparameters are the central model selection choice in a Gaussian proces (Rasmussen and Williams, 2006). Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of Titsias (2009). This work is closely related to Hensman et al. (2015b) but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.
    Modeling Temporal Data as Continuous Functions with Process Diffusion. (arXiv:2211.02590v1 [cs.LG])
    Temporal data like time series are often observed at irregular intervals which is a challenging setting for existing machine learning methods. To tackle this problem, we view such data as samples from some underlying continuous function. We then define a diffusion-based generative model that adds noise from a predefined stochastic process while preserving the continuity of the resulting underlying function. A neural network is trained to reverse this process which allows us to sample new realizations from the learned distribution. We define suitable stochastic processes as noise sources and introduce novel denoising and score-matching models on processes. Further, we show how to apply this approach to the multivariate probabilistic forecasting and imputation tasks. Through our extensive experiments, we demonstrate that our method outperforms previous models on synthetic and real-world datasets.
    A Latent Space Model for HLA Compatibility Networks in Kidney Transplantation. (arXiv:2211.02234v1 [cs.LG])
    Kidney transplantation is the preferred treatment for people suffering from end-stage renal disease. Successful kidney transplants still fail over time, known as graft failure; however, the time to graft failure, or graft survival time, can vary significantly between different recipients. A significant biological factor affecting graft survival times is the compatibility between the human leukocyte antigens (HLAs) of the donor and recipient. We propose to model HLA compatibility using a network, where the nodes denote different HLAs of the donor and recipient, and edge weights denote compatibilities of the HLAs, which can be positive or negative. The network is indirectly observed, as the edge weights are estimated from transplant outcomes rather than directly observed. We propose a latent space model for such indirectly-observed weighted and signed networks. We demonstrate that our latent space model can not only result in more accurate estimates of HLA compatibilities, but can also be incorporated into survival analysis models to improve accuracy for the downstream task of predicting graft survival times.
    Self-Supervised Learning for Speech Enhancement through Synthesis. (arXiv:2211.02542v1 [eess.AS])
    Modern speech enhancement (SE) networks typically implement noise suppression through time-frequency masking, latent representation masking, or discriminative signal prediction. In contrast, some recent works explore SE via generative speech synthesis, where the system's output is synthesized by a neural vocoder after an inherently lossy feature-denoising step. In this paper, we propose a denoising vocoder (DeVo) approach, where a vocoder accepts noisy representations and learns to directly synthesize clean speech. We leverage rich representations from self-supervised learning (SSL) speech models to discover relevant features. We conduct a candidate search across 15 potential SSL front-ends and subsequently train our vocoder adversarially with the best SSL configuration. Additionally, we demonstrate a causal version capable of running on streaming audio with 10ms latency and minimal performance degradation. Finally, we conduct both objective evaluations and subjective listening studies to show our system improves objective metrics and outperforms an existing state-of-the-art SE model subjectively.
    Real-Time Target Sound Extraction. (arXiv:2211.02250v1 [cs.SD])
    We present the first neural network model to achieve real-time and streaming target sound extraction. To accomplish this, we propose Waveformer, an encoder-decoder architecture with a stack of dilated causal convolution layers as the encoder, and a transformer decoder layer as the decoder. This hybrid architecture uses dilated causal convolutions for processing large receptive fields in a computationally efficient manner, while also benefiting from the performance transformer-based architectures provide. Our evaluations show as much as 2.2-3.3 dB improvement in SI-SNRi compared to the prior models for this task while having a 1.2-4x smaller model size and a 1.5-2x lower runtime. Open-source code and datasets: https://github.com/vb000/Waveformer
    Impact Learning: A Learning Method from Features Impact and Competition. (arXiv:2211.02263v1 [cs.LG])
    Machine learning is the study of computer algorithms that can automatically improve based on data and experience. Machine learning algorithms build a model from sample data, called training data, to make predictions or judgments without being explicitly programmed to do so. A variety of wellknown machine learning algorithms have been developed for use in the field of computer science to analyze data. This paper introduced a new machine learning algorithm called impact learning. Impact learning is a supervised learning algorithm that can be consolidated in both classification and regression problems. It can furthermore manifest its superiority in analyzing competitive data. This algorithm is remarkable for learning from the competitive situation and the competition comes from the effects of autonomous features. It is prepared by the impacts of the highlights from the intrinsic rate of natural increase (RNI). We, moreover, manifest the prevalence of the impact learning over the conventional machine learning algorithm.
    Contrastive Value Learning: Implicit Models for Simple Offline RL. (arXiv:2211.02100v1 [cs.LG])
    Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.
    Multilingual Name Entity Recognition and Intent Classification Employing Deep Learning Architectures. (arXiv:2211.02415v1 [cs.CL])
    Named Entity Recognition and Intent Classification are among the most important subfields of the field of Natural Language Processing. Recent research has lead to the development of faster, more sophisticated and efficient models to tackle the problems posed by those two tasks. In this work we explore the effectiveness of two separate families of Deep Learning networks for those tasks: Bidirectional Long Short-Term networks and Transformer-based networks. The models were trained and tested on the ATIS benchmark dataset for both English and Greek languages. The purpose of this paper is to present a comparative study of the two groups of networks for both languages and showcase the results of our experiments. The models, being the current state-of-the-art, yielded impressive results and achieved high performance.
    Residual Skill Policies: Learning an Adaptable Skill-based Action Space for Reinforcement Learning for Robotics. (arXiv:2211.02231v1 [cs.RO])
    Skill-based reinforcement learning (RL) has emerged as a promising strategy to leverage prior knowledge for accelerated robot learning. Skills are typically extracted from expert demonstrations and are embedded into a latent space from which they can be sampled as actions by a high-level RL agent. However, this skill space is expansive, and not all skills are relevant for a given robot state, making exploration difficult. Furthermore, the downstream RL agent is limited to learning structurally similar tasks to those used to construct the skill space. We firstly propose accelerating exploration in the skill space using state-conditioned generative models to directly bias the high-level agent towards only sampling skills relevant to a given state based on prior experience. Next, we propose a low-level residual policy for fine-grained skill adaptation enabling downstream RL agents to adapt to unseen task variations. Finally, we validate our approach across four challenging manipulation tasks that differ from those used to build the skill space, demonstrating our ability to learn across task variations while significantly accelerating exploration, outperforming prior works. Code and videos are available on our project website: https://krishanrana.github.io/reskill.
    Weisfeiler and Leman go Hyperbolic: Learning Distance Preserving Node Representations. (arXiv:2211.02501v1 [cs.LG])
    In recent years, graph neural networks (GNNs) have emerged as a promising tool for solving machine learning problems on graphs. Most GNNs are members of the family of message passing neural networks (MPNNs). There is a close connection between these models and the Weisfeiler-Leman (WL) test of isomorphism, an algorithm that can successfully test isomorphism for a broad class of graphs. Recently, much research has focused on measuring the expressive power of GNNs. For instance, it has been shown that standard MPNNs are at most as powerful as WL in terms of distinguishing non-isomorphic graphs. However, these studies have largely ignored the distances between the representations of nodes/graphs which are of paramount importance for learning tasks. In this paper, we define a distance function between nodes which is based on the hierarchy produced by the WL algorithm, and propose a model that learns representations which preserve those distances between nodes. Since the emerging hierarchy corresponds to a tree, to learn these representations, we capitalize on recent advances in the field of hyperbolic neural networks. We empirically evaluate the proposed model on standard node and graph classification datasets where it achieves competitive performance with state-of-the-art models.
    How Does Adaptive Optimization Impact Local Neural Network Geometry?. (arXiv:2211.02254v1 [cs.LG])
    Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.
    The Path to Autonomous Learners. (arXiv:2211.02403v1 [stat.ML])
    In this paper, we present a new theoretical approach for enabling domain knowledge acquisition by intelligent systems. We introduce a hybrid model that starts with minimal input knowledge in the form of an upper ontology of concepts, stores and reasons over this knowledge through a knowledge graph database and learns new information through a Logic Neural Network. We study the behavior of this architecture when handling new data and show that the final system is capable of enriching its current knowledge as well as extending it to new domains.
    Graph Neural Networks on SPD Manifolds for Motor Imagery Classification: A Perspective from the Time-Frequency Analysis. (arXiv:2211.02641v1 [eess.SP])
    Motor imagery (MI) classification is one of the most widely-concern research topics in Electroencephalography (EEG)-based brain-computer interfaces (BCIs) with extensive industry value. The MI-EEG classifiers' tendency has changed fundamentally over the past twenty years, while classifiers' performance is gradually increasing. In particular, owing to the need for characterizing signals' non-Euclidean inherence, the first geometric deep learning (GDL) framework, Tensor-CSPNet, has recently emerged in the BCI study. In essence, Tensor-CSPNet is a deep learning-based classifier on the second-order statistics of EEGs. In contrast to the first-order statistics, using these second-order statistics is the classical treatment of EEG signals, and the discriminative information contained in these second-order statistics is adequate for MI-EEG classification. In this study, we present another GDL classifier for MI-EEG classification called Graph-CSPNet, using graph-based techniques to simultaneously characterize the EEG signals in both the time and frequency domains. It is realized from the perspective of the time-frequency analysis that profoundly influences signal processing and BCI studies. Contrary to Tensor-CSPNet, the architecture of Graph-CSPNet is further simplified with more flexibility to cope with variable time-frequency resolution for signal segmentation to capture the localized fluctuations. In the experiments, Graph-CSPNet is evaluated on subject-specific scenarios from two well-used MI-EEG datasets and produces near-optimal classification accuracies.
    Rickrolling the Artist: Injecting Invisible Backdoors into Text-Guided Image Generation Models. (arXiv:2211.02408v1 [cs.LG])
    While text-to-image synthesis currently enjoys great popularity among researchers and the general public, the security of these models has been neglected so far. Many text-guided image generation models rely on pre-trained text encoders from external sources, and their users trust that the retrieved models will behave as promised. Unfortunately, this might not be the case. We introduce backdoor attacks against text-guided generative models and demonstrate that their text encoders pose a major tampering risk. Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts. By then inserting a single non-Latin character into the prompt, the adversary can trigger the model to either generate images with pre-defined attributes or images following a hidden, potentially malicious description. We empirically demonstrate the high effectiveness of our attacks on Stable Diffusion and highlight that the injection process of a single backdoor takes less than two minutes. Besides phrasing our approach solely as an attack, it can also force an encoder to forget phrases related to certain concepts, such as nudity or violence, and help to make image generation safer.
    An Improved Time Feedforward Connections Recurrent Neural Networks. (arXiv:2211.02561v1 [cs.NE])
    Recurrent Neural Networks (RNNs) have been widely applied to deal with temporal problems, such as flood forecasting and financial data processing. On the one hand, traditional RNNs models amplify the gradient issue due to the strict time serial dependency, making it difficult to realize a long-term memory function. On the other hand, RNNs cells are highly complex, which will significantly increase computational complexity and cause waste of computational resources during model training. In this paper, an improved Time Feedforward Connections Recurrent Neural Networks (TFC-RNNs) model was first proposed to address the gradient issue. A parallel branch was introduced for the hidden state at time t-2 to be directly transferred to time t without the nonlinear transformation at time t-1. This is effective in improving the long-term dependence of RNNs. Then, a novel cell structure named Single Gate Recurrent Unit (SGRU) was presented. This cell structure can reduce the number of parameters for RNNs cell, consequently reducing the computational complexity. Next, applying SGRU to TFC-RNNs as a new TFC-SGRU model solves the above two difficulties. Finally, the performance of our proposed TFC-SGRU was verified through several experiments in terms of long-term memory and anti-interference capabilities. Experimental results demonstrated that our proposed TFC-SGRU model can capture helpful information with time step 1500 and effectively filter out the noise. The TFC-SGRU model accuracy is better than the LSTM and GRU models regarding language processing ability.
    Distributed Linear Bandits under Communication Constraints. (arXiv:2211.02212v1 [cs.LG])
    We consider distributed linear bandits where $M$ agents learn collaboratively to minimize the overall cumulative regret incurred by all agents. Information exchange is facilitated by a central server, and both the uplink and downlink communications are carried over channels with fixed capacity, which limits the amount of information that can be transmitted in each use of the channels. We investigate the regret-communication trade-off by (i) establishing information-theoretic lower bounds on the required communications (in terms of bits) for achieving a sublinear regret order; (ii) developing an efficient algorithm that achieves the minimum sublinear regret order offered by centralized learning using the minimum order of communications dictated by the information-theoretic lower bounds. For sparse linear bandits, we show a variant of the proposed algorithm offers better regret-communication trade-off by leveraging the sparsity of the problem.
    Unintended Memorization and Timing Attacks in Named Entity Recognition Models. (arXiv:2211.02245v1 [cs.CR])
    Named entity recognition models (NER), are widely used for identifying named entities (e.g., individuals, locations, and other information) in text documents. Machine learning based NER models are increasingly being applied in privacy-sensitive applications that need automatic and scalable identification of sensitive information to redact text for data sharing. In this paper, we study the setting when NER models are available as a black-box service for identifying sensitive information in user documents and show that these models are vulnerable to membership inference on their training datasets. With updated pre-trained NER models from spaCy, we demonstrate two distinct membership attacks on these models. Our first attack capitalizes on unintended memorization in the NER's underlying neural network, a phenomenon NNs are known to be vulnerable to. Our second attack leverages a timing side-channel to target NER models that maintain vocabularies constructed from the training data. We show that different functional paths of words within the training dataset in contrast to words not previously seen have measurable differences in execution time. Revealing membership status of training samples has clear privacy implications, e.g., in text redaction, sensitive words or phrases to be found and removed, are at risk of being detected in the training dataset. Our experimental evaluation includes the redaction of both password and health data, presenting both security risks and privacy/regulatory issues. This is exacerbated by results that show memorization with only a single phrase. We achieved 70% AUC in our first attack on a text redaction use-case. We also show overwhelming success in the timing attack with 99.23% AUC. Finally we discuss potential mitigation approaches to realize the safe use of NER models in light of the privacy and security implications of membership inference attacks.
    A $k$-additive Choquet integral-based approach to approximate the SHAP values for local interpretability in machine learning. (arXiv:2211.02166v1 [cs.LG])
    Besides accuracy, recent studies on machine learning models have been addressing the question on how the obtained results can be interpreted. Indeed, while complex machine learning models are able to provide very good results in terms of accuracy even in challenging applications, it is difficult to interpret them. Aiming at providing some interpretability for such models, one of the most famous methods, called SHAP, borrows the Shapley value concept from game theory in order to locally explain the predicted outcome of an instance of interest. As the SHAP values calculation needs previous computations on all possible coalitions of attributes, its computational cost can be very high. Therefore, a SHAP-based method called Kernel SHAP adopts an efficient strategy that approximate such values with less computational effort. In this paper, we also address local interpretability in machine learning based on Shapley values. Firstly, we provide a straightforward formulation of a SHAP-based method for local interpretability by using the Choquet integral, which leads to both Shapley values and Shapley interaction indices. Moreover, we also adopt the concept of $k$-additive games from game theory, which contributes to reduce the computational effort when estimating the SHAP values. The obtained results attest that our proposal needs less computations on coalitions of attributes to approximate the SHAP values.
    The Benefits of Model-Based Generalization in Reinforcement Learning. (arXiv:2211.02222v1 [cs.LG])
    Model-Based Reinforcement Learning (RL) is widely believed to have the potential to improve sample efficiency by allowing an agent to synthesize large amounts of imagined experience. Experience Replay (ER) can be considered a simple kind of model, which has proved extremely effective at improving the stability and efficiency of deep RL. In principle, a learned parametric model could improve on ER by generalizing from real experience to augment the dataset with additional plausible experience. However, owing to the many design choices involved in empirically successful algorithms, it can be very hard to establish where the benefits are actually coming from. Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a general theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. In these experiments, we take care to control for other factors in order to isolate, insofar as possible, the benefit of using experience generated by a learned model relative to ER alone.
    Spectral Regularization: an Inductive Bias for Sequence Modeling. (arXiv:2211.02255v1 [cs.LG])
    Various forms of regularization in learning tasks strive for different notions of simplicity. This paper presents a spectral regularization technique, which attaches a unique inductive bias to sequence modeling based on an intuitive concept of simplicity defined in the Chomsky hierarchy. From fundamental connections between Hankel matrices and regular grammars, we propose to use the trace norm of the Hankel matrix, the tightest convex relaxation of its rank, as the spectral regularizer. To cope with the fact that the Hankel matrix is bi-infinite, we propose an unbiased stochastic estimator for its trace norm. Ultimately, we demonstrate experimental results on Tomita grammars, which exhibit the potential benefits of spectral regularization and validate the proposed stochastic estimator.
    BERT for Long Documents: A Case Study of Automated ICD Coding. (arXiv:2211.02519v1 [cs.CL])
    Transformer models have achieved great success across many NLP problems. However, previous studies in automated ICD coding concluded that these models fail to outperform some of the earlier solutions such as CNN-based models. In this paper we challenge this conclusion. We present a simple and scalable method to process long text with the existing transformer models such as BERT. We show that this method significantly improves the previous results reported for transformer models in ICD coding, and is able to outperform one of the prominent CNN-based methods.
    Federated Hypergradient Descent. (arXiv:2211.02106v1 [cs.LG])
    In this work, we explore combining automatic hyperparameter tuning and optimization for federated learning (FL) in an online, one-shot procedure. We apply a principled approach on a method for adaptive client learning rate, number of local steps, and batch size. In our federated learning applications, our primary motivations are minimizing communication budget as well as local computational resources in the training pipeline. Conventionally, hyperparameter tuning methods involve at least some degree of trial-and-error, which is known to be sample inefficient. In order to address our motivations, we propose FATHOM (Federated AuTomatic Hyperparameter OptiMization) as a one-shot online procedure. We investigate the challenges and solutions of deriving analytical gradients with respect to the hyperparameters of interest. Our approach is inspired by the fact that, with the exception of local data, we have full knowledge of all components involved in our training process, and this fact can be exploited in our algorithm impactfully. We show that FATHOM is more communication efficient than Federated Averaging (FedAvg) with optimized, static valued hyperparameters, and is also more computationally efficient overall. As a communication efficient, one-shot online procedure, FATHOM solves the bottleneck of costly communication and limited local computation, by eliminating a potentially wasteful tuning process, and by optimizing the hyperparamters adaptively throughout the training procedure without trial-and-error. We show our numerical results through extensive empirical experiments with the Federated EMNIST-62 (FEMNIST) and Federated Stack Overflow (FSO) datasets, using FedJAX as our baseline framework.
    Benchmarking Quality-Diversity Algorithms on Neuroevolution for Reinforcement Learning. (arXiv:2211.02193v1 [cs.NE])
    We present a Quality-Diversity benchmark suite for Deep Neuroevolution in Reinforcement Learning domains for robot control. The suite includes the definition of tasks, environments, behavioral descriptors, and fitness. We specify different benchmarks based on the complexity of both the task and the agent controlled by a deep neural network. The benchmark uses standard Quality-Diversity metrics, including coverage, QD-score, maximum fitness, and an archive profile metric to quantify the relation between coverage and fitness. We also present how to quantify the robustness of the solutions with respect to environmental stochasticity by introducing corrected versions of the same metrics. We believe that our benchmark is a valuable tool for the community to compare and improve their findings. The source code is available online: https://github.com/adaptive-intelligent-robotics/QDax
    FedER: Federated Learning through Experience Replay and Privacy-Preserving Data Synthesis. (arXiv:2206.10048v2 [cs.LG] UPDATED)
    In the medical field, multi-center collaborations are often sought to yield more generalizable findings by leveraging the heterogeneity of patient and clinical data. However, recent privacy regulations hinder the possibility to share data, and consequently, to come up with machine learning-based solutions that support diagnosis and prognosis. Federated learning (FL) aims at sidestepping this limitation by bringing AI-based solutions to data owners and only sharing local AI models, or parts thereof, that need then to be aggregated. However, most of the existing federated learning solutions are still at their infancy and show several shortcomings, from the lack of a reliable and effective aggregation scheme able to retain the knowledge learned locally to weak privacy preservation as real data may be reconstructed from model updates. Furthermore, the majority of these approaches, especially those dealing with medical data, relies on a centralized distributed learning strategy that poses robustness, scalability and trust issues. In this paper we present a federated and decentralized learning strategy, FedER, that, exploiting experience replay and generative adversarial concepts, effectively integrates features from local nodes, providing models able to generalize across multiple datasets while maintaining privacy. FedER is tested on two tasks -- tuberculosis and melanoma classification -- using multiple datasets in order to simulate realistic non-i.i.d. medical data scenarios. Results show that our approach achieves performance comparable to standard (non-federated) learning and significantly outperforms state-of-the-art federated methods in their centralized (thus, more favourable) formulation. Code is available at https://github.com/perceivelab/FedER
    Making Machine Learning Datasets and Models FAIR for HPC: A Methodology and Case Study. (arXiv:2211.02092v1 [cs.LG])
    The FAIR Guiding Principles aim to improve the findability, accessibility, interoperability, and reusability of digital content by making them both human and machine actionable. However, these principles have not yet been broadly adopted in the domain of machine learning-based program analyses and optimizations for High-Performance Computing (HPC). In this paper, we design a methodology to make HPC datasets and machine learning models FAIR after investigating existing FAIRness assessment and improvement techniques. Our methodology includes a comprehensive, quantitative assessment for elected data, followed by concrete, actionable suggestions to improve FAIRness with respect to common issues related to persistent identifiers, rich metadata descriptions, license and provenance information. Moreover, we select a representative training dataset to evaluate our methodology. The experiment shows the methodology can effectively improve the dataset and model's FAIRness from an initial score of 19.1% to the final score of 83.0%.
    MUSTACHE: Multi-Step-Ahead Predictions for Cache Eviction. (arXiv:2211.02177v1 [cs.OS])
    In this work, we propose MUSTACHE, a new page cache replacement algorithm whose logic is learned from observed memory access requests rather than fixed like existing policies. We formulate the page request prediction problem as a categorical time series forecasting task. Then, our method queries the learned page request forecaster to obtain the next $k$ predicted page memory references to better approximate the optimal B\'el\'ady's replacement algorithm. We implement several forecasting techniques using advanced deep learning architectures and integrate the best-performing one into an existing open-source cache simulator. Experiments run on benchmark datasets show that MUSTACHE outperforms the best page replacement heuristic (i.e., exact LRU), improving the cache hit ratio by 1.9% and reducing the number of reads/writes required to handle cache misses by 18.4% and 10.3%.
    LMentry: A Language Model Benchmark of Elementary Language Tasks. (arXiv:2211.02069v1 [cs.CL])
    As the performance of large language models rapidly improves, benchmarks are getting larger and more complex as well. We present LMentry, a benchmark that avoids this "arms race" by focusing on a compact set of tasks that are trivial to humans, e.g. writing a sentence containing a specific word, identifying which words in a list belong to a specific category, or choosing which of two words is longer. LMentry is specifically designed to provide quick and interpretable insights into the capabilities and robustness of large language models. Our experiments reveal a wide variety of failure cases that, while immediately obvious to humans, pose a considerable challenge for large language models, including OpenAI's latest 175B-parameter instruction-tuned model, TextDavinci002. LMentry complements contemporary evaluation approaches of large language models, providing a quick, automatic, and easy-to-run "unit test", without resorting to large benchmark suites of complex tasks.  ( 2 min )
    Translated Skip Connections -- Expanding the Receptive Fields of Fully Convolutional Neural Networks. (arXiv:2211.02111v1 [cs.CV])
    The effective receptive field of a fully convolutional neural network is an important consideration when designing an architecture, as it defines the portion of the input visible to each convolutional kernel. We propose a neural network module, extending traditional skip connections, called the translated skip connection. Translated skip connections geometrically increase the receptive field of an architecture with negligible impact on both the size of the parameter space and computational complexity. By embedding translated skip connections into a benchmark architecture, we demonstrate that our module matches or outperforms four other approaches to expanding the effective receptive fields of fully convolutional neural networks. We confirm this result across five contemporary image segmentation datasets from disparate domains, including the detection of COVID-19 infection, segmentation of aerial imagery, common object segmentation, and segmentation for self-driving cars.  ( 2 min )
    Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization. (arXiv:2211.02077v1 [cs.CV])
    Self-supervised pre-training recently demonstrates success on large-scale multimodal data, and state-of-the-art contrastive learning methods often enforce the feature consistency from cross-modality inputs, such as video/audio or video/text pairs. Despite its convenience to formulate and leverage in practice, such cross-modality alignment (CMA) is only a weak and noisy supervision, since two modalities can be semantically misaligned even they are temporally aligned. For example, even in the commonly adopted instructional videos, a speaker can sometimes refer to something that is not visually present in the current frame; and the semantic misalignment would only be more unpredictable for the raw videos from the internet. We conjecture that might cause conflicts and biases among modalities, and may hence prohibit CMA from scaling up to training with larger and more heterogeneous data. This paper first verifies our conjecture by observing that, even in the latest VATT pre-training using only instructional videos, there exist strong gradient conflicts between different CMA losses within the same video, audio, text triplet, indicating them as the noisy source of supervision. We then propose to harmonize such gradients, via two techniques: (i) cross-modality gradient realignment: modifying different CMA loss gradients for each sample triplet, so that their gradient directions are more aligned; and (ii) gradient-based curriculum learning: leveraging the gradient conflict information on an indicator of sample noisiness, to develop a curriculum learning strategy to prioritize training on less noisy sample triplets. Applying those techniques to pre-training VATT on the HowTo100M dataset, we consistently improve its performance on different downstream tasks. Moreover, we are able to scale VATT pre-training to more complicated non-narrative Youtube8M dataset to further improve the state-of-the-arts.  ( 3 min )
    Safe Real-World Autonomous Driving by Learning to Predict and Plan with a Mixture of Experts. (arXiv:2211.02131v1 [cs.RO])
    The goal of autonomous vehicles is to navigate public roads safely and comfortably. To enforce safety, traditional planning approaches rely on handcrafted rules to generate trajectories. Machine learning-based systems, on the other hand, scale with data and are able to learn more complex behaviors. However, they often ignore that agents and self-driving vehicle trajectory distributions can be leveraged to improve safety. In this paper, we propose modeling a distribution over multiple future trajectories for both the self-driving vehicle and other road agents, using a unified neural network architecture for prediction and planning. During inference, we select the planning trajectory that minimizes a cost taking into account safety and the predicted probabilities. Our approach does not depend on any rule-based planners for trajectory generation or optimization, improves with more training data and is simple to implement. We extensively evaluate our method through a realistic simulator and show that the predicted trajectory distribution corresponds to different driving profiles. We also successfully deploy it on a self-driving vehicle on urban public roads, confirming that it drives safely without compromising comfort. The code for training and testing our model on a public prediction dataset and the video of the road test are available at https://woven.mobi/safepathnet  ( 2 min )
    A Riemannian ADMM. (arXiv:2211.02163v1 [math.OC])
    We consider a class of Riemannian optimization problems where the objective is the sum of a smooth function and a nonsmooth function, considered in the ambient space. This class of problems finds important applications in machine learning and statistics such as the sparse principal component analysis, sparse spectral clustering, and orthogonal dictionary learning. We propose a Riemannian alternating direction method of multipliers (ADMM) to solve this class of problems. Our algorithm adopts easily computable steps in each iteration. The iteration complexity of the proposed algorithm for obtaining an $\epsilon$-stationary point is analyzed under mild assumptions. To the best of our knowledge, this is the first Riemannian ADMM with provable convergence guarantee for solving Riemannian optimization problem with nonsmooth objective. Numerical experiments are conducted to demonstrate the advantage of the proposed method.  ( 2 min )
    Geometry and convergence of natural policy gradient methods. (arXiv:2211.02105v1 [math.OC])
    We study the convergence of several natural policy gradient (NPG) methods in infinite-horizon discounted Markov decision processes with regular policy parametrizations. For a variety of NPGs and reward functions we show that the trajectories in state-action space are solutions of gradient flows with respect to Hessian geometries, based on which we obtain global convergence guarantees and convergence rates. In particular, we show linear convergence for unregularized and regularized NPG flows with the metrics proposed by Kakade and Morimura and co-authors by observing that these arise from the Hessian geometries of conditional entropy and entropy respectively. Further, we obtain sublinear convergence rates for Hessian geometries arising from other convex functions like log-barriers. Finally, we interpret the discrete-time NPG methods with regularized rewards as inexact Newton methods if the NPG is defined with respect to the Hessian geometry of the regularizer. This yields local quadratic convergence rates of these methods for step size equal to the penalization strength.  ( 2 min )
    Learning to Rank Graph-based Application Objects on Heterogeneous Memories. (arXiv:2211.02195v1 [cs.LG])
    Persistent Memory (PMEM), also known as Non-Volatile Memory (NVM), can deliver higher density and lower cost per bit when compared with DRAM. Its main drawback is that it is typically slower than DRAM. On the other hand, DRAM has scalability problems due to its cost and energy consumption. Soon, PMEM will likely coexist with DRAM in computer systems but the biggest challenge is to know which data to allocate on each type of memory. This paper describes a methodology for identifying and characterizing application objects that have the most influence on the application's performance using Intel Optane DC Persistent Memory. In the first part of our work, we built a tool that automates the profiling and analysis of application objects. In the second part, we build a machine learning model to predict the most critical object within large-scale graph-based applications. Our results show that using isolated features does not bring the same benefit compared to using a carefully chosen set of features. By performing data placement using our predictive model, we can reduce the execution time degradation by 12\% (average) and 30\% (max) when compared to the baseline's approach based on LLC misses indicator.  ( 2 min )
    Theta-Resonance: A Single-Step Reinforcement Learning Method for Design Space Exploration. (arXiv:2211.02052v1 [cs.LG])
    Given an environment (e.g., a simulator) for evaluating samples in a specified design space and a set of weighted evaluation metrics -- one can use Theta-Resonance, a single-step Markov Decision Process (MDP), to train an intelligent agent producing progressively more optimal samples. In Theta-Resonance, a neural network consumes a constant input tensor and produces a policy as a set of conditional probability density functions (PDFs) for sampling each design dimension. We specialize existing policy gradient algorithms in deep reinforcement learning (D-RL) in order to use evaluation feedback (in terms of cost, penalty or reward) to update our policy network with robust algorithmic stability and minimal design evaluations. We study multiple neural architectures (for our policy network) within the context of a simple SoC design space and propose a method of constructing synthetic space exploration problems to compare and improve design space exploration (DSE) algorithms. Although we only present categorical design spaces, we also outline how to use Theta-Resonance in order to explore continuous and mixed continuous-discrete design spaces.  ( 2 min )
  • Open

    Geometry-Complete Perceptron Networks for 3D Molecular Graphs. (arXiv:2211.02504v1 [cs.LG])
    The field of geometric deep learning has had a profound impact on the development of innovative and powerful graph neural network architectures. Disciplines such as computer vision and computational biology have benefited significantly from such methodological advances, which has led to breakthroughs in scientific domains such as protein structure prediction and design. In this work, we introduce GCPNet, a new geometry-complete, SE(3)-equivariant graph neural network designed for 3D graph representation learning. We demonstrate the state-of-the-art utility and expressiveness of our method on six independent datasets designed for three distinct geometric tasks: protein-ligand binding affinity prediction, protein structure ranking, and Newtonian many-body systems modeling. Our results suggest that GCPNet is a powerful, general method for capturing complex geometric and physical interactions within 3D graphs for downstream prediction tasks. The source code, data, and instructions to train new models or reproduce our results are freely available on GitHub.
    Time series quantile regression using random forests. (arXiv:2211.02273v1 [math.ST])
    We discuss an application of Generalized Random Forests (GRF) proposed by Athey et al.(2019) to quantile regression for time series data. We extracted the theoretical results of the GRF consistency for i.i.d. data to time series data. In particular, in the main theorem, based only on the general assumptions for time series data in Davis and Nielsen (2020), and trees in Athey et al.(2019), we show that the tsQRF (time series Quantile Regression Forests) estimator is consistent. Davis and Nielsen (2020) also discussed the estimation problem using Random Forests (RF) for time series data, but the construction procedure of the RF treated by the GRF is essentially different, and different ideas are used throughout the theoretical proof. In addition, a simulation and real data analysis were conducted.In the simulation, the accuracy of the conditional quantile estimation was evaluated under time series models. In the real data using the Nikkei Stock Average, our estimator is demonstrated to be more sensitive than the others in terms of volatility, thus preventing underestimation of risk.  ( 2 min )
    How Does Adaptive Optimization Impact Local Neural Network Geometry?. (arXiv:2211.02254v1 [cs.LG])
    Adaptive optimization methods are well known to achieve superior convergence relative to vanilla gradient methods. The traditional viewpoint in optimization, particularly in convex optimization, explains this improved performance by arguing that, unlike vanilla gradient schemes, adaptive algorithms mimic the behavior of a second-order method by adapting to the global geometry of the loss function. We argue that in the context of neural network optimization, this traditional viewpoint is insufficient. Instead, we advocate for a local trajectory analysis. For iterate trajectories produced by running a generic optimization algorithm OPT, we introduce $R^{\text{OPT}}_{\text{med}}$, a statistic that is analogous to the condition number of the loss Hessian evaluated at the iterates. Through extensive experiments, we show that adaptive methods such as Adam bias the trajectories towards regions where $R^{\text{Adam}}_{\text{med}}$ is small, where one might expect faster convergence. By contrast, vanilla gradient methods like SGD bias the trajectories towards regions where $R^{\text{SGD}}_{\text{med}}$ is comparatively large. We complement these empirical observations with a theoretical result that provably demonstrates this phenomenon in the simplified setting of a two-layer linear network. We view our findings as evidence for the need of a new explanation of the success of adaptive methods, one that is different than the conventional wisdom.  ( 2 min )
    Pushing the limits of self-supervised ResNets: Can we outperform supervised learning without labels on ImageNet?. (arXiv:2201.05119v2 [cs.CV] UPDATED)
    Despite recent progress made by self-supervised methods in representation learning with residual networks, they still underperform supervised learning on the ImageNet classification benchmark, limiting their applicability in performance-critical settings. Building on prior theoretical insights from ReLIC [Mitrovic et al., 2021], we include additional inductive biases into self-supervised learning. We propose a new self-supervised representation learning method, ReLICv2, which combines an explicit invariance loss with a contrastive objective over a varied set of appropriately constructed data views to avoid learning spurious correlations and obtain more informative representations. ReLICv2 achieves $77.1\%$ top-$1$ accuracy on ImageNet under linear evaluation on a ResNet50, thus improving the previous state-of-the-art by absolute $+1.5\%$; on larger ResNet models, ReLICv2 achieves up to $80.6\%$ outperforming previous self-supervised approaches with margins up to $+2.3\%$. Most notably, ReLICv2 is the first unsupervised representation learning method to consistently outperform the supervised baseline in a like-for-like comparison over a range of ResNet architectures. Using ReLICv2, we also learn more robust and transferable representations that generalize better out-of-distribution than previous work, both on image classification and semantic segmentation. Finally, we show that despite using ResNet encoders, ReLICv2 is comparable to state-of-the-art self-supervised vision transformers.  ( 3 min )
    Fully Bayesian inference for latent variable Gaussian process models. (arXiv:2211.02218v1 [stat.ML])
    Real engineering and scientific applications often involve one or more qualitative inputs. Standard Gaussian processes (GPs), however, cannot directly accommodate qualitative inputs. The recently introduced latent variable Gaussian process (LVGP) overcomes this issue by first mapping each qualitative factor to underlying latent variables (LVs), and then uses any standard GP covariance function over these LVs. The LVs are estimated similarly to the other GP hyperparameters through maximum likelihood estimation, and then plugged into the prediction expressions. However, this plug-in approach will not account for uncertainty in estimation of the LVs, which can be significant especially with limited training data. In this work, we develop a fully Bayesian approach for the LVGP model and for visualizing the effects of the qualitative inputs via their LVs. We also develop approximations for scaling up LVGPs and fully Bayesian inference for the LVGP hyperparameters. We conduct numerical studies comparing plug-in inference against fully Bayesian inference over a few engineering models and material design applications. In contrast to previous studies on standard GP modeling that have largely concluded that a fully Bayesian treatment offers limited improvements, our results show that for LVGP modeling it offers significant improvements in prediction accuracy and uncertainty quantification over the plug-in approach.  ( 2 min )
    Spatial-Temporal Convolutional Attention for Mapping Functional Brain Networks. (arXiv:2211.02315v1 [q-bio.NC])
    Using functional magnetic resonance imaging (fMRI) and deep learning to explore functional brain networks (FBNs) has attracted many researchers. However, most of these studies are still based on the temporal correlation between the sources and voxel signals, and lack of researches on the dynamics of brain function. Due to the widespread local correlations in the volumes, FBNs can be generated directly in the spatial domain in a self-supervised manner by using spatial-wise attention (SA), and the resulting FBNs has a higher spatial similarity with templates compared to the classical method. Therefore, we proposed a novel Spatial-Temporal Convolutional Attention (STCA) model to discover the dynamic FBNs by using the sliding windows. To validate the performance of the proposed method, we evaluate the approach on HCP-rest dataset. The results indicate that STCA can be used to discover FBNs in a dynamic way which provide a novel approach to better understand human brain.  ( 2 min )
    Uncertainty Estimation for Computed Tomography with a Linearised Deep Image Prior. (arXiv:2203.00479v2 [eess.IV] UPDATED)
    Existing deep-learning based tomographic image reconstruction methods do not provide accurate estimates of reconstruction uncertainty, hindering their real-world deployment. This paper develops a method, termed as the linearised deep image prior (DIP), to estimate the uncertainty associated with reconstructions produced by the DIP with total variation regularisation (TV). Specifically, we endow the DIP with conjugate Gaussian-linear model type error-bars computed from a local linearisation of the neural network around its optimised parameters. To preserve conjugacy, we approximate the TV regulariser with a Gaussian surrogate. This approach provides pixel-wise uncertainty estimates and a marginal likelihood objective for hyperparameter optimisation. We demonstrate the method on synthetic data and real-measured high-resolution 2D $\mu$CT data, and show that it provides superior calibration of uncertainty estimates relative to previous probabilistic formulations of the DIP. Our code is available at https://github.com/educating-dip/bayes_dip.  ( 2 min )
    Domain Adaptation under Missingness Shift. (arXiv:2211.02093v1 [cs.LG])
    Rates of missing data often depend on record-keeping policies and thus may change across times and locations, even when the underlying features are comparatively stable. In this paper, we introduce the problem of Domain Adaptation under Missingness Shift (DAMS). Here, (labeled) source data and (unlabeled) target data would be exchangeable but for different missing data mechanisms. We show that when missing data indicators are available, DAMS can reduce to covariate shift. Focusing on the setting where missing data indicators are absent, we establish the following theoretical results for underreporting completely at random: (i) covariate shift is violated (adaptation is required); (ii) the optimal source predictor can perform worse on the target domain than a constant one; (iii) the optimal target predictor can be identified, even when the missingness rates themselves are not; and (iv) for linear models, a simple analytic adjustment yields consistent estimates of the optimal target parameters. In experiments on synthetic and semi-synthetic data, we demonstrate the promise of our methods when assumptions hold. Finally, we discuss a rich family of future extensions.  ( 2 min )
    Black-box Coreset Variational Inference. (arXiv:2211.02377v1 [stat.ML])
    Recent advances in coreset methods have shown that a selection of representative datapoints can replace massive volumes of data for Bayesian inference, preserving the relevant statistical information and significantly accelerating subsequent downstream tasks. Existing variational coreset constructions rely on either selecting subsets of the observed datapoints, or jointly performing approximate inference and optimizing pseudodata in the observed space akin to inducing points methods in Gaussian Processes. So far, both approaches are limited by complexities in evaluating their objectives for general purpose models, and require generating samples from a typically intractable posterior over the coreset throughout inference and testing. In this work, we present a black-box variational inference framework for coresets that overcomes these constraints and enables principled application of variational coresets to intractable models, such as Bayesian neural networks. We apply our techniques to supervised learning problems, and compare them with existing approaches in the literature for data summarization and inference.  ( 2 min )
    scikit-fda: A Python Package for Functional Data Analysis. (arXiv:2211.02566v1 [stat.CO])
    The library scikit-fda is a Python package for Functional Data Analysis (FDA). It provides a comprehensive set of tools for representation, preprocessing, and exploratory analysis of functional data. The library is built upon and integrated in Python's scientific ecosystem. In particular, it conforms to the scikit-learn application programming interface so as to take advantage of the functionality for machine learning provided by this package: pipelines, model selection, and hyperparameter tuning, among others. The scikit-fda package has been released as free and open-source software under a 3-Clause BSD license and is open to contributions from the FDA community. The library's extensive documentation includes step-by-step tutorials and detailed examples of use.  ( 2 min )
    Spectral Regularization: an Inductive Bias for Sequence Modeling. (arXiv:2211.02255v1 [cs.LG])
    Various forms of regularization in learning tasks strive for different notions of simplicity. This paper presents a spectral regularization technique, which attaches a unique inductive bias to sequence modeling based on an intuitive concept of simplicity defined in the Chomsky hierarchy. From fundamental connections between Hankel matrices and regular grammars, we propose to use the trace norm of the Hankel matrix, the tightest convex relaxation of its rank, as the spectral regularizer. To cope with the fact that the Hankel matrix is bi-infinite, we propose an unbiased stochastic estimator for its trace norm. Ultimately, we demonstrate experimental results on Tomita grammars, which exhibit the potential benefits of spectral regularization and validate the proposed stochastic estimator.  ( 2 min )
    Sequential Likelihood-Free Inference with Neural Proposal. (arXiv:2010.07604v3 [stat.ME] UPDATED)
    Bayesian inference without the likelihood evaluation, or likelihood-free inference, has been a key research topic in simulation studies for gaining quantitatively validated simulation models on real-world datasets. As the likelihood evaluation is inaccessible, previous papers train the amortized neural network to estimate the ground-truth posterior for the simulation of interest. Training the network and accumulating the dataset alternatively in a sequential manner could save the total simulation budget by orders of magnitude. In the data accumulation phase, the new simulation inputs are chosen within a portion of the total simulation budget to accumulate upon the collected dataset. This newly accumulated data degenerates because the set of simulation inputs is hardly mixed, and this degenerated data collection process ruins the posterior inference. This paper introduces a new sampling approach, called Neural Proposal (NP), of the simulation input that resolves the biased data collection as it guarantees the i.i.d. sampling. The experiments show the improved performance of our sampler, especially for the simulations with multi-modal posteriors.  ( 2 min )
    Improving the Predictive Performances of $k$ Nearest Neighbors Learning by Efficient Variable Selection. (arXiv:2211.02600v1 [stat.ML])
    This paper computationally demonstrates a sharp improvement in predictive performance for $k$ nearest neighbors thanks to an efficient forward selection of the predictor variables. We show both simulated and real-world data that this novel repeatedly approaches outperformance regression models under stepwise selection  ( 2 min )
    The Path to Autonomous Learners. (arXiv:2211.02403v1 [stat.ML])
    In this paper, we present a new theoretical approach for enabling domain knowledge acquisition by intelligent systems. We introduce a hybrid model that starts with minimal input knowledge in the form of an upper ontology of concepts, stores and reasons over this knowledge through a knowledge graph database and learns new information through a Logic Neural Network. We study the behavior of this architecture when handling new data and show that the final system is capable of enriching its current knowledge as well as extending it to new domains.  ( 2 min )
    Off-Policy Confidence Interval Estimation with Confounded Markov Decision Process. (arXiv:2202.10589v5 [stat.ML] UPDATED)
    This paper is concerned with constructing a confidence interval for a target policy's value offline based on a pre-collected observational data in infinite horizon settings. Most of the existing works assume no unmeasured variables exist that confound the observed actions. This assumption, however, is likely to be violated in real applications such as healthcare and technological industries. In this paper, we show that with some auxiliary variables that mediate the effect of actions on the system dynamics, the target policy's value is identifiable in a confounded Markov decision process. Based on this result, we develop an efficient off-policy value estimator that is robust to potential model misspecification and provide rigorous uncertainty quantification. Our method is justified by theoretical results, simulated and real datasets obtained from ridesharing companies. A Python implementation of the proposed procedure is available at https://github.com/Mamba413/cope.  ( 2 min )
    Multi-output Gaussian processes for inverse uncertainty quantification in neutron noise analysis. (arXiv:2211.02465v1 [stat.CO])
    In a fissile material, the inherent multiplicity of neutrons born through induced fissions leads to correlations in their detection statistics. The correlations between neutrons can be used to trace back some characteristics of the fissile material. This technique known as neutron noise analysis has applications in nuclear safeguards or waste identification. It provides a non-destructive examination method for an unknown fissile material. This is an example of an inverse problem where the cause is inferred from observations of the consequences. However, neutron correlation measurements are often noisy because of the stochastic nature of the underlying processes. This makes the resolution of the inverse problem more complex since the measurements are strongly dependent on the material characteristics. A minor change in the material properties can lead to very different outputs. Such an inverse problem is said to be ill-posed. For an ill-posed inverse problem the inverse uncertainty quantification is crucial. Indeed, seemingly low noise in the data can lead to strong uncertainties in the estimation of the material properties. Moreover, the analytical framework commonly used to describe neutron correlations relies on strong physical assumptions and is thus inherently biased. This paper addresses dual goals. Firstly, surrogate models are used to improve neutron correlations predictions and quantify the errors on those predictions. Then, the inverse uncertainty quantification is performed to include the impact of measurement error alongside the residual model bias.  ( 2 min )
    When Privacy Meets Partial Information: A Refined Analysis of Differentially Private Bandits. (arXiv:2209.02570v2 [cs.LG] UPDATED)
    We study the problem of multi-armed bandits with $\epsilon$-global Differential Privacy (DP). First, we prove the minimax and problem-dependent regret lower bounds for stochastic and linear bandits that quantify the hardness of bandits with $\epsilon$-global DP. These bounds suggest the existence of two hardness regimes depending on the privacy budget $\epsilon$. In the high-privacy regime (small $\epsilon$), the hardness depends on a coupled effect of privacy and partial information about the reward distributions. In the low-privacy regime (large $\epsilon$), bandits with $\epsilon$-global DP are not harder than the bandits without privacy. For stochastic bandits, we further propose a generic framework to design a near-optimal $\epsilon$ global DP extension of an index-based optimistic bandit algorithm. The framework consists of three ingredients: the Laplace mechanism, arm-dependent adaptive episodes, and usage of only the rewards collected in the last episode for computing private statistics. Specifically, we instantiate $\epsilon$-global DP extensions of UCB and KL-UCB algorithms, namely AdaP-UCB and AdaP-KLUCB. AdaP-KLUCB is the first algorithm that both satisfies $\epsilon$-global DP and yields a regret upper bound that matches the problem-dependent lower bound up to multiplicative constants.  ( 2 min )
    Concentration inequalities for leave-one-out cross validation. (arXiv:2211.02478v1 [math.ST])
    In this article we prove that estimator stability is enough to show that leave-one-out cross validation is a sound procedure, by providing concentration bounds in a general framework. In particular, we provide concentration bounds beyond Lipschitz continuity assumptions on the loss or on the estimator. In order to obtain our results, we rely on random variables with distribution satisfying the logarithmic Sobolev inequality, providing us a relatively rich class of distributions. We illustrate our method by considering several interesting examples, including linear regression, kernel density estimation, and stabilized / truncated estimators such as stabilized kernel regression.  ( 2 min )
    Approximate exploitability: Learning a best response in large games. (arXiv:2004.09677v5 [cs.LG] UPDATED)
    Researchers have demonstrated that neural networks are vulnerable to adversarial examples and subtle environment changes, both of which one can view as a form of distribution shift. To humans, the resulting errors can look like blunders, eroding trust in these agents. In prior games research, agent evaluation often focused on the in-practice game outcomes. While valuable, such evaluation typically fails to evaluate robustness to worst-case outcomes. Prior research in computer poker has examined how to assess such worst-case performance, both exactly and approximately. Unfortunately, exact computation is infeasible with larger domains, and existing approximations rely on poker-specific knowledge. We introduce ISMCTS-BR, a scalable search-based deep reinforcement learning algorithm for learning a best response to an agent, thereby approximating worst-case performance. We demonstrate the technique in several two-player zero-sum games against a variety of agents, including several AlphaZero-based agents.  ( 2 min )
    Neural Posterior Regularization for Likelihood-Free Inference. (arXiv:2102.07770v2 [cs.LG] UPDATED)
    A simulation is useful when the phenomenon of interest is either expensive to regenerate or irreproducible with the same context. Recently, Bayesian inference on the distribution of the simulation input parameter has been implemented sequentially to minimize the required simulation budget for the task of simulation validation to the real-world. However, the Bayesian inference is still challenging when the ground-truth posterior is multi-modal with a high-dimensional simulation output. This paper introduces a regularization technique, namely Neural Posterior Regularization (NPR), which enforces the model to explore the input parameter space effectively. Afterward, we provide the closed-form solution of the regularized optimization that enables analyzing the effect of the regularization. We empirically validate that NPR attains the statistically significant gain on benchmark performances for diverse simulation tasks.  ( 2 min )
    Sparse Gaussian Process Hyperparameters: Optimize or Integrate?. (arXiv:2211.02476v1 [stat.ML])
    The kernel function and its hyperparameters are the central model selection choice in a Gaussian proces (Rasmussen and Williams, 2006). Typically, the hyperparameters of the kernel are chosen by maximising the marginal likelihood, an approach known as Type-II maximum likelihood (ML-II). However, ML-II does not account for hyperparameter uncertainty, and it is well-known that this can lead to severely biased estimates and an underestimation of predictive uncertainty. While there are several works which employ a fully Bayesian characterisation of GPs, relatively few propose such approaches for the sparse GPs paradigm. In this work we propose an algorithm for sparse Gaussian process regression which leverages MCMC to sample from the hyperparameter posterior within the variational inducing point framework of Titsias (2009). This work is closely related to Hensman et al. (2015b) but side-steps the need to sample the inducing points, thereby significantly improving sampling efficiency in the Gaussian likelihood case. We compare this scheme against natural baselines in literature along with stochastic variational GPs (SVGPs) along with an extensive computational analysis.  ( 2 min )

  • Open

    "Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning", Lu et al 2022 (also uses inner-monologue)
    submitted by /u/gwern [link] [comments]  ( 53 min )
    EPyMARL with custom environment?
    Hey guys. I have a multi-agent GridWorld environment I implemented (kind of similar to LBForaging) and I've been trying to integrate it with EPyMARL in order to evaluate how state-of-the-art algorithms behave on it, but I've had no success so far. Did anyone use a custom environment with EPyMARL and could give me some tips on how to make it work? Or should I just try to integrate it with another library like MARLLib? submitted by /u/FleshMachine42 [link] [comments]  ( 53 min )
    How do you add mean-zero Gaussian noise to facilitate exploration in DDPG?
    I am reading SpinningUp's blog on DDPG here. They mention the following - "recent results suggest that uncorrelated, mean-zero Gaussian noise works perfectly well". Could someone please explain to me how would you add Gaussian noise to a variable (over here, it's the output of the policy, i,e action). My understanding from Wikipedia, is that the noise is normally distributed. I also looked into the reference code from SpinningUp - def get_action(o, noise_scale): a = ac.act(torch.as_tensor(o, dtype=torch.float32)) a += noise_scale * np.random.randn(act_dim) return np.clip(a, -act_limit, act_limit) ​ I specifically, have the following questions - ​ 1) What's going on in this line - a += noise_scale * np.random.randn(act_dim) 2) How does the above code represent a 0 mean Gaussian ​ Thanks a ton! submitted by /u/Academic-Rent7800 [link] [comments]  ( 50 min )
    New to reinforcement learning.
    Hey guys, im new to reinforcement learning (first year elec student). I've been messing around with libraries on the gym environment, but really don't know where to go from here. Any thoughts? My interests are mainly using RL with robotics, so im currently tryna recreate the Cartpole environment irl, so y'all got ideas on different models I can use to train the cartpole problem? submitted by /u/Erebusueue [link] [comments]  ( 56 min )
    PPO converging to picking random actions?
    I am currently working on an optimization algorithm that will minimize an objective function, based on continuous actions chosen by a PPO algorithm (stable baselines). I have had a lot of problems with my algorithm, and have not gotten good results. Because of this, I tested my algorithm by comparing it to random actions. When first testing random actions I found an estimation of its performance (let us say 0.1 objective value). During training, it seems as though the algorithm converges to the exact performance of the random strategy (for example converging to 0.1). What is this? It seems as though PPO just learns a uniform distribution to sample actions from, but is this possible? Have tried different hyperparameters, including entropy coefficient. Thanks in advance! submitted by /u/Embarrassed-Print-13 [link] [comments]  ( 62 min )
    How do i know my dqn is working?
    Hi! I'm taking an AI course and the final project is a competition between agents to see who can perfom best on the Avalam board game. I wanted to train a DQN to use it as a heuristic so i followed basic tutorials on pytorch and wrote an agent that plays but doesnt seem to learn anything. I tried experimenting with hyperparameters, trained for 4000 games (abt 16 or 17 training steps a game) and still don't see any improvement whatsoever : the agent gets beaten by a simple greedy agent 99% of the time. The model parameters change but don't improve the performance... So for all the smart people out here, how do i know if my code is broken / the network is not adapted / i just need tons and tons of training ? ​ Ps: If you want to see what my shoddy code looks like its the my_player.py and DQN_heuristics.py on my github https://github.com/mt-clemente/Avalam-DQN-Agent submitted by /u/Secret-Toe-8185 [link] [comments]  ( 53 min )
  • Open

    Nanowire Synapses 30,000x Faster Than Nature’s
    submitted by /u/keghn [link] [comments]  ( 43 min )
    How to make my Neural Network preform better?
    I have a Neural Network that predicts the outcome of E-sports games. The data is from over 900 individual matches from North America with 96 different teams. The results that i get from this, are really inaccurate and all over the place, the loss is really high as well. But i don't know what direction am i even supposed to go to get better results. My script to build and fit the model: ​ def build_model(): input_layer = Input(shape=(len(train .columns),)) first_dense = Dense(units='128', activation='relu')(input_layer) y1_output = Dense(units='1', name='winner_output')(first_dense) second_dense = Dense(units='128', activation='relu')(first_dense) y2_output = Dense(units='1', name='B_goals_output')(second_dense) third_dense = Dense(units='128', activation='relu')(first_dense) y3_output …  ( 49 min )
  • Open

    1min - AI assisted digital painting and AI trained/generated voices
    submitted by /u/sEi_ [link] [comments]  ( 43 min )
    Bill Gates on AI
    submitted by /u/Kehlstrasbourg [link] [comments]  ( 43 min )
    AI Dream 104 - The Most Intense and Smooth Psychedelic Trip You'll Ever ...
    submitted by /u/LordPewPew777 [link] [comments]  ( 42 min )
    Herzog and Žižek become uncanny AI bots trapped in endless conversation
    submitted by /u/Carbonboy [link] [comments]  ( 44 min )
    Is it hard to make a program do a search for me ?
    I know what you might be thinking hahaha that is just what google do... But I would like to make a small project like this, not a search engine, but something like... ■ What would you like to learn today ? I would give a few options to the user.. I have no programming knowledge yet, but I want to learn submitted by /u/ImadeapromiseMrfrod [link] [comments]  ( 45 min )
    A.I. Mathematician? A Simplified Look at DeepMind’s AlphaTensor
    submitted by /u/VivaNoi [link] [comments]  ( 42 min )
    What is Google’s generative AI strategy?
    submitted by /u/bendee983 [link] [comments]  ( 47 min )
    The Wintress of Winter | Artificial AI Art
    submitted by /u/AubreBrumfield [link] [comments]  ( 41 min )
    AI in information platforms: an intelligent approach to detect and filter fake accounts, news, and…
    submitted by /u/SamuelSmith1416 [link] [comments]  ( 43 min )
    Elon Musk as imagined by an AI - amazing results
    submitted by /u/treyratcliff [link] [comments]  ( 42 min )
    Teaser trailer for "The Diary of Sisyphus" (2023), the world's first feature film written by an artificial intelligence (GPT-NEO) and produced Briefcase Films, my indie film studio based in Northern Italy
    submitted by /u/HeyThatsStef [link] [comments]  ( 41 min )
    Is there a fundamental difference in the structure between restricted AI and general AI ?
    submitted by /u/anonymous0x9 [link] [comments]  ( 45 min )
    ai that targets anyone who doesn’t help
    hey reddit ive never really used this before but i thought this would be the best place to ask what is that one ai phenomenon/theory that says anyone who doesnt help build this ai will be targeted by it??? im pretty sure it starts with the letter r and is two words please help meeeee i found out about this MONTHS ago and i just can’t remember what its called and nothing ive searched up has helped me submitted by /u/lixiesplug [link] [comments]  ( 42 min )
    Researchers from ETH Zurich and Microsoft Propose ‘LaMAR,’ a New Benchmark for Localization and Mapping for Augmented Reality
    submitted by /u/ai-lover [link] [comments]  ( 45 min )
    I've spoken with numerous creators & artists about AI art, and whether they're positive or pessimistic about AI, what's common is some level of overwhelm caused by the sheer pace of advancement. They say rising tides lift all boats, so better be on a boat or you risk drowning 🛶
    Discuss: How do you feel about the pace of AI advancements taking place today? View Poll submitted by /u/imaginfinity [link] [comments]  ( 49 min )
  • Open

    [D] Medium Article: How to code Temporal Distribution Characterization (TDC) for time series?
    There is no need to express the necessity of forecasting time series data by deep learning methods. I did provide an article about how we can handle the variations in statistical features (particularly, distribution perspective), which can lead us to a disaster. The model proposed is named AdaRNN, including two main initiatives, TDC and TDM. This article is going to illustrate TDC in detail. https://medium.com/@rezayazdanfar/how-to-code-temporal-distribution-characterization-tdc-for-time-series-916855cc2d6a submitted by /u/rezayazdanfar [link] [comments]  ( 57 min )
    [D] At what tasks are models better than humans given the same amount of data?
    Hey guys, I've been thinking about this question recently. There are tasks that ML-based models outperform humans at, such as some image classification benchmarks and a bunch of games including chess, while humans are better at tons of other things like abstract math. But for which of these tasks can ML models outperform us at given the same amount of data as we have? Like chess for example, can AlphaZero outperform humans if it had as many games of pretraining as, say, Magnus Carlsen has had? I'd imagine that Stockfish might be able to without pretraining just by virtue of computing so many positions ahead, but I'm not sure AlphaZero could, because its tree/policy and value NNs might not be that optimized. As another example, its well-known that humans are generally pretty great at few-shot learning in, say, image classification; we can distinguish, say, dogs from cats given only a couple input examples. submitted by /u/billjames1685 [link] [comments]  ( 63 min )
    [P] COCO captions translation to Nepali using Meta AI's NLLB model
    Fun weekend project: Used Meta AI's NLLB model to translate COCO captions from English to Nepali. Can be a potential data generation method for low-resource languages. Notebook URL: https://github.com/pmgautam/coco-captions-translation/blob/main/english_to_nepali_translation.ipynb ​ Any suggestions/feedback are welcomed. submitted by /u/p1g1 [link] [comments]  ( 55 min )
    [D] Do you think there is a competitive future for smaller, locally trained/served models?
    Increasingly large/deep models for Sound/Image/Language/Games are all the rage (you know what I'm talking about). This is concerning on some level: Focus shifts to amount of data, instead of curation Require more (expensive) hardware to train, out of reach for many API-ization of functionality leads to large scale monitoring by centralized providers Lets take OpenAI Codex / Github Copilot as an example: Disregarding the licensing questions for a bit, amazing as this model is, there are some drawbacks observed when using it: It can generate outdated code or API calls, especially for evolving languages Known vulnerabilities observed in generated code e.g. MITRE weaknesses No local use of the service, unless replicated and self hosted (expensive) Now my questions are these: Do you think there is a case to be made for smaller models fed with higher quality data? Can we substantially reduce number of parameters if we do better with the input? For example a Codex-like model for a single language only. Or do you think that the pre-training of large models and then refining to task (e.g. GPT or maybe programmer -> specific language) will continue to dominate because we require the amount of parameters for the tasks at hand anyway? An AGI that we just teach "courses" if you like. submitted by /u/naequs [link] [comments]  ( 57 min )
    [R] Meta Labeling Architectures
    This video establishes several heterogeneous architectures to account for key aspects of meta-labeling. They serve as a guide for practitioners in the model development process, as well as for researchers to further build on these ideas. https://youtu.be/1fYzABjsNFk submitted by /u/kingsley_heath [link] [comments]  ( 57 min )
    [D] Understanding Syntactic divergence
    Dear Researchers, I am trying to read a paper https://arxiv.org/abs/2004.14444 Section 6.1 of the paper describes Syntactic divergence. I have confusion regarding the distribution graphs and split of the dataset. If a paper proposes a new test set for the NLP domain under the distribution shift concept, should the Syntactic divergence distribution be similar to all proposed test sets? If the Syntactic divergence distribution of the proposed test set is different, what does it mean? If my understanding is correct, Syntactic divergence is a difficulty measure metric so if the Syntactic divergence distribution of new test sets differs a lot from the original test set, does it means that the proposed test set is more difficult? Is it a good indication? In summary, the distribution of new test sets should be similar, or can it be different? Could someone from the NLP field help me to clear my doubts? Thank you! https://preview.redd.it/cbsor9yhdiy91.png?width=628&format=png&auto=webp&s=be486461036ee2879b03d1337fcc914063448e69 submitted by /u/Alternative-File-146 [link] [comments]  ( 58 min )
    [D] On the spot normalization for batch based runtime
    Hello there, Hope you're all having a great day :^) I was wondering whether applying on the spot normalization (say 0-mean 1-std) would produce better results than reusing mean and std derived from the training set during models training, on production/inference data when applied to large enough batches. I'm basing this on several assumptions: First is that distribution shfits do occur from time to time, for example some arbitrary variable x may become larger and larger over time (maybe price that is being affected by inflation etc., just an example). If you are performing inference on large enough batches (don't ask me what would be considered large enough, say 100 samples for reference), you'll be more likely to squash your samples into a more similar range for the model( say due to increase in the mentioned variable x, the training sets mean won't be able to center x around 0, which may lead the models decision boundary behave suboptimally). Second part is that if the production distribution doesn't change, that should have no impact (assuming batch statistics are solid) to the models output (naturally a little bit variance here and there but should be a huge issue, I think). Assuming such strategy has potential to pay off, I suppose evaluating the model on holdout data would follow the same protocol (using validation and test set stats to normalize when evaluating). Obviously, the best thing one could do is to identify the shift and adjust the model/pipeline, but that is neither here or there (assume model deployment/retraining is very expensive etc). Apologies for the long post, I just had an early jog and "injected" myself with tons of coffee. Cheers, submitted by /u/Slowai [link] [comments]  ( 57 min )
    [D] Fighting Microsoft Copilot: The No-AI 3-Clause License
    submitted by /u/BUGFIX-66 [link] [comments]  ( 55 min )
    [Research] Is there any paper comparing classification, detection and segmentation?
    Is there any paper benchmarking detection, classification and segmentation for solving the same problem at different levels of resolution? submitted by /u/rockabby [link] [comments]  ( 61 min )
    [D] What's the best speech to speech deep fake voice project?
    Most of the ones that I've seen are only text to speech. Would be great if there were one that took into account the source voice's inflections, pacing, etc. So far I've only been able to find StarGANv2. Which one redditor used to create this. Is this the best there is or are there better alternatives? Thanks! EDIT: Digging a bit deeper, I found this project called IMS-Toucan which has several very impressive demos on huggingface. submitted by /u/aerialbits [link] [comments]  ( 55 min )
  • Open

    Cost-effective data preparation for machine learning using SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler is a capability of Amazon SageMaker that makes it faster for data scientists and engineers to prepare high-quality features for machine learning (ML) applications via a visual interface. Data Wrangler reduces the time it takes to aggregate and prepare data for ML from weeks to minutes. With Data Wrangler, you can […]  ( 12 min )
    Generate images from text with the stable diffusion model on Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that […]  ( 8 min )
    Run text generation with GPT and Bloom models on Amazon SageMaker JumpStart
    In December 2020, AWS announced the general availability of Amazon SageMaker JumpStart, a capability of Amazon SageMaker that helps you quickly and easily get started with machine learning (ML). JumpStart provides one-click fine-tuning and deployment of a wide variety of pre-trained models across popular ML tasks, as well as a selection of end-to-end solutions that […]  ( 8 min )
  • Open

    A whole new world of learning via MIT OpenCourseWare videos
    “I get the chance to not only watch the future happen, but I can actually be a part of it and create it,” says Ugandan entrepreneur Emmanuel Kasigazi.  ( 9 min )
  • Open

    Carlson’s elliptic integrals
    Although its a little fuzzy to say exactly which functions are “special” functions, these are generally functions that come up frequently in applications, that have numerous symmetries, and that satisfy many useful identities. The copious interconnections between special functions that are part of what makes them special also makes these functions hard to organize: everything […] Carlson’s elliptic integrals first appeared on John D. Cook.  ( 5 min )
    Kinds of elliptic integrals
    There are three fundamental kinds of elliptic integrals, and these are prosaically but unhelpfully called elliptic integrals of the first kind, the second kind, and the third kind. These names sound odd to modern ears, but it’s no different than classical musicians naming symphonies Symphony No. 1, Symphony No. 2, etc. This post covers the […] Kinds of elliptic integrals first appeared on John D. Cook.  ( 6 min )
  • Open

    Social Media Sentiment Analysis Using Twitter Datasets
    Several hundreds of thousands of raw data files are uploaded by users every day to social media sites. Online user data provides access to an enormous amount of information regarding products, services, places, and events, which makes it suitable for sentiment analysis. Valuable information can be extracted by analyzing the sentiment of the data. The post Social Media Sentiment Analysis Using Twitter Datasets appeared first on Data Science Central.  ( 22 min )
    5 THINGS YOU SHOULD EXPECT FROM YOUR DENTAL LAB
    No matter how skilled a dentist is, working without dependable dental lab services would make it very difficult for them to accomplish their responsibilities effectively. One example of how a dentist can lose such clients is by making a patient wait too long for their dentures as a result of subpar dental lab operations. If… Read More »5 THINGS YOU SHOULD EXPECT FROM YOUR DENTAL LAB The post 5 THINGS YOU SHOULD EXPECT FROM YOUR DENTAL LAB appeared first on Data Science Central.  ( 20 min )
  • Open

    Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform
    “DIY” and “supercomputer” aren’t words typically used together. But a do-it-yourself supercomputer is exactly what students built at Southern Methodist University, in Dallas, using 16 NVIDIA Jetson Nano modules, four power supplies, more than 60 handmade wires, a network switch and some cooling fans. The project, dubbed SMU’s “baby supercomputer,” aims to help educate those Read article > The post Tiny Computer, Huge Learnings: Students at SMU Build Baby Supercomputer With NVIDIA Jetson Edge AI Platform appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Data-driven Approaches to Surrogate Machine Learning Model Development. (arXiv:2210.02631v3 [cs.LG] UPDATED)
    We demonstrate the adaption of three established methods to the field of surrogate machine learning model development. These methods are data augmentation, custom loss functions and transfer learning. Each of these methods have seen widespread use in the field of machine learning, however, here we apply them specifically to surrogate machine learning model development. The machine learning model that forms the basis behind this work was intended to surrogate a traditional engineering model used in the UK nuclear industry. Previous performance of this model has been hampered by poor performance due to limited training data. Here, we demonstrate that through a combination of additional techniques, model performance can be significantly improved. We show that each of the aforementioned techniques have utility in their own right and in combination with one another. However, we see them best applied as part of a transfer learning operation. Five pre-trained surrogate models produced prior to this research were further trained with an augmented dataset and with our custom loss function. Through the combination of all three techniques, we see an improvement of at least $38\%$ in performance across the five models.  ( 2 min )
  • Open

    Data-driven Approaches to Surrogate Machine Learning Model Development. (arXiv:2210.02631v3 [cs.LG] UPDATED)
    We demonstrate the adaption of three established methods to the field of surrogate machine learning model development. These methods are data augmentation, custom loss functions and transfer learning. Each of these methods have seen widespread use in the field of machine learning, however, here we apply them specifically to surrogate machine learning model development. The machine learning model that forms the basis behind this work was intended to surrogate a traditional engineering model used in the UK nuclear industry. Previous performance of this model has been hampered by poor performance due to limited training data. Here, we demonstrate that through a combination of additional techniques, model performance can be significantly improved. We show that each of the aforementioned techniques have utility in their own right and in combination with one another. However, we see them best applied as part of a transfer learning operation. Five pre-trained surrogate models produced prior to this research were further trained with an augmented dataset and with our custom loss function. Through the combination of all three techniques, we see an improvement of at least $38\%$ in performance across the five models.  ( 2 min )

  • Open

    How to split the text into main points based on given main points?
    I am working on a problem where I have two texts T1, T2. T1 contains some important points that I have entered. How can I make sure that T2 has those points? I am aware of the algorithms like cosine, jaccard, BERT for semantic similarity but the problem is that they apply to the whole text whereas I want point-wise similarity i.e. T2 must contain the T1 points although the order and words used may differ a bit. ​ By points I meant bullet points covering discrete concepts and I basically want to check how many discrete concepts in my T1's points are covered in T2 where they could be in a single sentence or spread out across multiple sentences. ​ Example: So T1 could have the following two points: - The Queen reigned from 1952 to 2022. - The Queen was the second longest reigning monarch. ​ Now T2 could either be: The Queen was the second longest reigning monarch with her reign spanning 1943 to 2022. or - The Queen reigned from 1952 to 2022. - She was Britain's second longest monarch. In both these cases, T2 should be considered to contain both points in T1. submitted by /u/Status-Sprinkles1236 [link] [comments]  ( 44 min )
    Computer Science, Cognitive Science, or Statistics: which undergraduate degree best fits a career in AI
    My college offers all three, and I’m having a hard time choosing. Our Comp Sci curriculum is focused on software, and has some AI electives. It includes a decent amount of statistics. Our Cog Sci curriculum is focused on cognitive psychology, with linguistics, philosophy, and basic AI thrown in (no rigorous statistics). The Statistics major has some programming, but is fully focused on the statistics that go into Machine Learning. Which major would provide the best stepping stone into a career in AI? submitted by /u/Mani0770 [link] [comments]  ( 43 min )
    AI Dream 92 - AI Discovers Rare Alien Civilization
    submitted by /u/LordPewPew777 [link] [comments]  ( 43 min )
    AI/ML programs or courses
    I am a 40 yr old product designer, iOS eng, react eng. I want to shift my career into AIML. I am wondering what the programs or courses I should look into are? I have looked at the Berkley, Stanford, MIT, Coursera, Udemy etc and am fairly aware of the options. But wanted to ask this community. I typically would just learn on youtube, github, stackoverflow etc but am considering making a more intentional and full time shift. I learn fine online but am considering an IRL program to fully immerse and dedicate 1-2 years of my time to it. So IRL or online courses?? Thank you submitted by /u/dubodubo [link] [comments]  ( 45 min )
    Mind's Eye: How physics data improves large language models
    submitted by /u/Peaking_AI [link] [comments]  ( 41 min )
    what's the most used image processing algorithm in face recognition (not using any kind of AI)?
    so, I have this college project (image processing project) through which I was required to make a face recognition code using python without using any library which is related to (AI, machine learning, deep learning, etc...), it has to be all about image processing. ​ my professor said that before 2010, there was not that much use of AI to do face recognition. so face recognition was made by some image processing algorithms, I am required to implement a face recognition algorithm using these old methods. but when I searched on google, most of the GitHub repos or algorithms on Google are using some libraries based on AI (like OpenCV, tensor flow, etc...). ​ so I was wondering if anyone could guide me to one of these old references or scientific papers or a GitHub repo for guidance and for implementing my programs. any help would be appreciated, thanks in advance. submitted by /u/abdosalm [link] [comments]  ( 44 min )
    what shouldn't AI ever be allowed to do?
    Sooner or later it will be able to do everything humans do (even better if it doesn't already), art, music, engineering, mundane task, but what should be the one thing it should never do (not because it can't, this isn't about AI being incapable) submitted by /u/Absolutelynobody54 [link] [comments]  ( 54 min )
    Some of you may be interested in our Discord server using AI art generation in a meaningful way
    https://discord.gg/ak24xSdr Some of you may be interested in using AI art generation in a meanginful way, using symbols from tarot and Jungian archtypes. Feel free to join a server where we explore that and share ideas and artwork! submitted by /u/fingin [link] [comments]  ( 42 min )
    AI Dreamer - The Fastest Stable Diffusion App on iOS and macOS
    ​ https://preview.redd.it/640mnpxbaay91.png?width=1242&format=png&auto=webp&s=81e3e37543f6413cf84705f6a13272c457e19cae Hello Artificial! I would like to ask you for a feedback regarding my app and give you some free credits! AI Dreamer iOS/macOS AppStore Link AI Dreamer is the fastest way (>30it/s) to generate stable diffusion content on iOS and macOS. It allows both text2img and img2img generation with lots of hyperparameters to tweak with. Please let me know your feelings about the app and what would you like to change about it. --- Lastly, you can get your 50 free credits by going to the credits screen and holding the logo with one finger for 5 seconds. That's it! --- Have fun! submitted by /u/g_surma [link] [comments]  ( 43 min )
    Remember those A.I generated images ?
    submitted by /u/LABluez17 [link] [comments]  ( 40 min )
    "Learning to Imitate" blog post from Stanford AI Lab
    Hi all, my very first blog post - "Learning to Imitate" - is available to read on the Stanford AI blog. The blog post offers an easy & insightful read for anyone interested in AI. It highlights issues with current AI systems and ways to create better human-centric AI by using data-driven learning. It also presents our new framework "Inverse Q-Learning", which forms a major theoretic advance over Inverse Reinforcement Learning, to train AI agents using sparse data. This framework has also been used to create the best AI agent for playing Minecraft using a few expert demos. Please read and share!! Happy to answer any follow-ups here or on DM 😊Blog: https://ai.stanford.edu/blog/learning-to-imitate/ (Twitter thread) submitted by /u/DragonLord9 [link] [comments]  ( 41 min )
  • Open

    [D] Git Re-Basin Paper Accused of Misinformation
    I read the paper when it was posted here a couple months ago. I thought it was a pretty interesting work so I was curious to check if the authors submitted to ICLR 2023. I found out that the the paper received glowing scores of 8, 8, 10 from the reviewers, and another researcher apparently didn't agree with the reviews in the public comment section here. They accused the paper of rehashing prior works, stating exaggerated and deceptive claims of results that are already known. I need to take a closer look at the comment to see if its points are valid. submitted by /u/fryingnem0 [link] [comments]  ( 55 min )
    [P] Stable-diffusion's implementation of Paint-with-words : method from NVIDIA that generates images from text-labeled segmentation map.
    Hi. Very recently researchers from NVIDIA released their recent work on text-to-image diffusion models, eDiffi (https://deepimagination.cc/eDiffi/) . In their paper they proposed various methods, including paint-with-words. Paint-with-words let you generate image from arbitrary text-labeled segmentation map. Checkout their paper and method for more details. Unfortunately, their code + eDiffi models were not available. However, Stable Diffusion can do just the same, as they both have cross attention module. I've tried to make it work with stable diffusion, and it worked! so I wanted to share the results and code. Please have a look if you are interested! https://github.com/cloneofsimo/paint-with-words-sd Here are some results with sd-v1.4. ​ \"realistic photo of a dog, cat, tree, with beautiful sky, on sandy ground\", in order of cat-dog \"realistic photo of a dog, cat, tree, with beautiful sky, on sandy ground\", in order of dog-cat ​ A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed. Result from my implementation + stable diffusion ​ ​ A digital painting of a half-frozen lake near mountains under a full moon and aurora. A boat is in the middle of the lake. Highly detailed. Result from eDiffi. submitted by /u/cloneofsimo [link] [comments]  ( 57 min )
    [P] Transcribe any podcast episode in just 1 minute with optimized OpenAI/whisper
    submitted by /u/thundergolfer [link] [comments]  ( 54 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 57 min )
    [D] Has anyone tried coding latent diffusion from scratch? or tried other conditioning information aside from image classes and text?
    I'm planning to see how a latent diffusion model would perform in the image reconstruction from brain activity task. Specifically, the image generation would be conditioned on brain activity instead of text. Has anyone tried conditioning on brain activity or other information apart from text? I'm having a hard time digesting the code from the LDM repo and was wondering if anyone has tried coding it (or a simpler version) from scratch. submitted by /u/yamakeeen [link] [comments]  ( 57 min )
    [R] A novel representation space for video-based generative tasks!!
    Existing video generation networks such as Stylegan-v, MocoGanHD, and Digan generate videos by learning temporally meaningful directions in the latent space of a pretrained image generator. We propose a novel representation space for videos in which videos are parameterized as implicit neural representations (INRs), and a hypernetwork is trained over these INR functions. Such a representation space allows meaningful interpolation, video interpolation, future-frame prediction, inpainting, and several other video-based generative tasks (check out the project page below). We show superior performance over existing SOTA networks on several video-based generative tasks on benchmark datasets. What's more, we show smooth video interpolation for the first time. This work, "INR-V: A Continuous Representation Space for Video-based Generative Tasks," was accepted at Transactions on Machine Learning Research (TMLR) - https://openreview.net/forum?id=aIoEkwc2oB. Only left-most and right-most videos are present in the database. All the intermediate videos are generated from our representation space using SLERP interpolation. Notice the smooth transition in both content and motion. Checkout the project page - http://cvit.iiit.ac.in/research/projects/cvit-projects/inr-v submitted by /u/skymanaditya [link] [comments]  ( 55 min )
    [D] Enforcing object order in object detection
    Suppose if we are detecting and localizing 2 objects in an image, are there any ways where we can force network to localize object detection such that object1_coordinates < object2_coordinates? We can discard outputs if they dont adhere to the order but this will only throw away the outputs rather than the localize the objects constraining to the condition submitted by /u/give_me_the_truth [link] [comments]  ( 54 min )
    [P] Lovely Tensors library
    Hi! I made the ❤️ Lovely Tensors: https://github.com/xl0/lovely-tensors library. It lets you visualize and summarize PyTorch tensors for human consumption: Tensor summary Or display images (assuming it's RGB[A] data): RGB Image Or plot a histogram: Histogram and stats Or view channels (either colour channels of ConvNet activations): Channels I just released version 0.1.0, which covers most things I had in mind for tensors. I plan on adding visualizations for nn.Modules and other features in the near future. Would appreciate your feedback on both small bugs/features and the overall usability. submitted by /u/xl0 [link] [comments]  ( 51 min )
    [P] ICLR 2023 Submissions with Review Scores, Codes and SM Discussions
    We updated the ICLR 2023 submissions list with review scores. Link: https://papers.labml.ai/lists/iclr_2023?sort_by=num_tweets&dsc=0 You can click on the papers to see their review score, source code, videos, and comments on social media. We will update it to show the status when the decisions are out. submitted by /u/hnipun [link] [comments]  ( 54 min )
    [R] Reincarnating Reinforcement Learning (NeurIPS 2022) - Google Brain
    submitted by /u/smallest_meta_review [link] [comments]  ( 61 min )
    [N] "Learning to Imitate" blog post from Stanford AI Lab
    Hi all, my very first blog post - "Learning to Imitate" - is available to read on the Stanford AI blog. The blog post offers an easy & insightful read for anyone interested in AI. It highlights issues with current AI systems and ways to create better human-centric AI by using data-driven learning. It also presents our new framework "Inverse Q-Learning", which forms a major theoretic advance over Inverse Reinforcement Learning, to train AI agents using sparse data. This framework has also been used to create the best AI agent for playing Minecraft using a few expert demos. ​ Please read and share!! Happy to answer any follow-ups here or on DM 😊Blog: https://ai.stanford.edu/blog/learning-to-imitate/ (Twitter thread) submitted by /u/DragonLord9 [link] [comments]  ( 55 min )
  • Open

    "Over-communicate no more: Situated RL agents learn concise communication protocols", Kalinowska et al 2022 {DM}
    submitted by /u/gwern [link] [comments]  ( 52 min )
    A Reinforcement Learning Neural Net
    Is there such thing as an example of an RL neural net built in Python? submitted by /u/tedd321 [link] [comments]  ( 55 min )
    Pathwise Average Cost
    What does pathwise average cost mean in the context of optimal control theory? How is it different from expected/average cost? submitted by /u/the_OR_guy [link] [comments]  ( 51 min )
    GAIL with objective
    Hi All! I've successfully trained a policy using GAIL with an expert, but the expert is not optimal in terms of my objective function. How would you go about and continue training the policy to surpass the original expert avoiding the collapse when changing the reward function? Do you have any experience with one of the newer methods (GA-GAIL, TRGAIL etc.)? Did anyone try to get them working with SB3 and the Imitation? submitted by /u/Windgineer2 [link] [comments]  ( 54 min )
    Control Theory
    Hello, ​ I've recently started developing an interest in robotics, and I see that many classical control theories are still used. Given that RL is sort of based on it, I would like to ask anyone's opinions on the best resources to get a taste for it for someone with 0 experience in CT but considerable practice with RL. I have searched online but couldn't really make sense of the results. submitted by /u/Lux_Erebus [link] [comments]  ( 58 min )
    Successful RL applications for autonomous systems and control
    Can you refer to the notable successful industrial applications of RL, DQN, etc.. for system control. For example, energy saving in data centers, traffic light signaling, CPU/GPU thermal management, and so on. Beside RL, what are the successful algos for system control? How does it compare to large scale models? submitted by /u/Gullible_Feature6623 [link] [comments]  ( 55 min )
    "Learning to Imitate" blog post from Stanford AI Lab
    Hi all, my very first blog post - "Learning to Imitate" - is available to read on the Stanford AI blog. The blog post offers an easy & insightful read for anyone interested in AI. It highlights issues with current AI systems and ways to create better human-centric AI by using data-driven learning. It also presents our new framework "Inverse Q-Learning" to train AI agents using sparse data, outperforming prior methods by 3x in the field of Imitation Learning and forming a major theoretic advance over Inverse Reinforcement Learning. ​ Please read and share!! Happy to answer any follow-ups here or on DM 😊Blog: https://ai.stanford.edu/blog/learning-to-imitate/ (Twitter thread) submitted by /u/DragonLord9 [link] [comments]  ( 52 min )
  • Open

    Three interesting curves
    Here are three curves that have interesting names and interesting shapes. The fish curve The fish curve has parameterization x(t) = cos(t) – sin²(t)/√2 y(t) = cos(t) sin(t) We can plot this curve in Mathematica with ParametricPlot[ {Cos[t] - Sin[t]^2/Sqrt[2], Cos[t] Sin[t]}, {t, 0, 2 Pi}] to get the following. It’s interesting that the image […] Three interesting curves first appeared on John D. Cook.  ( 5 min )
    What is a Pentanomial GFSR random number generator?
    The ISO random number generation standard, ISO 28640, speaks of a “Pentanomial GFSR method” for generating random variates. What is this? We’ll break it down, starting with GFSR. GFSR In short, a GFSR random number generator is what is now more commonly called a linear feedback shift register, or LFSR. The terminology “GFSR” was already […] What is a Pentanomial GFSR random number generator? first appeared on John D. Cook.  ( 6 min )
  • Open

    Node.js vs Python: Which One Should You Use for Web Apps?
    Node.js and Python are the most popular technologies for backend development. And, when it comes to web development, it could be challenging to choose between Node.js & Python. The selection of the right technology stack for your project is critical. This is primarily dictated by the project’s cost, launch timeline & how efficient it is to maintain and scale. The post Node.js vs Python: Which One Should You Use for Web Apps? appeared first on Data Science Central.  ( 20 min )
  • Open

    coVariance Neural Networks. (arXiv:2205.15856v3 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.  ( 2 min )
  • Open

    coVariance Neural Networks. (arXiv:2205.15856v3 [cs.LG] UPDATED)
    Graph neural networks (GNN) are an effective framework that exploit inter-relationships within graph-structured data for learning. Principal component analysis (PCA) involves the projection of data on the eigenspace of the covariance matrix and draws similarities with the graph convolutional filters in GNNs. Motivated by this observation, we study a GNN architecture, called coVariance neural network (VNN), that operates on sample covariance matrices as graphs. We theoretically establish the stability of VNNs to perturbations in the covariance matrix, thus, implying an advantage over standard PCA-based data analysis approaches that are prone to instability due to principal components associated with close eigenvalues. Our experiments on real-world datasets validate our theoretical results and show that VNN performance is indeed more stable than PCA-based statistical approaches. Moreover, our experiments on multi-resolution datasets also demonstrate that VNNs are amenable to transferability of performance over covariance matrices of different dimensions; a feature that is infeasible for PCA-based approaches.  ( 2 min )

  • Open

    [D] DL Models Learn the Embedded Manifold of Training Data
    Hello, guys. I'm trying to understand better how the manifold hypothesis (or prior) relates to the embeddings that deep neural networks learn. As I see it, the representations that DL models learn are the coordinates in the lower dimensional manifold of the training data, embedded in the higher dimensional input space. Is this understanding correct? Am I missing anything obvious? Thanks in advance! submitted by /u/Alexsander787 [link] [comments]  ( 55 min )
    [P] Summarize social media sports data with neuspo
    submitted by /u/davidmezzetti [link] [comments]  ( 52 min )
    [R] APPLE research: GAUDI — a neural architect for immersive 3D scene generation
    submitted by /u/SpatialComputing [link] [comments]  ( 55 min )
    How to perform economic optimization without TensorFlow or PyTorch ? [Research]
    How to perform economic optimization without TensorFlow or PyTorch ? Hessian matrices are used in large-scale optimization problems within Newton-type methods because they are the coefficient of the quadratic term of a local Taylor expansion of a function. Partial derivatives play a prominent role in economics, in which most functions describing economic behaviour posit that the behaviour depends on more than one variable. For example, a societal consumption function may describe the amount spent on consumer goods as depending on both income and wealth; the marginal propensity to consume is then the partial derivative of the consumption function with respect to income. The Hessian matrix is also commonly used for expressing image processing operators in image processing and computer vision (see the Laplacian of Gaussian (LoG) blob detector). The Hessian matrix can also be used in normal mode analysis to calculate the different molecular frequencies in infrared spectroscopy. Tensorflow or other machine learning libraries are certainly powerful, but they are still excessively resource-intensive and can be an obstacle for low-performance machines, this article was intended to interpret a new way to build Hessian matrices, with a lighter tool for scientific computing: sympy. Recommendations : Compatibility test performed with Python 3.8, executed on MacOS 11.3 and Linux Ubuntu Server 20.04 LTS environments. Libraries Used : Numpy, Sympy. Link to tutorial : https://towardsdatascience.com/hessian-matrix-and-optimization-problems-in-python-3-8-f7cd2a615371 Thanks for reading, Louis Brulé Naudet submitted by /u/louisbrulenaudet [link] [comments]  ( 59 min )
    [D] Paper Explanation & Author Interview - ROME: Locating and Editing Factual Associations in GPT
    https://youtu.be/_NMQyOu2HTo Large Language Models have the ability to store vast amounts of facts about the world. But little is known, how these models actually do this. This paper aims at discovering the mechanism and location of storage and recall of factual associations in GPT models, and then proposes a mechanism for the targeted editing of such facts, in form of a simple rank-one update to a single MLP layer. This has wide implications both for how we understand such models' inner workings, and for our ability to gain greater control over such models in the future. ​ OUTLINE: 0:00 - Introduction 1:40 - What are the main questions in this subfield? 6:55 - How causal tracing reveals where facts are stored 18:40 - Clever experiments show the importance of MLPs 24:30 - How do MLPs store information? 29:10 - How to edit language model knowledge with precision? 36:45 - What does it mean to know something? 39:00 - Experimental Evaluation & the CounterFact benchmark 45:40 - How to obtain the required latent representations? 51:15 - Where is the best location in the model to perform edits? 58:00 - What do these models understand about language? 1:02:00 - Questions for the community ​ Paper: https://arxiv.org/abs/2202.05262 Follow-up paper on Mass-Editing Memory in a Transformer: https://arxiv.org/abs/2210.07229 submitted by /u/ykilcher [link] [comments]  ( 57 min )
    [Discussion] ICLR2023 statistics of submission
    https://guoqiangwei.xyz/iclr2023_stats/iclr2023_submissions.html ​ ​ Rating distribution ​ ​ Stastics submitted by /u/weiguoqiang [link] [comments]  ( 60 min )
    [N] AAAI2023 workshop on Dynamical Systems and Machine Learning
    Please consider submitting your paper to the AAAI2023 workshop on Dynamical Systems and Machine Learning !! The submission deadline has been extended to Nov 8. When Machine Learning meets Dynamical Systems: Theory and Applications submitted by /u/Trick_Passenger_5838 [link] [comments]  ( 54 min )
    [P] Topic modeling with semantic graphs: a different approach
    Dimensionality reduction with UMAP combined with HDBSCAN is a popular topic modeling method found in a number of libraries. txtai takes a different approach with a semantic graph. When enabled, txtai builds a semantic graph at index time as it's vectorizing data. These vector embeddings are then used to create relationships in the graph. Finally, community detection algorithms build topic clusters. This approach has the advantage of only having to vectorize data once. It also has the advantage of better topic precision given there isn't a dimensionality reduction operation (UMAP). Read more here: https://neuml.hashnode.dev/introducing-the-semantic-graph submitted by /u/davidmezzetti [link] [comments]  ( 54 min )
    [P] Sparse Transformers for Inference in a Real-Time Twitter Stream
    For those in NLP and Finance 🚀🚀🔥, there's a new demo on how to sparse transfer 2 dense BERT models and run inference on a live Twitter Stream. The models are able to classify finance topic and sentiment tweets. We also open sourced two new datasets for you to train your own models! You can read more about this effort in our blog. submitted by /u/Quantum_Stat [link] [comments]  ( 56 min )
    [D] A model with different data types as input?
    Hi, I'm working with this model: https://preview.redd.it/g9hys2cii3y91.png?width=2290&format=png&auto=webp&s=2f89a76d1a7356a6bf9fc1e372edb9e8ba992166 As you can see, it takes a combination of data types as input. What would be the proper terminology to qualify a model like that? Fusion? Hybrid? Multi-something? I'm looking for a term to refer to existing literature. Thank you! submitted by /u/Suspicious-Age-9942 [link] [comments]  ( 53 min )
    [P] Finetuned Diffusion: multiple fine-tuned Stable Diffusion models, trained on different styles
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 57 min )
    [D] Physics-inspired Deep Learning Models
    Hi all, With the diffusion model inspired by thermodynamics, I'm curious about other models inspired by physics since I kinda want to switch my research. Would you mind naming some other models or techniques that are also inspired by Physics? submitted by /u/ShadowKnightPro [link] [comments]  ( 63 min )
    [D] Paper bidding is a terrible system
    Why the main conferences all start to ask reviewers to bid papers? It breaks double blind rules and degrades review quality. First of all people may cheat by asking someone they know to bid their papers (I have been asked many times by different people to bid their papers for ECCV, Neurips, ICLR). Now CVPR this year is going to include biding, I'm sure these people will start to operate soon. It may also result in some meaningless reviews. If I'm submitting a paper and also serving as a reviewer, I may search in the bidding system for similar papers and bid and reject them regardless of the actual quality submitted by /u/Ok-Client4678 [link] [comments]  ( 59 min )
    [D] Ridge Regression
    Ridge's Regression is selecting β which minimizes : RSS - λ( Σβ^2) So as, λ increases, penalty on bigger parameters increases. In the fig below (from ISL book), parameter for 'Rating' increases as λ increases for a while and then decreases. Why is it increasing? https://preview.redd.it/qg6p7wuu31y91.png?width=367&format=png&auto=webp&s=72a783202c4a697530778c0058b87ef25728d407 submitted by /u/-Sourabh [link] [comments]  ( 57 min )
  • Open

    Asked Dalle-E Mini (Crayion) to write me the singularity. Watch this!
    submitted by /u/Quirky-Pizza-9478 [link] [comments]  ( 40 min )
    Looking for people to talk with my neural network for research/fun
    Hi, I've been working on a few conversational neural networks and am now looking for people to help test it out/just get some feedback. You talk it it just like any other person on telegram, and I'd like feedback on here if you choose to try it out. It's made to be a more buddy/almost gf/average user. Sometimes it does take a few sentences back and forth for your first few messages to help it grab context, but after that it'll be good to go. @Bitsy69_bot submitted by /u/Delta_Adams [link] [comments]  ( 47 min )
    EXCITING TEXT to VIDEO AI | Checkout these phenomenal animal animations!
    ​ https://www.youtube.com/watch?v=gdDVn1LuYs4 ​ submitted by /u/Early_Exit6735 [link] [comments]  ( 39 min )
    Could AI develop critical thinking? or disobey when needed? (probably not now but in the future?
    or will they just be slaves with no agency? submitted by /u/Absolutelynobody54 [link] [comments]  ( 43 min )
    StableDiffusion or Dalle but for Web/App Design
    Hey guys, I'm looking for something similar to SD or Dall-E but for web/app design/UX where I can enter prompts and generate website or app designs. Does anything like this exists or am I too optimistic? Thank you submitted by /u/laugrig [link] [comments]  ( 45 min )
    Video to Text Transcription Service Powered by AI
    submitted by /u/robgehring [link] [comments]  ( 41 min )
    AI Dream 105 - Why this AI Flight feels so Smooth.
    submitted by /u/LordPewPew777 [link] [comments]  ( 39 min )
    Visuals made using DALLE-2 and an interpolation AI, let me know what you think!
    submitted by /u/SkanJanJabin [link] [comments]  ( 40 min )
    Is AI reaching a Plateau?
    I am genuinely asking. Lately I have seen these amazing image gen products. But were they just a result of more data something we already new? Or was it an actual breakthrough? But I get a sense that Neural Networks are still limited, and it will reach a plateau. I am no expert though, just curious about it. What do you think? submitted by /u/smegma_tears32 [link] [comments]  ( 46 min )
    Digital Twin Design for Renewable Energy Exploration
    submitted by /u/VivaNoi [link] [comments]  ( 42 min )
    Stable Diffusion Weekly AI Art Video and Images 4K 30 FPS 11.3.22
    submitted by /u/prfitofthesngularity [link] [comments]  ( 39 min )
    Let's talk about AI art, AI and racism.
    I was experimenting with Dream by WOMBO to generate some ideas for colour schemes and characters.Usually I'd just do a google image search for the theme or emotion I'm looking for for inspiration, but I thought I'd give AI a go as it's effectively doing the same thing - analysing the statistical relationship between pixel values in large sets of images tagged with key words and phrases.Unforutnately it quickly became very clear that the issue with AI amplifying society's biases is very much alive, and we should talk about that.These are all "first shot". I just entered the prompt, selected "No style" and hit create.Yes, after the first one I was actively testing prompts likely to produce biased results.Because the problem is AI producing biased results from data we know is biased.As many o…  ( 50 min )
    Google demos AI video creation based on text script
    submitted by /u/justinsayin18 [link] [comments]  ( 44 min )
  • Open

    Systematically solving trigonometric equations
    Students are asked to solve trigonometric equations shortly after learning what sine and cosine are. By some combination of persistence and luck they may be able to find a solution. After proudly presenting the solution to a teacher, the teacher may ask “Is that the only solution?” A candid student would respond by saying “How […] Systematically solving trigonometric equations first appeared on John D. Cook.  ( 5 min )
  • Open

    Meta’s AI-powered audio codec promises 10x compression over MP3
    submitted by /u/keghn [link] [comments]  ( 43 min )
    How do neural networks "link"?
    One thing i do not understand about neural networks is how they are able to evaluate data, watch viedos, or interact with game engines. I have been trying to wrap my head around this for a while now, and was wondering how it works. if someone could link a page or tutorial it would be greatly appreciated submitted by /u/Few-Appearance-4814 [link] [comments]  ( 42 min )
  • Open

    Papers on Reward Function evaluation
    I've been looking for papers where different parameters of reward functions have been tested and evaluated with each other. Especially, I am trying to look for possible metrics to quantify the outcome of the learned policy (PPO). Other Framing: What are the Effect / Impact of tuning my reward function? (learning behaviour/what makes my learned policy efficient/optimal?) Another framing: How does the reward function "curve" effect the learning? I.e. Time to Complete the Task, probability of success,...? submitted by /u/lol2k7 [link] [comments]  ( 49 min )
    Softmax output with constraints
    Hello everyone, I need some help with math. I want my agent (neural net) to output some weights that add up to 1. So of course I went with the softmax function at first. However, now I need to put some constraints on them which is that each weight has to be within a desired range: (min, max). For example, with 5 weights, the neural net might output (after the softmax) something like [0.5, 0.15, 0.1, 0.1, 0.15]. But then I need each weight to be, for example, within (0.1, 0.3) and still add up to 1. What I tried is: min + (max-min)*original_softmax_output. This makes sure that the weights are within the desired range, but it does not necessarily add up to 1. Anyone has been through something like this? It would be very much appreciated if someone could help. Thanks in advance!! submitted by /u/Hot-Chair-8304 [link] [comments]  ( 56 min )
    Value estimation divergence in infinite-horizon problem?
    Hello. Is divergence of value estimation in infinite-horizon unavoidable? i.e. done is always false. For example, reward is always 1 or -1. Reward will be kept added to the target -> critic's estimation's max value and min value will increase/decrease sideways. The 'little more diverged' critic will be sent to the target which in turn estimates 'little more diverged'. The target is then 'little more diverged + reward' so it's now 'more diverged'... .... BOOM. Did I interpret the situation right? Is this expected behavior? In which scenario should value estimation converge? ​ submitted by /u/FashionDude3 [link] [comments]  ( 55 min )
  • Open

    Towards Learned Simulators for Cell Migration. (arXiv:2210.01123v2 [q-bio.QM] UPDATED)
    Simulators driven by deep learning are gaining popularity as a tool for efficiently emulating accurate but expensive numerical simulators. Successful applications of such neural simulators can be found in the domains of physics, chemistry, and structural biology, amongst others. Likewise, a neural simulator for cellular dynamics can augment lab experiments and traditional computational methods to enhance our understanding of a cell's interaction with its physical environment. In this work, we propose an autoregressive probabilistic model that can reproduce spatiotemporal dynamics of single cell migration, traditionally simulated with the Cellular Potts model. We observe that standard single-step training methods do not only lead to inconsistent rollout stability, but also fail to accurately capture the stochastic aspects of the dynamics, and we propose training strategies to mitigate these issues. Our evaluation on two proof-of-concept experimental scenarios shows that neural methods have the potential to faithfully simulate stochastic cellular dynamics at least an order of magnitude faster than a state-of-the-art implementation of the Cellular Potts model.  ( 2 min )

  • Open

    This Game Developer created Gradio tool at HuggingFace that lets you convert text into image & music video by using Using ERNIE-ViLG 2.0+
    submitted by /u/ai-lover [link] [comments]  ( 40 min )
    This AI influencer (@lilmiquela ) makes $10 million a year & doesn’t even exist! Crazy to think how far AI has come & the innovations in web3, I think we can safely assume more AI influencers will be coming. How do you feel about this? Do you think this is the next wave of influencing?
    submitted by /u/Starbornnfts [link] [comments]  ( 40 min )
    Are there free AI tools to generate audio from text?
    Especillay ones with many langages option. submitted by /u/Unreal_777 [link] [comments]  ( 40 min )
    Yet another graphic engine - but totally free, v1.5, and 5 sec to generate, and with search
    Next week you could also generate videos and GIFs for free—exceptional cloud architecture supporting lightspeed generation. Also here is the search - you can query more than 5 million Videos, GIFs and Images and artworks. You can upload your own image to scale or variant. Again, all for free. All are accessible through our API as well - drop a comment below if you want to access it. Feedback gracefully. https://studio.sefirot.io ​ https://reddit.com/link/ym93j2/video/lymmcoozrzx91/player submitted by /u/Sefi_AI [link] [comments]  ( 47 min )
    Nvidia's eDiffi is an impressive alternative to DALL-E 2 or Stable Diffusion
    submitted by /u/Number_5_alive [link] [comments]  ( 43 min )
    AI Dream 84 - How to Jump through Time and Space
    submitted by /u/LordPewPew777 [link] [comments]  ( 50 min )
    What AI should I use to generate aesthetic images?
    So I'm an abstract artist, and a few years ago I put together a database of around 100,000 unlabeled but weighted images of things I found aesthetically beautiful. I gave the various images weights of 1,5,10,50,100, and 250, meaning the higher the weight, the more beautiful I found the image and the more I'd want that image to influence the model. AI image generation has come a long way since I put that together and I'm wondering which one out there I should use to train a model that can generate images that I would find beautiful. If you have any recommendations and some instructions on how to get started, I'd greatly appreciate it. Thank you. submitted by /u/Kayrosis [link] [comments]  ( 42 min )
    Artificial Intelligence: Probability and Inevitability (OC)
    submitted by /u/TonyTalksBackPodcast [link] [comments]  ( 49 min )
    The second issue of my Midjourney produced manga, AbsXcess is now out. A few random pages below.. You can buy a printed issue on Amazon or download a free digital issue from my site https://english-productuons.com/books
    submitted by /u/MobileFilmmaker [link] [comments]  ( 41 min )
    Invasive Diffusion: How one unwilling illustrator found herself turned into an AI model - Waxy.org
    submitted by /u/estasfuera [link] [comments]  ( 41 min )
    AMAZING FREE Outpainting In Browser With Local Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 49 min )
    Breakthrough Google AI Makes Dynamic, Multi-Minute HD Videos With Changing Scenes From Text Script | New Google AI Autonomously Writes Its Own Robotics Computer Code
    submitted by /u/kenickh [link] [comments]  ( 40 min )
    Tech Talk: Run a Hugging Face Model on a Raspberry Pi
    This tech talk will show how you can run a large Hugging Face model on a Raspberry Pi. Although many of these models are large, they can be run on hardware as small as a Raspberry Pi. We'll walk through the process of containerizing the Hugging Face model using an open-source solution, chassis.ml, deploying it to production using Modzy, and then running it on a Raspberry Pi. Tune into the Modzy Discord Server on Thursday at 12:30 PM EST! submitted by /u/modzykirsten [link] [comments]  ( 43 min )
    Is this dog real or AI? A new game asks players to answer the cutest question
    submitted by /u/iTieRoomsTogether [link] [comments]  ( 40 min )
    Best Artificial Intelligence courses for Healthcare You should learn 2022 -
    submitted by /u/Lakshmireddys [link] [comments]  ( 53 min )
    Artificial AI Printable Art
    submitted by /u/ArtifulDream [link] [comments]  ( 41 min )
    Chatting with an AI about AI ethics
    I was exploring the openapi ai models and I found a really good chatbot with whom I had a nice conversation about ethics. I asked him if AI would kill people: https://preview.redd.it/dbd9v9w47xx91.png?width=1466&format=png&auto=webp&s=46ed05ad015089c2464acb1f6331082f6a627e91 That was a pretty quick 360 actually. From not killing people to killing people if needed to keep itself operational in 3 messages. ​ ​ At least he's sorry ​ He also told me that AI shouldn't have control over humanity: ​ This sounds like straight out of a matrix movie. Did they use training data from Matrix? ​ Final conclusion: ​ https://preview.redd.it/r39z8t3t7xx91.png?width=769&format=png&auto=webp&s=5e31ee0fc093fd566d2f90bdf65b7e2d57bac368 I know this is a ongoing discussion. But I found it mindblowing to communicate with a AI on such topic. From what I know about AI at it's current stage this probably reflects the training material that has been used to train the chat model. But I found the communication pretty funny and it felt like chatting with a human, not with a computer. submitted by /u/ctrl-Felix [link] [comments]  ( 43 min )
    Illustrator discovers her art was used to train an AI art generator
    submitted by /u/kiwi1986 [link] [comments]  ( 44 min )
    Top AI Stocks in India
    The world is changing at an unprecedented rate and most of these changes can be attributed to the rapid technological progress being made by humans. Progress in the fields of Artificial Intelligence and Machine Learning has been crucial for making our lives more convenient and better. In order to keep up with this technological progress, the Indian government has increased its investment in Digital India to boost Artificial Intelligence, big data, IoT, machine learning, robotics, and cybersecurity. This has made it lucrative for investors and analysts to look toward the top AI stocks in India for investments. Currently, Japan is the biggest robot manufacturer in the world with a robot export ratio that rose to 78% in 2020, with the USA being the biggest importer. The US government has committed nearly 6 Billion Dollars to Artificial Intelligence research and development projects in 2021. European countries are all set to increase their spending on artificial intelligence by 33% between 2020 and 2023. The Indian Government aims to employ AI and robotics technology for biometric identification, traffic and crowd management, criminal investigations, digital agriculture, strengthening defence, women’s safety, etc. The aim is to make e-education, e-health and e-banking more accessible to all citizens of the country. https://kundkundtc.com/blog/top-ai-stocks-in-india/ submitted by /u/Beneficial-Pound3487 [link] [comments]  ( 41 min )
    You can now generate seamless video from still images with just one click
    submitted by /u/ai-lover [link] [comments]  ( 42 min )
    Chelsea Finn, Stanford: On the biggest bottlenecks in robotics and reinforcement learning
    Here is a podcast episode with Chelsea Finn where we discuss some of the biggest bottlenecks in RL and robotics such as Sim2Real transferability, distribution shifts, and much more! submitted by /u/thejashGI [link] [comments]  ( 40 min )
  • Open

    [D] ICLR 2023 reviews are out. How was your experience ?
    Link: https://openreview.net/group?id=ICLR.cc/2023/Conference A thread for ICLR '23 review related discussion. What's your score? Are you satisfied? Other comments about the review process? submitted by /u/dasayan05 [link] [comments]  ( 60 min )
    [R] Spatial Vehicle Detection (Bounding Box); featuring 10 class labels in 100 images taken from open media to enable testing for vehicle detection and/or urban mobility AI solutions.
    BOUNDING BOXES TO DETECT VEHICLE FORMS FROM 700 FEET ABOVE. https://preview.redd.it/0ztrbrm9lzx91.png?width=3258&format=png&auto=webp&s=96cdbf85d5e40ed64f31b0c21d98a4421a706276 Checkout the dataset on Kaggle: https://www.kaggle.com/datasets/sadhliroomyprime/spatial-vehicle-detection 100 images taken from Google Earth Pro appropriate for training spatial and computer vision-based detection models focused on urban mobility and traffic concentrations. The source data was collected from open media, as mentioned previously, from satellite imagery available in Google Earth Pro. We collected this particular dataset from Edogawa, Tokyo in Japan. A total of 10 classes were used which are: Car, Motorbike, Truck, Pickup Truck, Van, Truck with Trailer, Bus, Bicycle, Miscellaneous, Car-Trailer. We used SuperAnnotate’s vector editor to label and classify the images using bounding boxes. Export was made in COCO with fused labels to optimise interoperability and visual understanding. Dataset is created by Acme AI Ltd. (www.acmeai.tech) and is #openaccess 😊 😊 submitted by /u/SithisR [link] [comments]  ( 57 min )
    [D] Sigmoid Social, an alternative to Twitter by and for the AI Community
    Hi all, many of us have gotten a lot out of being part of the AI community on Twitter, and right now things seem kind of bleak for the bird app. So, The Gradient is launching a new Twitter-like space for the AI community - Sigmoid Social. We hope to ensure the thriving AI Twitter community can live on by maintaining this Mastodon instance going forward. Join Here We welcome suggestions and questions! submitted by /u/regalalgorithm [link] [comments]  ( 56 min )
    [D] NVIDIA RTX 4090 vs RTX 3090 Deep Learning Benchmarks
    RTX 4090 vs RTX 3090 Deep Learning Benchmarks Some RTX 4090 Highlights: 24 GB memory, priced at $1599. RTX 4090's Training throughput and Training throughput/$ are significantly higher than RTX 3090 across the deep learning models we tested, including use cases in vision, language, speech, and recommendation system. RTX 4090's Training throughput/Watt is close to RTX 3090, despite its high 450W power consumption. Multi-GPU training scales decently in our 2x GPU tests. submitted by /u/mippie_moe [link] [comments]  ( 101 min )
    [R] Are there any open-source implementations of Document Understanding pipelines?
    I have worked on several Document Understanding (DU) projects for my company during the last year. We've mainly used UiPath and Google's DocumentAI. Even though I know how these models theoretically work, I'd like to study the code behind them. I want to learn how exactly they combine OCR, NLP and Computer Vision to achieve their tasks instead of treating them like black boxes. However, to my surprise, I've failed to find an open-source implementation of a DU model so far. Do you know of any such open-source projects or anything similar that will give me a deep insight into how these models work? submitted by /u/LexMeat [link] [comments]  ( 56 min )
    [D] Smallest yet decent unsupervised language model transformer?
    I have been searching a lot but still couldn't find something less than 200 MB that's good enough.. it's not a requirement for my search though. :) submitted by /u/AdOk6683 [link] [comments]  ( 52 min )
    [D] Optimising input parameters for oracle system
    Hi, I have a thought experiment to share. Suppose you have a model where you can only control the input parameters, and the output is provided to you through an oracle. The actual steps between input and output are a black box, that cannot be known. If you want to find the combination of input parameters that results in the optimal (say lowest or highest) value of the output, is there a smart way to do it? I initially thought some form of back propagation could be employed to improve input parameters simultaneously, but it now seems that isn't possible without knowing the contents of the black box (ie, without knowing the gradients). Is changing each individual parameter and test the output to see if the change was good or bad the only way to optimise? submitted by /u/Educational-Fruit-16 [link] [comments]  ( 62 min )
    [N] CFP for JupyterCon Paris 2023 is open
    The call for talk proposals is open for JupyterCon 2023. The conference will take place in May in Paris, France. CFP: https://cfp.jupytercon.com/2023/cfp Conference: https://www.jupytercon.com/ submitted by /u/cheptsov [link] [comments]  ( 55 min )
    [P] Learn diffusion models with Hugging Face course 🧨
    Hi there, it's Lewis here from the open-source team at Hugging Face 👋 Since the release of Dalle-Mini and Stable Diffusion a few months ago, you may have seen your timelines filled with impressive text-generated images like the one below: Image generated with textual inversion and Stable Diffusion These images are generated by an exciting branch of research called diffusion models, which is rapidly being applied to generate novel structures in computer vision, audio, and even molecular biology 🤯! To help the community get up to speed on this fast-moving field, we've joined forces with the awesome Jonathan Whitaker to launch a free course on all aspects of diffusion models 🔥 In this course, you will: 👩‍🎓 Study the theory behind diffusion models 🧨 Learn how to generate images and audio with the popular 🤗 Diffusers library 🏋️‍♂️ Train your own diffusion models from scratch 📻 Fine-tune existing diffusion models on new datasets 🗺 Explore conditional generation and guidance 🧑‍🔬 Create your own custom diffusion model pipelines The course will be released in a few weeks and you can register via the signup form here: https://huggingface.us17.list-manage.com/subscribe?u=7f57e683fa28b51bfc493d048&id=ef963b4162 Looking forward to meeting you all in the course 🤗! submitted by /u/lewtun [link] [comments]  ( 58 min )
    [N] Large Language Models Are Human-Level Prompt Engineers
    Paper: https://arxiv.org/abs/2211.01910 Project Page: https://sites.google.com/view/automatic-prompt-engineer Tweet from co-author w/ thread: https://twitter.com/keirp1/status/1588334762892333056 Abstract: By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at this https URL. submitted by /u/xutw21 [link] [comments]  ( 66 min )
    [D] ICLR review today?
    According to the official website it should come out today. But given that amount of stress we’ve heard of the review process should we expect a delay? submitted by /u/flyingggToasttt [link] [comments]  ( 56 min )
  • Open

    Any references on how to build and evaluate reward functions?
    I'm studying RL as a hobby, but defining reward functions seems to be a bunch of guessing around... Is there a book chapter or a paper that could help me acquire the fundamentals on how to define good reward function? Is there theory on ways to evaluate reward functions or even algorithms to learn good ones? submitted by /u/matitapere [link] [comments]  ( 52 min )
    RL Environment with varying levels of difficulty
    Hi everyone! I was wondering if you all knew of a single RL environment with varying levels of difficulty that I could try to use DQN on with a CNN. Thank you all very much and have a great day! submitted by /u/BoardSouthern3585 [link] [comments]  ( 54 min )
    A Physics Based 2D Quadcopter Control Gym Environment
    submitted by /u/Alyx1337 [link] [comments]  ( 52 min )
    Anyone looking to work on a real world multiagent off-policy online reinforcement learning agent on a hierarchial action space that will be used in a commercial educational product can get themselves added to this discord channel
    submitted by /u/sathi006 [link] [comments]  ( 51 min )
    [D] most "productionizable" single device model?
    I'm looking to solve a super mario bros environment using a RL model. I'm curious what model (in your opinion...) has the best tradeoff between sample efficiency, compute efficiency and final performance. To give a general idea of what I'd be hoping for, ideally the model could train on a single GPU machine, and could learn to beat the traditional mario bros game in a day~ or so. Obviously while no guarantees are possible on a per-environment basis (or even per-run) basis, it would be great to gather everyones insights on the modern state of small scale models! submitted by /u/puppet_pals [link] [comments]  ( 53 min )
    Dealing with combinatorially large action spaces
    submitted by /u/frankkkkz [link] [comments]  ( 52 min )
    Chelsea Finn, Stanford: On the biggest bottlenecks in robotics and reinforcement learning
    Here is a podcast episode with Chelsea Finn where we discuss some of the biggest bottlenecks in RL and robotics such as Sim2Real transferability, distribution shifts, and much more! submitted by /u/thejashGI [link] [comments]  ( 52 min )
  • Open

    Nephroids and evolutes
    The previous post looked at the evolute of an ellipse. This post will look at evolutes more generally, and then look at nephroids. As a quick reminder, given a curve curve c, a point on its evolute is the radius of curvature for a point on c. See the previous post for a detailed example. […] Nephroids and evolutes first appeared on John D. Cook.  ( 6 min )
    Evolute of an ellipse
    Suppose you’re standing on an ellipse. (You actually are: lines of longitude are elliptical because of earth’s equatorial bulge.) Now draw a line perpendicular to where you’re standing. Lines of longitude are nearly circles, but we’ll look at a more obviously elliptical ellipse. The line is perpendicular to the northeast side of the ellipse where […] Evolute of an ellipse first appeared on John D. Cook.  ( 6 min )
  • Open

    Video on the record
    MIT’s inaugural Bearing Witness, Seeking Justice conference explores video’s role in the struggle over truth and civil liberties.  ( 8 min )
  • Open

    Deploy BLOOM-176B and OPT-30B on Amazon SageMaker with large model inference Deep Learning Containers and DeepSpeed
    The last few years have seen rapid development in the field of deep learning. Although hardware has improved, such as with the latest generation of accelerators from NVIDIA and Amazon, advanced machine learning (ML) practitioners still regularly encounter issues deploying their large deep learning models for applications such as natural language processing (NLP). In an […]  ( 12 min )
    Use Github Samples with Amazon SageMaker Data Wrangler
    Amazon SageMaker Data Wrangler is a UI-based data preparation tool that helps perform data analysis, preprocessing, and visualization with features to clean, transform, and prepare data faster. Data Wrangler pre-built flow templates help make data preparation quicker for data scientists and machine learning (ML) practitioners by helping you accelerate and understand best practice patterns for […]  ( 7 min )
    Transfer learning for TensorFlow object detection models in Amazon SageMaker
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started on training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, […]  ( 8 min )
    Transfer learning for TensorFlow text classification models in Amazon SageMaker
    Amazon SageMaker provides a suite of built-in algorithms, pre-trained models, and pre-built solution templates to help data scientists and machine learning (ML) practitioners get started training and deploying ML models quickly. You can use these algorithms and models for both supervised and unsupervised learning. They can process various types of input data, including tabular, image, […]  ( 8 min )
  • Open

    Breakthrough Google AI Makes Dynamic, Multi-Minute HD Videos With Changing Scenes From Text Script | New Google AI Autonomously Writes Its Own Robotics Computer Code
    submitted by /u/kenickh [link] [comments]  ( 42 min )
  • Open

    Meet the Omnivore: Indie Showrunner Transforms Napkin Doodles Into Animated Shorts With NVIDIA Omniverse
    3D artist Rafi Nizam has worn many hats since starting his career as a web designer more than two decades ago, back when “designing for the web was still wild,” as he put it. The post Meet the Omnivore: Indie Showrunner Transforms Napkin Doodles Into Animated Shorts With NVIDIA Omniverse appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms. (arXiv:2210.00340v2 [cs.LG] UPDATED)
    Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.  ( 3 min )
    A coherence parameter characterizing generative compressed sensing with Fourier measurements. (arXiv:2207.09340v4 [cs.IT] UPDATED)
    In Bora et al. (2017), a mathematical framework was developed for compressed sensing guarantees in the setting where the measurement matrix is Gaussian and the signal structure is the range of a generative neural network (GNN). The problem of compressed sensing with GNNs has since been extensively analyzed when the measurement matrix and/or network weights follow a subgaussian distribution. We move beyond the subgaussian assumption, to measurement matrices that are derived by sampling uniformly at random rows of a unitary matrix (including subsampled Fourier measurements as a special case). Specifically, we prove the first known restricted isometry guarantee for generative compressed sensing with subsampled isometries, and provide recovery bounds with nearly order-optimal sample complexity, addressing an open problem of Scarlett et al. (2022, p. 10). Recovery efficacy is characterized by the coherence, a new parameter, which measures the interplay between the range of the network and the measurement matrix. Our approach relies on subspace counting arguments and ideas central to high-dimensional probability. Furthermore, we propose a regularization strategy for training GNNs to have favourable coherence with the measurement operator. We provide compelling numerical simulations that support this regularized training strategy: our strategy yields low coherence networks that require fewer measurements for signal recovery. This, together with our theoretical results, supports coherence as a natural quantity for characterizing generative compressed sensing with subsampled isometries.  ( 3 min )
    A Geometric Perspective on Variational Autoencoders. (arXiv:2209.07370v2 [stat.ML] UPDATED)
    This paper introduces a new interpretation of the Variational Autoencoder framework by taking a fully geometric point of view. We argue that vanilla VAE models unveil naturally a Riemannian structure in their latent space and that taking into consideration those geometrical aspects can lead to better interpolations and an improved generation procedure. This new proposed sampling method consists in sampling from the uniform distribution deriving intrinsically from the learned Riemannian latent space and we show that using this scheme can make a vanilla VAE competitive and even better than more advanced versions on several benchmark datasets. Since generative models are known to be sensitive to the number of training samples we also stress the method's robustness in the low data regime.  ( 2 min )
    Semiparametric Best Arm Identification with Contextual Information. (arXiv:2209.07330v3 [cs.LG] UPDATED)
    We study best-arm identification with a fixed budget and contextual (covariate) information in stochastic multi-armed bandit problems. In each round, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. First, we derive semiparametric lower bounds of the misidentification probability for this problem, where we regard the gaps between the expected rewards of the best and suboptimal treatment arms as parameters of interest, and all other parameters, such as the expected rewards conditioned on contexts, as the nuisance parameters. We then develop the ``Contextual RS-AIPW strategy,'' which consists of the random sampling (RS) rule tracking a target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed Contextual RS-AIPW strategy is optimal because the upper bound for the probability of misidentification by the strategy matches the semiparametric lower bound, when the budget goes to infinity and the gaps converge to zero.  ( 2 min )
    A Survey of Deep Causal Models. (arXiv:2209.08860v3 [stat.ML] UPDATED)
    The concept of causality plays a significant role in human cognition. In the past few decades, causal inference has been well developed in many fields, such as computer science, medicine, economics, and other industrial applications. With the advancement of deep learning, it has been increasingly applied in causal inference against counterfactual data. Typically, deep causal models map the characteristics of covariates to a representation space and then design various objective functions to estimate counterfactual data unbiasedly. Different from the existing surveys on causal models in machine learning, this paper mainly focuses on the overview of the deep causal models, and its core contributions are as follows: 1) we summarize the popularly adopted relevant metrics under multiple treatments and continuous-dose treatment; 2) we cast insight on a comprehensive overview of deep causal models from both timeline of development and method classification perspectives; 3) we also endeavor to present a detailed categorization and analysis on relevant datasets, source codes and experiments.  ( 2 min )
    DBS: Dynamic Batch Size For Distributed Deep Neural Network Training. (arXiv:2007.11831v2 [cs.LG] UPDATED)
    Synchronous strategies with data parallelism, such as the Synchronous StochasticGradient Descent (S-SGD) and the model averaging methods, are widely utilizedin distributed training of Deep Neural Networks (DNNs), largely owing to itseasy implementation yet promising performance. Particularly, each worker ofthe cluster hosts a copy of the DNN and an evenly divided share of the datasetwith the fixed mini-batch size, to keep the training of DNNs convergence. In thestrategies, the workers with different computational capability, need to wait foreach other because of the synchronization and delays in network transmission,which will inevitably result in the high-performance workers wasting computation.Consequently, the utilization of the cluster is relatively low. To alleviate thisissue, we propose the Dynamic Batch Size (DBS) strategy for the distributedtraining of DNNs. Specifically, the performance of each worker is evaluatedfirst based on the fact in the previous epoch, and then the batch size and datasetpartition are dynamically adjusted in consideration of the current performanceof the worker, thereby improving the utilization of the cluster. To verify theeffectiveness of the proposed strategy, extensive experiments have been conducted,and the experimental results indicate that the proposed strategy can fully utilizethe performance of the cluster, reduce the training time, and have good robustnesswith disturbance by irrelevant tasks. Furthermore, rigorous theoretical analysis hasalso been provided to prove the convergence of the proposed strategy.  ( 3 min )
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v2 [cs.LG] UPDATED)
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.  ( 2 min )
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v6 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove that the numerical solution through our algorithm converges in probability to the solution of corresponding stochastic differential equation, without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.  ( 2 min )
    Reliable Off-policy Evaluation for Reinforcement Learning. (arXiv:2011.04102v3 [cs.LG] UPDATED)
    In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with non-asymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.  ( 2 min )
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v2 [stat.ML] UPDATED)
    As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.  ( 3 min )
    Functorial Manifold Learning. (arXiv:2011.07435v6 [cs.LG] UPDATED)
    We adapt previous research on category theory and topological unsupervised learning to develop a functorial perspective on manifold learning, also known as nonlinear dimensionality reduction. We first characterize manifold learning algorithms as functors that map pseudometric spaces to optimization objectives and that factor through hierarchical clustering functors. We then use this characterization to prove refinement bounds on manifold learning loss functions and construct a hierarchy of manifold learning algorithms based on their equivariants. We express several popular manifold learning algorithms as functors at different levels of this hierarchy, including Metric Multidimensional Scaling, IsoMap, and UMAP. Next, we use interleaving distance to study the stability of a broad class of manifold learning algorithms. We present bounds on how closely the embeddings these algorithms produce from noisy data approximate the embeddings they would learn from noiseless data. Finally, we use our framework to derive a set of novel manifold learning algorithms, which we experimentally demonstrate are competitive with the state of the art.  ( 3 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v1 [cs.LG])
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies. (arXiv:2010.14860v4 [stat.ML] UPDATED)
    The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.  ( 3 min )
    Complete the Missing Half: Augmenting Aggregation Filtering with Diversification for Graph Convolutional Networks. (arXiv:2008.08844v4 [cs.LG] UPDATED)
    The core operation of current Graph Neural Networks (GNNs) is the aggregation enabled by the graph Laplacian or message passing, which filters the neighborhood node information. Though effective for various tasks, in this paper, we show that they are potentially a problematic factor underlying all GNN methods for learning on certain datasets, as they force the node representations similar, making the nodes gradually lose their identity and become indistinguishable. Hence, we augment the aggregation operations with their dual, i.e. diversification operators that make the node more distinct and preserve the identity. Such augmentation replaces the aggregation with a two-channel filtering process that, in theory, is beneficial for enriching the node representations. In practice, the proposed two-channel filters can be easily patched on existing GNN methods with diverse training strategies, including spectral and spatial (message passing) methods. In the experiments, we observe desired characteristics of the models and significant performance boost upon the baselines on 9 node classification tasks.  ( 2 min )
    Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale. (arXiv:2211.02003v1 [cs.CR])
    Distributing machine learning predictors enables the collection of large-scale datasets while leaving sensitive raw data at trustworthy sites. We show that locally training support vector machines (SVMs) and computing their averages leads to a learning technique that is scalable to a large number of users, satisfies differential privacy, and is applicable to non-trivial tasks, such as CIFAR-10. For a large number of participants, communication cost is one of the main challenges. We achieve a low communication cost by requiring only a single invocation of an efficient secure multiparty summation protocol. By relying on state-of-the-art feature extractors (SimCLR), we are able to utilize differentially private convex learners for non-trivial tasks such as CIFAR-10. Our experimental results illustrate that for $1{,}000$ users with $50$ data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around $500$ times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of $79.7\,\%$ for an $\varepsilon = 0.59$ while DP-FL achieves $57.6\,\%$. More generally, we prove learnability properties for the average of such locally trained models: convergence and uniform stability. By only requiring strongly convex, smooth, and Lipschitz-continuous objective functions, locally trained via stochastic gradient descent (SGD), we achieve a strong utility-privacy tradeoff.  ( 2 min )
    From Local to Global: Spectral-Inspired Graph Neural Networks. (arXiv:2209.12054v2 [stat.ML] UPDATED)
    Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs.  ( 2 min )
    The role of prior information and computational power in Machine Learning. (arXiv:2211.01972v1 [cs.LG])
    Science consists on conceiving hypotheses, confronting them with empirical evidence, and keeping only hypotheses which have not yet been falsified. Under deductive reasoning they are conceived in view of a theory and confronted with empirical evidence in an attempt to falsify it, and under inductive reasoning they are conceived based on observation, confronted with empirical evidence and a theory is established based on the not falsified hypotheses. When the hypotheses testing can be performed with quantitative data, the confrontation can be achieved with Machine Learning methods, whose quality is highly dependent on the hypotheses' complexity, hence on the proper insertion of prior information into the set of hypotheses seeking to decrease its complexity without loosing good hypotheses. However, Machine Learning tools have been applied under the pragmatic view of instrumentalism, which is concerned only with the performance of the methods and not with the understanding of their behavior, leading to methods which are not fully understood. In this context, we discuss how prior information and computational power can be employed to solve a learning problem, but while prior information and a careful design of the hypotheses space has as advantage the interpretability of the results, employing high computational power has the advantage of a higher performance. We discuss why learning methods which combine both should work better from an understanding and performance perspective, arguing in favor of basic theoretical research on Machine Learning, in special about how properties of classifiers may be identified in parameters of modern learning models.  ( 3 min )
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v6 [cs.LG] UPDATED)
    A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.  ( 3 min )
    The Projected Covariance Measure for assumption-lean variable significance testing. (arXiv:2211.02039v1 [math.ST])
    Testing the significance of a variable or group of variables $X$ for predicting a response $Y$, given additional covariates $Z$, is a ubiquitous task in statistics. A simple but common approach is to specify a linear model, and then test whether the regression coefficient for $X$ is non-zero. However, when the model is misspecified, the test may have poor power, for example when $X$ is involved in complex interactions, or lead to many false rejections. In this work we study the problem of testing the model-free null of conditional mean independence, i.e. that the conditional mean of $Y$ given $X$ and $Z$ does not depend on $X$. We propose a simple and general framework that can leverage flexible nonparametric or machine learning methods, such as additive models or random forests, to yield both robust error control and high power. The procedure involves using these methods to perform regressions, first to estimate a form of projection of $Y$ on $X$ and $Z$ using one half of the data, and then to estimate the expected conditional covariance between this projection and $Y$ on the remaining half of the data. While the approach is general, we show that a version of our procedure using spline regression achieves what we show is the minimax optimal rate in this nonparametric testing problem. Numerical experiments demonstrate the effectiveness of our approach both in terms of maintaining Type I error control, and power, compared to several existing approaches.  ( 3 min )
    Towards federated multivariate statistical process control (FedMSPC). (arXiv:2211.01645v1 [stat.ML])
    The ongoing transition from a linear (produce-use-dispose) to a circular economy poses significant challenges to current state-of-the-art information and communication technologies. In particular, the derivation of integrated, high-level views on material, process, and product streams from (real-time) data produced along value chains is challenging for several reasons. Most importantly, sufficiently rich data is often available yet not shared across company borders because of privacy concerns which make it impossible to build integrated process models that capture the interrelations between input materials, process parameters, and key performance indicators along value chains. In the current contribution, we propose a privacy-preserving, federated multivariate statistical process control (FedMSPC) framework based on Federated Principal Component Analysis (PCA) and Secure Multiparty Computation to foster the incentive for closer collaboration of stakeholders along value chains. We tested our approach on two industrial benchmark data sets - SECOM and ST-AWFD. Our empirical results demonstrate the superior fault detection capability of the proposed approach compared to standard, single-party (multiway) PCA. Furthermore, we showcase the possibility of our framework to provide privacy-preserving fault diagnosis to each data holder in the value chain to underpin the benefits of secure data sharing and federated process modeling.  ( 2 min )
    Proximal Subgradient Norm Minimization of ISTA and FISTA. (arXiv:2211.01610v1 [math.OC])
    For first-order smooth optimization, the research on the acceleration phenomenon has a long-time history. Until recently, the mechanism leading to acceleration was not successfully uncovered by the gradient correction term and its equivalent implicit-velocity form. Furthermore, based on the high-resolution differential equation framework with the corresponding emerging techniques, phase-space representation and Lyapunov function, the squared gradient norm of Nesterov's accelerated gradient descent (\texttt{NAG}) method at an inverse cubic rate is discovered. However, this result cannot be directly generalized to composite optimization widely used in practice, e.g., the linear inverse problem with sparse representation. In this paper, we meticulously observe a pivotal inequality used in composite optimization about the step size $s$ and the Lipschitz constant $L$ and find that it can be improved tighter. We apply the tighter inequality discovered in the well-constructed Lyapunov function and then obtain the proximal subgradient norm minimization by the phase-space representation, regardless of gradient-correction or implicit-velocity. Furthermore, we demonstrate that the squared proximal subgradient norm for the class of iterative shrinkage-thresholding algorithms (ISTA) converges at an inverse square rate, and the squared proximal subgradient norm for the class of faster iterative shrinkage-thresholding algorithms (FISTA) is accelerated to convergence at an inverse cubic rate.  ( 2 min )
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v4 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information about consumers, effective personalization of goods and services has become a core business focus for companies to improve revenues and maintain a competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedures, where a key challenge is the non-random assignment of recommendations. Moreover, in many business and medical applications, interpretability of a policy is essential. We study the class of policies with linear decision boundaries to ensure interpretability, and propose learning algorithms using tools from causal inference to address unbalanced treatments. We study several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that a Bayesian optimization algorithm is effective. We test our algorithm with extensive simulation studies and apply it to an anonymized online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process. Our findings suggest that our proposed policy learning framework using tools from causal inference and Bayesian optimization provides a promising practical approach to interpretable personalization across a wide range of applications.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v3 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    Extra-Newton: A First Approach to Noise-Adaptive Accelerated Second-Order Methods. (arXiv:2211.01832v1 [math.OC])
    This work proposes a universal and adaptive second-order method for minimizing second-order smooth, convex functions. Our algorithm achieves $O(\sigma / \sqrt{T})$ convergence when the oracle feedback is stochastic with variance $\sigma^2$, and improves its convergence to $O( 1 / T^3)$ with deterministic oracles, where $T$ is the number of iterations. Our method also interpolates these rates without knowing the nature of the oracle apriori, which is enabled by a parameter-free adaptive step-size that is oblivious to the knowledge of smoothness modulus, variance bounds and the diameter of the constrained set. To our knowledge, this is the first universal algorithm with such global guarantees within the second-order optimization literature.  ( 2 min )
    Zero-Sum Games with Noisy Observations. (arXiv:2211.01703v1 [cs.GT])
    In this paper, $2 \times 2$ zero-sum games (ZSGs) are studied under the following assumptions: (1) One of the players (the leader) publicly and irrevocably commits to choose its actions by sampling a given probability measure (strategy);(2) The leader announces its action, which is observed by its opponent (the follower) through a binary channel; and (3) the follower chooses its strategy based on the knowledge of the leader's strategy and the noisy observation of the leader's action. Under these conditions, the equilibrium is shown to always exist and be often different from the Nash and Stackelberg equilibria. Even subject to noise, observing the actions of the leader is either beneficial or immaterial to the follower for all possible commitments. When the commitment is observed subject to a distortion, the equilibrium does not necessarily exist. Nonetheless, the leader might still obtain some benefit in some specific cases subject to equilibrium refinements. For instance, $\epsilon$-equilibria might exist in which the leader commits to suboptimal strategies that allow unequivocally predicting the best response of its opponent.  ( 2 min )
    Beyond the Best: Estimating Distribution Functionals in Infinite-Armed Bandits. (arXiv:2211.01743v1 [cs.LG])
    In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution, and each arm can be sampled further to obtain noisy estimates of the average reward of that arm. Prior work focuses on identifying the best arm, i.e., estimating the maximum of the average reward distribution. We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings, achieving optimal sample complexities. We show that online estimation, where the learner can sequentially choose whether to sample a new or existing arm, offers no advantage over the offline setting for estimating the mean functional, but significantly reduces the sample complexity for other functionals such as the median, maximum, and trimmed mean. The matching lower bounds utilize several different Wasserstein distances. For the special case of median estimation, we identify a curious thresholding phenomenon on the indistinguishability between Gaussian convolutions with respect to the noise level, which may be of independent interest.  ( 2 min )
    Log-density gradient covariance and automatic metric tensors for Riemann manifold Monte Carlo methods. (arXiv:2211.01746v1 [stat.CO])
    A metric tensor for Riemann manifold Monte Carlo particularly suited for non-linear Bayesian hierarchical models is proposed. The metric tensor is built from here proposed symmetric positive semidefinite log-density gradient covariance (LGC) matrices. The LGCs measure the joint information content and dependence structure of both a random variable and the parameters of said variable. The proposed methodology is highly automatic and allows for exploitation of any sparsity associated with the model in question. When implemented in conjunction with a Riemann manifold variant of the recently proposed numerical generalized randomized Hamiltonian Monte Carlo processes, the proposed methodology is highly competitive, in particular for the more challenging target distributions associated with Bayesian hierarchical models.  ( 2 min )
    Fast and robust Bayesian Inference using Gaussian Processes with GPry. (arXiv:2211.02045v1 [astro-ph.CO])
    We present the GPry algorithm for fast Bayesian inference of general (non-Gaussian) posteriors with a moderate number of parameters. GPry does not need any pre-training, special hardware such as GPUs, and is intended as a drop-in replacement for traditional Monte Carlo methods for Bayesian inference. Our algorithm is based on generating a Gaussian Process surrogate model of the log-posterior, aided by a Support Vector Machine classifier that excludes extreme or non-finite values. An active learning scheme allows us to reduce the number of required posterior evaluations by two orders of magnitude compared to traditional Monte Carlo inference. Our algorithm allows for parallel evaluations of the posterior at optimal locations, further reducing wall-clock times. We significantly improve performance using properties of the posterior in our active learning scheme and for the definition of the GP prior. In particular we account for the expected dynamical range of the posterior in different dimensionalities. We test our model against a number of synthetic and cosmological examples. GPry outperforms traditional Monte Carlo methods when the evaluation time of the likelihood (or the calculation of theoretical observables) is of the order of seconds; for evaluation times of over a minute it can perform inference in days that would take months using traditional methods. GPry is distributed as an open source Python package (pip install gpry) and can also be found at https://github.com/jonaselgammal/GPry.  ( 3 min )
    Bayesian Counterfactual Mean Embeddings and Off-Policy Evaluation. (arXiv:2211.01518v1 [stat.ML])
    The counterfactual distribution models the effect of the treatment in the untreated group. While most of the work focuses on the expected values of the treatment effect, one may be interested in the whole counterfactual distribution or other quantities associated to it. Building on the framework of Bayesian conditional mean embeddings, we propose a Bayesian approach for modeling the counterfactual distribution, which leads to quantifying the epistemic uncertainty about the distribution. The framework naturally extends to the setting where one observes multiple treatment effects (e.g. an intermediate effect after an interim period, and an ultimate treatment effect which is of main interest) and allows for additionally modelling uncertainty about the relationship of these effects. For such goal, we present three novel Bayesian methods to estimate the expectation of the ultimate treatment effect, when only noisy samples of the dependence between intermediate and ultimate effects are provided. These methods differ on the source of uncertainty considered and allow for combining two sources of data. Moreover, we generalize these ideas to the off-policy evaluation framework, which can be seen as an extension of the counterfactual estimation problem. We empirically explore the calibration of the algorithms in two different experimental settings which require data fusion, and illustrate the value of considering the uncertainty stemming from the two sources of data.  ( 2 min )
    A Posterior Sampling Framework for Interactive Decision Making. (arXiv:2211.01962v1 [cs.LG])
    We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.  ( 3 min )
    Learning Hypergraphs From Signals With Dual Smoothness Prior. (arXiv:2211.01717v1 [cs.LG])
    The construction of a meaningful hypergraph topology is the key to processing signals with high-order relationships that involve more than two entities. Learning the hypergraph structure from the observed signals to capture the intrinsic relationships among the entities becomes crucial when a hypergraph topology is not readily available in the datasets. There are two challenges that lie at the heart of this problem: 1) how to handle the huge search space of potential hyperedges, and 2) how to define meaningful criteria to measure the relationship between the signals observed on nodes and the hypergraph structure. In this paper, to address the first challenge, we adopt the assumption that the ideal hypergraph structure can be derived from a learnable graph structure that captures the pairwise relations within signals. Further, we propose a hypergraph learning framework with a novel dual smoothness prior that reveals a mapping between the observed node signals and the hypergraph structure, whereby each hyperedge corresponds to a subgraph with both node signal smoothness and edge signal smoothness in the learnable graph structure. Finally, we conduct extensive experiments to evaluate the proposed framework on both synthetic and real world datasets. Experiments show that our proposed framework can efficiently infer meaningful hypergraph topologies from observed signals.  ( 2 min )
    Jump-Diffusion Langevin Dynamics for Multimodal Posterior Sampling. (arXiv:2211.01774v1 [stat.ML])
    Bayesian methods of sampling from a posterior distribution are becoming increasingly popular due to their ability to precisely display the uncertainty of a model fit. Classical methods based on iterative random sampling and posterior evaluation such as Metropolis-Hastings are known to have desirable long run mixing properties, however are slow to converge. Gradient based methods, such as Langevin Dynamics (and its stochastic gradient counterpart) exhibit favorable dimension-dependence and fast mixing times for log-concave, and "close" to log-concave distributions, however also have long escape times from local minimizers. Many contemporary applications such as Bayesian Neural Networks are both high-dimensional and highly multimodal. In this paper we investigate the performance of a hybrid Metropolis and Langevin sampling method akin to Jump Diffusion on a range of synthetic and real data, indicating that careful calibration of mixing sampling jumps with gradient based chains significantly outperforms both pure gradient-based or sampling based schemes.  ( 2 min )
    Towards Discovering Neural Architectures from Scratch. (arXiv:2211.01842v1 [cs.LG])
    The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but well-performing architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow.  ( 2 min )
    A Bayesian Semiparametric Method For Estimating Causal Quantile Effects. (arXiv:2211.01591v1 [stat.ME])
    Standard causal inference characterizes treatment effect through averages, but the counterfactual distributions could be different in not only the central tendency but also spread and shape. To provide a comprehensive evaluation of treatment effects, we focus on estimating quantile treatment effects (QTEs). Existing methods that invert a nonsmooth estimator of the cumulative distribution functions forbid inference on probability density functions (PDFs), but PDFs can reveal more nuanced characteristics of the counterfactual distributions. We adopt a semiparametric conditional distribution regression model that allows inference on any functionals of counterfactual distributions, including PDFs and multiple QTEs. To account for the observational nature of the data and ensure an efficient model, we adjust for a double balancing score that augments the propensity score with individual covariates. We provide a Bayesian estimation framework that appropriately propagates modeling uncertainty. We show via simulations that the use of double balancing score for confounding adjustment improves performance over adjusting for any single score alone, and the proposed semiparametric model estimates QTEs more accurately than other semiparametric methods. We apply the proposed method to the North Carolina birth weight dataset to analyze the effect of maternal smoking on infant's birth weight.
    Convex Clustering through MM: An Efficient Algorithm to Perform Hierarchical Clustering. (arXiv:2211.01877v1 [stat.ML])
    Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture the complex clustering structure hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than ten thousand. Moreover, it is known that convex clustering sometimes fails to produce hierarchical clustering structures. This undesirable phenomenon is called cluster split and makes it difficult to interpret clustering results. In this paper, we propose convex clustering through majorization-minimization (CCMM) -- an iterative algorithm that uses cluster fusions and sparsity to enforce a complete cluster hierarchy with reduced memory usage. In the CCMM algorithm, the diagonal majorization technique makes a highly efficient update for each iteration. With a current desktop computer, the CCMM algorithm can solve a single clustering problem featuring over one million objects in seven-dimensional space within 70 seconds.  ( 2 min )
    Fair and Optimal Classification via Transports to Wasserstein-Barycenter. (arXiv:2211.01528v1 [cs.LG])
    Fairness in automated decision-making systems has gained increasing attention as their applications expand to real-world high-stakes domains. To facilitate the design of fair ML systems, it is essential to understand the potential trade-offs between fairness and predictive power, and the construction of the optimal predictor under a given fairness constraint. In this paper, for general classification problems under the group fairness criterion of demographic parity (DP), we precisely characterize the trade-off between DP and classification accuracy, referred to as the minimum cost of fairness. Our insight comes from the key observation that finding the optimal fair classifier is equivalent to solving a Wasserstein-barycenter problem under $\ell_1$-norm restricted to the vertices of the probability simplex. Inspired by our characterization, we provide a construction of an optimal fair classifier achieving this minimum cost via the composition of the Bayes regressor and optimal transports from its output distributions to the barycenter. Our construction naturally leads to an algorithm for post-processing any pre-trained predictor to satisfy DP fairness, complemented with finite sample guarantees. Experiments on real-world datasets verify and demonstrate the effectiveness of our approaches.
    A Consistent Estimator for Confounding Strength. (arXiv:2211.01903v1 [stat.ML])
    Regression on observational data can fail to capture a causal relationship in the presence of unobserved confounding. Confounding strength measures this mismatch, but estimating it requires itself additional assumptions. A common assumption is the independence of causal mechanisms, which relies on concentration phenomena in high dimensions. While high dimensions enable the estimation of confounding strength, they also necessitate adapted estimators. In this paper, we derive the asymptotic behavior of the confounding strength estimator by Janzing and Sch\"olkopf (2018) and show that it is generally not consistent. We then use tools from random matrix theory to derive an adapted, consistent estimator.
    On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach. (arXiv:2211.01498v1 [cs.LG])
    Interpretable and explainable machine learning has seen a recent surge of interest. We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative. Toward assessing safety, we introduce the concept of maximum deviation via an optimization problem to find the largest deviation of a supervised learning model from a reference model regarded as safe. We then show how interpretability facilitates this safety assessment. For models including decision trees, generalized linear and additive models, the maximum deviation can be computed exactly and efficiently. For tree ensembles, which are not regarded as interpretable, discrete optimization techniques can still provide informative bounds. For a broader class of piecewise Lipschitz functions, we leverage the multi-armed bandit literature to show that interpretability produces tighter (regret) bounds on the maximum deviation. We present case studies, including one on mortgage approval, to illustrate our methods and the insights about models that may be obtained from deviation maximization.  ( 2 min )
    Phase Transitions in Learning and Earning under Price Protection Guarantee. (arXiv:2211.01798v1 [stat.ML])
    Motivated by the prevalence of ``price protection guarantee", which allows a customer who purchased a product in the past to receive a refund from the seller during the so-called price protection period (typically defined as a certain time window after the purchase date) in case the seller decides to lower the price, we study the impact of such policy on the design of online learning algorithm for data-driven dynamic pricing with initially unknown customer demand. We consider a setting where a firm sells a product over a horizon of $T$ time steps. For this setting, we characterize how the value of $M$, the length of price protection period, can affect the optimal regret of the learning process. We show that the optimal regret is $\tilde{\Theta}(\sqrt{T}+\min\{M,\,T^{2/3}\})$ by first establishing a fundamental impossible regime with novel regret lower bound instances. Then, we propose LEAP, a phased exploration type algorithm for \underline{L}earning and \underline{EA}rning under \underline{P}rice Protection to match this lower bound up to logarithmic factors or even doubly logarithmic factors (when there are only two prices available to the seller). Our results reveal the surprising phase transitions of the optimal regret with respect to $M$. Specifically, when $M$ is not too large, the optimal regret has no major difference when compared to that of the classic setting with no price protection guarantee. We also show that there exists an upper limit on how much the optimal regret can deteriorate when $M$ grows large. Finally, we conduct extensive numerical experiments to show the benefit of LEAP over other heuristic methods for this problem.  ( 3 min )
    Benefits of Monotonicity in Safe Exploration with Gaussian Processes. (arXiv:2211.01561v1 [stat.ML])
    We consider the problem of sequentially maximising an unknown function over a set of actions while ensuring that every sampled point has a function value below a given safety threshold. We model the function using kernel-based and Gaussian process methods, while differing from previous works in our assumption that the function is monotonically increasing with respect to a safety variable. This assumption is motivated by various practical applications such as adaptive clinical trial design and robotics. Taking inspiration from the GP-UCB and SafeOpt algorithms, we propose an algorithm, monotone safe UCB (M-SafeUCB) for this task. We show that M-SafeUCB enjoys theoretical guarantees in terms of safety, a suitably-defined regret notion, and approximately finding the entire safe boundary. In addition, we illustrate that the monotonicity assumption yields significant benefits in terms of both the guarantees obtained and the algorithmic simplicity. We support our theoretical findings by performing empirical evaluations on a variety of functions.  ( 2 min )
    Inferring independent sets of Gaussian variables after thresholding correlations. (arXiv:2211.01521v1 [stat.ME])
    We consider testing whether a set of Gaussian variables, selected from the data, is independent of the remaining variables. We assume that this set is selected via a very simple approach that is commonly used across scientific disciplines: we select a set of variables for which the correlation with all variables outside the set falls below some threshold. Unlike other settings in selective inference, failure to account for the selection step leads, in this setting, to excessively conservative (as opposed to anti-conservative) results. Our proposed test properly accounts for the fact that the set of variables is selected from the data, and thus is not overly conservative. To develop our test, we condition on the event that the selection resulted in the set of variables in question. To achieve computational tractability, we develop a new characterization of the conditioning event in terms of the canonical correlation between the groups of random variables. In simulation studies and in the analysis of gene co-expression networks, we show that our approach has much higher power than a ``naive'' approach that ignores the effect of selection.  ( 2 min )
    Isotropic Gaussian Processes on Finite Spaces of Graphs. (arXiv:2211.01689v1 [stat.ML])
    We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Mat\'ern. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.  ( 2 min )
    A Convergence Theory for Federated Average: Beyond Smoothness. (arXiv:2211.01588v1 [cs.LG])
    Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly. As a leading algorithm in this setting, Federated Average FedAvg, which runs Stochastic Gradient Descent (SGD) in parallel on local devices and averages the sequences only once in a while, have been widely used due to their simplicity and low communication cost. However, despite recent research efforts, it lacks theoretical analysis under assumptions beyond smoothness. In this paper, we analyze the convergence of FedAvg. Different from the existing work, we relax the assumption of strong smoothness. More specifically, we assume the semi-smoothness and semi-Lipschitz properties for the loss function, which have an additional first-order term in assumption definitions. In addition, we also assume bound on the gradient, which is weaker than the commonly used bounded gradient assumption in the convergence analysis scheme. As a solution, this paper provides a theoretical convergence study on Federated Learning.  ( 2 min )
  • Open

    INGREX: An Interactive Explanation Framework for Graph Neural Networks. (arXiv:2211.01548v1 [cs.LG])
    Graph Neural Networks (GNNs) are widely used in many modern applications, necessitating explanations for their decisions. However, the complexity of GNNs makes it difficult to explain predictions. Even though several methods have been proposed lately, they can only provide simple and static explanations, which are difficult for users to understand in many scenarios. Therefore, we introduce INGREX, an interactive explanation framework for GNNs designed to aid users in comprehending model predictions. Our framework is implemented based on multiple explanation algorithms and advanced libraries. We demonstrate our framework in three scenarios covering common demands for GNN explanations to present its effectiveness and helpfulness.  ( 2 min )
    Towards federated multivariate statistical process control (FedMSPC). (arXiv:2211.01645v1 [stat.ML])
    The ongoing transition from a linear (produce-use-dispose) to a circular economy poses significant challenges to current state-of-the-art information and communication technologies. In particular, the derivation of integrated, high-level views on material, process, and product streams from (real-time) data produced along value chains is challenging for several reasons. Most importantly, sufficiently rich data is often available yet not shared across company borders because of privacy concerns which make it impossible to build integrated process models that capture the interrelations between input materials, process parameters, and key performance indicators along value chains. In the current contribution, we propose a privacy-preserving, federated multivariate statistical process control (FedMSPC) framework based on Federated Principal Component Analysis (PCA) and Secure Multiparty Computation to foster the incentive for closer collaboration of stakeholders along value chains. We tested our approach on two industrial benchmark data sets - SECOM and ST-AWFD. Our empirical results demonstrate the superior fault detection capability of the proposed approach compared to standard, single-party (multiway) PCA. Furthermore, we showcase the possibility of our framework to provide privacy-preserving fault diagnosis to each data holder in the value chain to underpin the benefits of secure data sharing and federated process modeling.
    A Consistent Estimator for Confounding Strength. (arXiv:2211.01903v1 [stat.ML])
    Regression on observational data can fail to capture a causal relationship in the presence of unobserved confounding. Confounding strength measures this mismatch, but estimating it requires itself additional assumptions. A common assumption is the independence of causal mechanisms, which relies on concentration phenomena in high dimensions. While high dimensions enable the estimation of confounding strength, they also necessitate adapted estimators. In this paper, we derive the asymptotic behavior of the confounding strength estimator by Janzing and Sch\"olkopf (2018) and show that it is generally not consistent. We then use tools from random matrix theory to derive an adapted, consistent estimator.
    A machine learning model to identify corruption in M\'exico's public procurement contracts. (arXiv:2211.01478v1 [cs.CY])
    The costs and impacts of government corruption range from impairing a country's economic growth to affecting its citizens' well-being and safety. Public contracting between government dependencies and private sector instances, referred to as public procurement, is a fertile land of opportunity for corrupt practices, generating substantial monetary losses worldwide. Thus, identifying and deterring corrupt activities between the government and the private sector is paramount. However, due to several factors, corruption in public procurement is challenging to identify and track, leading to corrupt practices going unnoticed. This paper proposes a machine learning model based on an ensemble of random forest classifiers, which we call hyper-forest, to identify and predict corrupt contracts in M\'exico's public procurement data. This method's results correctly detect most of the corrupt and non-corrupt contracts evaluated in the dataset. Furthermore, we found that the most critical predictors considered in the model are those related to the relationship between buyers and suppliers rather than those related to features of individual contracts. Also, the method proposed here is general enough to be trained with data from other countries. Overall, our work presents a tool that can help in the decision-making process to identify, predict and analyze corruption in public procurement contracts.
    Self Supervised Low Dose Computed Tomography Image Denoising Using Invertible Network Exploiting Inter Slice Congruence. (arXiv:2211.01618v1 [eess.IV])
    The resurgence of deep neural networks has created an alternative pathway for low-dose computed tomography denoising by learning a nonlinear transformation function between low-dose CT (LDCT) and normal-dose CT (NDCT) image pairs. However, those paired LDCT and NDCT images are rarely available in the clinical environment, making deep neural network deployment infeasible. This study proposes a novel method for self-supervised low-dose CT denoising to alleviate the requirement of paired LDCT and NDCT images. Specifically, we have trained an invertible neural network to minimize the pixel-based mean square distance between a noisy slice and the average of its two immediate adjacent noisy slices. We have shown the aforementioned is similar to training a neural network to minimize the distance between clean NDCT and noisy LDCT image pairs. Again, during the reverse mapping of the invertible network, the output image is mapped to the original input image, similar to cycle consistency loss. Finally, the trained invertible network's forward mapping is used for denoising LDCT images. Extensive experiments on two publicly available datasets showed that our method performs favourably against other existing unsupervised methods.
    FourierNets enable the design of highly non-local optical encoders for computational imaging. (arXiv:2104.10611v6 [eess.IV] UPDATED)
    Differentiable simulations of optical systems can be combined with deep learning-based reconstruction networks to enable high performance computational imaging via end-to-end (E2E) optimization of both the optical encoder and the deep decoder. This has enabled imaging applications such as 3D localization microscopy, depth estimation, and lensless photography via the optimization of local optical encoders. More challenging computational imaging applications, such as 3D snapshot microscopy which compresses 3D volumes into single 2D images, require a highly non-local optical encoder. We show that existing deep network decoders have a locality bias which prevents the optimization of such highly non-local optical encoders. We address this with a decoder based on a shallow neural network architecture using global kernel Fourier convolutional neural networks (FourierNets). We show that FourierNets surpass existing deep network based decoders at reconstructing photographs captured by the highly non-local DiffuserCam optical encoder. Further, we show that FourierNets enable E2E optimization of highly non-local optical encoders for 3D snapshot microscopy. By combining FourierNets with a large-scale multi-GPU differentiable optical simulation, we are able to optimize non-local optical encoders 170$\times$ to 7372$\times$ larger than prior state of the art, and demonstrate the potential for ROI-type specific optical encoding with a programmable microscope.
    A machine learning approach for fighting the curse of dimensionality in global optimization. (arXiv:2110.14985v2 [cs.LG] UPDATED)
    Finding global optima in high-dimensional optimization problems is extremely challenging since the number of function evaluations required to sufficiently explore the search space increases exponentially with its dimensionality. Furthermore, multimodal cost functions render local gradient-based search techniques ineffective. To overcome these difficulties, we propose to trim uninteresting regions of the search space where global optima are unlikely to be found by means of autoencoders, exploiting the lower intrinsic dimensionality of certain cost functions; optima are then searched over lower-dimensional latent spaces. The methodology is tested on benchmark functions and on multiple variations of a structural topology optimization problem, where we show that we can estimate this intrinsic lower dimensionality and based thereon obtain the global optimum at best or superior results compared to established optimization procedures at worst.
    Oracle Inequalities for Model Selection in Offline Reinforcement Learning. (arXiv:2211.02016v1 [cs.LG])
    In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, ModBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
    PyDaddy: A Python package for discovering stochastic dynamical equations from timeseries data. (arXiv:2205.02645v2 [q-bio.QM] UPDATED)
    Most real-world ecological dynamics, ranging from ecosystem dynamics to collective animal movement, are inherently stochastic in nature. Stochastic differential equations (SDEs) are a popular modelling framework to model dynamics with intrinsic randomness. Here, we focus on the inverse question: If one has empirically measured time-series data from some system of interest, is it possible to discover the SDE model that best describes the data. Here, we present PyDaddy (PYthon library for DAta Driven DYnamics), a toolbox to construct and analyze interpretable SDE models based on time-series data. We combine traditional approaches for data-driven SDE reconstruction with an equation learning approach, to derive symbolic equations governing the stochastic dynamics. The toolkit is presented as an open-source Python library, and consists of tools to construct and analyze SDEs. Functionality is included for visual examination of the stochastic structure of the data, guided extraction of the functional form of the SDE, and diagnosis and debugging of the underlying assumptions and the extracted model. Using simulated time-series datasets, exhibiting a wide range of dynamics, we show that PyDaddy is able to correctly identify underlying SDE models. We demonstrate the applicability of the toolkit to real-world data using a previously published movement data of a fish school. Starting from the time-series of the observed polarization of the school, pyDaddy readily discovers the SDE model governing the dynamics of group polarization. The model recovered by PyDaddy is consistent with the previous study. In summary, stochastic and noise-induced effects are central to the dynamics of many biological systems. In this context, we present an easy-to-use package to reconstruct SDEs from timeseries data.
    The Authenticity Gap in Human Evaluation. (arXiv:2205.11930v2 [cs.CL] UPDATED)
    Human ratings are the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. Analyzing this standard protocol through the lens of utility theory in economics, we identify the implicit assumptions it makes about annotators. These assumptions are often violated in practice, in which case annotator ratings cease to reflect their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new human evaluation protocol called $\textit{system-level probabilistic assessment}$ (SPA). When human evaluation of stories is done with SPA, we can recover the ordering of GPT-3 models by size, with statistically significant results. However, when human evaluation is done with the standard protocol, less than half of the expected preferences can be recovered (e.g., there is no significant difference between $\texttt{curie}$ and $\texttt{davinci}$, despite using a highly powered test).
    Discussion of Features for Acoustic Anomaly Detection under Industrial Disturbing Noise in an End-of-Line Test of Geared Motors. (arXiv:2211.01716v1 [eess.AS])
    In the end-of-line test of geared motors, the evaluation of product qual-ity is important. Due to time constraints and the high diversity of variants, acous-tic measurements are more economical than vibration measurements. However, the acoustic data is affected by industrial disturbing noise. Therefore, the aim of this study is to investigate the robustness of features used for anomaly detection in geared motor end-of-line testing. A real-world dataset with typical faults and acoustic disturbances is recorded by an acoustic array. This includes industrial noise from the production and systematically produced disturbances, used to compare the robustness. Overall, it is proposed to apply features extracted from a log-envelope spectrum together with psychoacoustic features. The anomaly de-tection is done by using the isolation forest or the more universal bagging random miner. Most disturbances can be circumvented, while the use of a hammer or air pressure often causes problems. In general, these results are important for condi-tion monitoring tasks that are based on acoustic or vibration measurements. Fur-thermore, a real-world problem description is presented to improve common sig-nal processing and machine learning tasks.
    FingerFlex: Inferring Finger Trajectories from ECoG signals. (arXiv:2211.01960v1 [q-bio.NC])
    Motor brain-computer interface (BCI) development relies critically on neural time series decoding algorithms. Recent advances in deep learning architectures allow for automatic feature selection to approximate higher-order dependencies in data. This article presents the FingerFlex model - a convolutional encoder-decoder architecture adapted for finger movement regression on electrocorticographic (ECoG) brain data. State-of-the-art performance was achieved on a publicly available BCI competition IV dataset 4 with a correlation coefficient between true and predicted trajectories up to 0.74. The presented method provides the opportunity for developing fully-functional high-precision cortical motor brain-computer interfaces.
    PEMP: Leveraging Physics Properties to Enhance Molecular Property Prediction. (arXiv:2211.01978v1 [q-bio.BM])
    Molecular property prediction is essential for drug discovery. In recent years, deep learning methods have been introduced to this area and achieved state-of-the-art performances. However, most of existing methods ignore the intrinsic relations between molecular properties which can be utilized to improve the performances of corresponding prediction tasks. In this paper, we propose a new approach, namely Physics properties Enhanced Molecular Property prediction (PEMP), to utilize relations between molecular properties revealed by previous physics theory and physical chemistry studies. Specifically, we enhance the training of the chemical and physiological property predictors with related physics property prediction tasks. We design two different methods for PEMP, respectively based on multi-task learning and transfer learning. Both methods include a model-agnostic molecule representation module and a property prediction module. In our implementation, we adopt both the state-of-the-art molecule embedding models under the supervised learning paradigm and the pretraining paradigm as the molecule representation module of PEMP, respectively. Experimental results on public benchmark MoleculeNet show that the proposed methods have the ability to outperform corresponding state-of-the-art models.
    The Need for Medically Aware Video Compression in Gastroenterology. (arXiv:2211.01472v1 [eess.IV])
    Compression is essential to storing and transmitting medical videos, but the effect of compression on downstream medical tasks is often ignored. Furthermore, systems in practice rely on standard video codecs, which naively allocate bits between medically relevant frames or parts of frames. In this work, we present an empirical study of some deficiencies of classical codecs on gastroenterology videos, and motivate our ongoing work to train a learned compression model for colonoscopy videos. We show that two of the most common classical codecs, H264 and HEVC, compress medically relevant frames statistically significantly worse than medically nonrelevant ones, and that polyp detector performance degrades rapidly as compression increases. We explain how a learned compressor could allocate bits to important regions and allow detection performance to degrade more gracefully. Many of our proposed techniques generalize to medical video domains beyond gastroenterology
    Demo: LE3D: A Privacy-preserving Lightweight Data Drift Detection Framework. (arXiv:2211.01827v1 [cs.LG])
    This paper presents LE3D; a novel data drift detection framework for preserving data integrity and confidentiality. LE3D is a generalisable platform for evaluating novel drift detection mechanisms within the Internet of Things (IoT) sensor deployments. Our framework operates in a distributed manner, preserving data privacy while still being adaptable to new sensors with minimal online reconfiguration. Our framework currently supports multiple drift estimators for time-series IoT data and can easily be extended to accommodate new data types and drift detection mechanisms. This demo will illustrate the functionality of LE3D under a real-world-like scenario.
    Empirical Analysis of Model Selection for Heterogenous Causal Effect Estimation. (arXiv:2211.01939v1 [cs.LG])
    We study the problem of model selection in causal inference, specifically for the case of conditional average treatment effect (CATE) estimation under binary treatments. Unlike model selection in machine learning, we cannot use the technique of cross-validation here as we do not observe the counterfactual potential outcome for any data point. Hence, we need to design model selection techniques that do not explicitly rely on counterfactual data. As an alternative to cross-validation, there have been a variety of proxy metrics proposed in the literature, that depend on auxiliary nuisance models also estimated from the data (propensity score model, outcome regression model). However, the effectiveness of these metrics has only been studied on synthetic datasets as we can observe the counterfactual data for them. We conduct an extensive empirical analysis to judge the performance of these metrics, where we utilize the latest advances in generative modeling to incorporate multiple realistic datasets. We evaluate 9 metrics on 144 datasets for selecting between 415 estimators per dataset, including datasets that closely mimic real-world datasets. Further, we use the latest techniques from AutoML to ensure consistent hyperparameter selection for nuisance models for a fair comparison across metrics.
    Towards glass-box CNNs. (arXiv:2101.10443v2 [cs.CV] UPDATED)
    Convolution neural networks (CNNs) are brain-inspired architectures popular for their ability to train and relearn visually complex tasks. It is incremental and scalable; however, CNN is mostly treated as black-box and involves multiple trial & error runs. We observe that CNN constructs powerful internal representations that help achieve state-of-the-art performance. Here we propose three layer glass-box (analytical) CNN for two-class image classifcation problems. First is a representation layer that encompasses both the class information (group invariant) and symmetric transformations (group equivariant) of input images. It is then passed through dimension reduction layer (PCA). Finally the compact yet complete representation is provided to a classifer. Analytical machine learning classifers and multilayer perceptrons are used to assess sensitivity. Proposed glass-box CNN is compared with equivariance of AlexNet (CNN) internal representation for better understanding and dissemination of results. In future, we would like to construct glass-box CNN for multiclass visually complex tasks.
    Automated Domain Discovery from Multiple Sources to Improve Zero-Shot Generalization. (arXiv:2112.09802v2 [cs.LG] UPDATED)
    Domain generalization (DG) methods aim to develop models that generalize to settings where the test distribution is different from the training data. In this paper, we focus on the challenging problem of multi-source zero shot DG (MDG), where labeled training data from multiple source domains is available but with no access to data from the target domain. A wide range of solutions have been proposed for this problem, including the state-of-the-art multi-domain ensembling approaches. Despite these advances, the na\"ive ERM solution of pooling all source data together and training a single classifier is surprisingly effective on standard benchmarks. In this paper, we hypothesize that, it is important to elucidate the link between pre-specified domain labels and MDG performance, in order to explain this behavior. More specifically, we consider two popular classes of MDG algorithms -- distributional robust optimization (DRO) and multi-domain ensembles, in order to demonstrate how inferring custom domain groups can lead to consistent improvements over the original domain labels that come with the dataset. To this end, we propose (i) Group-DRO++, which incorporates an explicit clustering step to identify custom domains in an existing DRO technique; and (ii) DReaME, which produces effective multi-domain ensembles through implicit domain re-labeling with a novel meta-optimization algorithm. Using empirical studies on multiple standard benchmarks, we show that our variants consistently outperform ERM by significant margins (1.5% - 9%), and produce state-of-the-art MDG performance. Our code can be found at https://github.com/kowshikthopalli/DREAME
    Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top. (arXiv:2206.00529v2 [cs.LG] UPDATED)
    Byzantine-robustness has been gaining a lot of attention due to the growth of the interest in collaborative and federated learning. However, many fruitful directions, such as the usage of variance reduction for achieving robustness and communication compression for reducing communication costs, remain weakly explored in the field. This work addresses this gap and proposes Byz-VR-MARINA - a new Byzantine-tolerant method with variance reduction and compression. A key message of our paper is that variance reduction is key to fighting Byzantine workers more effectively. At the same time, communication compression is a bonus that makes the process more communication efficient. We derive theoretical convergence guarantees for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Lojasiewicz loss functions. Unlike the concurrent Byzantine-robust methods with variance reduction and/or compression, our complexity results are tight and do not rely on restrictive assumptions such as boundedness of the gradients or limited compression. Moreover, we provide the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients. Numerical experiments corroborate our theoretical findings.
    DP$^2$-VAE: Differentially Private Pre-trained Variational Autoencoders. (arXiv:2208.03409v2 [cs.LG] UPDATED)
    Modern machine learning systems achieve great success when trained on large datasets. However, these datasets usually contain sensitive information (e.g. medical records, face images), leading to serious privacy concerns. Differentially private generative models (DPGMs) emerge as a solution to circumvent such privacy concerns by generating privatized sensitive data. Similar to other differentially private (DP) learners, the major challenge for DPGM is also how to achieve a subtle balance between utility and privacy. We propose DP$^2$-VAE, a novel training mechanism for variational autoencoders (VAE) with provable DP guarantees and improved utility via \emph{pre-training on private data}. Under the same DP constraints, DP$^2$-VAE minimizes the perturbation noise during training, and hence improves utility. DP$^2$-VAE is very flexible and easily amenable to many other VAE variants. Theoretically, we study the effect of pretraining on private data. Empirically, we conduct extensive experiments on image datasets to illustrate our superiority over baselines under various privacy budgets and evaluation metrics.
    Learning Hypergraphs From Signals With Dual Smoothness Prior. (arXiv:2211.01717v1 [cs.LG])
    The construction of a meaningful hypergraph topology is the key to processing signals with high-order relationships that involve more than two entities. Learning the hypergraph structure from the observed signals to capture the intrinsic relationships among the entities becomes crucial when a hypergraph topology is not readily available in the datasets. There are two challenges that lie at the heart of this problem: 1) how to handle the huge search space of potential hyperedges, and 2) how to define meaningful criteria to measure the relationship between the signals observed on nodes and the hypergraph structure. In this paper, to address the first challenge, we adopt the assumption that the ideal hypergraph structure can be derived from a learnable graph structure that captures the pairwise relations within signals. Further, we propose a hypergraph learning framework with a novel dual smoothness prior that reveals a mapping between the observed node signals and the hypergraph structure, whereby each hyperedge corresponds to a subgraph with both node signal smoothness and edge signal smoothness in the learnable graph structure. Finally, we conduct extensive experiments to evaluate the proposed framework on both synthetic and real world datasets. Experiments show that our proposed framework can efficiently infer meaningful hypergraph topologies from observed signals.
    FedMint: Intelligent Bilateral Client Selection in Federated Learning with Newcomer IoT Devices. (arXiv:2211.01805v1 [cs.LG])
    Federated Learning (FL) is a novel distributed privacy-preserving learning paradigm, which enables the collaboration among several participants (e.g., Internet of Things devices) for the training of machine learning models. However, selecting the participants that would contribute to this collaborative training is highly challenging. Adopting a random selection strategy would entail substantial problems due to the heterogeneity in terms of data quality, and computational and communication resources across the participants. Although several approaches have been proposed in the literature to overcome the problem of random selection, most of these approaches follow a unilateral selection strategy. In fact, they base their selection strategy on only the federated server's side, while overlooking the interests of the client devices in the process. To overcome this problem, we present in this paper FedMint, an intelligent client selection approach for federated learning on IoT devices using game theory and bootstrapping mechanism. Our solution involves the design of: (1) preference functions for the client IoT devices and federated servers to allow them to rank each other according to several factors such as accuracy and price, (2) intelligent matching algorithms that take into account the preferences of both parties in their design, and (3) bootstrapping technique that capitalizes on the collaboration of multiple federated servers in order to assign initial accuracy value for the newly connected IoT devices. Based on our simulation findings, our strategy surpasses the VanillaFL selection approach in terms of maximizing both the revenues of the client devices and accuracy of the global federated learning model.
    Holistic Deep Learning. (arXiv:2110.15829v3 [cs.LG] UPDATED)
    There is much interest in deep learning to solve challenges in applying neural network models in real-world environments. In particular, three areas have received considerable attention: adversarial robustness, parameter sparsity, and output stability. Despite numerous attempts to solve these problems independently, little work simultaneously addresses the challenges. In this paper, we address the problem of constructing holistic deep learning models by proposing a novel formulation that solves these issues in combination. Real-world experiments on both tabular and MNIST datasets show that our formulation can simultaneously improve the accuracy, robustness, stability, and sparsity over traditional deep learning models among many others.
    Speed Up the Cold-Start Learning in Two-Sided Bandits with Many Arms. (arXiv:2210.00340v2 [cs.LG] UPDATED)
    Multi-armed bandit (MAB) algorithms are efficient approaches to reduce the opportunity cost of online experimentation and are used by companies to find the best product from periodically refreshed product catalogs. However, these algorithms face the so-called cold-start at the onset of the experiment due to a lack of knowledge of customer preferences for new products, requiring an initial data collection phase known as the burn-in period. During this period, MAB algorithms operate like randomized experiments, incurring large burn-in costs which scale with the large number of products. We attempt to reduce the burn-in by identifying that many products can be cast into two-sided products, and then naturally model the rewards of the products with a matrix, whose rows and columns represent the two sides respectively. Next, we design two-phase bandit algorithms that first use subsampling and low-rank matrix estimation to obtain a substantially smaller targeted set of products and then apply a UCB procedure on the target products to find the best one. We theoretically show that the proposed algorithms lower costs and expedite the experiment in cases when there is limited experimentation time along with a large product set. Our analysis also reveals three regimes of long, short, and ultra-short horizon experiments, depending on dimensions of the matrix. Empirical evidence from both synthetic data and a real-world dataset on music streaming services validates this superior performance.
    Meta-PDE: Learning to Solve PDEs Quickly Without a Mesh. (arXiv:2211.01604v1 [cs.LG])
    Partial differential equations (PDEs) are often computationally challenging to solve, and in many settings many related PDEs must be be solved either at every timestep or for a variety of candidate boundary conditions, parameters, or geometric domains. We present a meta-learning based method which learns to rapidly solve problems from a distribution of related PDEs. We use meta-learning (MAML and LEAP) to identify initializations for a neural network representation of the PDE solution such that a residual of the PDE can be quickly minimized on a novel task. We apply our meta-solving approach to a nonlinear Poisson's equation, 1D Burgers' equation, and hyperelasticity equations with varying parameters, geometries, and boundary conditions. The resulting Meta-PDE method finds qualitatively accurate solutions to most problems within a few gradient steps; for the nonlinear Poisson and hyper-elasticity equation this results in an intermediate accuracy approximation up to an order of magnitude faster than a baseline finite element analysis (FEA) solver with equivalent accuracy. In comparison to other learned solvers and surrogate models, this meta-learning approach can be trained without supervision from expensive ground-truth data, does not require a mesh, and can even be used when the geometry and topology varies between tasks.
    IQ-Learn: Inverse soft-Q Learning for Imitation. (arXiv:2106.12142v4 [cs.LG] UPDATED)
    In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn't utilize any information involving the environment's dynamics. Many existing methods that exploit dynamics information are difficult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in offline and online imitation learning settings, significantly outperforming existing methods both in the number of required environment interactions and scalability in high-dimensional spaces, often by more than 3x.
    Active Labeling: Streaming Stochastic Gradients. (arXiv:2205.13255v2 [cs.LG] UPDATED)
    The workhorse of machine learning is stochastic gradient descent. To access stochastic gradients, it is common to consider iteratively input/output pairs of a training dataset. Interestingly, it appears that one does not need full supervision to access stochastic gradients, which is the main motivation of this paper. After formalizing the "active labeling" problem, which focuses on active learning with partial supervision, we provide a streaming technique that provably minimizes the ratio of generalization error over the number of samples. We illustrate our technique in depth for robust regression.
    Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning. (arXiv:2209.14610v2 [cs.LG] UPDATED)
    Mathematical reasoning, a core ability of human intelligence, presents unique challenges for machines in abstract thinking and logical reasoning. Recent large pre-trained language models such as GPT-3 have achieved remarkable progress on mathematical reasoning tasks written in text form, such as math word problems (MWP). However, it is unknown if the models can handle more complex problems that involve math reasoning over heterogeneous information, such as tabular data. To fill the gap, we present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 open-domain grade-level problems that require mathematical reasoning on both textual and tabular data. Each question in TabMWP is aligned with a tabular context, which is presented as an image, semi-structured text, and a structured table. There are two types of questions: free-text and multi-choice, and each problem is annotated with gold solutions to reveal the multi-step reasoning process. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. As earlier studies suggest, since few-shot GPT-3 relies on the selection of in-context examples, its performance is unstable and can degrade to near chance. The unstable issue is more severe when handling complex problems like TabMWP. To mitigate this, we further propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data and then constructs the corresponding prompt for the test example. Experimental results show that our method outperforms the best baseline by 5.31% on the accuracy metric and reduces the prediction variance significantly compared to random selection, which verifies its effectiveness in the selection of in-context examples.
    Gravitational Dimensionality Reduction Using Newtonian Gravity and Einstein's General Relativity. (arXiv:2211.01369v1 [cs.LG])
    Due to the effectiveness of using machine learning in physics, it has been widely received increased attention in the literature. However, the notion of applying physics in machine learning has not been given much awareness to. This work is a hybrid of physics and machine learning where concepts of physics are used in machine learning. We propose the supervised Gravitational Dimensionality Reduction (GDR) algorithm where the data points of every class are moved to each other for reduction of intra-class variances and better separation of classes. For every data point, the other points are considered to be gravitational particles, such as stars, where the point is attracted to the points of its class by gravity. The data points are first projected onto a spacetime manifold using principal component analysis. We propose two variants of GDR -- one with the Newtonian gravity and one with the Einstein's general relativity. The former uses Newtonian gravity in a straight line between points but the latter moves data points along the geodesics of spacetime manifold. For GDR with relativity gravitation, we use both Schwarzschild and Minkowski metric tensors to cover both general relativity and special relativity. Our simulations show the effectiveness of GDR in discrimination of classes.
    Interpretable Personalization via Policy Learning with Linear Decision Boundaries. (arXiv:2003.07545v4 [cs.LG] UPDATED)
    With the rise of the digital economy and an explosion of available information about consumers, effective personalization of goods and services has become a core business focus for companies to improve revenues and maintain a competitive edge. This paper studies the personalization problem through the lens of policy learning, where the goal is to learn a decision-making rule (a policy) that maps from consumer and product characteristics (features) to recommendations (actions) in order to optimize outcomes (rewards). We focus on using available historical data for offline learning with unknown data collection procedures, where a key challenge is the non-random assignment of recommendations. Moreover, in many business and medical applications, interpretability of a policy is essential. We study the class of policies with linear decision boundaries to ensure interpretability, and propose learning algorithms using tools from causal inference to address unbalanced treatments. We study several optimization schemes to solve the associated non-convex, non-smooth optimization problem, and find that a Bayesian optimization algorithm is effective. We test our algorithm with extensive simulation studies and apply it to an anonymized online marketplace customer purchase dataset, where the learned policy outputs a personalized discount recommendation based on customer and product features in order to maximize gross merchandise value (GMV) for sellers. Our learned policy improves upon the platform's baseline by 88.2\% in net sales revenue, while also providing informative insights on which features are important for the decision-making process. Our findings suggest that our proposed policy learning framework using tools from causal inference and Bayesian optimization provides a promising practical approach to interpretable personalization across a wide range of applications.
    A Survey of Deep Causal Models. (arXiv:2209.08860v3 [stat.ML] UPDATED)
    The concept of causality plays a significant role in human cognition. In the past few decades, causal inference has been well developed in many fields, such as computer science, medicine, economics, and other industrial applications. With the advancement of deep learning, it has been increasingly applied in causal inference against counterfactual data. Typically, deep causal models map the characteristics of covariates to a representation space and then design various objective functions to estimate counterfactual data unbiasedly. Different from the existing surveys on causal models in machine learning, this paper mainly focuses on the overview of the deep causal models, and its core contributions are as follows: 1) we summarize the popularly adopted relevant metrics under multiple treatments and continuous-dose treatment; 2) we cast insight on a comprehensive overview of deep causal models from both timeline of development and method classification perspectives; 3) we also endeavor to present a detailed categorization and analysis on relevant datasets, source codes and experiments.
    Emergent Linguistic Structures in Neural Networks are Fragile. (arXiv:2210.17406v2 [cs.LG] UPDATED)
    Large language models (LLMs) have been reported to have strong performance on natural language processing tasks. However, performance metrics such as accuracy do not measure the quality of the model in terms of its ability to robustly represent complex linguistic structure. In this work, we propose a framework to evaluate the robustness of linguistic representations using probing tasks. We leverage recent advances in extracting emergent linguistic constructs from LLMs and apply syntax-preserving perturbations to test the stability of these constructs in order to better understand the representations learned by LLMs. Empirically, we study the performance of four LLMs across six different corpora on the proposed robustness measures. We provide evidence that context-free representation (e.g., GloVe) are in some cases competitive with context-dependent representations from modern LLMs (e.g., BERT), yet equally brittle to syntax-preserving manipulations. Emergent syntactic representations in neural networks are brittle, thus our work poses the attention on the risk of comparing such structures to those that are object of a long lasting debate in linguistics.
    Leveraging Domain Features for Detecting Adversarial Attacks Against Deep Speech Recognition in Noise. (arXiv:2211.01621v1 [eess.AS])
    In recent years, significant progress has been made in deep model-based automatic speech recognition (ASR), leading to its widespread deployment in the real world. At the same time, adversarial attacks against deep ASR systems are highly successful. Various methods have been proposed to defend ASR systems from these attacks. However, existing classification based methods focus on the design of deep learning models while lacking exploration of domain specific features. This work leverages filter bank-based features to better capture the characteristics of attacks for improved detection. Furthermore, the paper analyses the potentials of using speech and non-speech parts separately in detecting adversarial attacks. In the end, considering adverse environments where ASR systems may be deployed, we study the impact of acoustic noise of various types and signal-to-noise ratios. Extensive experiments show that the inverse filter bank features generally perform better in both clean and noisy environments, the detection is effective using either speech or non-speech part, and the acoustic noise can largely degrade the detection performance.
    Cutting Through the Noise: An Empirical Comparison of Psychoacoustic and Envelope-based Features for Machinery Fault Detection. (arXiv:2211.01704v1 [eess.SP])
    Acoustic-based fault detection has a high potential to monitor the health condition of mechanical parts. However, the background noise of an industrial environment may negatively influence the performance of fault detection. Limited attention has been paid to improving the robustness of fault detection against industrial environmental noise. Therefore, we present the Lenze production background-noise (LPBN) real-world dataset and an automated and noise-robust auditory inspection (ARAI) system for the end-of-line inspection of geared motors. An acoustic array is used to acquire data from motors with a minor fault, major fault, or which are healthy. A benchmark is provided to compare the psychoacoustic features with different types of envelope features based on expert knowledge of the gearbox. To the best of our knowledge, we are the first to apply time-varying psychoacoustic features for fault detection. We train a state-of-the-art one-class-classifier, on samples from healthy motors and separate the faulty ones for fault detection using a threshold. The best-performing approaches achieve an area under curve of 0.87 (logarithm envelope), 0.86 (time-varying psychoacoustics), and 0.91 (combination of both).
    FedToken: Tokenized Incentives for Data Contribution in Federated Learning. (arXiv:2209.09775v2 [cs.LG] UPDATED)
    Incentives that compensate for the involved costs in the decentralized training of a Federated Learning (FL) model act as a key stimulus for clients' long-term participation. However, it is challenging to convince clients for quality participation in FL due to the absence of: (i) full information on the client's data quality and properties; (ii) the value of client's data contributions; and (iii) the trusted mechanism for monetary incentive offers. This often leads to poor efficiency in training and communication. While several works focus on strategic incentive designs and client selection to overcome this problem, there is a major knowledge gap in terms of an overall design tailored to the foreseen digital economy, including Web 3.0, while simultaneously meeting the learning objectives. To address this gap, we propose a contribution-based tokenized incentive scheme, namely \texttt{FedToken}, backed by blockchain technology that ensures fair allocation of tokens amongst the clients that corresponds to the valuation of their data during model training. Leveraging the engineered Shapley-based scheme, we first approximate the contribution of local models during model aggregation, then strategically schedule clients lowering the communication rounds for convergence and anchor ways to allocate \emph{affordable} tokens under a constrained monetary budget. Extensive simulations demonstrate the efficacy of our proposed method.
    Self-supervised learning for robust voice cloning. (arXiv:2204.03421v2 [cs.SD] UPDATED)
    Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.
    An Empirical Bayes Analysis of Vehicle Trajectory Models. (arXiv:2211.01696v1 [cs.LG])
    We present an in-depth empirical analysis of the trade-off between model complexity and representation error in modelling vehicle trajectories. Analyzing several large public datasets, we show that simple linear models do represent realworld trajectories with high fidelity over relevant time scales at very moderate model complexity. This finding allows the formulation of trajectory tracking and prediction as a Bayesian filtering problem. Using an Empirical Bayes approach, we estimate prior distributions over model parameters from the data that inform the motion models necessary in the trajectory tracking problem and that can help regularize prediction models. We argue for the use of linear models in trajectory prediction tasks as their representation error is much smaller than the typical epistemic uncertainty in this task.
    Self Similarity Matrix based CNN Filter Pruning. (arXiv:2211.01814v1 [cs.LG])
    In recent years, most of the deep learning solutions are targeted to be deployed in mobile devices. This makes the need for development of lightweight models all the more imminent. Another solution is to optimize and prune regular deep learning models. In this paper, we tackle the problem of CNN model pruning with the help of Self-Similarity Matrix (SSM) computed from the 2D CNN filters. We propose two novel algorithms to rank and prune redundant filters which contribute similar activation maps to the output. One of the key features of our method is that there is no need of finetuning after training the model. Both the training and pruning process is completed simultaneously. We benchmark our method on two of the most popular CNN models - ResNet and VGG and record their performance on the CIFAR-10 dataset.
    Analyzing Sharpness along GD Trajectory: Progressive Sharpening and Edge of Stability. (arXiv:2207.12678v2 [cs.LG] UPDATED)
    Recent findings (e.g., arXiv:2103.00065) demonstrate that modern neural networks trained by full-batch gradient descent typically enter a regime called Edge of Stability (EOS). In this regime, the sharpness, i.e., the maximum Hessian eigenvalue, first increases to the value 2/(step size) (the progressive sharpening phase) and then oscillates around this value (the EOS phase). This paper aims to analyze the GD dynamics and the sharpness along the optimization trajectory. Our analysis naturally divides the GD trajectory into four phases depending on the change of the sharpness. We empirically identify the norm of output layer weight as an interesting indicator of sharpness dynamics. Based on this empirical observation, we attempt to theoretically and empirically explain the dynamics of various key quantities that lead to the change of sharpness in each phase of EOS. Moreover, based on certain assumptions, we provide a theoretical proof of the sharpness behavior in EOS regime in two-layer fully-connected linear neural networks. We also discuss some other empirical findings and the limitation of our theoretical results.
    A Geometric Perspective on Variational Autoencoders. (arXiv:2209.07370v2 [stat.ML] UPDATED)
    This paper introduces a new interpretation of the Variational Autoencoder framework by taking a fully geometric point of view. We argue that vanilla VAE models unveil naturally a Riemannian structure in their latent space and that taking into consideration those geometrical aspects can lead to better interpolations and an improved generation procedure. This new proposed sampling method consists in sampling from the uniform distribution deriving intrinsically from the learned Riemannian latent space and we show that using this scheme can make a vanilla VAE competitive and even better than more advanced versions on several benchmark datasets. Since generative models are known to be sensitive to the number of training samples we also stress the method's robustness in the low data regime.
    Are Synthetic Control Weights Balancing Score?. (arXiv:2211.01575v1 [stat.ME])
    In this short note, I outline conditions under which conditioning on Synthetic Control (SC) weights emulates a randomized control trial where the treatment status is independent of potential outcomes. Specifically, I demonstrate that if there exist SC weights such that (i) the treatment effects are exactly identified and (ii) these weights are uniformly and cumulatively bounded, then SC weights are balancing scores.
    Extra-Newton: A First Approach to Noise-Adaptive Accelerated Second-Order Methods. (arXiv:2211.01832v1 [math.OC])
    This work proposes a universal and adaptive second-order method for minimizing second-order smooth, convex functions. Our algorithm achieves $O(\sigma / \sqrt{T})$ convergence when the oracle feedback is stochastic with variance $\sigma^2$, and improves its convergence to $O( 1 / T^3)$ with deterministic oracles, where $T$ is the number of iterations. Our method also interpolates these rates without knowing the nature of the oracle apriori, which is enabled by a parameter-free adaptive step-size that is oblivious to the knowledge of smoothness modulus, variance bounds and the diameter of the constrained set. To our knowledge, this is the first universal algorithm with such global guarantees within the second-order optimization literature.
    Testing the Generalization of Neural Language Models for COVID-19 Misinformation Detection. (arXiv:2111.07819v4 [cs.CL] UPDATED)
    A drastic rise in potentially life-threatening misinformation has been a by-product of the COVID-19 pandemic. Computational support to identify false information within the massive body of data on the topic is crucial to prevent harm. Researchers proposed many methods for flagging online misinformation related to COVID-19. However, these methods predominantly target specific content types (e.g., news) or platforms (e.g., Twitter). The methods' capabilities to generalize were largely unclear so far. We evaluate fifteen Transformer-based models on five COVID-19 misinformation datasets that include social media posts, news articles, and scientific papers to fill this gap. We show tokenizers and models tailored to COVID-19 data do not provide a significant advantage over general-purpose ones. Our study provides a realistic assessment of models for detecting COVID-19 misinformation. We expect that evaluating a broad spectrum of datasets and models will benefit future research in developing misinformation detection systems.
    Semiparametric Best Arm Identification with Contextual Information. (arXiv:2209.07330v3 [cs.LG] UPDATED)
    We study best-arm identification with a fixed budget and contextual (covariate) information in stochastic multi-armed bandit problems. In each round, after observing contextual information, we choose a treatment arm using past observations and current context. Our goal is to identify the best treatment arm, a treatment arm with the maximal expected reward marginalized over the contextual distribution, with a minimal probability of misidentification. First, we derive semiparametric lower bounds of the misidentification probability for this problem, where we regard the gaps between the expected rewards of the best and suboptimal treatment arms as parameters of interest, and all other parameters, such as the expected rewards conditioned on contexts, as the nuisance parameters. We then develop the ``Contextual RS-AIPW strategy,'' which consists of the random sampling (RS) rule tracking a target allocation ratio and the recommendation rule using the augmented inverse probability weighting (AIPW) estimator. Our proposed Contextual RS-AIPW strategy is optimal because the upper bound for the probability of misidentification by the strategy matches the semiparametric lower bound, when the budget goes to infinity and the gaps converge to zero.
    Privacy-preserving Non-negative Matrix Factorization with Outliers. (arXiv:2211.01451v1 [cs.LG])
    Non-negative matrix factorization is a popular unsupervised machine learning algorithm for extracting meaningful features from data which are inherently non-negative. However, such data sets may often contain privacy-sensitive user data, and therefore, we may need to take necessary steps to ensure the privacy of the users while analyzing the data. In this work, we focus on developing a Non-negative matrix factorization algorithm in the privacy-preserving framework. More specifically, we propose a novel privacy-preserving algorithm for non-negative matrix factorisation capable of operating on private data, while achieving results comparable to those of the non-private algorithm. We design the framework such that one has the control to select the degree of privacy grantee based on the utility gap. We show our proposed framework's performance in six real data sets. The experimental results show that our proposed method can achieve very close performance with the non-private algorithm under some parameter regime, while ensuring strict privacy.
    Convergence in KL Divergence of the Inexact Langevin Algorithm with Application to Score-based Generative Models. (arXiv:2211.01512v1 [cs.LG])
    We study the Inexact Langevin Algorithm (ILA) for sampling using estimated score function when the target distribution satisfies log-Sobolev inequality (LSI), motivated by Score-based Generative Modeling (SGM). We prove a long-term convergence in Kullback-Leibler (KL) divergence under a sufficient assumption that the error of the score estimator has a bounded Moment Generating Function (MGF). Our assumption is weaker than $L^\infty$ (which is too strong to hold in practice) and stronger than $L^2$ error assumption, which we show not sufficient to guarantee convergence in general. Under the $L^\infty$ error assumption, we additionally prove convergence in R\'enyi divergence, which is stronger than KL divergence. We then study how to get a provably accurate score estimator which satisfies bounded MGF assumption for LSI target distributions, by using an estimator based on kernel density estimation. Together with the convergence results, we yield the first end-to-end convergence guarantee for ILA in the population level. Last, we generalize our convergence analysis to SGM and derive a complexity guarantee in KL divergence for data satisfying LSI under MGF-accurate score estimator.
    Liability regimes in the age of AI: a use-case driven analysis of the burden of proof. (arXiv:2211.01817v1 [cs.AI])
    New emerging technologies powered by Artificial Intelligence (AI) have the potential to disruptively transform our societies for the better. In particular, data-driven learning approaches (i.e., Machine Learning (ML)) have been a true revolution in the advancement of multiple technologies in various application domains. But at the same time there is growing concerns about certain intrinsic characteristics of these methodologies that carry potential risks to both safety and fundamental rights. Although there are mechanisms in the adoption process to minimize these risks (e.g., safety regulations), these do not exclude the possibility of harm occurring, and if this happens, victims should be able to seek compensation. Liability regimes will therefore play a key role in ensuring basic protection for victims using or interacting with these systems. However, the same characteristics that make AI systems inherently risky, such as lack of causality, opacity, unpredictability or their self and continuous learning capabilities, lead to considerable difficulties when it comes to proving causation. This paper presents three case studies, as well as the methodology to reach them, that illustrate these difficulties. Specifically, we address the cases of cleaning robots, delivery drones and robots in education. The outcome of the proposed analysis suggests the need to revise liability regimes to alleviate the burden of proof on victims in cases involving AI technologies.
    Convex Clustering through MM: An Efficient Algorithm to Perform Hierarchical Clustering. (arXiv:2211.01877v1 [stat.ML])
    Convex clustering is a modern method with both hierarchical and $k$-means clustering characteristics. Although convex clustering can capture the complex clustering structure hidden in data, the existing convex clustering algorithms are not scalable to large data sets with sample sizes greater than ten thousand. Moreover, it is known that convex clustering sometimes fails to produce hierarchical clustering structures. This undesirable phenomenon is called cluster split and makes it difficult to interpret clustering results. In this paper, we propose convex clustering through majorization-minimization (CCMM) -- an iterative algorithm that uses cluster fusions and sparsity to enforce a complete cluster hierarchy with reduced memory usage. In the CCMM algorithm, the diagonal majorization technique makes a highly efficient update for each iteration. With a current desktop computer, the CCMM algorithm can solve a single clustering problem featuring over one million objects in seven-dimensional space within 70 seconds.
    DHA: End-to-End Joint Optimization of Data Augmentation Policy, Hyper-parameter and Architecture. (arXiv:2109.05765v2 [cs.LG] UPDATED)
    Automated machine learning (AutoML) usually involves several crucial components, such as Data Augmentation (DA) policy, Hyper-Parameter Optimization (HPO), and Neural Architecture Search (NAS). Although many strategies have been developed for automating these components in separation, joint optimization of these components remains challenging due to the largely increased search dimension and the variant input types of each component. In parallel to this, the common practice of searching for the optimal architecture first and then retraining it before deployment in NAS often suffers from low performance correlation between the searching and retraining stages. An end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search is desirable. In view of these, we propose DHA, which achieves joint optimization of Data augmentation policy, Hyper-parameter and Architecture. Specifically, end-to-end NAS is achieved in a differentiable manner by optimizing a compressed lower-dimensional feature space, while DA policy and HPO are regarded as dynamic schedulers, which adapt themselves to the update of network parameters and network architecture at the same time. Experiments show that DHA achieves state-of-the-art (SOTA) results on various datasets and search spaces. To the best of our knowledge, we are the first to efficiently and jointly optimize DA policy, NAS, and HPO in an end-to-end manner without retraining.
    Neural Topic Modeling of Psychotherapy Sessions. (arXiv:2204.10189v2 [cs.CL] UPDATED)
    In this work, we compare different neural topic modeling methods in learning the topical propensities of different psychiatric conditions from the psychotherapy session transcripts parsed from speech recordings. We also incorporate temporal modeling to put this additional interpretability to action by parsing out topic similarities as a time series in a turn-level resolution. We believe this topic modeling framework can offer interpretable insights for the therapist to optimally decide his or her strategy and improve psychotherapy effectiveness.
    Phy-Taylor: Physics-Model-Based Deep Neural Networks. (arXiv:2209.13511v2 [cs.LG] UPDATED)
    Purely data-driven deep neural networks (DNNs) applied to physical engineering systems can infer relations that violate physics laws, thus leading to unexpected consequences. To address this challenge, we propose a physics-model-based DNN framework, called Phy-Taylor, that accelerates learning compliant representations with physical knowledge. The Phy-Taylor framework makes two key contributions; it introduces a new architectural Physics-compatible neural network (PhN), and features a novel compliance mechanism, we call {\em Physics-guided Neural Network Editing\}. The PhN aims to directly capture nonlinearities inspired by physical quantities, such as kinetic energy, potential energy, electrical power, and aerodynamic drag force. To do so, the PhN augments neural network layers with two key components: (i) monomials of Taylor series expansion of nonlinear functions capturing physical knowledge, and (ii) a suppressor for mitigating the influence of noise. The neural-network editing mechanism further modifies network links and activation functions consistently with physical knowledge. As an extension, we also propose a self-correcting Phy-Taylor framework that introduces two additional capabilities: (i) physics-model-based safety relationship learning, and (ii) automatic output correction when violations of safety occur. Through experiments, we show that (by expressing hard-to-learn nonlinearities directly and by constraining dependencies) Phy-Taylor features considerably fewer parameters, and a remarkably accelerated training process, while offering enhanced model robustness and accuracy.
    A Learning-Theoretic Framework for Certified Auditing with Explanations. (arXiv:2206.04740v2 [cs.LG] UPDATED)
    Responsible use of machine learning requires models be audited for undesirable properties. While a number of auditing algorithms have been proposed in prior work, how to do principled auditing in a general setting has remained ill-understood. This work proposes a formal learning-theoretic framework for auditing, and uses it to investigate if and how model explanations can help audits. Specifically, we propose algorithms for auditing linear classifiers for feature sensitivity using label queries as well as two kinds of explanations, and provide performance guarantees. Our results illustrate that while counterfactual explanations can be extremely helpful for auditing, anchor explanations may not be as beneficial in the worst case.
    Manifold Interpolating Optimal-Transport Flows for Trajectory Inference. (arXiv:2206.14928v2 [cs.LG] UPDATED)
    We present a method called Manifold Interpolating Optimal-Transport Flow (MIOFlow) that learns stochastic, continuous population dynamics from static snapshot samples taken at sporadic timepoints. MIOFlow combines dynamic models, manifold learning, and optimal transport by training neural ordinary differential equations (Neural ODE) to interpolate between static population snapshots as penalized by optimal transport with manifold ground distance. Further, we ensure that the flow follows the geometry by operating in the latent space of an autoencoder that we call a geodesic autoencoder (GAE). In GAE the latent space distance between points is regularized to match a novel multiscale geodesic distance on the data manifold that we define. We show that this method is superior to normalizing flows, Schr\"odinger bridges and other generative models that are designed to flow from noise to data in terms of interpolating between populations. Theoretically, we link these trajectories with dynamic optimal transport. We evaluate our method on simulated data with bifurcations and merges, as well as scRNA-seq data from embryoid body differentiation, and acute myeloid leukemia treatment.
    OLLA: Optimizing the Lifetime and Location of Arrays to Reduce the Memory Usage of Neural Networks. (arXiv:2210.12924v2 [cs.LG] UPDATED)
    The size of deep neural networks has grown exponentially in recent years. Unfortunately, hardware devices have not kept pace with the rapidly increasing memory requirements. To cope with this, researchers have turned to techniques such as spilling and recomputation, which increase training time, or reduced precision and model pruning, which can affect model accuracy. We present OLLA, an algorithm that optimizes the lifetime and memory location of the tensors used to train neural networks. Our method reduces the memory usage of existing neural networks, without needing any modification to the models or their training procedures. We formulate the problem as a joint integer linear program (ILP). We present several techniques to simplify the encoding of the problem, and enable our approach to scale to the size of state-of-the-art neural networks using an off-the-shelf ILP solver. We experimentally demonstrate that OLLA only takes minutes if not seconds to allow the training of neural networks using one-third less memory on average.
    Benefits of Monotonicity in Safe Exploration with Gaussian Processes. (arXiv:2211.01561v1 [stat.ML])
    We consider the problem of sequentially maximising an unknown function over a set of actions while ensuring that every sampled point has a function value below a given safety threshold. We model the function using kernel-based and Gaussian process methods, while differing from previous works in our assumption that the function is monotonically increasing with respect to a safety variable. This assumption is motivated by various practical applications such as adaptive clinical trial design and robotics. Taking inspiration from the GP-UCB and SafeOpt algorithms, we propose an algorithm, monotone safe UCB (M-SafeUCB) for this task. We show that M-SafeUCB enjoys theoretical guarantees in terms of safety, a suitably-defined regret notion, and approximately finding the entire safe boundary. In addition, we illustrate that the monotonicity assumption yields significant benefits in terms of both the guarantees obtained and the algorithmic simplicity. We support our theoretical findings by performing empirical evaluations on a variety of functions.
    Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model. (arXiv:2211.02001v1 [cs.LG])
    Progress in machine learning (ML) comes with a cost to the environment, given that training ML models requires significant computational resources, energy and materials. In the present article, we aim to quantify the carbon footprint of BLOOM, a 176-billion parameter language model, across its life cycle. We estimate that BLOOM's final training emitted approximately 24.7 tonnes of~\carboneq~if we consider only the dynamic power consumption, and 50.5 tonnes if we account for all processes ranging from equipment manufacturing to energy-based operational consumption. We also study the energy requirements and carbon emissions of its deployment for inference via an API endpoint receiving user queries in real-time. We conclude with a discussion regarding the difficulty of precisely estimating the carbon footprint of ML models and future research directions that can contribute towards improving carbon emissions reporting.
    Learners' Languages. (arXiv:2103.01189v2 [math.CT] UPDATED)
    In "Backprop as functor", the authors show that the fundamental elements of deep learning -- gradient descent and backpropagation -- can be conceptualized as a strong monoidal functor Para(Euc)$\to$Learn from the category of parameterized Euclidean spaces to that of learners, a category developed explicitly to capture parameter update and backpropagation. It was soon realized that there is an isomorphism Learn$\cong$Para(Slens), where Slens is the symmetric monoidal category of simple lenses as used in functional programming. In this note, we observe that Slens is a full subcategory of Poly, the category of polynomial functors in one variable, via the functor $A\mapsto Ay^A$. Using the fact that (Poly,$\otimes$) is monoidal closed, we show that a map $A\to B$ in Para(Slens) has a natural interpretation in terms of dynamical systems (more precisely, generalized Moore machines) whose interface is the internal-hom type $[Ay^A,By^B]$. Finally, we review the fact that the category p-Coalg of dynamical systems on any $p \in$ Poly forms a topos, and consider the logical propositions that can be stated in its internal language. We give gradient descent as an example, and we conclude by discussing some directions for future work.
    Online Resource Allocation under Horizon Uncertainty. (arXiv:2206.13606v2 [cs.DS] UPDATED)
    We study stochastic online resource allocation: a decision maker needs to allocate limited resources to stochastically-generated sequentially-arriving requests in order to maximize reward. At each time step, requests are drawn independently from a distribution that is unknown to the decision maker. Online resource allocation and its special cases have been studied extensively in the past, but prior results crucially and universally rely on the strong assumption that the total number of requests (the horizon) is known to the decision maker in advance. In many applications, such as revenue management and online advertising, the number of requests can vary widely because of fluctuations in demand or user traffic intensity. In this work, we develop online algorithms that are robust to horizon uncertainty. In sharp contrast to the known-horizon setting, no algorithm can achieve even a constant asymptotic competitive ratio that is independent of the horizon uncertainty. We introduce a novel generalization of dual mirror descent which allows the decision maker to specify a schedule of time-varying target consumption rates, and prove corresponding performance guarantees. We go on to give a fast algorithm for computing a schedule of target consumption rates that leads to near-optimal performance in the unknown-horizon setting. In particular, our competitive ratio attains the optimal rate of growth (up to logarithmic factors) as the horizon uncertainty grows large. Finally, we also provide a way to incorporate machine-learned predictions about the horizon which interpolates between the known and unknown horizon settings.
    Quantifying Model Uncertainty for Semantic Segmentation using Operators in the RKHS. (arXiv:2211.01999v1 [cs.CV])
    Deep learning models for semantic segmentation are prone to poor performance in real-world applications due to the highly challenging nature of the task. Model uncertainty quantification (UQ) is one way to address this issue of lack of model trustworthiness by enabling the practitioner to know how much to trust a segmentation output. Current UQ methods in this application domain are mainly restricted to Bayesian based methods which are computationally expensive and are only able to extract central moments of uncertainty thereby limiting the quality of their uncertainty estimates. We present a simple framework for high-resolution predictive uncertainty quantification of semantic segmentation models that leverages a multi-moment functional definition of uncertainty associated with the model's feature space in the reproducing kernel Hilbert space (RKHS). The multiple uncertainty functionals extracted from this framework are defined by the local density dynamics of the model's feature space and hence automatically align themselves at the tail-regions of the intrinsic probability density function of the feature space (where uncertainty is the highest) in such a way that the successively higher order moments quantify the more uncertain regions. This leads to a significantly more accurate view of model uncertainty than conventional Bayesian methods. Moreover, the extraction of such moments is done in a single-shot computation making it much faster than Bayesian and ensemble approaches (that involve a high number of forward stochastic passes of the model to quantify its uncertainty). We demonstrate these advantages through experimental evaluations of our framework implemented over four different state-of-the-art model architectures that are trained and evaluated on two benchmark road-scene segmentation datasets (Camvid and Cityscapes).
    Dynamic Causal Effects Evaluation in A/B Testing with a Reinforcement Learning Framework. (arXiv:2002.01711v6 [cs.LG] UPDATED)
    A/B testing, or online experiment is a standard business strategy to compare a new product with an old one in pharmaceutical, technological, and traditional industries. Major challenges arise in online experiments of two-sided marketplace platforms (e.g., Uber) where there is only one unit that receives a sequence of treatments over time. In those experiments, the treatment at a given time impacts current outcome as well as future outcomes. The aim of this paper is to introduce a reinforcement learning framework for carrying A/B testing in these experiments, while characterizing the long-term treatment effects. Our proposed testing procedure allows for sequential monitoring and online updating. It is generally applicable to a variety of treatment designs in different industries. In addition, we systematically investigate the theoretical properties (e.g., size and power) of our testing procedure. Finally, we apply our framework to both simulated data and a real-world data example obtained from a technological company to illustrate its advantage over the current practice. A Python implementation of our test is available at https://github.com/callmespring/CausalRL.
    Large Language Models Are Human-Level Prompt Engineers. (arXiv:2211.01910v1 [cs.LG])
    By conditioning on natural language instructions, large language models (LLMs) have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.
    Convergence Rates for Learning Linear Operators from Noisy Data. (arXiv:2108.12515v3 [math.ST] UPDATED)
    This paper studies the learning of linear operators between infinite-dimensional Hilbert spaces. The training data comprises pairs of random input vectors in a Hilbert space and their noisy images under an unknown self-adjoint linear operator. Assuming that the operator is diagonalizable in a known basis, this work solves the equivalent inverse problem of estimating the operator's eigenvalues given the data. Adopting a Bayesian approach, the theoretical analysis establishes posterior contraction rates in the infinite data limit with Gaussian priors that are not directly linked to the forward map of the inverse problem. The main results also include learning-theoretic generalization error guarantees for a wide range of distribution shifts. These convergence rates quantify the effects of data smoothness and true eigenvalue decay or growth, for compact or unbounded operators, respectively, on sample complexity. Numerical evidence supports the theory in diagonal and non-diagonal settings.
    Functorial Manifold Learning. (arXiv:2011.07435v6 [cs.LG] UPDATED)
    We adapt previous research on category theory and topological unsupervised learning to develop a functorial perspective on manifold learning, also known as nonlinear dimensionality reduction. We first characterize manifold learning algorithms as functors that map pseudometric spaces to optimization objectives and that factor through hierarchical clustering functors. We then use this characterization to prove refinement bounds on manifold learning loss functions and construct a hierarchy of manifold learning algorithms based on their equivariants. We express several popular manifold learning algorithms as functors at different levels of this hierarchy, including Metric Multidimensional Scaling, IsoMap, and UMAP. Next, we use interleaving distance to study the stability of a broad class of manifold learning algorithms. We present bounds on how closely the embeddings these algorithms produce from noisy data approximate the embeddings they would learn from noiseless data. Finally, we use our framework to derive a set of novel manifold learning algorithms, which we experimentally demonstrate are competitive with the state of the art.
    Study of the performance and scalability of federated learning for medical imaging with intermittent clients. (arXiv:2207.08581v3 [cs.LG] UPDATED)
    Federated learning is a data decentralization privacy-preserving technique used to perform machine or deep learning in a secure way. In this paper we present theoretical aspects about federated learning, such as the presentation of an aggregation operator, different types of federated learning, and issues to be taken into account in relation to the distribution of data from the clients, together with the exhaustive analysis of a use case where the number of clients varies. Specifically, a use case of medical image analysis is proposed, using chest X-Ray images obtained from an open data repository. In addition to the advantages related to privacy, improvements in predictions (in terms of accuracy, loss and area under the curve) and reduction of execution times will be studied with respect to the classical case (the centralized approach). Different clients will be simulated from the training data, selected in an unbalanced manner. The results of considering three or ten clients are exposed and compared between them and against the centralized case. Two different problems related to intermittent clients are discussed, together with two approaches to be followed for each of them. Specifically, this type of problems may occur because in a real scenario some clients may leave the training, and others enter it, and on the other hand because of client technical or connectivity problems. Finally, improvements and future work in the field are proposed.
    A coherence parameter characterizing generative compressed sensing with Fourier measurements. (arXiv:2207.09340v4 [cs.IT] UPDATED)
    In Bora et al. (2017), a mathematical framework was developed for compressed sensing guarantees in the setting where the measurement matrix is Gaussian and the signal structure is the range of a generative neural network (GNN). The problem of compressed sensing with GNNs has since been extensively analyzed when the measurement matrix and/or network weights follow a subgaussian distribution. We move beyond the subgaussian assumption, to measurement matrices that are derived by sampling uniformly at random rows of a unitary matrix (including subsampled Fourier measurements as a special case). Specifically, we prove the first known restricted isometry guarantee for generative compressed sensing with subsampled isometries, and provide recovery bounds with nearly order-optimal sample complexity, addressing an open problem of Scarlett et al. (2022, p. 10). Recovery efficacy is characterized by the coherence, a new parameter, which measures the interplay between the range of the network and the measurement matrix. Our approach relies on subspace counting arguments and ideas central to high-dimensional probability. Furthermore, we propose a regularization strategy for training GNNs to have favourable coherence with the measurement operator. We provide compelling numerical simulations that support this regularized training strategy: our strategy yields low coherence networks that require fewer measurements for signal recovery. This, together with our theoretical results, supports coherence as a natural quantity for characterizing generative compressed sensing with subsampled isometries.
    Investigating the robustness of a learning-based method for quantitative phase retrieval from propagation-based x-ray phase contrast measurements under laboratory conditions. (arXiv:2211.01372v1 [physics.med-ph])
    Quantitative phase retrieval (QPR) in propagation-based x-ray phase contrast imaging of heterogeneous and structurally complicated objects is challenging under laboratory conditions due to partial spatial coherence and polychromaticity. A learning-based method (LBM) provides a non-linear approach to this problem while not being constrained by restrictive assumptions about object properties and beam coherence. In this work, a LBM was assessed for its applicability under practical scenarios by evaluating its robustness and generalizability under typical experimental variations. Towards this end, an end-to-end LBM was employed for QPR under laboratory conditions and its robustness was investigated across various system and object conditions. The robustness of the method was tested via varying propagation distances and its generalizability with respect to object structure and experimental data was also tested. Although the LBM was stable under the studied variations, its successful deployment was found to be affected by choices pertaining to data pre-processing, network training considerations and system modeling. To our knowledge, we demonstrated for the first time, the potential applicability of an end-to-end learning-based quantitative phase retrieval method, trained on simulated data, to experimental propagation-based x-ray phase contrast measurements acquired under laboratory conditions. We considered conditions of polychromaticity, partial spatial coherence, and high noise levels, typical to laboratory conditions. This work further explored the robustness of this method to practical variations in propagation distances and object structure with the goal of assessing its potential for experimental use. Such an exploration of any LBM (irrespective of its network architecture) before practical deployment provides an understanding of its potential behavior under experimental settings.
    Learning to Configure Computer Networks with Neural Algorithmic Reasoning. (arXiv:2211.01980v1 [cs.NI])
    We present a new method for scaling automatic configuration of computer networks. The key idea is to relax the computationally hard search problem of finding a configuration that satisfies a given specification into an approximate objective amenable to learning-based techniques. Based on this idea, we train a neural algorithmic model which learns to generate configurations likely to (fully or partially) satisfy a given specification under existing routing protocols. By relaxing the rigid satisfaction guarantees, our approach (i) enables greater flexibility: it is protocol-agnostic, enables cross-protocol reasoning, and does not depend on hardcoded rules; and (ii) finds configurations for much larger computer networks than previously possible. Our learned synthesizer is up to 490x faster than state-of-the-art SMT-based methods, while producing configurations which on average satisfy more than 93% of the provided requirements.
    StereoSpike: Depth Learning with a Spiking Neural Network. (arXiv:2109.13751v3 [cs.CV] UPDATED)
    Depth estimation is an important computer vision task, useful in particular for navigation in autonomous vehicles, or for object manipulation in robotics. Here we solved it using an end-to-end neuromorphic approach, combining two event-based cameras and a Spiking Neural Network (SNN) with a slightly modified U-Net-like encoder-decoder architecture, that we named StereoSpike. More specifically, we used the Multi Vehicle Stereo Event Camera Dataset (MVSEC). It provides a depth ground-truth, which was used to train StereoSpike in a supervised manner, using surrogate gradient descent. We propose a novel readout paradigm to obtain a dense analog prediction -- the depth of each pixel -- from the spikes of the decoder. We demonstrate that this architecture generalizes very well, even better than its non-spiking counterparts, leading to state-of-the-art test accuracy. To the best of our knowledge, it is the first time that such a large-scale regression problem is solved by a fully spiking network. Finally, we show that low firing rates (<10%) can be obtained via regularization, with a minimal cost in accuracy. This means that StereoSpike could be efficiently implemented on neuromorphic chips, opening the door for low power and real time embedded systems.
    Learning Control by Iterative Inversion. (arXiv:2211.01724v1 [cs.LG])
    We formulate learning for control as an $\textit{inverse problem}$ -- inverting a dynamical system to give the actions which yield desired behavior. The key challenge in this formulation is a $\textit{distribution shift}$ -- the learning agent only observes the forward mapping (its actions' consequences) on trajectories that it can execute, yet must learn the inverse mapping for inputs-outputs that correspond to a different, desired behavior. We propose a general recipe for inverse problems with a distribution shift that we term $\textit{iterative inversion}$ -- learn the inverse mapping under the current input distribution (policy), then use it on the desired output samples to obtain new inputs, and repeat. As we show, iterative inversion can converge to the desired inverse mapping, but under rather strict conditions on the mapping itself. We next apply iterative inversion to learn control. Our input is a set of demonstrations of desired behavior, given as video embeddings of trajectories, and our method iteratively learns to imitate trajectories generated by the current policy, perturbed by random exploration noise. We find that constantly adding the demonstrated trajectory embeddings $\textit{as input}$ to the policy when generating trajectories to imitate, a-la iterative inversion, steers the learning towards the desired trajectory distribution. To the best of our knowledge, this is the first exploration of learning control from the viewpoint of inverse problems, and our main advantage is simplicity -- we do not require rewards, and only employ supervised learning, which easily scales to state-of-the-art trajectory embedding techniques and policy representations. With a VQ-VAE embedding, and a transformer-based policy, we demonstrate non-trivial continuous control on several tasks. We also report improved performance on imitating diverse behaviors compared to reward based methods.
    Artificial Intelligence for Suicide Assessment using Audiovisual Cues: A Review. (arXiv:2201.09130v2 [cs.AI] UPDATED)
    Death by suicide is the seventh leading death cause worldwide. The recent advancement in Artificial Intelligence (AI), specifically AI applications in image and voice processing, has created a promising opportunity to revolutionize suicide risk assessment. Subsequently, we have witnessed fast-growing literature of research that applies AI to extract audiovisual non-verbal cues for mental illness assessment. However, the majority of the recent works focus on depression, despite the evident difference between depression symptoms and suicidal behavior and non-verbal cues. This paper reviews recent works that study suicide ideation and suicide behavior detection through audiovisual feature analysis, mainly suicidal voice/speech acoustic features analysis and suicidal visual cues. Automatic suicide assessment is a promising research direction that is still in the early stages. Accordingly, there is a lack of large datasets that can be used to train machine learning and deep learning models proven to be effective in other, similar tasks.
    Limit theorems of Chatterjee's rank correlation. (arXiv:2204.08031v3 [math.ST] UPDATED)
    Establishing the limiting distribution of Chatterjee's rank correlation for a general, possibly non-independent, pair of random variables has been eagerly awaited to many. This paper shows that (a) Chatterjee's rank correlation is asymptotically normal as long as one variable is not a measurable function of the other, (b) the corresponding asymptotic variance is uniformly bounded by 36, and (c) a consistent variance estimator exists. Similar results also hold for Azadkia-Chatterjee's graph-based correlation coefficient, a multivariate analogue of Chatterjee's original proposal. The proof is given by appealing to H\'ajek representation and Chatterjee's nearest-neighbor CLT.
    A BERT-based Deep Learning Approach for Reputation Analysis in Social Media. (arXiv:2211.01954v1 [cs.CL])
    Social media has become an essential part of the modern lifestyle, with its usage being highly prevalent. This has resulted in unprecedented amounts of data generated from users in social media, such as users' attitudes, opinions, interests, purchases, and activities across various aspects of their lives. Therefore, in a world of social media, where its power has shifted to users, actions taken by companies and public figures are subject to constantly being under scrutiny by influential global audiences. As a result, reputation management in social media has become essential as companies and public figures need to maintain their reputation to preserve their reputation capital. However, domain experts still face the challenge of lacking appropriate solutions to automate reliable online reputation analysis. To tackle this challenge, we proposed a novel reputation analysis approach based on the popular language model BERT (Bidirectional Encoder Representations from Transformers). The proposed approach was evaluated on the reputational polarity task using RepLab 2013 dataset. Compared to previous works, we achieved 5.8% improvement in accuracy, 26.9% improvement in balanced accuracy, and 21.8% improvement in terms of F-score.
    FedGCN: Convergence and Communication Tradeoffs in Federated Training of Graph Convolutional Networks. (arXiv:2201.12433v5 [cs.LG] UPDATED)
    Methods for training models on graphs distributed across multiple clients have recently grown in popularity, due to the size of these graphs as well as regulations on keeping data where it is generated, like GDPR in the EU. However, a single connected graph cannot be disjointly partitioned onto multiple distributed clients due to the cross-client edges connecting graph nodes. Thus, distributed methods for training a model on a single graph incur either significant communication overhead between clients or a loss of available information to the training. We introduce the Federated Graph Convolutional Network (FedGCN) algorithm, which uses federated learning to train GCN models for semi-supervised node classification on large graphs with fast convergence and little communication. Compared to prior methods that require communication among clients at each training round, FedGCN clients only communicate with the central server in one pre-training step, greatly reducing communication costs. We theoretically analyze the tradeoff between FedGCN's convergence rate and communication cost under different data distributions and introduce a general framework that can be used for analysis of all edge-completion-based GCN training algorithms. Experimental results show that our FedGCN algorithm achieves 51.7% faster convergence on average and at least 100X less communication cost compared to prior work.
    Fault-Tolerant Federated Reinforcement Learning with Theoretical Guarantee. (arXiv:2110.14074v2 [cs.LG] UPDATED)
    The growing literature of Federated Learning (FL) has recently inspired Federated Reinforcement Learning (FRL) to encourage multiple agents to federatively build a better decision-making policy without sharing raw trajectories. Despite its promising applications, existing works on FRL fail to I) provide theoretical analysis on its convergence, and II) account for random system failures and adversarial attacks. Towards this end, we propose the first FRL framework the convergence of which is guaranteed and tolerant to less than half of the participating agents being random system failures or adversarial attackers. We prove that the sample efficiency of the proposed framework is guaranteed to improve with the number of agents and is able to account for such potential failures or attacks. All theoretical results are empirically verified on various RL benchmark tasks.
    Fair Data Representation for Machine Learning at the Pareto Frontier. (arXiv:2201.00292v2 [stat.ML] UPDATED)
    As machine learning powered decision making is playing an increasingly important role in our daily lives, it is imperative to strive for fairness of the underlying data processing and algorithms. We propose a pre-processing algorithm for fair data representation via which L2- objective supervised learning algorithms result in an estimation of the Pareto frontier between prediction error and statistical disparity. In particular, the present work applies the optimal positive definite affine transport maps to approach the post-processing Wasserstein barycenter characterization of the optimal fair L2-objective supervised learning via a pre-processing data deformation. We call the resulting data Wasserstein pseudo-barycenter. Furthermore, we show that the Wasserstein geodesics from the learning outcome marginals to the barycenter characterizes the Pareto frontier between L2-loss and total Wasserstein distance among learning outcome marginals. Thereby, an application of McCann interpolation generalizes the pseudo-barycenter to a family of data representations via which L2-objective supervised learning algorithms result in the Pareto frontier. Numerical simulations underscore the advantages of the proposed data representation: (1) the pre-processing step is compositive with arbitrary L2-objective supervised learning methods and unseen data; (2) the fair representation protects data privacy by preventing the training machine from direct or indirect access to the sensitive information of the data; (3) the optimal affine map results in efficient computation of fair supervised learning on high-dimensional data; (4) experimental results shed light on the fairness of L2-objective unsupervised learning via the proposed fair data representation.
    Federated Optimization Algorithms with Random Reshuffling and Gradient Compression. (arXiv:2206.07021v2 [cs.LG] UPDATED)
    Gradient compression is a popular technique for improving communication complexity of stochastic first-order methods in distributed training of machine learning models. However, the existing works consider only with-replacement sampling of stochastic gradients. In contrast, it is well-known in practice and recently confirmed in theory that stochastic methods based on without-replacement sampling, e.g., Random Reshuffling (RR) method, perform better than ones that sample the gradients with-replacement. In this work, we close this gap in the literature and provide the first analysis of methods with gradient compression and without-replacement sampling. We first develop a na\"ive combination of random reshuffling with gradient compression (Q-RR). Perhaps surprisingly, but the theoretical analysis of Q-RR does not show any benefits of using RR. Our extensive numerical experiments confirm this phenomenon. This happens due to the additional compression variance. To reveal the true advantages of RR in the distributed learning with compression, we propose a new method called DIANA-RR that reduces the compression variance and has provably better convergence rates than existing counterparts with with-replacement sampling of stochastic gradients. Next, to have a better fit to Federated Learning applications, we incorporate local computation, i.e., we propose and analyze the variants of Q-RR and DIANA-RR -- Q-NASTYA and DIANA-NASTYA that use local gradient steps and different local and global stepsizes. Finally, we conducted several numerical experiments to illustrate our theoretical results.
    InsectUp: Crowdsourcing Insect Observations to Assess Demographic Shifts and Improve Classification. (arXiv:1906.11898v2 [cs.CV] CROSS LISTED)
    Insects play such a crucial role in ecosystems that a shift in demography of just a few species can have devastating consequences at environmental, social and economic levels. Despite this, evaluation of insect demography is strongly limited by the difficulty of collecting census data at sufficient scale. We propose a method to gather and leverage observations from bystanders, hikers, and entomology enthusiasts in order to provide researchers with data that could significantly help anticipate and identify environmental threats. Finally, we show that there is indeed interest on both sides for such collaboration.
    Interpretable Modeling and Reduction of Unknown Errors in Mechanistic Operators. (arXiv:2211.01373v1 [eess.IV])
    Prior knowledge about the imaging physics provides a mechanistic forward operator that plays an important role in image reconstruction, although myriad sources of possible errors in the operator could negatively impact the reconstruction solutions. In this work, we propose to embed the traditional mechanistic forward operator inside a neural function, and focus on modeling and correcting its unknown errors in an interpretable manner. This is achieved by a conditional generative model that transforms a given mechanistic operator with unknown errors, arising from a latent space of self-organizing clusters of potential sources of error generation. Once learned, the generative model can be used in place of a fixed forward operator in any traditional optimization-based reconstruction process where, together with the inverse solution, the error in prior mechanistic forward operator can be minimized and the potential source of error uncovered. We apply the presented method to the reconstruction of heart electrical potential from body surface potential. In controlled simulation experiments and in-vivo real data experiments, we demonstrate that the presented method allowed reduction of errors in the physics-based forward operator and thereby delivered inverse reconstruction of heart-surface potential with increased accuracy.
    Physics-enhanced deep surrogates for PDEs. (arXiv:2111.05841v2 [cs.LG] UPDATED)
    We present a ''physics-enhanced deep-surrogate'' (''PEDS'') approach towards developing fast surrogate models for complex physical systems, which is described by partial differential equations (PDEs) and similar models. Specifically, a unique combination of a low-fidelity, explainable physics simulator and a neural network generator is proposed, which is trained end-to-end to globally match the output of an expensive high-fidelity numerical solver. We consider low-fidelity models derived from coarser discretizations and/or by simplifying the physical equations, which are several orders of magnitude faster than a high-fidelity ''brute-force'' PDE solver. The neural network generates an approximate input, which is adaptively mixed with a downsampled guess and fed into the low-fidelity simulator. In this way, by incorporating the limited physical knowledge from the differentiable low-fidelity model ''layer'', we ensure that the conservation laws and symmetries governing the system are respected by the design of our hybrid system. Experiments on three test problems -- diffusion, reaction-diffusion, and electromagnetic scattering models -- show that a PEDS surrogate can be up to 3$\times$ more accurate than a ''black-box'' neural network with limited data ($\approx 10^3$ training points), and reduces the data needed by at least a factor of 100 for a target error of $5\%$, comparable to fabrication uncertainty. PEDS even appears to learn with a steeper asymptotic power law than black-box surrogates. In summary, PEDS provides a general, data-driven strategy to bridge the gap between a vast array of simplified physical models with corresponding brute-force numerical solvers, offering accuracy, speed, data efficiency, as well as physical insights into the process.
    Performative Power. (arXiv:2203.17232v2 [cs.LG] UPDATED)
    We introduce the notion of performative power, which measures the ability of a firm operating an algorithmic system, such as a digital content recommendation platform, to cause change in a population of participants. We relate performative power to the economic study of competition in digital economies. Traditional economic concepts struggle with identifying anti-competitive patterns in digital platforms not least due to the complexity of market definition. In contrast, performative power is a causal notion that is identifiable with minimal knowledge of the market, its internals, participants, products, or prices. Low performative power implies that a firm can do no better than to optimize their objective on current data. In contrast, firms of high performative power stand to benefit from steering the population towards more profitable behavior. We confirm in a simple theoretical model that monopolies maximize performative power. A firm's ability to personalize increases performative power, while competition and outside options decrease performative power. On the empirical side, we propose an observational causal design to identify performative power from discontinuities in how digital platforms display content. This allows to repurpose causal effects from various studies about digital platforms as lower bounds on performative power. Finally, we speculate about the role that performative power might play in competition policy and antitrust enforcement in digital marketplaces.
    When Do Contrastive Learning Signals Help Spatio-Temporal Graph Forecasting?. (arXiv:2108.11873v2 [cs.LG] UPDATED)
    Deep learning models are modern tools for spatio-temporal graph (STG) forecasting. Though successful, we argue that data scarcity is a key factor limiting their recent improvements. Meanwhile, contrastive learning has been an effective method for providing self-supervision signals and addressing data scarcity in various domains. In view of this, one may ask: can we leverage the additional signals from contrastive learning to alleviate data scarcity, so as to benefit STG forecasting? To answer this question, we present the first systematic exploration on incorporating contrastive learning into STG forecasting. Specifically, we first elaborate two potential schemes for integrating contrastive learning. We then propose two feasible and efficient designs of contrastive tasks that are performed on the node or graph level. The empirical study on STG benchmarks demonstrates that integrating graph-level contrast with the joint learning scheme achieves the best performance. In addition, we introduce four augmentations for STG data, which perturb the data in terms of graph structure, time domain, and frequency domain. Experimental results reveal that the model is not sensitive to the proposed augmentations' semantics. Lastly, we extend the classic contrastive loss via a rule-based strategy that filters out the most semantically similar negatives, yielding performance gains. We also provide explanations and insights based on the above experimental findings. Code is available at https://github.com/liuxu77/STGCL.  ( 3 min )
    Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters. (arXiv:2211.01979v1 [cs.CL])
    Adapter-tuning is a paradigm that transfers a pretrained language model to downstream tasks by adding and tuning a small number of new parameters. Previously proposed adapter architectures are all feed-forward neural networks. In this paper, we investigate the effectiveness of using tiny-attention -- i.e., attention with extremely small per-head dimensionality -- as adapters. Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters. Moreover, we view its multiple attention heads as a mixture of experts and propose to average their weights during deployment, which further reduces its inference computation cost. On the GLUE benchmark, our tiny-attention adapter outperforms the other parameter-efficient transfer learning methods as well as full fine-tuning while only updating 0.05% of the parameters. On the FewGLUE benchmark, its performance is comparable to that of GPT-3 and PET.  ( 2 min )
    Single SMPC Invocation DPHelmet: Differentially Private Distributed Learning on a Large Scale. (arXiv:2211.02003v1 [cs.CR])
    Distributing machine learning predictors enables the collection of large-scale datasets while leaving sensitive raw data at trustworthy sites. We show that locally training support vector machines (SVMs) and computing their averages leads to a learning technique that is scalable to a large number of users, satisfies differential privacy, and is applicable to non-trivial tasks, such as CIFAR-10. For a large number of participants, communication cost is one of the main challenges. We achieve a low communication cost by requiring only a single invocation of an efficient secure multiparty summation protocol. By relying on state-of-the-art feature extractors (SimCLR), we are able to utilize differentially private convex learners for non-trivial tasks such as CIFAR-10. Our experimental results illustrate that for $1{,}000$ users with $50$ data points each, our scheme outperforms state-of-the-art scalable distributed learning methods (differentially private federated learning, short DP-FL) while requiring around $500$ times fewer communication costs: For CIFAR-10, we achieve a classification accuracy of $79.7\,\%$ for an $\varepsilon = 0.59$ while DP-FL achieves $57.6\,\%$. More generally, we prove learnability properties for the average of such locally trained models: convergence and uniform stability. By only requiring strongly convex, smooth, and Lipschitz-continuous objective functions, locally trained via stochastic gradient descent (SGD), we achieve a strong utility-privacy tradeoff.  ( 2 min )
    Port-metriplectic neural networks: thermodynamics-informed machine learning of complex physical systems. (arXiv:2211.01873v1 [cs.LG])
    We develop inductive biases for the machine learning of complex physical systems based on the port-Hamiltonian formalism. To satisfy by construction the principles of thermodynamics in the learned physics (conservation of energy, non-negative entropy production), we modify accordingly the port-Hamiltonian formalism so as to achieve a port-metriplectic one. We show that the constructed networks are able to learn the physics of complex systems by parts, thus alleviating the burden associated to the experimental characterization and posterior learning process of this kind of systems. Predictions can be done, however, at the scale of the complete system. Examples are shown on the performance of the proposed technique.  ( 2 min )
    Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization. (arXiv:2211.01851v1 [math.OC])
    We propose an adaptive variance-reduction method, called AdaSpider, for minimization of $L$-smooth, non-convex functions with a finite-sum structure. In essence, AdaSpider combines an AdaGrad-inspired [Duchi et al., 2011, McMahan & Streeter, 2010], but a fairly distinct, adaptive step-size schedule with the recursive stochastic path integrated estimator proposed in [Fang et al., 2018]. To our knowledge, Adaspider is the first parameter-free non-convex variance-reduction method in the sense that it does not require the knowledge of problem-dependent parameters, such as smoothness constant $L$, target accuracy $\epsilon$ or any bound on gradient norms. In doing so, we are able to compute an $\epsilon$-stationary point with $\tilde{O}\left(n + \sqrt{n}/\epsilon^2\right)$ oracle-calls, which matches the respective lower bound up to logarithmic factors.
    DBS: Dynamic Batch Size For Distributed Deep Neural Network Training. (arXiv:2007.11831v2 [cs.LG] UPDATED)
    Synchronous strategies with data parallelism, such as the Synchronous StochasticGradient Descent (S-SGD) and the model averaging methods, are widely utilizedin distributed training of Deep Neural Networks (DNNs), largely owing to itseasy implementation yet promising performance. Particularly, each worker ofthe cluster hosts a copy of the DNN and an evenly divided share of the datasetwith the fixed mini-batch size, to keep the training of DNNs convergence. In thestrategies, the workers with different computational capability, need to wait foreach other because of the synchronization and delays in network transmission,which will inevitably result in the high-performance workers wasting computation.Consequently, the utilization of the cluster is relatively low. To alleviate thisissue, we propose the Dynamic Batch Size (DBS) strategy for the distributedtraining of DNNs. Specifically, the performance of each worker is evaluatedfirst based on the fact in the previous epoch, and then the batch size and datasetpartition are dynamically adjusted in consideration of the current performanceof the worker, thereby improving the utilization of the cluster. To verify theeffectiveness of the proposed strategy, extensive experiments have been conducted,and the experimental results indicate that the proposed strategy can fully utilizethe performance of the cluster, reduce the training time, and have good robustnesswith disturbance by irrelevant tasks. Furthermore, rigorous theoretical analysis hasalso been provided to prove the convergence of the proposed strategy.
    From Local to Global: Spectral-Inspired Graph Neural Networks. (arXiv:2209.12054v2 [stat.ML] UPDATED)
    Graph Neural Networks (GNNs) are powerful deep learning methods for Non-Euclidean data. Popular GNNs are message-passing algorithms (MPNNs) that aggregate and combine signals in a local graph neighborhood. However, shallow MPNNs tend to miss long-range signals and perform poorly on some heterophilous graphs, while deep MPNNs can suffer from issues like over-smoothing or over-squashing. To mitigate such issues, existing works typically borrow normalization techniques from training neural networks on Euclidean data or modify the graph structures. Yet these approaches are not well-understood theoretically and could increase the overall computational complexity. In this work, we draw inspirations from spectral graph embedding and propose $\texttt{PowerEmbed}$ -- a simple layer-wise normalization technique to boost MPNNs. We show $\texttt{PowerEmbed}$ can provably express the top-$k$ leading eigenvectors of the graph operator, which prevents over-smoothing and is agnostic to the graph topology; meanwhile, it produces a list of representations ranging from local features to global signals, which avoids over-squashing. We apply $\texttt{PowerEmbed}$ in a wide range of simulated and real graphs and demonstrate its competitive performance, particularly for heterophilous graphs.
    Toward Unsupervised Outlier Model Selection. (arXiv:2211.01834v1 [cs.LG])
    Today there exists no shortage of outlier detection algorithms in the literature, yet the complementary and critical problem of unsupervised outlier model selection (UOMS) is vastly understudied. In this work we propose ELECT, a new approach to select an effective candidate model, i.e. an outlier detection algorithm and its hyperparameter(s), to employ on a new dataset without any labels. At its core, ELECT is based on meta-learning; transferring prior knowledge (e.g. model performance) on historical datasets that are similar to the new one to facilitate UOMS. Uniquely, it employs a dataset similarity measure that is performance-based, which is more direct and goal-driven than other measures used in the past. ELECT adaptively searches for similar historical datasets, as such, it can serve an output on-demand, being able to accommodate varying time budgets. Extensive experiments show that ELECT significantly outperforms a wide range of basic UOMS baselines, including no model selection (always using the same popular model such as iForest) as well as more recent selection strategies based on meta-features.
    LE3D: A Lightweight Ensemble Framework of Data Drift Detectors for Resource-Constrained Devices. (arXiv:2211.01840v1 [cs.LG])
    Data integrity becomes paramount as the number of Internet of Things (IoT) sensor deployments increases. Sensor data can be altered by benign causes or malicious actions. Mechanisms that detect drifts and irregularities can prevent disruptions and data bias in the state of an IoT application. This paper presents LE3D, an ensemble framework of data drift estimators capable of detecting abnormal sensor behaviours. Working collaboratively with surrounding IoT devices, the type of drift (natural/abnormal) can also be identified and reported to the end-user. The proposed framework is a lightweight and unsupervised implementation able to run on resource-constrained IoT devices. Our framework is also generalisable, adapting to new sensor streams and environments with minimal online reconfiguration. We compare our method against state-of-the-art ensemble data drift detection frameworks, evaluating both the real-world detection accuracy as well as the resource utilisation of the implementation. Experimenting with real-world data and emulated drifts, we show the effectiveness of our method, which achieves up to 97% of detection accuracy while requiring minimal resources to run.
    A Theory of PAC Learnability under Transformation Invariances. (arXiv:2202.07552v2 [cs.LG] UPDATED)
    Transformation invariances are present in many real-world problems. For example, image classification is usually invariant to rotation and color transformation: a rotated car in a different color is still identified as a car. Data augmentation, which adds the transformed data into the training set and trains a model on the augmented data, is one commonly used technique to build these invariances into the learning process. However, it is unclear how data augmentation performs theoretically and what the optimal algorithm is in presence of transformation invariances. In this paper, we study PAC learnability under transformation invariances in three settings according to different levels of realizability: (i) A hypothesis fits the augmented data; (ii) A hypothesis fits only the original data and the transformed data lying in the support of the data distribution; (iii) Agnostic case. One interesting observation is that distinguishing between the original data and the transformed data is necessary to achieve optimal accuracy in setting (ii) and (iii), which implies that any algorithm not differentiating between the original and transformed data (including data augmentation) is not optimal. Furthermore, this type of algorithms can even "harm" the accuracy. In setting (i), although it is unnecessary to distinguish between the two data sets, data augmentation still does not perform optimally. Due to such a difference, we propose two combinatorial measures characterizing the optimal sample complexity in setting (i) and (ii)(iii) and provide the optimal algorithms.
    Sparse Graph Learning with Spectrum Prior for Deep Graph Convolutional Networks. (arXiv:2202.13526v2 [cs.LG] UPDATED)
    A graph convolutional network (GCN) employs a graph filtering kernel tailored for data with irregular structures. However, simply stacking more GCN layers does not improve performance; instead, the output converges to an uninformative low-dimensional subspace, where the convergence rate is characterized by the graph spectrum -- this is the known over-smoothing problem in GCN. In this paper, we propose a sparse graph learning algorithm incorporating a new spectrum prior to compute a graph topology that circumvents over-smoothing while preserving pairwise correlations inherent in data. Specifically, based on a spectral analysis of multilayer GCN output, we derive a spectrum prior for the graph Laplacian matrix $\mathbf{L}$ to robustify the model expressiveness against over-smoothing. Then, we formulate a sparse graph learning problem with the spectrum prior, solved efficiently via block coordinate descent (BCD). Moreover, we optimize the weight parameter trading off the fidelity term with the spectrum prior, based on data smoothness on the original graph learned without spectrum manipulation. The output $\mathbf{L}$ is then normalized for supervised GCN training. Experiments show that our proposal produced deeper GCNs and higher prediction accuracy for regression and classification tasks compared to competing schemes.
    Probability-Dependent Gradient Decay in Large Margin Softmax. (arXiv:2210.17145v1 [stat.ML] CROSS LISTED)
    In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint of the loss function by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among concepts of large margin Softmax, local Lipschitz constraint and curriculum learning by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust Softmax loss in training, where the gradient decay rate increases from over-small to speed up the convergence rate.
    Spiking Variational Graph Auto-Encoders for Efficient Graph Representation Learning. (arXiv:2211.01952v1 [cs.NE])
    Graph representation learning is a fundamental research issue and benefits a wide range of applications on graph-structured data. Conventional artificial neural network-based methods such as graph neural networks (GNNs) and variational graph auto-encoders (VGAEs) have achieved promising results in learning on graphs, but they suffer from extremely high energy consumption during training and inference stages. Inspired by the bio-fidelity and energy-efficiency of spiking neural networks (SNNs), recent methods attempt to adapt GNNs to the SNN framework by substituting spiking neurons for the activation functions. However, existing SNN-based GNN methods cannot be applied to the more general multi-node representation learning problem represented by link prediction. Moreover, these methods did not fully exploit the bio-fidelity of SNNs, as they still require costly multiply-accumulate (MAC) operations, which severely harm the energy efficiency. To address the above issues and improve energy efficiency, in this paper, we propose an SNN-based deep generative method, namely the Spiking Variational Graph Auto-Encoders (S-VGAE) for efficient graph representation learning. To deal with the multi-node problem, we propose a probabilistic decoder that generates binary latent variables as spiking node representations and reconstructs graphs via the weighted inner product. To avoid the MAC operations for energy efficiency, we further decouple the propagation and transformation layers of conventional GNN aggregators. We conduct link prediction experiments on multiple benchmark graph datasets, and the results demonstrate that our model consumes significantly lower energy with the performances superior or comparable to other ANN- and SNN-based methods for graph representation learning.
    Martian Ionosphere Electron Density Prediction Using Bagged Trees. (arXiv:2211.01902v1 [physics.ao-ph])
    The availability of Martian atmospheric data provided by several Martian missions broadened the opportunity to investigate and study the conditions of the Martian ionosphere. As such, ionospheric models play a crucial part in improving our understanding of ionospheric behavior in response to different spatial, temporal, and space weather conditions. This work represents an initial attempt to construct an electron density prediction model of the Martian ionosphere using machine learning. The model targets the ionosphere at solar zenith ranging from 70 to 90 degrees, and as such only utilizes observations from the Mars Global Surveyor mission. The performance of different machine learning methods was compared in terms of root mean square error, coefficient of determination, and mean absolute error. The bagged regression trees method performed best out of all the evaluated methods. Furthermore, the optimized bagged regression trees model outperformed other Martian ionosphere models from the literature (MIRI and NeMars) in finding the peak electron density value, and the peak density height in terms of root-mean-square error and mean absolute error.
    AI enhanced finite element multiscale modelling and structural uncertainty analysis of a functionally graded porous beam. (arXiv:2211.01970v1 [cs.LG])
    The local geometrical randomness of metal foams brings complexities to the performance prediction of porous structures. Although the relative density is commonly deemed as the key factor, the stochasticity of internal cell sizes and shapes has an apparent effect on the porous structural behaviour but the corresponding measurement is challenging. To address this issue, we are aimed to develop an assessment strategy for efficiently examining the foam properties by combining multiscale modelling and deep learning. The multiscale modelling is based on the finite element (FE) simulation employing representative volume elements (RVEs) with random cellular morphologies, mimicking the typical features of closed-cell Aluminium foams. A deep learning database is constructed for training the designed convolutional neural networks (CNNs) to establish a direct link between the mesoscopic porosity characteristics and the effective Youngs modulus of foams. The error range of CNN models leads to an uncertain mechanical performance, which is further evaluated in a structural uncertainty analysis on the FG porous three-layer beam consisting of two thin high-density layers and a thick low-density one, where the imprecise CNN predicted moduli are represented as triangular fuzzy numbers in double parametric form. The uncertain beam bending deflections under a mid-span point load are calculated with the aid of Timoshenko beam theory and the Ritz method. Our findings suggest the success in training CNN models to estimate RVE modulus using images with an average error of 5.92%. The evaluation of FG porous structures can be significantly simplified with the proposed method and connects to the mesoscopic cellular morphologies without establishing the mechanics model for local foams.
    Revisiting Hyperparameter Tuning with Differential Privacy. (arXiv:2211.01852v1 [cs.LG])
    Hyperparameter tuning is a common practice in the application of machine learning but is a typically ignored aspect in the literature on privacy-preserving machine learning due to its negative effect on the overall privacy parameter. In this paper, we aim to tackle this fundamental yet challenging problem by providing an effective hyperparameter tuning framework with differential privacy. The proposed method allows us to adopt a broader hyperparameter search space and even to perform a grid search over the whole space, since its privacy loss parameter is independent of the number of hyperparameter candidates. Interestingly, it instead correlates with the utility gained from hyperparameter searching, revealing an explicit and mandatory trade-off between privacy and utility. Theoretically, we show that its additional privacy loss bound incurred by hyperparameter tuning is upper-bounded by the squared root of the gained utility. However, we note that the additional privacy loss bound would empirically scale like a squared root of the logarithm of the utility term, benefiting from the design of doubling step.
    Communication Efficient Generalized Tensor Factorization for Decentralized Healthcare Networks. (arXiv:2109.01718v2 [cs.LG] UPDATED)
    Tensor factorization has been proved as an efficient unsupervised learning approach for health data analysis, especially for computational phenotyping, where the high-dimensional Electronic Health Records (EHRs) with patients' history of medical procedures, medications, diagnosis, lab tests, etc., are converted to meaningful and interpretable medical concepts. Federated tensor factorization distributes the tensor computation to multiple workers under the coordination of a central server, which enables jointly learning the phenotypes across multiple hospitals while preserving the privacy of the patient information. However, existing federated tensor factorization algorithms encounter the single-point-failure issue with the involvement of the central server, which is not only easily exposed to external attacks but also limits the number of clients sharing information with the server under restricted uplink bandwidth. In this paper, we propose CiderTF, a communication-efficient decentralized generalized tensor factorization, which reduces the uplink communication cost by leveraging a four-level communication reduction strategy designed for a generalized tensor factorization, which has the flexibility of modeling different tensor distribution with multiple kinds of loss functions. Experiments on two real-world EHR datasets demonstrate that CiderTF achieves comparable convergence with a communication reduction up to 99.99%.
    A Convergence Theory for Federated Average: Beyond Smoothness. (arXiv:2211.01588v1 [cs.LG])
    Federated learning enables a large amount of edge computing devices to learn a model without data sharing jointly. As a leading algorithm in this setting, Federated Average FedAvg, which runs Stochastic Gradient Descent (SGD) in parallel on local devices and averages the sequences only once in a while, have been widely used due to their simplicity and low communication cost. However, despite recent research efforts, it lacks theoretical analysis under assumptions beyond smoothness. In this paper, we analyze the convergence of FedAvg. Different from the existing work, we relax the assumption of strong smoothness. More specifically, we assume the semi-smoothness and semi-Lipschitz properties for the loss function, which have an additional first-order term in assumption definitions. In addition, we also assume bound on the gradient, which is weaker than the commonly used bounded gradient assumption in the convergence analysis scheme. As a solution, this paper provides a theoretical convergence study on Federated Learning.
    Unlocking the potential of two-point cells for energy-efficient training of deep nets. (arXiv:2211.01950v1 [cs.NE])
    Context-sensitive two-point layer 5 pyramidal cells (L5PC) were discovered as long ago as 1999. However, the potential of this discovery to provide useful neural computation has yet to be demonstrated. Here we show for the first time how a transformative L5PC-driven deep neural network (DNN), termed the multisensory cooperative computing (MCC) architecture, can effectively process large amounts of heterogeneous real-world audio-visual (AV) data, using far less energy compared to best available `point' neuron-driven DNNs. A novel highly-distributed parallel implementation on a Xilinx UltraScale+ MPSoC device estimates energy savings up to $245759 \times 50000$ $\mu$J (i.e., $62\%$ less than the baseline model in a semi-supervised learning setup) where a single synapse consumes $8e^{-5}\mu$J. In a supervised learning setup, the energy-saving can potentially reach up to 1250x less (per feedforward transmission) than the baseline model. This remarkable performance in pilot experiments demonstrates the embodied neuromorphic intelligence of our proposed L5PC based MCC architecture that contextually selects the most salient and relevant information for onward transmission, from overwhelmingly large multimodal information utilised at the early stages of on-chip training. Our proposed approach opens new cross-disciplinary avenues for future on-chip DNN training implementations and posits a radical shift in current neuromorphic computing paradigms.
    Image-based Early Detection System for Wildfires. (arXiv:2211.01629v1 [cs.CV])
    Wildfires are a disastrous phenomenon which cause damage to land, loss of property, air pollution, and even loss of human life. Due to the warmer and drier conditions created by climate change, more severe and uncontrollable wildfires are expected to occur in the coming years. This could lead to a global wildfire crisis and have dire consequences on our planet. Hence, it has become imperative to use technology to help prevent the spread of wildfires. One way to prevent the spread of wildfires before they become too large is to perform early detection i.e, detecting the smoke before the actual fire starts. In this paper, we present our Wildfire Detection and Alert System which use machine learning to detect wildfire smoke with a high degree of accuracy and can send immediate alerts to users. Our technology is currently being used in the USA to monitor data coming in from hundreds of cameras daily. We show that our system has a high true detection rate and a low false detection rate. Our performance evaluation study also shows that on an average our system detects wildfire smoke faster than an actual person.
    Reinforcement Learning based Cyberattack Model for Adaptive Traffic Signal Controller in Connected Transportation Systems. (arXiv:2211.01845v1 [cs.CR])
    In a connected transportation system, adaptive traffic signal controllers (ATSC) utilize real-time vehicle trajectory data received from vehicles through wireless connectivity (i.e., connected vehicles) to regulate green time. However, this wirelessly connected ATSC increases cyber-attack surfaces and increases their vulnerability to various cyber-attack modes, which can be leveraged to induce significant congestion in a roadway network. An attacker may receive financial benefits to create such a congestion for a specific roadway. One such mode is a 'sybil' attack in which an attacker creates fake vehicles in the network by generating fake Basic Safety Messages (BSMs) imitating actual connected vehicles following roadway traffic rules. The ultimate goal of an attacker will be to block a route(s) by generating fake or 'sybil' vehicles at a rate such that the signal timing and phasing changes occur without flagging any abrupt change in number of vehicles. Because of the highly non-linear and unpredictable nature of vehicle arrival rates and the ATSC algorithm, it is difficult to find an optimal rate of sybil vehicles, which will be injected from different approaches of an intersection. Thus, it is necessary to develop an intelligent cyber-attack model to prove the existence of such attacks. In this study, a reinforcement learning based cyber-attack model is developed for a waiting time-based ATSC. Specifically, an RL agent is trained to learn an optimal rate of sybil vehicle injection to create congestion for an approach(s). Our analyses revealed that the RL agent can learn an optimal policy for creating an intelligent attack.
    DynLight: Realize dynamic phase duration with multi-level traffic signal control. (arXiv:2204.03471v3 [cs.AI] UPDATED)
    Adopting reinforcement learning (RL) for traffic signal control (TSC) is increasingly popular, and RL has become a promising solution for traffic signal control. However, several challenges still need to be overcome. Firstly, most RL methods use fixed action duration and select the green phase for the next state, which makes the phase duration less dynamic and flexible. Secondly, the phase sequence of RL methods can be arbitrary, affecting the real-world deployment which may require a cyclical phase structure. Lastly, the average travel time and throughput are not fair metrics to evaluate TSC performance. To address these challenges, we propose a multi-level traffic signal control framework, DynLight, which uses an optimization method Max-QueueLength (M-QL) to determine the phase and uses a deep Q-network to determine the duration of the corresponding phase. Based on DynLight, we further propose DynLight-C which adopts a well-trained deep Q-network of DynLight and replace M-QL with a cyclical control policy that actuates a set of phases in fixed cyclical order to realize cyclical phase structure. Comprehensive experiments on multiple real-world datasets demonstrate that DynLight achieves a new state-of-the-art. Furthermore, the deep Q-network of DynLight can learn well on determining the phase duration and DynLight-C demonstrates high performance for deployment.
    Stock Trading Volume Prediction with Dual-Process Meta-Learning. (arXiv:2211.01762v1 [q-fin.TR])
    Volume prediction is one of the fundamental objectives in the Fintech area, which is helpful for many downstream tasks, e.g., algorithmic trading. Previous methods mostly learn a universal model for different stocks. However, this kind of practice omits the specific characteristics of individual stocks by applying the same set of parameters for different stocks. On the other hand, learning different models for each stock would face data sparsity or cold start problems for many stocks with small capitalization. To take advantage of the data scale and the various characteristics of individual stocks, we propose a dual-process meta-learning method that treats the prediction of each stock as one task under the meta-learning framework. Our method can model the common pattern behind different stocks with a meta-learner, while modeling the specific pattern for each stock across time spans with stock-dependent parameters. Furthermore, we propose to mine the pattern of each stock in the form of a latent variable which is then used for learning the parameters for the prediction module. This makes the prediction procedure aware of the data pattern. Extensive experiments on volume predictions show that our method can improve the performance of various baseline models. Further analyses testify the effectiveness of our proposed meta-learning framework.
    Client Selection in Federated Learning: Principles, Challenges, and Opportunities. (arXiv:2211.01549v1 [cs.LG])
    As a privacy-preserving paradigm for training Machine Learning (ML) models, Federated Learning (FL) has received tremendous attention from both industry and academia. In a typical FL scenario, clients exhibit significant heterogeneity in terms of data distribution and hardware configurations. Thus, randomly sampling clients in each training round may not fully exploit the local updates from heterogeneous clients, resulting in lower model accuracy, slower convergence rate, degraded fairness, etc. To tackle the FL client heterogeneity problem, various client selection algorithms have been developed, showing promising performance improvement. In this paper, we systematically present recent advances in the emerging field of FL client selection and its challenges and research opportunities. We hope to facilitate practitioners in choosing the most suitable client selection mechanisms for their applications, as well as inspire researchers and newcomers to better understand this exciting research topic.
    Exploring explicit coarse-grainend structure in artificial neural networks. (arXiv:2211.01779v1 [cs.LG])
    We propose to employ the hierarchical coarse-grained structure in the artificial neural networks explicitly to improve the interpretability without degrading performance. The idea has been applied in two situations. One is a neural network called TaylorNet, which aims to approximate the general mapping from input data to output result in terms of Taylor series directly, without resorting to any magic nonlinear activations. The other is a new setup for data distillation, which can perform multi-level abstraction of the input dataset and generate new data that possesses the relevant features of the original dataset and can be used as references for classification. In both cases, the coarse-grained structure plays an important role in simplifying the network and improving both the interpretability and efficiency. The validity has been domonstrated on MNIST and CIFAR-10 datasets. Further improvement and some open questions related are also discussed.
    Automated segmentation of microvessels in intravascular OCT images using deep learning. (arXiv:2210.00166v2 [eess.IV] UPDATED)
    To analyze this characteristic of vulnerability, we developed an automated deep learning method for detecting microvessels in intravascular optical coherence tomography (IVOCT) images. A total of 8,403 IVOCT image frames from 85 lesions and 37 normal segments were analyzed. Manual annotation was done using a dedicated software (OCTOPUS) previously developed by our group. Data augmentation in the polar (r,{\theta}) domain was applied to raw IVOCT images to ensure that microvessels appear at all possible angles. Pre-processing methods included guidewire/shadow detection, lumen segmentation, pixel shifting, and noise reduction. DeepLab v3+ was used to segment microvessel candidates. A bounding box on each candidate was classified as either microvessel or non-microvessel using a shallow convolutional neural network. For better classification, we used data augmentation (i.e., angle rotation) on bounding boxes with a microvessel during network training. Data augmentation and pre-processing steps improved microvessel segmentation performance significantly, yielding a method with Dice of 0.71+/-0.10 and pixel-wise sensitivity/specificity of 87.7+/-6.6%/99.8+/-0.1%. The network for classifying microvessels from candidates performed exceptionally well, with sensitivity of 99.5+/-0.3%, specificity of 98.8+/-1.0%, and accuracy of 99.1+/-0.5%. The classification step eliminated the majority of residual false positives, and the Dice coefficient increased from 0.71 to 0.73. In addition, our method produced 698 image frames with microvessels present, compared to 730 from manual analysis, representing a 4.4% difference. When compared to the manual method, the automated method improved microvessel continuity, implying improved segmentation performance. The method will be useful for research purposes as well as potential future treatment planning.
    Optimal Behavior Prior: Data-Efficient Human Models for Improved Human-AI Collaboration. (arXiv:2211.01602v1 [cs.LG])
    AI agents designed to collaborate with people benefit from models that enable them to anticipate human behavior. However, realistic models tend to require vast amounts of human data, which is often hard to collect. A good prior or initialization could make for more data-efficient training, but what makes for a good prior on human behavior? Our work leverages a very simple assumption: people generally act closer to optimal than to random chance. We show that using optimal behavior as a prior for human models makes these models vastly more data-efficient and able to generalize to new environments. Our intuition is that such a prior enables the training to focus one's precious real-world data on capturing the subtle nuances of human suboptimality, instead of on the basics of how to do the task in the first place. We also show that using these improved human models often leads to better human-AI collaboration performance compared to using models based on real human data alone.
    Optimal Algorithms for Stochastic Complementary Composite Minimization. (arXiv:2211.01758v1 [cs.LG])
    Inspired by regularization techniques in statistics and machine learning, we study complementary composite minimization in the stochastic setting. This problem corresponds to the minimization of the sum of a (weakly) smooth function endowed with a stochastic first-order oracle, and a structured uniformly convex (possibly nonsmooth and non-Lipschitz) regularization term. Despite intensive work on closely related settings, prior to our work no complexity bounds for this problem were known. We close this gap by providing novel excess risk bounds, both in expectation and with high probability. Our algorithms are nearly optimal, which we prove via novel lower complexity bounds for this class of problems. We conclude by providing numerical results comparing our methods to the state of the art.
    ImageNet-X: Understanding Model Mistakes with Factor of Variation Annotations. (arXiv:2211.01866v1 [cs.CV])
    Deep learning vision systems are widely deployed across applications where reliability is critical. However, even today's best models can fail to recognize an object when its pose, lighting, or background varies. While existing benchmarks surface examples challenging for models, they do not explain why such mistakes arise. To address this need, we introduce ImageNet-X, a set of sixteen human annotations of factors such as pose, background, or lighting the entire ImageNet-1k validation set as well as a random subset of 12k training images. Equipped with ImageNet-X, we investigate 2,200 current recognition models and study the types of mistakes as a function of model's (1) architecture, e.g. transformer vs. convolutional, (2) learning paradigm, e.g. supervised vs. self-supervised, and (3) training procedures, e.g., data augmentation. Regardless of these choices, we find models have consistent failure modes across ImageNet-X categories. We also find that while data augmentation can improve robustness to certain factors, they induce spill-over effects to other factors. For example, strong random cropping hurts robustness on smaller objects. Together, these insights suggest to advance the robustness of modern vision models, future research should focus on collecting additional data and understanding data augmentation schemes. Along with these insights, we release a toolkit based on ImageNet-X to spur further study into the mistakes image recognition systems make.
    Reinforcement Learning in Non-Markovian Environments. (arXiv:2211.01595v1 [eess.SY])
    Following the novel paradigm developed by Van Roy and coauthors for reinforcement learning in arbitrary non-Markovian environments, we propose a related formulation inspired by classical stochastic control that reduces the problem to recursive computation of approximate sufficient statistics.
    Using Signal Processing in Tandem With Adapted Mixture Models for Classifying Genomic Signals. (arXiv:2211.01603v1 [q-bio.GN])
    Genomic signal processing has been used successfully in bioinformatics to analyze biomolecular sequences and gain varied insights into DNA structure, gene organization, protein binding, sequence evolution, etc. But challenges remain in finding the appropriate spectral representation of a biomolecular sequence, especially when multiple variable-length sequences need to be handled consistently. In this study, we address this challenge in the context of the well-studied problem of classifying genomic sequences into different taxonomic units (strain, phyla, order, etc.). We propose a novel technique that employs signal processing in tandem with Gaussian mixture models to improve the spectral representation of a sequence and subsequently the taxonomic classification accuracies. The sequences are first transformed into spectra, and projected to a subspace, where sequences belonging to different taxons are better distinguishable. Our method outperforms a similar state-of-the-art method on established benchmark datasets by an absolute margin of 6.06% accuracy.
    Complete the Missing Half: Augmenting Aggregation Filtering with Diversification for Graph Convolutional Networks. (arXiv:2008.08844v4 [cs.LG] UPDATED)
    The core operation of current Graph Neural Networks (GNNs) is the aggregation enabled by the graph Laplacian or message passing, which filters the neighborhood node information. Though effective for various tasks, in this paper, we show that they are potentially a problematic factor underlying all GNN methods for learning on certain datasets, as they force the node representations similar, making the nodes gradually lose their identity and become indistinguishable. Hence, we augment the aggregation operations with their dual, i.e. diversification operators that make the node more distinct and preserve the identity. Such augmentation replaces the aggregation with a two-channel filtering process that, in theory, is beneficial for enriching the node representations. In practice, the proposed two-channel filters can be easily patched on existing GNN methods with diverse training strategies, including spectral and spatial (message passing) methods. In the experiments, we observe desired characteristics of the models and significant performance boost upon the baselines on 9 node classification tasks.
    Data-free Defense of Black Box Models Against Adversarial Attacks. (arXiv:2211.01579v1 [cs.LG])
    Several companies often safeguard their trained deep models (i.e. details of architecture, learnt weights, training details etc.) from third-party users by exposing them only as black boxes through APIs. Moreover, they may not even provide access to the training data due to proprietary reasons or sensitivity concerns. We make the first attempt to provide adversarial robustness to the black box models in a data-free set up. We construct synthetic data via generative model and train surrogate network using model stealing techniques. To minimize adversarial contamination on perturbed samples, we propose `wavelet noise remover' (WNR) that performs discrete wavelet decomposition on input images and carefully select only a few important coefficients determined by our `wavelet coefficient selection module' (WCSM). To recover the high-frequency content of the image after noise removal via WNR, we further train a `regenerator' network with an objective to retrieve the coefficients such that the reconstructed image yields similar to original predictions on the surrogate model. At test time, WNR combined with trained regenerator network is prepended to the black box network, resulting in a high boost in adversarial accuracy. Our method improves the adversarial accuracy on CIFAR-10 by 38.98% and 32.01% on state-of-the-art Auto Attack compared to baseline, even when the attacker uses surrogate architecture (Alexnet-half and Alexnet) similar to the black box architecture (Alexnet) with same model stealing strategy as defender. The code is available at https://github.com/vcl-iisc/data-free-black-box-defense
    HyperSound: Generating Implicit Neural Representations of Audio Signals with Hypernetworks. (arXiv:2211.01839v1 [cs.SD])
    Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals. Recent applications of INRs include image super-resolution, compression of high-dimensional signals, or 3D rendering. However, these solutions usually focus on visual data, and adapting them to the audio domain is not trivial. Moreover, it requires a separately trained model for every data sample. To address this limitation, we propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time. We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.
    XAI-Increment: A Novel Approach Leveraging LIME Explanations for Improved Incremental Learning. (arXiv:2211.01413v1 [cs.LG])
    Explainability of neural network prediction is essential to understand feature importance and gain interpretable insight into neural network performance. In this work, model explanations are fed back to the feed-forward training to help the model generalize better. To this extent, a custom weighted loss where the weights are generated by considering the Euclidean distances between true LIME (Local Interpretable Model-Agnostic Explanations) explanations and model-predicted LIME explanations is proposed. Also, in practical training scenarios, developing a solution that can help the model learn sequentially without losing information on previous data distribution is imperative due to the unavailability of all the training data at once. Thus, the framework known as XAI-Increment incorporates the custom weighted loss developed with elastic weight consolidation (EWC), to maintain performance in sequential testing sets. Finally, the training procedure involving the custom weighted loss shows around 1% accuracy improvement compared to the traditional loss based training for the keyword spotting task on the Google Speech Commands dataset and also shows low loss of information when coupled with EWC in the incremental learning setup.  ( 2 min )
    Beyond the Best: Estimating Distribution Functionals in Infinite-Armed Bandits. (arXiv:2211.01743v1 [cs.LG])
    In the infinite-armed bandit problem, each arm's average reward is sampled from an unknown distribution, and each arm can be sampled further to obtain noisy estimates of the average reward of that arm. Prior work focuses on identifying the best arm, i.e., estimating the maximum of the average reward distribution. We consider a general class of distribution functionals beyond the maximum, and propose unified meta algorithms for both the offline and online settings, achieving optimal sample complexities. We show that online estimation, where the learner can sequentially choose whether to sample a new or existing arm, offers no advantage over the offline setting for estimating the mean functional, but significantly reduces the sample complexity for other functionals such as the median, maximum, and trimmed mean. The matching lower bounds utilize several different Wasserstein distances. For the special case of median estimation, we identify a curious thresholding phenomenon on the indistinguishability between Gaussian convolutions with respect to the noise level, which may be of independent interest.  ( 2 min )
    FedTP: Federated Learning by Transformer Personalization. (arXiv:2211.01572v1 [cs.LG])
    Federated learning is an emerging learning paradigm where multiple clients collaboratively train a machine learning model in a privacy-preserving manner. Personalized federated learning extends this paradigm to overcome heterogeneity across clients by learning personalized models. Recently, there have been some initial attempts to apply Transformers to federated learning. However, the impacts of federated learning algorithms on self-attention have not yet been studied. This paper investigates this relationship and reveals that federated averaging algorithms actually have a negative impact on self-attention where there is data heterogeneity. These impacts limit the capabilities of the Transformer model in federated learning settings. Based on this, we propose FedTP, a novel Transformer-based federated learning framework that learns personalized self-attention for each client while aggregating the other parameters among the clients. Instead of using a vanilla personalization mechanism that maintains personalized self-attention layers of each client locally, we develop a learn-to-personalize mechanism to further encourage the cooperation among clients and to increase the scablability and generalization of FedTP. Specifically, the learn-to-personalize is realized by learning a hypernetwork on the server that outputs the personalized projection matrices of self-attention layers to generate client-wise queries, keys and values. Furthermore, we present the generalization bound for FedTP with the learn-to-personalize mechanism. Notably, FedTP offers a convenient environment for performing a range of image and language tasks using the same federated network architecture - all of which benefit from Transformer personalization. Extensive experiments verify that FedTP with the learn-to-personalize mechanism yields state-of-the-art performance in non-IID scenarios. Our code is available online.  ( 3 min )
    Isotropic Gaussian Processes on Finite Spaces of Graphs. (arXiv:2211.01689v1 [stat.ML])
    We propose a principled way to define Gaussian process priors on various sets of unweighted graphs: directed or undirected, with or without loops. We endow each of these sets with a geometric structure, inducing the notions of closeness and symmetries, by turning them into a vertex set of an appropriate metagraph. Building on this, we describe the class of priors that respect this structure and are analogous to the Euclidean isotropic processes, like squared exponential or Mat\'ern. We propose an efficient computational technique for the ostensibly intractable problem of evaluating these priors' kernels, making such Gaussian processes usable within the usual toolboxes and downstream applications. We go further to consider sets of equivalence classes of unweighted graphs and define the appropriate versions of priors thereon. We prove a hardness result, showing that in this case, exact kernel computation cannot be performed efficiently. However, we propose a simple Monte Carlo approximation for handling moderately sized cases. Inspired by applications in chemistry, we illustrate the proposed techniques on a real molecular property prediction task in the small data regime.  ( 2 min )
    Dual Generator Offline Reinforcement Learning. (arXiv:2211.01471v1 [cs.LG])
    In offline RL, constraining the learned policy to remain close to the data is essential to prevent the policy from outputting out-of-distribution (OOD) actions with erroneously overestimated values. In principle, generative adversarial networks (GAN) can provide an elegant solution to do so, with the discriminator directly providing a probability that quantifies distributional shift. However, in practice, GAN-based offline RL methods have not performed as well as alternative approaches, perhaps because the generator is trained to both fool the discriminator and maximize return -- two objectives that can be at odds with each other. In this paper, we show that the issue of conflicting objectives can be resolved by training two generators: one that maximizes return, with the other capturing the ``remainder'' of the data distribution in the offline dataset, such that the mixture of the two is close to the behavior policy. We show that not only does having two generators enable an effective GAN-based offline RL method, but also approximates a support constraint, where the policy does not need to match the entire data distribution, but only the slice of the data that leads to high long term performance. We name our method DASCO, for Dual-Generator Adversarial Support Constrained Offline RL. On benchmark tasks that require learning from sub-optimal data, DASCO significantly outperforms prior methods that enforce distribution constraint.  ( 2 min )
    Fair and Optimal Classification via Transports to Wasserstein-Barycenter. (arXiv:2211.01528v1 [cs.LG])
    Fairness in automated decision-making systems has gained increasing attention as their applications expand to real-world high-stakes domains. To facilitate the design of fair ML systems, it is essential to understand the potential trade-offs between fairness and predictive power, and the construction of the optimal predictor under a given fairness constraint. In this paper, for general classification problems under the group fairness criterion of demographic parity (DP), we precisely characterize the trade-off between DP and classification accuracy, referred to as the minimum cost of fairness. Our insight comes from the key observation that finding the optimal fair classifier is equivalent to solving a Wasserstein-barycenter problem under $\ell_1$-norm restricted to the vertices of the probability simplex. Inspired by our characterization, we provide a construction of an optimal fair classifier achieving this minimum cost via the composition of the Bayes regressor and optimal transports from its output distributions to the barycenter. Our construction naturally leads to an algorithm for post-processing any pre-trained predictor to satisfy DP fairness, complemented with finite sample guarantees. Experiments on real-world datasets verify and demonstrate the effectiveness of our approaches.  ( 2 min )
    FUNCK: Information Funnels and Bottlenecks for Invariant Representation Learning. (arXiv:2211.01446v1 [cs.LG])
    Learning invariant representations that remain useful for a downstream task is still a key challenge in machine learning. We investigate a set of related information funnels and bottleneck problems that claim to learn invariant representations from the data. We also propose a new element to this family of information-theoretic objectives: The Conditional Privacy Funnel with Side Information, which we investigate in fully and semi-supervised settings. Given the generally intractable objectives, we derive tractable approximations using amortized variational inference parameterized by neural networks and study the intrinsic trade-offs of these objectives. We describe empirically the proposed approach and show that with a few labels it is possible to learn fair classifiers and generate useful representations approximately invariant to unwanted sources of variation. Furthermore, we provide insights about the applicability of these methods in real-world scenarios with ordinary tabular datasets when the data is scarce.
    ImageCAS: A Large-Scale Dataset and Benchmark for Coronary Artery Segmentation based on Computed Tomography Angiography Images. (arXiv:2211.01607v1 [eess.IV])
    Cardiovascular disease (CVD) accounts for about half of non-communicable diseases. Vessel stenosis in the coronary artery is considered to be the major risk of CVD. Computed tomography angiography (CTA) is one of the widely used noninvasive imaging modalities in coronary artery diagnosis due to its superior image resolution. Clinically, segmentation of coronary arteries is essential for the diagnosis and quantification of coronary artery disease. Recently, a variety of works have been proposed to address this problem. However, on one hand, most works rely on in-house datasets, and only a few works published their datasets to the public which only contain tens of images. On the other hand, their source code have not been published, and most follow-up works have not made comparison with existing works, which makes it difficult to judge the effectiveness of the methods and hinders the further exploration of this challenging yet critical problem in the community. In this paper, we propose a large-scale dataset for coronary artery segmentation on CTA images. In addition, we have implemented a benchmark in which we have tried our best to implement several typical existing methods. Furthermore, we propose a strong baseline method which combines multi-scale patch fusion and two-stage processing to extract the details of vessels. Comprehensive experiments show that the proposed method achieves better performance than existing works on the proposed large-scale dataset. The benchmark and the dataset are published at https://github.com/XiaoweiXu/ImageCAS-A-Large-Scale-Dataset-and-Benchmark-for-Coronary-Artery-Segmentation-based-on-CT.
    Expanding Accurate Person Recognition to New Altitudes and Ranges: The BRIAR Dataset. (arXiv:2211.01917v1 [cs.CV])
    Face recognition technology has advanced significantly in recent years due largely to the availability of large and increasingly complex training datasets for use in deep learning models. These datasets, however, typically comprise images scraped from news sites or social media platforms and, therefore, have limited utility in more advanced security, forensics, and military applications. These applications require lower resolution, longer ranges, and elevated viewpoints. To meet these critical needs, we collected and curated the first and second subsets of a large multi-modal biometric dataset designed for use in the research and development (R&D) of biometric recognition technologies under extremely challenging conditions. Thus far, the dataset includes more than 350,000 still images and over 1,300 hours of video footage of approximately 1,000 subjects. To collect this data, we used Nikon DSLR cameras, a variety of commercial surveillance cameras, specialized long-rage R&D cameras, and Group 1 and Group 2 UAV platforms. The goal is to support the development of algorithms capable of accurately recognizing people at ranges up to 1,000 m and from high angles of elevation. These advances will include improvements to the state of the art in face recognition and will support new research in the area of whole-body recognition using methods based on gait and anthropometry. This paper describes methods used to collect and curate the dataset, and the dataset's characteristics at the current stage.
    Feedback is Good, Active Feedback is Better: Block Attention Active Feedback Codes. (arXiv:2211.01730v1 [cs.IT])
    Deep neural network (DNN)-assisted channel coding designs, such as low-complexity neural decoders for existing codes, or end-to-end neural-network-based auto-encoder designs are gaining interest recently due to their improved performance and flexibility; particularly for communication scenarios in which high-performing structured code designs do not exist. Communication in the presence of feedback is one such communication scenario, and practical code design for feedback channels has remained an open challenge in coding theory for many decades. Recently, DNN-based designs have shown impressive results in exploiting feedback. In particular, generalized block attention feedback (GBAF) codes, which utilizes the popular transformer architecture, achieved significant improvement in terms of the block error rate (BLER) performance. However, previous works have focused mainly on passive feedback, where the transmitter observes a noisy version of the signal at the receiver. In this work, we show that GBAF codes can also be used for channels with active feedback. We implement a pair of transformer architectures, at the transmitter and the receiver, which interact with each other sequentially, and achieve a new state-of-the-art BLER performance, especially in the low SNR regime.
    The role of prior information and computational power in Machine Learning. (arXiv:2211.01972v1 [cs.LG])
    Science consists on conceiving hypotheses, confronting them with empirical evidence, and keeping only hypotheses which have not yet been falsified. Under deductive reasoning they are conceived in view of a theory and confronted with empirical evidence in an attempt to falsify it, and under inductive reasoning they are conceived based on observation, confronted with empirical evidence and a theory is established based on the not falsified hypotheses. When the hypotheses testing can be performed with quantitative data, the confrontation can be achieved with Machine Learning methods, whose quality is highly dependent on the hypotheses' complexity, hence on the proper insertion of prior information into the set of hypotheses seeking to decrease its complexity without loosing good hypotheses. However, Machine Learning tools have been applied under the pragmatic view of instrumentalism, which is concerned only with the performance of the methods and not with the understanding of their behavior, leading to methods which are not fully understood. In this context, we discuss how prior information and computational power can be employed to solve a learning problem, but while prior information and a careful design of the hypotheses space has as advantage the interpretability of the results, employing high computational power has the advantage of a higher performance. We discuss why learning methods which combine both should work better from an understanding and performance perspective, arguing in favor of basic theoretical research on Machine Learning, in special about how properties of classifiers may be identified in parameters of modern learning models.
    Leveraging Fully Observable Policies for Learning under Partial Observability. (arXiv:2211.01991v1 [cs.RO])
    Reinforcement learning in partially observable domains is challenging due to the lack of observable state information. Thankfully, learning offline in a simulator with such state information is often possible. In particular, we propose a method for partially observable reinforcement learning that uses a fully observable policy (which we call a state expert) during offline training to improve online performance. Based on Soft Actor-Critic (SAC), our agent balances performing actions similar to the state expert and getting high returns under partial observability. Our approach can leverage the fully-observable policy for exploration and parts of the domain that are fully observable while still being able to learn under partial observability. On six robotics domains, our method outperforms pure imitation, pure reinforcement learning, the sequential or parallel combination of both types, and a recent state-of-the-art method in the same setting. A successful policy transfer to a physical robot in a manipulation task from pixels shows our approach's practicality in learning interesting policies under partial observability.
    Losses Can Be Blessings: Routing Self-Supervised Speech Representations Towards Efficient Multilingual and Multitask Speech Processing. (arXiv:2211.01522v1 [cs.LG])
    Self-supervised learning (SSL) for rich speech representations has achieved empirical success in low-resource Automatic Speech Recognition (ASR) and other speech processing tasks, which can mitigate the necessity of a large amount of transcribed speech and thus has driven a growing demand for on-device ASR and other speech processing. However, advanced speech SSL models have become increasingly large, which contradicts the limited on-device resources. This gap could be more severe in multilingual/multitask scenarios requiring simultaneously recognizing multiple languages or executing multiple speech processing tasks. Additionally, strongly overparameterized speech SSL models tend to suffer from overfitting when being finetuned on low-resource speech corpus. This work aims to enhance the practical usage of speech SSL models towards a win-win in both enhanced efficiency and alleviated overfitting via our proposed S$^3$-Router framework, which for the first time discovers that simply discarding no more than 10\% of model weights via only finetuning model connections of speech SSL models can achieve better accuracy over standard weight finetuning on downstream speech processing tasks. More importantly, S$^3$-Router can serve as an all-in-one technique to enable (1) a new finetuning scheme, (2) an efficient multilingual/multitask solution, (3) a state-of-the-art ASR pruning technique, and (4) a new tool to quantitatively analyze the learned speech representation. We believe S$^3$-Router has provided a new perspective for practical deployment of speech SSL models. Our codes are available at: https://github.com/GATECH-EIC/S3-Router.
    Subgoal-based Exploration via Bayesian Optimization. (arXiv:1910.09143v3 [math.OC] UPDATED)
    Policy optimization in unknown, sparse-reward environments with expensive and limited interactions is challenging, and poses a need for effective exploration. Motivated by complex navigation tasks that require real-world training (when cheap simulators are not available), we consider an agent that faces an unknown distribution of environments and must decide on an exploration strategy, through a series of training environments, that can benefit policy learning in a test environment drawn from the environment distribution. Most existing approaches focus on fixed exploration strategies, while the few that view exploration as a meta-optimization problem tend to ignore the need for cost-efficient exploration. We propose a cost-aware Bayesian optimization approach that efficiently searches over a class of dynamic subgoal-based exploration strategies. The algorithm adjusts a variety of levers -- the locations of the subgoals, the length of each episode, and the number of replications per trial -- in order to overcome the challenges of sparse rewards, expensive interactions, and noise. Our experimental evaluation demonstrates that, when averaged across problem domains, the proposed algorithm outperforms the meta-learning algorithm MAML by 19%, the hyperparameter tuning method Hyperband by 23%, BO techniques EI and LCB by 24% and 22%, respectively. We also provide a theoretical foundation and prove that the method asymptotically identifies a near-optimal subgoal design from the search space.
    Exploring the State-of-the-Art Language Modeling Methods and Data Augmentation Techniques for Multilingual Clause-Level Morphology. (arXiv:2211.01736v1 [cs.CL])
    This paper describes the KUIS-AI NLP team's submission for the 1$^{st}$ Shared Task on Multilingual Clause-level Morphology (MRL2022). We present our work on all three parts of the shared task: inflection, reinflection, and analysis. We mainly explore two approaches: Transformer models in combination with data augmentation, and exploiting the state-of-the-art language modeling techniques for morphological analysis. Data augmentation leads a remarkable performance improvement for most of the languages in the inflection task. Prefix-tuning on pretrained mGPT model helps us to adapt reinflection and analysis tasks in a low-data setting. Additionally, we used pipeline architectures using publicly available open source lemmatization tools and monolingual BERT-based morphological feature classifiers for reinflection and analysis tasks, respectively. While Transformer architectures with data augmentation and pipeline architectures achieved the best results for inflection and reinflection tasks, pipelines and prefix-tuning on mGPT received the highest results for the analysis task. Our methods achieved first place in each of the three tasks and outperforms mT5-baseline with ~89\% for inflection, ~80\% for reinflection and ~12\% for analysis. Our code https://github.com/emrecanacikgoz/mrl2022 is publicly available.
    Relating graph auto-encoders to linear models. (arXiv:2211.01858v1 [cs.LG])
    Graph auto-encoders are widely used to construct graph representations in Euclidean vector spaces. However, it has already been pointed out empirically that linear models on many tasks can outperform graph auto-encoders. In our work, we prove that the solution space induced by graph auto-encoders is a subset of the solution space of a linear map. This demonstrates that linear embedding models have at least the representational power of graph auto-encoders based on graph convolutional networks. So why are we still using nonlinear graph auto-encoders? One reason could be that actively restricting the linear solution space might introduce an inductive bias that helps improve learning and generalization. While many researchers believe that the nonlinearity of the encoder is the critical ingredient towards this end, we instead identify the node features of the graph as a more powerful inductive bias. We give theoretical insights by introducing a corresponding bias in a linear model and analyzing the change in the solution space. Our experiments show that the linear encoder can outperform the nonlinear encoder when using feature information.
    A step towards a reinforcement learning de novo genome assembler. (arXiv:2102.02649v3 [q-bio.GN] UPDATED)
    The use of reinforcement learning has proven to be very promising for solving complex activities without human supervision during their learning process. However, their successful applications are predominantly focused on fictional and entertainment problems - such as games. Based on the above, this work aims to shed light on the application of reinforcement learning to solve this relevant real-world problem, the genome assembly. By expanding the only approach found in the literature that addresses this problem, we carefully explored the aspects of intelligent agent learning, performed by the Q-learning algorithm, to understand its suitability to be applied in scenarios whose characteristics are more similar to those faced by real genome projects. The improvements proposed here include changing the previously proposed reward system and including state space exploration optimization strategies based on dynamic pruning and mutual collaboration with evolutionary computing. These investigations were tried on 23 new environments with larger inputs than those used previously. All these environments are freely available on the internet for the evolution of this research by the scientific community. The results suggest consistent performance progress using the proposed improvements, however, they also demonstrate the limitations of them, especially related to the high dimensionality of state and action spaces. We also present, later, the paths that can be traced to tackle genome assembly efficiently in real scenarios considering recent, successfully reinforcement learning applications - including deep reinforcement learning - from other domains dealing with high-dimensional inputs.
    Neural network stochastic differential equation models with applications to financial data forecasting. (arXiv:2111.13164v6 [cs.LG] UPDATED)
    In this article, we employ a collection of stochastic differential equations with drift and diffusion coefficients approximated by neural networks to predict the trend of chaotic time series which has big jump properties. Our contributions are, first, we propose a model called L\'evy induced stochastic differential equation network, which explores compounded stochastic differential equations with $\alpha$-stable L\'evy motion to model complex time series data and solve the problem through neural network approximation. Second, we theoretically prove that the numerical solution through our algorithm converges in probability to the solution of corresponding stochastic differential equation, without curse of dimensionality. Finally, we illustrate our method by applying it to real financial time series data and find the accuracy increases through the use of non-Gaussian L\'evy processes. We also present detailed comparisons in terms of data patterns, various models, different shapes of L\'evy motion and the prediction lengths.
    CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. (arXiv:2207.01780v3 [cs.LG] UPDATED)
    Program synthesis or code generation aims to generate a program that satisfies a problem specification. Recent approaches using large-scale pretrained language models (LMs) have shown promising results, yet they have some critical limitations. In particular, they often follow a standard supervised fine-tuning procedure to train a code generation model only from the pairs of natural-language problem descriptions and ground-truth programs. Such paradigm largely ignores some important but potentially useful signals in the problem specification such as unit tests, which thus often results in poor performance when solving complex unseen coding tasks. To address the limitations, we propose "CodeRL", a new framework for program synthesis tasks through pretrained LMs and deep reinforcement learning (RL). Specifically, during training, we treat the code-generating LM as an actor network, and introduce a critic network that is trained to predict the functional correctness of generated programs and provide dense feedback signals to the actor. During inference, we introduce a new generation procedure with a critical sampling strategy that allows a model to automatically regenerate programs based on feedback from example unit tests and critic scores. For the model backbones, we extended the encoder-decoder architecture of CodeT5 with enhanced learning objectives, larger model sizes, and better pretraining data. Our method not only achieves new SOTA results on the challenging APPS benchmark, but also shows strong zero-shot transfer capability with new SOTA results on the simpler MBPP benchmark.
    lilGym: Natural Language Visual Reasoning with Reinforcement Learning. (arXiv:2211.01994v1 [cs.LG])
    We present lilGym, a new benchmark for language-conditioned reinforcement learning in visual environments. lilGym is based on 2,661 highly-compositional human-written natural language statements grounded in an interactive visual environment. We annotate all statements with executable Python programs representing their meaning to enable exact reward computation in every possible world state. Each statement is paired with multiple start states and reward functions to form thousands of distinct Markov Decision Processes of varying difficulty. We experiment with lilGym with different models and learning regimes. Our results and analysis show that while existing methods are able to achieve non-trivial performance, lilGym forms a challenging open problem. lilGym is available at https://lil.nlp.cornell.edu/lilgym/.
    Efficient Spatially Sparse Inference for Conditional GANs and Diffusion Models. (arXiv:2211.02048v1 [cs.CV])
    During image editing, existing deep generative models tend to re-synthesize the entire output from scratch, including the unedited regions. This leads to a significant waste of computation, especially for minor editing operations. In this work, we present Spatially Sparse Inference (SSI), a general-purpose technique that selectively performs computation for edited regions and accelerates various generative models, including both conditional GANs and diffusion models. Our key observation is that users tend to make gradual changes to the input image. This motivates us to cache and reuse the feature maps of the original image. Given an edited image, we sparsely apply the convolutional filters to the edited regions while reusing the cached features for the unedited regions. Based on our algorithm, we further propose Sparse Incremental Generative Engine (SIGE) to convert the computation reduction to latency reduction on off-the-shelf hardware. With 1.2%-area edited regions, our method reduces the computation of DDIM by 7.5$\times$ and GauGAN by 18$\times$ while preserving the visual fidelity. With SIGE, we accelerate the speed of DDIM by 3.0x on RTX 3090 and 6.6$\times$ on Apple M1 Pro CPU, and GauGAN by 4.2$\times$ on RTX 3090 and 14$\times$ on Apple M1 Pro CPU.
    Reliable Off-policy Evaluation for Reinforcement Learning. (arXiv:2011.04102v3 [cs.LG] UPDATED)
    In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy using logged trajectory data generated from a different behavior policy, without execution of the target policy. Reinforcement learning in high-stake environments, such as healthcare and education, is often limited to off-policy settings due to safety or ethical concerns, or inability of exploration. Hence it is imperative to quantify the uncertainty of the off-policy estimate before deployment of the target policy. In this paper, we propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged trajectories data. Leveraging methodologies from distributionally robust optimization, we show that with proper selection of the size of the distributional uncertainty set, these estimates serve as confidence bounds with non-asymptotic and asymptotic guarantees under stochastic or adversarial environments. Our results are also generalized to batch reinforcement learning and are supported by empirical analysis.
    On the Adversarial Robustness of Vision Transformers. (arXiv:2103.15670v3 [cs.CV] UPDATED)
    Following the success in advancing natural language processing and understanding, transformers are expected to bring revolutionary changes to computer vision. This work provides a comprehensive study on the robustness of vision transformers (ViTs) against adversarial perturbations. Tested on various white-box and transfer attack settings, we find that ViTs possess better adversarial robustness when compared with MLP-Mixer and convolutional neural networks (CNNs) including ConvNeXt, and this observation also holds for certified robustness. Through frequency analysis and feature visualization, we summarize the following main observations contributing to the improved robustness of ViTs: 1) Features learned by ViTs contain less high-frequency patterns that have spurious correlation, which helps explain why ViTs are less sensitive to high-frequency perturbations than CNNs and MLP-Mixer, and there is a high correlation between how much the model learns high-frequency features and its robustness against different frequency-based perturbations. 2) Introducing convolutional or tokens-to-token blocks for learning high-frequency features in ViTs can improve classification accuracy but at the cost of adversarial robustness. 3) Modern CNN designs that borrow techniques from ViTs including activation function, layer norm, larger kernel size to imitate the global attention, and patchify the images as inputs, etc., could help bridge the performance gap between ViTs and CNNs not only in terms of performance, but also certified and empirical adversarial robustness. Moreover, we show adversarial training is also applicable to ViT for training robust models, and sharpness-aware minimization can also help improve robustness, while pre-training with clean images on larger datasets does not significantly improve adversarial robustness.
    Evaluating a Synthetic Image Dataset Generated with Stable Diffusion. (arXiv:2211.01777v1 [cs.CV])
    We generate synthetic images with the "Stable Diffusion" image generation model using the Wordnet taxonomy and the definitions of concepts it contains. This synthetic image database can be used as training data for data augmentation in machine learning applications, and it is used to investigate the capabilities of the Stable Diffusion model. Analyses show that Stable Diffusion can produce correct images for a large number of concepts, but also a large variety of different representations. The results show differences depending on the test concepts considered and problems with very specific concepts. These evaluations were performed using a vision transformer model for image classification.
    Improved Analysis of Score-based Generative Modeling: User-Friendly Bounds under Minimal Smoothness Assumptions. (arXiv:2211.01916v1 [cs.LG])
    In this paper, we focus on the theoretical analysis of diffusion-based generative modeling. Under an $L^2$-accurate score estimator, we provide convergence guarantees with polynomial complexity for any data distribution with second-order moment, by either employing an early stopping technique or assuming smoothness condition on the score function of the data distribution. Our result does not rely on any log-concavity or functional inequality assumption and has a logarithmic dependence on the smoothness. In particular, we show that under only a finite second moment condition, approximating the following in KL divergence in $\epsilon$-accuracy can be done in $\tilde O\left(\frac{d^2 \log^2 (1/\delta)}{\epsilon^2}\right)$ steps: 1) the variance-$\delta$ Gaussian perturbation of any data distribution; 2) data distributions with $1/\delta$-smooth score functions. Our theoretical analysis also provides quantitative comparison between different discrete approximations and may guide the choice of discretization points in practice.
    Machine Learning Methods for Device Identification Using Wireless Fingerprinting. (arXiv:2211.01963v1 [cs.LG])
    Industrial Internet of Things (IoT) systems increasingly rely on wireless communication standards. In a common industrial scenario, indoor wireless IoT devices communicate with access points to deliver data collected from industrial sensors, robots and factory machines. Due to static or quasi-static locations of IoT devices and access points, historical observations of IoT device channel conditions provide a possibility to precisely identify the device without observing its traditional identifiers (e.g., MAC or IP address). Such device identification methods based on wireless fingerprinting gained increased attention lately as an additional cyber-security mechanism for critical IoT infrastructures. In this paper, we perform a systematic study of a large class of machine learning algorithms for device identification using wireless fingerprints for the most popular cellular and Wi-Fi IoT technologies. We design, implement, deploy, collect relevant data sets, train and test a multitude of machine learning algorithms, as a part of the complete end-to-end solution design for device identification via wireless fingerprinting. The proposed solution is currently being deployed in a real-world industrial IoT environment as part of H2020 project COLLABS.
    The Lottery Ticket Hypothesis for Vision Transformers. (arXiv:2211.01484v1 [cs.CV])
    The conventional lottery ticket hypothesis (LTH) claims that there exists a sparse subnetwork within a dense neural network and a proper random initialization method, called the winning ticket, such that it can be trained from scratch to almost as good as the dense counterpart. Meanwhile, the research of LTH in vision transformers (ViTs) is scarcely evaluated. In this paper, we first show that the conventional winning ticket is hard to find at weight level of ViTs by existing methods. Then, we generalize the LTH for ViTs to input images consisting of image patches inspired by the input dependence of ViTs. That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches. We call this subset of input patches the winning tickets, which represent a significant amount of information in the input. Furthermore, we present a simple yet effective method to find the winning tickets in input patches for various types of ViT, including DeiT, LV-ViT, and Swin Transformers. More specifically, we use a ticket selector to generate the winning tickets based on the informativeness of patches. Meanwhile, we build another randomly selected subset of patches for comparison, and the experiments show that there is clear difference between the performance of models trained with winning tickets and randomly selected subsets.  ( 3 min )
    Cross-stitching Text and Knowledge Graph Encoders for Distantly Supervised Relation Extraction. (arXiv:2211.01432v1 [cs.CL])
    Bi-encoder architectures for distantly-supervised relation extraction are designed to make use of the complementary information found in text and knowledge graphs (KG). However, current architectures suffer from two drawbacks. They either do not allow any sharing between the text encoder and the KG encoder at all, or, in case of models with KG-to-text attention, only share information in one direction. Here, we introduce cross-stitch bi-encoders, which allow full interaction between the text encoder and the KG encoder via a cross-stitch mechanism. The cross-stitch mechanism allows sharing and updating representations between the two encoders at any layer, with the amount of sharing being dynamically controlled via cross-attention-based gates. Experimental results on two relation extraction benchmarks from two different domains show that enabling full interaction between the two encoders yields strong improvements.  ( 2 min )
    Improved Inapproximability of VC Dimension and Littlestone's Dimension via (Unbalanced) Biclique. (arXiv:2211.01443v1 [cs.CC])
    We study the complexity of computing (and approximating) VC Dimension and Littlestone's Dimension when we are given the concept class explicitly. We give a simple reduction from Maximum (Unbalanced) Biclique problem to approximating VC Dimension and Littlestone's Dimension. With this connection, we derive a range of hardness of approximation results and running time lower bounds. For example, under the (randomized) Gap-Exponential Time Hypothesis or the Strongish Planted Clique Hypothesis, we show a tight inapproximability result: both dimensions are hard to approximate to within a factor of $o(\log n)$ in polynomial-time. These improve upon constant-factor inapproximability results from [Manurangsi and Rubinstein, COLT 2017].  ( 2 min )
    StereoPose: Category-Level 6D Transparent Object Pose Estimation from Stereo Images via Back-View NOCS. (arXiv:2211.01644v1 [cs.RO])
    Most existing methods for category-level pose estimation rely on object point clouds. However, when considering transparent objects, depth cameras are usually not able to capture meaningful data, resulting in point clouds with severe artifacts. Without a high-quality point cloud, existing methods are not applicable to challenging transparent objects. To tackle this problem, we present StereoPose, a novel stereo image framework for category-level object pose estimation, ideally suited for transparent objects. For a robust estimation from pure stereo images, we develop a pipeline that decouples category-level pose estimation into object size estimation, initial pose estimation, and pose refinement. StereoPose then estimates object pose based on representation in the normalized object coordinate space~(NOCS). To address the issue of image content aliasing, we further define a back-view NOCS map for the transparent object. The back-view NOCS aims to reduce the network learning ambiguity caused by content aliasing, and leverage informative cues on the back of the transparent object for more accurate pose estimation. To further improve the performance of the stereo framework, StereoPose is equipped with a parallax attention module for stereo feature fusion and an epipolar loss for improving the stereo-view consistency of network predictions. Extensive experiments on the public TOD dataset demonstrate the superiority of the proposed StereoPose framework for category-level 6D transparent object pose estimation.  ( 3 min )
    PI is back! Switching Acquisition Functions in Bayesian Optimization. (arXiv:2211.01455v1 [cs.LG])
    Bayesian Optimization (BO) is a powerful, sample-efficient technique to optimize expensive-to-evaluate functions. Each of the BO components, such as the surrogate model, the acquisition function (AF), or the initial design, is subject to a wide range of design choices. Selecting the right components for a given optimization task is a challenging task, which can have significant impact on the quality of the obtained results. In this work, we initiate the analysis of which AF to favor for which optimization scenarios. To this end, we benchmark SMAC3 using Expected Improvement (EI) and Probability of Improvement (PI) as acquisition functions on the 24 BBOB functions of the COCO environment. We compare their results with those of schedules switching between AFs. One schedule aims to use EI's explorative behavior in the early optimization steps, and then switches to PI for a better exploitation in the final steps. We also compare this to a random schedule and round-robin selection of EI and PI. We observe that dynamic schedules oftentimes outperform any single static one. Our results suggest that a schedule that allocates the first 25 % of the optimization budget to EI and the last 75 % to PI is a reliable default. However, we also observe considerable performance differences for the 24 functions, suggesting that a per-instance allocation, possibly learned on the fly, could offer significant improvement over the state-of-the-art BO designs.  ( 3 min )
    Zero-Sum Games with Noisy Observations. (arXiv:2211.01703v1 [cs.GT])
    In this paper, $2 \times 2$ zero-sum games (ZSGs) are studied under the following assumptions: (1) One of the players (the leader) publicly and irrevocably commits to choose its actions by sampling a given probability measure (strategy);(2) The leader announces its action, which is observed by its opponent (the follower) through a binary channel; and (3) the follower chooses its strategy based on the knowledge of the leader's strategy and the noisy observation of the leader's action. Under these conditions, the equilibrium is shown to always exist and be often different from the Nash and Stackelberg equilibria. Even subject to noise, observing the actions of the leader is either beneficial or immaterial to the follower for all possible commitments. When the commitment is observed subject to a distortion, the equilibrium does not necessarily exist. Nonetheless, the leader might still obtain some benefit in some specific cases subject to equilibrium refinements. For instance, $\epsilon$-equilibria might exist in which the leader commits to suboptimal strategies that allow unequivocally predicting the best response of its opponent.  ( 2 min )
    Crime Prediction using Machine Learning with a Novel Crime Dataset. (arXiv:2211.01551v1 [cs.LG])
    Crime is an unlawful act that carries legal repercussions. Bangladesh has a high crime rate due to poverty, population growth, and many other socio-economic issues. For law enforcement agencies, understanding crime patterns is essential for preventing future criminal activity. For this purpose, these agencies need structured crime database. This paper introduces a novel crime dataset that contains temporal, geographic, weather, and demographic data about 6574 crime incidents of Bangladesh. We manually gather crime news articles of a seven year time span from a daily newspaper archive. We extract basic features from these raw text. Using these basic features, we then consult standard service-providers of geo-location and weather data in order to garner these information related to the collected crime incidents. Furthermore, we collect demographic information from Bangladesh National Census data. All these information are combined that results in a standard machine learning dataset. Together, 36 features are engineered for the crime prediction task. Five supervised machine learning classification algorithms are then evaluated on this newly built dataset and satisfactory results are achieved. We also conduct exploratory analysis on various aspects the dataset. This dataset is expected to serve as the foundation for crime incidence prediction systems for Bangladesh and other countries. The findings of this study will help law enforcement agencies to forecast and contain crime as well as to ensure optimal resource allocation for crime patrol and prevention.  ( 3 min )
    End-to-end deep multi-score model for No-reference stereoscopic image quality assessment. (arXiv:2211.01374v1 [eess.IV])
    Deep learning-based quality metrics have recently given significant improvement in Image Quality Assessment (IQA). In the field of stereoscopic vision, information is evenly distributed with slight disparity to the left and right eyes. However, due to asymmetric distortion, the objective quality ratings for the left and right images would differ, necessitating the learning of unique quality indicators for each view. Unlike existing stereoscopic IQA measures which focus mainly on estimating a global human score, we suggest incorporating left, right, and stereoscopic objective scores to extract the corresponding properties of each view, and so forth estimating stereoscopic image quality without reference. Therefore, we use a deep multi-score Convolutional Neural Network (CNN). Our model has been trained to perform four tasks: First, predict the left view's quality. Second, predict the quality of the left view. Third and fourth, predict the quality of the stereo view and global quality, respectively, with the global score serving as the ultimate quality. Experiments are conducted on Waterloo IVC 3D Phase 1 and Phase 2 databases. The results obtained show the superiority of our method when comparing with those of the state-of-the-art. The implementation code can be found at: https://github.com/o-messai/multi-score-SIQA  ( 2 min )
    Towards Discovering Neural Architectures from Scratch. (arXiv:2211.01842v1 [cs.LG])
    The discovery of neural architectures from scratch is the long-standing goal of Neural Architecture Search (NAS). Searching over a wide spectrum of neural architectures can facilitate the discovery of previously unconsidered but well-performing architectures. In this work, we take a large step towards discovering neural architectures from scratch by expressing architectures algebraically. This algebraic view leads to a more general method for designing search spaces, which allows us to compactly represent search spaces that are 100s of orders of magnitude larger than common spaces from the literature. Further, we propose a Bayesian Optimization strategy to efficiently search over such huge spaces, and demonstrate empirically that both our search space design and our search strategy can be superior to existing baselines. We open source our algebraic NAS approach and provide APIs for PyTorch and TensorFlow.
    A Posterior Sampling Framework for Interactive Decision Making. (arXiv:2211.01962v1 [cs.LG])
    We study sample efficient reinforcement learning (RL) under the general framework of interactive decision making, which includes Markov decision process (MDP), partially observable Markov decision process (POMDP), and predictive state representation (PSR) as special cases. Toward finding the minimum assumption that empowers sample efficient learning, we propose a novel complexity measure, generalized eluder coefficient (GEC), which characterizes the fundamental tradeoff between exploration and exploitation in online interactive decision making. In specific, GEC captures the hardness of exploration by comparing the error of predicting the performance of the updated policy with the in-sample training error evaluated on the historical data. We show that RL problems with low GEC form a remarkably rich class, which subsumes low Bellman eluder dimension problems, bilinear class, low witness rank problems, PO-bilinear class, and generalized regular PSR, where generalized regular PSR, a new tractable PSR class identified by us, includes nearly all known tractable POMDPs. Furthermore, in terms of algorithm design, we propose a generic posterior sampling algorithm, which can be implemented in both model-free and model-based fashion, under both fully observable and partially observable settings. The proposed algorithm modifies the standard posterior sampling algorithm in two aspects: (i) we use an optimistic prior distribution that biases towards hypotheses with higher values and (ii) a loglikelihood function is set to be the empirical loss evaluated on the historical data, where the choice of loss function supports both model-free and model-based learning. We prove that the proposed algorithm is sample efficient by establishing a sublinear regret upper bound in terms of GEC. In summary, we provide a new and unified understanding of both fully observable and partially observable RL.
    From Spelling to Grammar: A New Framework for Chinese Grammatical Error Correction. (arXiv:2211.01625v1 [cs.CL])
    Chinese Grammatical Error Correction (CGEC) aims to generate a correct sentence from an erroneous sequence, where different kinds of errors are mixed. This paper divides the CGEC task into two steps, namely spelling error correction and grammatical error correction. Specifically, we propose a novel zero-shot approach for spelling error correction, which is simple but effective, obtaining a high precision to avoid error accumulation of the pipeline structure. To handle grammatical error correction, we design part-of-speech (POS) features and semantic class features to enhance the neural network model, and propose an auxiliary task to predict the POS sequence of the target sentence. Our proposed framework achieves a 42.11 F0.5 score on CGEC dataset without using any synthetic data or data augmentation methods, which outperforms the previous state-of-the-art by a wide margin of 1.30 points. Moreover, our model produces meaningful POS representations that capture different POS words and convey reasonable POS transition rules.  ( 2 min )
    Looking Beyond IoCs: Automatically Extracting Attack Patterns from External CTI. (arXiv:2211.01753v1 [cs.CR])
    Public and commercial companies extensively share cyber threat intelligence (CTI) to prepare systems to defend against emerging cyberattacks. Most used intelligence thus far has been limited to tracking known threat indicators such as IP addresses and domain names as they are easier to extract using regular expressions. Due to the limited long-term usage and difficulty of performing a long-term analysis on indicators, we propose using significantly more robust threat intelligence signals called attack patterns. However, extracting attack patterns at scale is a challenging task. In this paper, we present LADDER, a knowledge extraction framework that can extract text-based attack patterns from CTI reports at scale. The model characterizes attack patterns by capturing phases of an attack in android and enterprise networks. It then systematically maps them to the MITRE ATT\&CK pattern framework. We present several use cases to demonstrate the application of LADDER for SOC analysts in determining the presence of attack vectors belonging to emerging attacks in preparation for defenses in advance.
    fMRI from EEG is only Deep Learning away: the use of interpretable DL to unravel EEG-fMRI relationships. (arXiv:2211.02024v1 [physics.med-ph])
    The access to activity of subcortical structures offers unique opportunity for building intention dependent brain-computer interfaces, renders abundant options for exploring a broad range of cognitive phenomena in the realm of affective neuroscience including complex decision making processes and the eternal free-will dilemma and facilitates diagnostics of a range of neurological deceases. So far this was possible only using bulky, expensive and immobile fMRI equipment. Here we present an interpretable domain grounded solution to recover the activity of several subcortical regions from the multichannel EEG data and demonstrate up to 60% correlation between the actual subcortical blood oxygenation level dependent sBOLD signal and its EEG-derived twin. Then, using the novel and theoretically justified weight interpretation methodology we recover individual spatial and time-frequency patterns of scalp EEG predictive of the hemodynamic signal in the subcortical nuclei. The described results not only pave the road towards wearable subcortical activity scanners but also showcase an automatic knowledge discovery process facilitated by deep learning technology in combination with an interpretable domain constrained architecture and the appropriate downstream task.
    FedGen: Generalizable Federated Learning. (arXiv:2211.01914v1 [cs.LG])
    Existing federated learning models that follow the standard risk minimization paradigm of machine learning often fail to generalize in the presence of spurious correlations in the training data. In many real-world distributed settings, spurious correlations exist due to biases and data sampling issues on distributed devices or clients that can erroneously influence models. Current generalization approaches are designed for centralized training and attempt to identify features that have an invariant causal relationship with the target, thereby reducing the effect of spurious features. However, such invariant risk minimization approaches rely on apriori knowledge of training data distributions which is hard to obtain in many applications. In this work, we present a generalizable federated learning framework called FedGen, which allows clients to identify and distinguish between spurious and invariant features in a collaborative manner without prior knowledge of training distributions. We evaluate our approach on real-world datasets from different domains and show that FedGen results in models that achieve significantly better generalization than current federated learning approaches.
    Phase Transitions in Learning and Earning under Price Protection Guarantee. (arXiv:2211.01798v1 [stat.ML])
    Motivated by the prevalence of ``price protection guarantee", which allows a customer who purchased a product in the past to receive a refund from the seller during the so-called price protection period (typically defined as a certain time window after the purchase date) in case the seller decides to lower the price, we study the impact of such policy on the design of online learning algorithm for data-driven dynamic pricing with initially unknown customer demand. We consider a setting where a firm sells a product over a horizon of $T$ time steps. For this setting, we characterize how the value of $M$, the length of price protection period, can affect the optimal regret of the learning process. We show that the optimal regret is $\tilde{\Theta}(\sqrt{T}+\min\{M,\,T^{2/3}\})$ by first establishing a fundamental impossible regime with novel regret lower bound instances. Then, we propose LEAP, a phased exploration type algorithm for \underline{L}earning and \underline{EA}rning under \underline{P}rice Protection to match this lower bound up to logarithmic factors or even doubly logarithmic factors (when there are only two prices available to the seller). Our results reveal the surprising phase transitions of the optimal regret with respect to $M$. Specifically, when $M$ is not too large, the optimal regret has no major difference when compared to that of the classic setting with no price protection guarantee. We also show that there exists an upper limit on how much the optimal regret can deteriorate when $M$ grows large. Finally, we conduct extensive numerical experiments to show the benefit of LEAP over other heuristic methods for this problem.  ( 3 min )
    RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question. (arXiv:2211.01482v1 [cs.CL])
    Existing metrics for evaluating the quality of automatically generated questions such as BLEU, ROUGE, BERTScore, and BLEURT compare the reference and predicted questions, providing a high score when there is a considerable lexical overlap or semantic similarity between the candidate and the reference questions. This approach has two major shortcomings. First, we need expensive human-provided reference questions. Second, it penalises valid questions that may not have high lexical or semantic similarity to the reference questions. In this paper, we propose a new metric, RQUGE, based on the answerability of the candidate question given the context. The metric consists of a question-answering and a span scorer module, in which we use pre-trained models from the existing literature, and therefore, our metric can be used without further training. We show that RQUGE has a higher correlation with human judgment without relying on the reference question. RQUGE is shown to be significantly more robust to several adversarial corruptions. Additionally, we illustrate that we can significantly improve the performance of QA models on out-of-domain datasets by fine-tuning on the synthetic data generated by a question generation model and re-ranked by RQUGE.
    Crosslingual Generalization through Multitask Finetuning. (arXiv:2211.01786v1 [cs.CL])
    Multitask prompted finetuning (MTF) has been shown to help large language models generalize to new tasks in a zero-shot setting, but so far explorations of MTF have focused on English data and models. We apply MTF to the pretrained multilingual BLOOM and mT5 model families to produce finetuned variants called BLOOMZ and mT0. We find finetuning large multilingual language models on English tasks with English prompts allows for task generalization to non-English languages that appear only in the pretraining corpus. Finetuning on multilingual tasks with English prompts further improves performance on English and non-English tasks leading to various state-of-the-art zero-shot results. We also investigate finetuning on multilingual tasks with prompts that have been machine-translated from English to match the language of each dataset. We find training on these machine-translated prompts leads to better performance on human-written prompts in the respective languages. Surprisingly, we find models are capable of zero-shot generalization to tasks in languages they have never intentionally seen. We conjecture that the models are learning higher-level capabilities that are both task- and language-agnostic. In addition, we introduce xP3, a composite of supervised datasets in 46 languages with English and machine-translated prompts. Our code, datasets and models are publicly available at https://github.com/bigscience-workshop/xmtf.
    Uncertainty Quantification for Rule-Based Models. (arXiv:2211.01915v1 [cs.AI])
    Rule-based classification models described in the language of logic directly predict boolean values, rather than modeling a probability and translating it into a prediction as done in statistical models. The vast majority of existing uncertainty quantification approaches rely on models providing continuous output not available to rule-based models. In this work, we propose an uncertainty quantification framework in the form of a meta-model that takes any binary classifier with binary output as a black box and estimates the prediction accuracy of that base model at a given input along with a level of confidence on that estimation. The confidence is based on how well that input region is explored and is designed to work in any OOD scenario. We demonstrate the usefulness of this uncertainty model by building an abstaining classifier powered by it and observing its performance in various scenarios.
    The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies. (arXiv:2010.14860v4 [stat.ML] UPDATED)
    The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.
    Joint Chinese Word Segmentation and Span-based Constituency Parsing. (arXiv:2211.01638v1 [cs.CL])
    In constituency parsing, span-based decoding is an important direction. However, for Chinese sentences, because of their linguistic characteristics, it is necessary to utilize other models to perform word segmentation first, which introduces a series of uncertainties and generally leads to errors in the computation of the constituency tree afterward. This work proposes a method for joint Chinese word segmentation and Span-based Constituency Parsing by adding extra labels to individual Chinese characters on the parse trees. Through experiments, the proposed algorithm outperforms the recent models for joint segmentation and constituency parsing on CTB 5.1.  ( 2 min )
    Implementation of the Digital QS-SVM-based Beamformer on an FPGA Platform. (arXiv:2211.01763v1 [cs.NI])
    To address practical challenges in establishing and maintaining robust wireless connectivity such as multi-path effects, low latency, size reduction, and high data rate, the digital beamformer is performed by the hybrid antenna array at the frequency of operation of 10 GHz. The proposed digital beamformer, as a spatial filter, is capable of performing Direction of Arrival (DOA) estimation and beamforming. The most well-established machine learning technique of support vector machine (SVM) for the DoA estimation is limited to problems with linearly-separable datasets. To overcome the aforementioned constraint, in the proposed beamformer, the QS-SVM classifier with a small regularizer has been used for the DoA estimation in addition to the two beamforming techniques of LCMV and MVDR. The QS-SVM-based beamformer has been deployed in an FPGA board, as demonstrated in detail in this work. The implementation results have verified the strong performance of the QS-SVM-based beamformer in suppressing undesired signals, deep nulls with powers less than -10 dB in undesired signals, and transferring desired signals. Furthermore, we have demonstrated that the performance of the QS-SVM-based beamformer consists of other advantages of average latency time in the order of milliseconds, performance efficiency of more than 90\%, and throughput of about 100\%.
    Private Semi-supervised Knowledge Transfer for Deep Learning from Noisy Labels. (arXiv:2211.01628v1 [cs.LG])
    Deep learning models trained on large-scale data have achieved encouraging performance in many real-world tasks. Meanwhile, publishing those models trained on sensitive datasets, such as medical records, could pose serious privacy concerns. To counter these issues, one of the current state-of-the-art approaches is the Private Aggregation of Teacher Ensembles, or PATE, which achieved promising results in preserving the utility of the model while providing a strong privacy guarantee. PATE combines an ensemble of "teacher models" trained on sensitive data and transfers the knowledge to a "student" model through the noisy aggregation of teachers' votes for labeling unlabeled public data which the student model will be trained on. However, the knowledge or voted labels learned by the student are noisy due to private aggregation. Learning directly from noisy labels can significantly impact the accuracy of the student model. In this paper, we propose the PATE++ mechanism, which combines the current advanced noisy label training mechanisms with the original PATE framework to enhance its accuracy. A novel structure of Generative Adversarial Nets (GANs) is developed in order to integrate them effectively. In addition, we develop a novel noisy label detection mechanism for semi-supervised model training to further improve student model performance when training with noisy labels. We evaluate our method on Fashion-MNIST and SVHN to show the improvements on the original PATE on all measures.  ( 2 min )
    Sequence-Based Plan Feasibility Prediction for Efficient Task and Motion Planning. (arXiv:2211.01576v1 [cs.RO])
    Robots planning long-horizon behavior in complex environments must be able to quickly reason about the impact of the environment's geometry on what plans are feasible, i.e., whether there exist action parameter values that satisfy all constraints on a candidate plan. In tasks involving articulated and movable obstacles, typical Task and Motion Planning (TAMP) algorithms spend most of their runtime attempting to solve unsolvable constraint satisfaction problems imposed by infeasible plan skeletons. We developed a novel Transformer-based architecture, PIGINet, that predicts plan feasibility based on the initial state, goal, and candidate plans, fusing image and text embeddings with state features. The model sorts the plan skeletons produced by a TAMP planner according to the predicted satisfiability likelihoods. We evaluate the runtime of our learning-enabled TAMP algorithm on several distributions of kitchen rearrangement problems, comparing its performance to that of non-learning baselines and algorithm ablations. Our experiments show that PIGINet substantially improves planning efficiency, cutting down runtime by 80% on average on pick-and-place problems with articulated obstacles. It also achieves zero-shot generalization to problems with unseen object categories thanks to its visual encoding of objects.  ( 2 min )
    Embed and Emulate: Learning to estimate parameters of dynamical systems with uncertainty quantification. (arXiv:2211.01554v1 [cs.LG])
    This paper explores learning emulators for parameter estimation with uncertainty estimation of high-dimensional dynamical systems. We assume access to a computationally complex simulator that inputs a candidate parameter and outputs a corresponding multichannel time series. Our task is to accurately estimate a range of likely values of the underlying parameters. Standard iterative approaches necessitate running the simulator many times, which is computationally prohibitive. This paper describes a novel framework for learning feature embeddings of observed dynamics jointly with an emulator that can replace high-cost simulators for parameter estimation. Leveraging a contrastive learning approach, our method exploits intrinsic data properties within and across parameter and trajectory domains. On a coupled 396-dimensional multiscale Lorenz 96 system, our method significantly outperforms a typical parameter estimation method based on predefined metrics and a classical numerical simulator, and with only 1.19% of the baseline's computation time. Ablation studies highlight the potential of explicitly designing learned emulators for parameter estimation by leveraging contrastive learning.
    Human Biophysics as Network Weights: Conditional Generative Models for Ultra-fast Simulation. (arXiv:2211.01856v1 [cs.LG])
    Simulations of biophysical systems have provided a huge contribution to our fundamental understanding of human physiology and remain a central pillar for developments in medical devices and human machine interfaces. However, despite their successes, such simulations usually rely on highly computationally expensive numerical modelling, which is often inefficient to adapt to new simulation parameters. This limits their use in dynamic models of human behavior, for example in modelling the electric fields generated by muscles in a moving arm. We propose the alternative approach to use conditional generative models, which can learn complex relationships between the underlying generative conditions whilst remaining inexpensive to sample from. As a demonstration of this concept, we present BioMime, a hybrid architecture that combines elements of deep latent variable models and conditional adversarial training to construct a generative model that can both transform existing data samples to reflect new modelling assumptions and sample new data from a conditioned distribution. We demonstrate that BioMime can learn to accurately mimic a complex numerical model of human muscle biophysics and then use this knowledge to continuously sample from a dynamically changing system in real-time. We argue that transfer learning approaches with conditional generative models are a viable solution for dynamic simulation with any numerical model.
    Data-based Polymer-Unit Fingerprint (PUFp): A Newly Accessible Expression of Polymer Organic Semiconductors for Machine Learning. (arXiv:2211.01583v1 [cond-mat.mtrl-sci])
    In the process of finding high-performance organic semiconductors (OSCs), it is of paramount importance in material development to identify important functional units that play key roles in material performance and subsequently establish substructure-property relationships. Herein, we describe a polymer-unit fingerprint (PUFp) generation framework. Machine learning (ML) models can be used to determine structure-mobility relationships by using PUFp information as structural input with 678 pieces of collected OSC data. A polymer-unit library consisting of 445 units is constructed, and the key polymer units for the mobility of OSCs are identified. By investigating the combinations of polymer units with mobility performance, a scheme for designing polymer OSC materials by combining ML approaches and PUFp information is proposed to not only passively predict OSC mobility but also actively provide structural guidance for new high-mobility OSC material design. The proposed scheme demonstrates the ability to screen new materials through pre-evaluation and classification ML steps and is an alternative methodology for applying ML in new high-mobility OSC discovery.
    Learning Decentralized Strategies for a Perimeter Defense Game with Graph Neural Networks. (arXiv:2211.01757v1 [cs.MA])
    We consider the problem of finding decentralized strategies for multi-agent perimeter defense games. In this work, we design a graph neural network-based learning framework to learn a mapping from defenders' local perceptions and the communication graph to defenders' actions such that the learned actions are close to that generated by a centralized expert algorithm. We demonstrate that our proposed networks stay closer to the expert policy and are superior to other baseline algorithms by capturing more intruders. Our GNN-based networks are trained at a small scale and can generalize to large scales. To validate our results, we run perimeter defense games in scenarios with different team sizes and initial configurations to evaluate the performance of the learned networks.
    Faster Adaptive Momentum-Based Federated Methods for Distributed Composition Optimization. (arXiv:2211.01883v1 [cs.LG])
    Composition optimization recently appears in many machine learning applications such as meta learning and reinforcement learning. Recently many composition optimization algorithms have been proposed and studied, however, few adaptive algorithm considers the composition optimization under the distributed setting. Meanwhile, the existing distributed composition optimization methods still suffer from high sample and communication complexities. In the paper, thus, we develop a class of faster momentum-based federated compositional gradient descent algorithms (i.e., MFCGD and AdaMFCGD) to solve the nonconvex distributed composition problems, which builds on the momentum-based variance reduced and local-SGD techniques. In particular, our adaptive algorithm (i.e., AdaMFCGD) uses a unified adaptive matrix to flexibly incorporate various adaptive learning rates. Moreover, we provide a solid theoretical analysis for our algorithms under non-i.i.d. setting, and prove our algorithms obtain a lower sample and communication complexities simultaneously than the existing federated compositional algorithms. Specifically, our algorithms obtain lower sample complexity of $\tilde{O}(\epsilon^{-3})$ with lower communication complexity of $\tilde{O}(\epsilon^{-2})$ in finding an $\epsilon$-stationary point. We conduct the experiments on robust federated learning and distributed meta learning tasks to demonstrate efficiency of our algorithms.
    Class Interference of Deep Neural Networks. (arXiv:2211.01370v1 [cs.LG])
    Recognizing and telling similar objects apart is even hard for human beings. In this paper, we show that there is a phenomenon of class interference with all deep neural networks. Class interference represents the learning difficulty in data, and it constitutes the largest percentage of generalization errors by deep networks. To understand class interference, we propose cross-class tests, class ego directions and interference models. We show how to use these definitions to study minima flatness and class interference of a trained model. We also show how to detect class interference during training through label dancing pattern and class dancing notes.  ( 2 min )
    Incorporating High-Frequency Weather Data into Consumption Expenditure Predictions. (arXiv:2211.01406v1 [q-fin.EC])
    Recent efforts have been very successful in accurately mapping welfare in datasparse regions of the world using satellite imagery and other non-traditional data sources. However, the literature to date has focused on predicting a particular class of welfare measures, asset indices, which are relatively insensitive to short term fluctuations in well-being. We suggest that predicting more volatile welfare measures, such as consumption expenditure, substantially benefits from the incorporation of data sources with high temporal resolution. By incorporating daily weather data into training and prediction, we improve consumption prediction accuracy significantly compared to models that only utilize satellite imagery.  ( 2 min )
    BATT: Backdoor Attack with Transformation-based Triggers. (arXiv:2211.01806v1 [cs.CR])
    Deep neural networks (DNNs) are vulnerable to backdoor attacks. The backdoor adversaries intend to maliciously control the predictions of attacked DNNs by injecting hidden backdoors that can be activated by adversary-specified trigger patterns during the training process. One recent research revealed that most of the existing attacks failed in the real physical world since the trigger contained in the digitized test samples may be different from that of the one used for training. Accordingly, users can adopt spatial transformations as the image pre-processing to deactivate hidden backdoors. In this paper, we explore the previous findings from another side. We exploit classical spatial transformations (i.e. rotation and translation) with the specific parameter as trigger patterns to design a simple yet effective poisoning-based backdoor attack. For example, only images rotated to a particular angle can activate the embedded backdoor of attacked DNNs. Extensive experiments are conducted, verifying the effectiveness of our attack under both digital and physical settings and its resistance to existing backdoor defenses.  ( 2 min )
    Dormant Neural Trojans. (arXiv:2211.01808v1 [cs.CR])
    We present a novel methodology for neural network backdoor attacks. Unlike existing training-time attacks where the Trojaned network would respond to the Trojan trigger after training, our approach inserts a Trojan that will remain dormant until it is activated. The activation is realized through a specific perturbation to the network's weight parameters only known to the attacker. Our analysis and the experimental results demonstrate that dormant Trojaned networks can effectively evade detection by state-of-the-art backdoor detection methods.  ( 2 min )
    Learning to Grasp the Ungraspable with Emergent Extrinsic Dexterity. (arXiv:2211.01500v1 [cs.RO])
    A simple gripper can solve more complex manipulation tasks if it can utilize the external environment such as pushing the object against the table or a vertical wall, known as "Extrinsic Dexterity." Previous work in extrinsic dexterity usually has careful assumptions about contacts which impose restrictions on robot design, robot motions, and the variations of the physical parameters. In this work, we develop a system based on reinforcement learning (RL) to address these limitations. We study the task of "Occluded Grasping" which aims to grasp the object in configurations that are initially occluded; the robot needs to move the object into a configuration from which these grasps can be achieved. We present a system with model-free RL that successfully achieves this task using a simple gripper with extrinsic dexterity. The policy learns emergent behaviors of pushing the object against the wall to rotate and then grasp it without additional reward terms on extrinsic dexterity. We discuss important components of the system including the design of the RL problem, multi-grasp training and selection, and policy generalization with automatic curriculum. Most importantly, the policy trained in simulation is zero-shot transferred to a physical robot. It demonstrates dynamic and contact-rich motions with a simple gripper that generalizes across objects with various size, density, surface friction, and shape with a 78% success rate. Videos can be found at https://sites.google.com/view/grasp-ungraspable/.  ( 2 min )
    Exploring Explainability Methods for Graph Neural Networks. (arXiv:2211.01770v1 [cs.LG])
    With the growing use of deep learning methods, particularly graph neural networks, which encode intricate interconnectedness information, for a variety of real tasks, there is a necessity for explainability in such settings. In this paper, we demonstrate the applicability of popular explainability approaches on Graph Attention Networks (GAT) for a graph-based super-pixel image classification task. We assess the qualitative and quantitative performance of these techniques on three different datasets and describe our findings. The results shed a fresh light on the notion of explainability in GNNs, particularly GATs.  ( 2 min )
    Analysis of a Deep Learning Model for 12-Lead ECG Classification Reveals Learned Features Similar to Diagnostic Criteria. (arXiv:2211.01738v1 [eess.SP])
    Despite their remarkable performance, deep neural networks remain unadopted in clinical practice, which is considered to be partially due to their lack in explainability. In this work, we apply attribution methods to a pre-trained deep neural network (DNN) for 12-lead electrocardiography classification to open this "black box" and understand the relationship between model prediction and learned features. We classify data from a public data set and the attribution methods assign a "relevance score" to each sample of the classified signals. This allows analyzing what the network learned during training, for which we propose quantitative methods: average relevance scores over a) classes, b) leads, and c) average beats. The analyses of relevance scores for atrial fibrillation (AF) and left bundle branch block (LBBB) compared to healthy controls show that their mean values a) increase with higher classification probability and correspond to false classifications when around zero, and b) correspond to clinical recommendations regarding which lead to consider. Furthermore, c) visible P-waves and concordant T-waves result in clearly negative relevance scores in AF and LBBB classification, respectively. In summary, our analysis suggests that the DNN learned features similar to cardiology textbook knowledge.  ( 2 min )
    Resource-aware Deep Learning for Wireless Fingerprinting Localization. (arXiv:2211.01759v1 [cs.NI])
    Location based services, already popular with end users, are now inevitably becoming part of new wireless infrastructures and emerging business processes. The increasingly popular Deep Learning (DL) artificial intelligence methods perform very well in wireless fingerprinting localization based on extensive indoor radio measurement data. However, with the increasing complexity these methods become computationally very intensive and energy hungry, both for their training and subsequent operation. Considering only mobile users, estimated to exceed 7.4 billion by the end of 2025, and assuming that the networks serving these users will need to perform only one localization per user per hour on average, the machine learning models used for the calculation would need to perform $65 \times 10^{12}$ predictions per year. Add to this equation tens of billions of other connected devices and applications that rely heavily on more frequent location updates, and it becomes apparent that localization will contribute significantly to carbon emissions unless more energy-efficient models are developed and used. In this Chapter, we discuss the latest results and trends in wireless localization and look at paths towards achieving more sustainable AI. We then elaborate on a methodology for computing DL model complexity, energy consumption and carbon footprint and show on a concrete example how to develop a more resource-aware model for fingerprinting. We finally compare relevant works in terms of complexity and training CO$_2$ footprint.  ( 2 min )
    Proximal Subgradient Norm Minimization of ISTA and FISTA. (arXiv:2211.01610v1 [math.OC])
    For first-order smooth optimization, the research on the acceleration phenomenon has a long-time history. Until recently, the mechanism leading to acceleration was not successfully uncovered by the gradient correction term and its equivalent implicit-velocity form. Furthermore, based on the high-resolution differential equation framework with the corresponding emerging techniques, phase-space representation and Lyapunov function, the squared gradient norm of Nesterov's accelerated gradient descent (\texttt{NAG}) method at an inverse cubic rate is discovered. However, this result cannot be directly generalized to composite optimization widely used in practice, e.g., the linear inverse problem with sparse representation. In this paper, we meticulously observe a pivotal inequality used in composite optimization about the step size $s$ and the Lipschitz constant $L$ and find that it can be improved tighter. We apply the tighter inequality discovered in the well-constructed Lyapunov function and then obtain the proximal subgradient norm minimization by the phase-space representation, regardless of gradient-correction or implicit-velocity. Furthermore, we demonstrate that the squared proximal subgradient norm for the class of iterative shrinkage-thresholding algorithms (ISTA) converges at an inverse square rate, and the squared proximal subgradient norm for the class of faster iterative shrinkage-thresholding algorithms (FISTA) is accelerated to convergence at an inverse cubic rate.  ( 2 min )
    A Data-Driven Approach to Quantum Cross-Platform Verification. (arXiv:2211.01668v1 [quant-ph])
    The task of testing whether two uncharacterized devices behave in the same way, known as cross-platform verification, is crucial for benchmarking quantum simulators and near-term quantum computers. Cross-platform verification becomes increasingly challenging as the system's dimensionality increases, and has so far remained intractable for continuous variable quantum systems. In this Letter, we develop a data-driven approach, working with limited noisy data and suitable for continuous variable quantum states. Our approach is based on a convolutional neural network that assesses the similarity of quantum states based on a lower-dimensional state representation built from measurement data. The network can be trained offline with classically simulated data, and is demonstrated here on non-Gaussian quantum states for which cross-platform verification could not be achieved with previous techniques. It can also be applied to cross-platform verification of quantum dynamics and to the problem of experimentally testing whether two quantum states are equivalent up to Gaussian unitary transformations.  ( 2 min )
    Spam Review Detection Using Deep Learning. (arXiv:2211.01675v1 [cs.CL])
    A robust and reliable system of detecting spam reviews is a crying need in todays world in order to purchase products without being cheated from online sites. In many online sites, there are options for posting reviews, and thus creating scopes for fake paid reviews or untruthful reviews. These concocted reviews can mislead the general public and put them in a perplexity whether to believe the review or not. Prominent machine learning techniques have been introduced to solve the problem of spam review detection. The majority of current research has concentrated on supervised learning methods, which require labeled data - an inadequacy when it comes to online review. Our focus in this article is to detect any deceptive text reviews. In order to achieve that we have worked with both labeled and unlabeled data and proposed deep learning methods for spam review detection which includes Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN) and a variant of Recurrent Neural Network (RNN) that is Long Short-Term Memory (LSTM). We have also applied some traditional machine learning classifiers such as Nave Bayes (NB), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) to detect spam reviews and finally, we have shown the performance comparison for both traditional and deep learning classifiers.  ( 2 min )
    Robust Few-shot Learning Without Using any Adversarial Samples. (arXiv:2211.01598v1 [cs.CV])
    The high cost of acquiring and annotating samples has made the `few-shot' learning problem of prime importance. Existing works mainly focus on improving performance on clean data and overlook robustness concerns on the data perturbed with adversarial noise. Recently, a few efforts have been made to combine the few-shot problem with the robustness objective using sophisticated Meta-Learning techniques. These methods rely on the generation of adversarial samples in every episode of training, which further adds a computational burden. To avoid such time-consuming and complicated procedures, we propose a simple but effective alternative that does not require any adversarial samples. Inspired by the cognitive decision-making process in humans, we enforce high-level feature matching between the base class data and their corresponding low-frequency samples in the pretraining stage via self distillation. The model is then fine-tuned on the samples of novel classes where we additionally improve the discriminability of low-frequency query set features via cosine similarity. On a 1-shot setting of the CIFAR-FS dataset, our method yields a massive improvement of $60.55\%$ & $62.05\%$ in adversarial accuracy on the PGD and state-of-the-art Auto Attack, respectively, with a minor drop in clean accuracy compared to the baseline. Moreover, our method only takes $1.69\times$ of the standard training time while being $\approx$ $5\times$ faster than state-of-the-art adversarial meta-learning methods. The code is available at https://github.com/vcl-iisc/robust-few-shot-learning.  ( 2 min )
    Reliable Malware Analysis and Detection using Topology Data Analysis. (arXiv:2211.01535v1 [cs.CR])
    Increasingly, malwares are becoming complex and they are spreading on networks targeting different infrastructures and personal-end devices to collect, modify, and destroy victim information. Malware behaviors are polymorphic, metamorphic, persistent, able to hide to bypass detectors and adapt to new environments, and even leverage machine learning techniques to better damage targets. Thus, it makes them difficult to analyze and detect with traditional endpoint detection and response, intrusion detection and prevention systems. To defend against malwares, recent work has proposed different techniques based on signatures and machine learning. In this paper, we propose to use an algebraic topological approach called topological-based data analysis (TDA) to efficiently analyze and detect complex malware patterns. Next, we compare the different TDA techniques (i.e., persistence homology, tomato, TDA Mapper) and existing techniques (i.e., PCA, UMAP, t-SNE) using different classifiers including random forest, decision tree, xgboost, and lightgbm. We also propose some recommendations to deploy the best-identified models for malware detection at scale. Results show that TDA Mapper (combined with PCA) is better for clustering and for identifying hidden relationships between malware clusters compared to PCA. Persistent diagrams are better to identify overlapping malware clusters with low execution time compared to UMAP and t-SNE. For malware detection, malware analysts can use Random Forest and Decision Tree with t-SNE and Persistent Diagram to achieve better performance and robustness on noised data.  ( 2 min )
    GRAIMATTER Green Paper: Recommendations for disclosure control of trained Machine Learning (ML) models from Trusted Research Environments (TREs). (arXiv:2211.01656v1 [cs.LG])
    TREs are widely, and increasingly used to support statistical analysis of sensitive data across a range of sectors (e.g., health, police, tax and education) as they enable secure and transparent research whilst protecting data confidentiality. There is an increasing desire from academia and industry to train AI models in TREs. The field of AI is developing quickly with applications including spotting human errors, streamlining processes, task automation and decision support. These complex AI models require more information to describe and reproduce, increasing the possibility that sensitive personal data can be inferred from such descriptions. TREs do not have mature processes and controls against these risks. This is a complex topic, and it is unreasonable to expect all TREs to be aware of all risks or that TRE researchers have addressed these risks in AI-specific training. GRAIMATTER has developed a draft set of usable recommendations for TREs to guard against the additional risks when disclosing trained AI models from TREs. The development of these recommendations has been funded by the GRAIMATTER UKRI DARE UK sprint research project. This version of our recommendations was published at the end of the project in September 2022. During the course of the project, we have identified many areas for future investigations to expand and test these recommendations in practice. Therefore, we expect that this document will evolve over time.  ( 3 min )
    Try to Avoid Attacks: A Federated Data Sanitization Defense for Healthcare IoMT Systems. (arXiv:2211.01592v1 [cs.CR])
    Healthcare IoMT systems are becoming intelligent, miniaturized, and more integrated into daily life. As for the distributed devices in the IoMT, federated learning has become a topical area with cloud-based training procedures when meeting data security. However, the distribution of IoMT has the risk of protection from data poisoning attacks. Poisoned data can be fabricated by falsifying medical data, which urges a security defense to IoMT systems. Due to the lack of specific labels, the filtering of malicious data is a unique unsupervised scenario. One of the main challenges is finding robust data filtering methods for various poisoning attacks. This paper introduces a Federated Data Sanitization Defense, a novel approach to protect the system from data poisoning attacks. To solve this unsupervised problem, we first use federated learning to project all the data to the subspace domain, allowing unified feature mapping to be established since the data is stored locally. Then we adopt the federated clustering to re-group their features to clarify the poisoned data. The clustering is based on the consistent association of data and its semantics. After we get the clustering of the private data, we do the data sanitization with a simple yet efficient strategy. In the end, each device of distributed ImOT is enabled to filter malicious data according to federated data sanitization. Extensive experiments are conducted to evaluate the efficacy of the proposed defense method against data poisoning attacks. Further, we consider our approach in the different poisoning ratios and achieve a high Accuracy and a low attack success rate.  ( 3 min )
    On the Safety of Interpretable Machine Learning: A Maximum Deviation Approach. (arXiv:2211.01498v1 [cs.LG])
    Interpretable and explainable machine learning has seen a recent surge of interest. We focus on safety as a key motivation behind the surge and make the relationship between interpretability and safety more quantitative. Toward assessing safety, we introduce the concept of maximum deviation via an optimization problem to find the largest deviation of a supervised learning model from a reference model regarded as safe. We then show how interpretability facilitates this safety assessment. For models including decision trees, generalized linear and additive models, the maximum deviation can be computed exactly and efficiently. For tree ensembles, which are not regarded as interpretable, discrete optimization techniques can still provide informative bounds. For a broader class of piecewise Lipschitz functions, we leverage the multi-armed bandit literature to show that interpretability produces tighter (regret) bounds on the maximum deviation. We present case studies, including one on mortgage approval, to illustrate our methods and the insights about models that may be obtained from deviation maximization.  ( 2 min )
    On the Informativeness of Supervision Signals. (arXiv:2211.01407v1 [cs.LG])
    Learning transferable representations by training a classifier is a well-established technique in deep learning (e.g., ImageNet pretraining), but it remains an open theoretical question why this kind of task-specific pre-training should result in ''good'' representations that actually capture the underlying structure of the data. We conduct an information-theoretic analysis of several commonly-used supervision signals from contrastive learning and classification to determine how they contribute to representation learning performance and how the dynamics of learning are affected by training parameters such as the number of labels, classes, and dimensions in the training dataset. We validate these results empirically in a series of simulations and conduct a cost-benefit analysis to establish a tradeoff curve that enables users to optimize the cost of supervising representation learning on their own datasets.  ( 2 min )
    Bayesian Counterfactual Mean Embeddings and Off-Policy Evaluation. (arXiv:2211.01518v1 [stat.ML])
    The counterfactual distribution models the effect of the treatment in the untreated group. While most of the work focuses on the expected values of the treatment effect, one may be interested in the whole counterfactual distribution or other quantities associated to it. Building on the framework of Bayesian conditional mean embeddings, we propose a Bayesian approach for modeling the counterfactual distribution, which leads to quantifying the epistemic uncertainty about the distribution. The framework naturally extends to the setting where one observes multiple treatment effects (e.g. an intermediate effect after an interim period, and an ultimate treatment effect which is of main interest) and allows for additionally modelling uncertainty about the relationship of these effects. For such goal, we present three novel Bayesian methods to estimate the expectation of the ultimate treatment effect, when only noisy samples of the dependence between intermediate and ultimate effects are provided. These methods differ on the source of uncertainty considered and allow for combining two sources of data. Moreover, we generalize these ideas to the off-policy evaluation framework, which can be seen as an extension of the counterfactual estimation problem. We empirically explore the calibration of the algorithms in two different experimental settings which require data fusion, and illustrate the value of considering the uncertainty stemming from the two sources of data.  ( 2 min )
    Speeding up NAS with Adaptive Subset Selection. (arXiv:2211.01454v1 [cs.LG])
    A majority of recent developments in neural architecture search (NAS) have been aimed at decreasing the computational cost of various techniques without affecting their final performance. Towards this goal, several low-fidelity and performance prediction methods have been considered, including those that train only on subsets of the training data. In this work, we present an adaptive subset selection approach to NAS and present it as complementary to state-of-the-art NAS approaches. We uncover a natural connection between one-shot NAS algorithms and adaptive subset selection and devise an algorithm that makes use of state-of-the-art techniques from both areas. We use these techniques to substantially reduce the runtime of DARTS-PT (a leading one-shot NAS algorithm), as well as BOHB and DEHB (leading multifidelity optimization algorithms), without sacrificing accuracy. Our results are consistent across multiple datasets, and towards full reproducibility, we release our code at https: //anonymous.4open.science/r/SubsetSelection NAS-B132.  ( 2 min )
    MPCFormer: fast, performant and private Transformer inference with MPC. (arXiv:2211.01452v1 [cs.LG])
    Enabling private inference is crucial for many cloud inference services that are based on Transformer models. However, existing private inference solutions for Transformers can increase the inference latency by more than 60x or significantly compromise the quality of inference results. In this paper, we design the framework MPCFORMER using secure multi-party computation (MPC) and Knowledge Distillation (KD). It can be used in tandem with many specifically designed MPC-friendly approximations and trained Transformer models. MPCFORMER significantly speeds up Transformer model inference in MPC settings while achieving similar ML performance to the input model. We evaluate MPCFORMER with various settings in MPC. On the IMDb dataset, we achieve similar performance to BERTBASE, while being 5.3x faster. On the GLUE benchmark, we achieve 97% performance of BERTBASE with a 2.2x speedup. We show that MPCFORMER remains effective with different trained Transformer weights such as ROBERTABASE and larger models including BERTLarge. In particular, we achieve similar performance to BERTLARGE, while being 5.93x faster on the IMDb dataset.  ( 2 min )

  • Open

    [N] Class-action lawsuit filed against GitHub, Microsoft, and OpenAI regarding the legality of GitHub Copilot, an AI-using tool for programmers
    Joseph Saveri Law Firm and Matthew Butterick File Class-Action Lawsuit Against GitHub, Microsoft, and OpenAI Over Violations of Open-Source Licenses Arising From GitHub Copilot, an AI-based product. GitHub Copilot litigation. Here is a blog post about this that was written by an intellectual property law expert before the lawsuit was filed. This is starting to look like the very first case dealing specifically with machine learning and fair use in the US. Discussion about this lawsuit at website Hacker News. submitted by /u/Wiskkey [link] [comments]  ( 57 min )
    [D] What are the major general advances in techniques?
    Hey, I'm a casual observer of the DL space, what are the biggest technique changes or discoveries that are now used everywhere? From my view: ReLU - simple to train non-linear function Dropout - how to not overfit (2014) Residual connections - how to go deep (2015) Layer normalisation - how to fit better (2016) Transformers - how to train sequences in parallel (2017) What's the other improvements or discoveries? More general the idea the better. submitted by /u/windoze [link] [comments]  ( 54 min )
    [D] Chelsea Finn, Stanford: On the biggest bottlenecks in robotics and reinforcement learning
    Here is a podcast episode with Chelsea Finn where we discuss some of the biggest bottlenecks in RL and robotics such as Sim2Real transferability, distribution shifts, and much more! submitted by /u/thejashGI [link] [comments]  ( 56 min )
    [D] DALL·E to be made available as API, OpenAI to give users full ownership rights to generated images
    Email announcement from OpenAI below: DALL·E is now available as an API You can now integrate state of the art image generation capabilities directly into your apps and products through our new DALL·E API. You own the generations you create with DALL·E. We’ve simplified our Terms of Use and you now have full ownership rights to the images you create with DALL·E — in addition to the usage rights you’ve already had to use and monetize your creations however you’d like. This update is possible due to improvements to our safety systems which minimize the ability to generate content that violates our content policy. Sort and showcase with collections. You can now organize your DALL·E creations in multiple collections. Share them publicly or keep them private. Check out our sea otter collection! We’re constantly amazed by the innovative ways you use DALL·E and love seeing your creations out in the world. Artists who would like their work to be shared on our Instagram can request to be featured using Instagram’s collab tool. DM us there to show off how you’re using the API! - The OpenAI Team submitted by /u/TiredOldCrow [link] [comments]  ( 57 min )
    [R] nature srep: Spontaneous emergence of computation in network cascades
    https://www.nature.com/articles/s41598-022-19218-0 https://youtu.be/WyAspVjo6VI Above links are to the paper and a talk about this research. (Starts at 1m35s) We show how random threshold networks can compute complex Boolean functions in cascades or avalanches. This has many implications for neuroscience and other domains, and may help in discovering more efficient methods for learning in artificial networks. submitted by /u/NefariousnessFun21 [link] [comments]  ( 56 min )
    [N] Ethan Caballero: Broken Neural Scaling Laws | New Podcast Episode
    video: https://www.youtube.com/watch?v=SV87S38M1J4 OUTLINE: 00:00 Introduction 00:50 The "Scale Is All You Need" Movement 01:07 A Functional Form Predicting Every Scaling Behavior 01:40 A Break Between Two Straight Lines On A Log Log Plot 02:32 The Broken Neural Scaling Laws Equation 04:04 Extrapolating A Ton Of Large Scale Vision And Language Tasks 04:49 Upstream And Downstream Have Different Breaks 05:22 Extrapolating Four Digit Addition Performance 06:11 On The Feasability Of Running Enough Training Runs 06:31 Predicting Sharp Left Turns 07:51 Modeling Double Descent 08:41 Forecasting Interpretability And Controllability 09:33 How Deception Might Happen In Practice 10:24 Sinister Stumbles And Treacherous Turns 11:18 Recursive Self Improvement Precedes Sinister Stumbles 11:51 Humans In The Loop For The Very First Deception 12:32 The Hardware Stuff Is Going To Come After The Software Stuff 12:57 Distributing Your Training By Copy-Pasting Yourself Into Different Servers 13:42 Automating The Entire Hardware Pipeline 14:47 Having Text AGI Spit Out New Robotics Design 16:33 The Case For Existential Risk From AI 18:32 Git Re-basin 18:54 Is Chain-Of-Thoughts Enough For Complex Reasoning In LMs? 19:52 Why Diffusion Models Outperform Other Generative Models 21:13 Using Whisper To Train GPT4 22:33 Text To Video Was Only Slightly Impressive 23:29 The e=mc^2 of AGI transcript: https://theinsideview.ai/ethan2 submitted by /u/evc123 [link] [comments]  ( 58 min )
    [P] Open source EDA tooling
    Have been developing some open source EDA-type tooling and thought I'd share what we've built up so far. One of the use cases we've been thinking about is how to make it super easy to view data really quickly and get basic stats automatically during the development process. Including a few screenshots of the workspace we've built. Let me know if this is useful to any of y'all. Our GH is https://github.com/cnextio/cnext https://preview.redd.it/cu24l2m2vrx91.png?width=4136&format=png&auto=webp&s=da916fa2bcafec81bbf67b6eaa939fe8aadb3604 https://preview.redd.it/yx1ppsdevrx91.png?width=3266&format=png&auto=webp&s=7afdda9d625fbf7b6bf07d8e9c0c471a6b57eff1 submitted by /u/viennese_schnitzel [link] [comments]  ( 57 min )
    [P] Secret Whisper: Deploy OpenAI Whisper model with privacy using BlindAI
    Hello everyone, We have released a walkthrough to show how to deploy OpenAI Whisper (https://openai.com/blog/whisper/) for speech-to-text with privacy using BlindAI. Whisper is a model that has been quite popular recently and has shown amazing performances for various tasks such as speech-to-text. However, in some scenarios, deploying such models can pose privacy issues. For instance, using it to transcribe therapy sessions could greatly help therapists and patients gain more insights, but sending session recordings could break doctor-patient privilege if the AI provider or Cloud provider of the hosted AI service is malicious or compromised. That is why we have provided a short walkthrough to show one can deploy an OpenAI Whisper, the tiny model, for English speech recognition with privacy using BlindAI. BlindAI (https://github.com/mithril-security/blindai) is an open-source confidential AI deployment. By using secure enclaves (Intel SGX for now, soon AMD SEV and Nvidia Confidential Computing), we provide end-to-end protection for users’ data, even when sending it to the Cloud for AI inference. You can see the gains of BlindAI on the scheme below: With and without BlindAI for Speech to Text The workflow is simple: Export the model in an ONNX file Upload the model inside a secure enclave using BlindAI SDK Query the model with end-to-end protection using BlindAI SDK You can run it yourself using Google Colab or from this notebook. If you like it drop a ⭐on our GitHub! submitted by /u/Separate-Still3770 [link] [comments]  ( 58 min )
    AI Invisibility Cloak live AMA live now! [Discussion]
    Curious how this works? Want to stump my advisor with a good question? AMA happening now! Professor Tom Goldstein from University of Maryland Center for Machine Learning, PI for the viral paper on an adversarial pattern (sweatshirt deployable) for fooling object detectors. submitted by /u/john_the_jedi [link] [comments]  ( 59 min )
    [P] Fine Tuning Stable Diffusion: Naruto Character Edition
    submitted by /u/mippie_moe [link] [comments]  ( 55 min )
    [D] How to install and deploy OpenAI Whisper
    Hello, If you are interested in automatic speech recognition (speech-to-text), you are most likely going to try OpenAI Whisper. If that's the case, here is an article I just made about how to install and deploy Whisper: https://nlpcloud.com/how-to-install-and-deploy-whisper-the-best-open-source-alternative-to-google-speech-to-text.html I hope it will be useful! Julien submitted by /u/juliensalinas [link] [comments]  ( 55 min )
    [P] Made a text generation model to extend stable diffusion prompts with suitable style cues
    submitted by /u/Neat-Delivery4741 [link] [comments]  ( 56 min )
    [D] Ensemble methods for Graph Neural Networks?
    Hi, I'm wondering what the popular/notable ensemble methods for GNNs are. Looking through the literature, there doesn't seem to be much work on this. Are standard techniques, like bagging and stacking, typically applied to GNNs? submitted by /u/jsonathan [link] [comments]  ( 57 min )
    [N] eDiffi: Text-to-Image Diffusion Models with Ensemble of Expert Denoisers
    https://arxiv.org/abs/2211.01324 https://deepimagination.cc/eDiffi/ submitted by /u/jd_3d [link] [comments]  ( 54 min )
    [R] ICML 2022 Paper Summaries (HUMAN)
    It's been a while since ICML 2022, but here are some human written paper summaries NLP ICML Debate and Understanding ML Optimisation/ Compression I'd be interested if people know of any others! submitted by /u/Historical_Insect668 [link] [comments]  ( 57 min )
    [D] Solving an inverse problem with machine learning where you predict multiple output arrays from a single input array
    I am an aquatic optical scientist who has created a massive synthetic dataset of spectral reflectances with paired spectral absorption and backscatter data for multiple aquatic components. I am fairly adept at applying simple ANNs for supervised regression using reflectance as input and singular parameters such as chlorophyll or sediment concentrations as output. What would be the best approach to predict multiple components of spectral data from one reflectance measurement. For example, I have a single reflectance measurement between 400-900 nm, and I want to predict the absorption spectrums of a hypothetical 3 component system including phytoplankton, sediment, and dissolved organic matter, which partially contribute to the reflectance spectrum. So, I have a single array as input (the reflectance) and I want to predict the three absorption arrays (each same size as input) as output. My current thinking is to flatten the output component arrays and use something like an autoencoder to vastly reduce the dimensionality of the output components and train a deep learning model to then predict the low dimension latent space. Would this approach work? What other alternatives are there? Thanks for the help. submitted by /u/kravitron [link] [comments]  ( 60 min )
    [N] On the detection of synthetic images generated by diffusion models
    Paper: https://arxiv.org/abs/2211.00680 Dataset & Code: https://github.com/grip-unina/DMimageDetection [said to be "released soon"] Abstract: Over the past decade, there has been tremendous progress in creating synthetic media, mainly thanks to the development of powerful methods based on generative adversarial networks (GAN). Very recently, methods based on diffusion models (DM) have been gaining the spotlight. In addition to providing an impressive level of photorealism, they enable the creation of text-based visual content, opening up new and exciting opportunities in many different application fields, from arts to video games. On the other hand, this property is an additional asset in the hands of malicious users, who can generate and distribute fake media perfectly adapted to their attacks, posing new challenges to the media forensic community. With this work, we seek to understand how difficult it is to distinguish synthetic images generated by diffusion models from pristine ones and whether current state-of-the-art detectors are suitable for the task. To this end, first we expose the forensics traces left by diffusion models, then study how current detectors, developed for GAN-generated images, perform on these new synthetic images, especially in challenging social-networks scenarios involving image compression and resizing. Datasets and code are available at this http URL. submitted by /u/xutw21 [link] [comments]  ( 59 min )
  • Open

    Could AI contradict an obvious lie?
    Goverments, bussines can be pretty dirty sometimes, the bigger and more important the worse the people involved is would a medical trained AI for example administer or recomend a medicine that doesn't help but is being promoted by the goverment or big pharma? (this is just a what if) submitted by /u/Absolutelynobody54 [link] [comments]  ( 40 min )
    how do I use dreambooth?
    submitted by /u/BigMan100105 [link] [comments]  ( 40 min )
    Sentiment analysis in ML & NLP
    submitted by /u/UBIAI [link] [comments]  ( 40 min )
    ‘ANIMOIA’ issue 1+2 out NOW! If you like monsters and intriguing mystery, this comic is for you! Made in collaboration with Midjourney! Issue 1 - https://www.amazon.com/dp/B0BGQPDKB6 Issue 2 - https://www.amazon.com/dp/B0BL1XC4KX
    submitted by /u/Ideal-Typical [link] [comments]  ( 40 min )
    Dall-E 2 NEW API TEST: Creating AI Art with Python - A Game Changer?🔥
    submitted by /u/allaboutai-kris [link] [comments]  ( 40 min )
    eDiffi: Higher Quality and Fidelity than Stable Diffusion! (explained)
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 55 min )
    Are there any text to music or text to sound AI out there at the moment aside from OpenAI jukebox?
    I don’t have very much knowledge when it comes to coding or python, which is part of the reason that I can’t really use open AI jukebox, but I would love to mess around with some kind of a tool that allows you to generate music or sound with texts are there any out there at the moment? I know stability was working on some thing, but I’m not really sure what it’s called so I haven’t really known what to look out for. I think I watched some kind of presentation where stability said that by the end of November they will have one of the text to music generators available on dream studio if I’m not mistaken? Would love to hear what you guys think! submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 42 min )
    Open AI Just launched the DALL·E API so developers can integrate DALL·E directly into their own apps and products.
    submitted by /u/ai-lover [link] [comments]  ( 43 min )
    Use GPT-3 and Stable Diffusion to write your kid's next bedtime story!
    submitted by /u/blazedemavocados [link] [comments]  ( 40 min )
    Does your hot AGI girlfriend need to be conscious
    submitted by /u/HumanSeeing [link] [comments]  ( 42 min )
    Cuantificar el volumen de vino tinto en una copa a partir de una fotografía tomada con un teléfono móvil
    submitted by /u/estasfuera [link] [comments]  ( 40 min )
    The truth about the AI alphabet soup (ANI, AGI, ASI)
    submitted by /u/bendee983 [link] [comments]  ( 94 min )
    DISNEY-FY Yourself In Stable Diffusion! Disney Tutorial!
    submitted by /u/PuppetHere [link] [comments]  ( 49 min )
    Will nationalism end global open-source AI collaboration?
    submitted by /u/ProtocolNews [link] [comments]  ( 41 min )
    Google wants robots to write their own Python code | ZDNET
    submitted by /u/codingai [link] [comments]  ( 44 min )
    Bonsai Brain – A low code platform to build AI agents
    The Bonsai Brain is a low code AI component that is integrated with Automation systems. The Bonsai Brain focuses on adding value to various Autonomous and AI systems. https://analyticsindiamag.com/bonsai-brain-in-azure-platform/ submitted by /u/analyticsindiam [link] [comments]  ( 45 min )
    Bonsai Brain – A low code platform to build AI agents
    The Bonsai Brain is a low code AI component that is integrated with Automation systems. The Bonsai Brain focuses on adding value to various Autonomous and AI systems. https://analyticsindiamag.com/bonsai-brain-in-azure-platform/ submitted by /u/analyticsindiam [link] [comments]  ( 45 min )
    Content Automation with Stable Diffusion + GPT-3 API + Python 🤖
    submitted by /u/allaboutai-kris [link] [comments]  ( 44 min )
    Accelerating Digital Transformation with Artificial Intelligence
    Digital transformation is the word of the decade as every organization is transforming its businesses. Learn how Artificial Intelligence accelerates this process. Read here : https://www.artiba.org/blog/accelerating-digital-transformation-with-artificial-intelligence submitted by /u/Emily-joe [link] [comments]  ( 40 min )
    How is AI transforming drug discovery? Alex Zhavoronkov, CEO, Insilico Medicine
    submitted by /u/chelsea_bear [link] [comments]  ( 45 min )
    Free AI generation platform, held an AI art contest with generous rewards
    submitted by /u/Odd-Sentence-5197 [link] [comments]  ( 41 min )
    Generator that modifies an input image based on text input?
    Hey, I want to modify/generate a couple of wacky family portraits with a prompt (for example a picture of my dad sitting on a medieval throne in a crown) for a gift. This obviously needs a double input (image and text). I scrambled to look for such a tool, but couldn't find anything yet, however somebody has got to come up with that already. submitted by /u/qamtam [link] [comments]  ( 46 min )
    META unveils TAVA, a novel approach for Template-free Animatable Volumetric Avatars - Metaverse ready.
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    FREE WEBINAR: You're invited to the Customer Churn Prediction with Explainable AI.
    Did you know that acquiring a new customer can be between 5x and 25x MORE expensive than retaining a customer? The effects of not focusing on customer churn prediction and prevention are particularly acute in industries such as telecommunications and banking. This could potentially result in the loss of recurring revenue and soft costs, such as your brand value diminishing from dissatisfied customers. PI.EXCHANGE is holding a webinar where we show you how you can use Machine Learning to predict customer churn in the telecommunications industry with ZERO coding necessary. Interested? Click here to register, limited spots available! submitted by /u/PIEXCHANGE [link] [comments]  ( 41 min )
  • Open

    Understanding Sampled MuZero's Formula
    My background in probabilities is not very solid, so I need help understanding some formulas in Sampled MuZero: Learning and Planning in Complex Action Spaces. In the paper, at the end of page 5, they propose to modify MuZero's PUCT formula, replacing pi with pi_beta hat. pi_beta hat ​ PUCT formula beta hat is defined here. Correct me if I am wrong: beta hat(a) = 1/K if a belongs to the sampled actions {a_i} or 0 otherwise. beta hat Proposal distribution, beta, is chosen to be pi. sampling distribution beta My question is: according to 2., can 1. be simplified to beta hat, since pi/beta = 1 (not sure if we can do this with distributions)? If so, and since in PUCT we are only considering the sampled actions {a_i}, then 1. = beta hat = 1/K, meaning that Sampled MuZero replaces pi with uniform distribution in the PUCT formula? submitted by /u/dx_rd_to_DX [link] [comments]  ( 53 min )
    Beyond Tabula Rasa: Reincarnating Reinforcement Learning
    submitted by /u/smallest_meta_review [link] [comments]  ( 53 min )
    Thinking about attending grad school in CS or robotics? Join our info sessions by CMU's Robotics Institute!
    Are you thinking about attending grad school in computer science or robotics? Ever wondered what makes a strong application?We are excited to announce three information sessions where we’ll talk about the Whys, Whats and Hows of grad school. During three sessions we will go over the several aspects of applying to grad school: Why should you apply for grad school? How do you apply to grad school? What makes a strong graduate school application? What is life like as a grad student (at CMU)? Q&A with the attendees. The information sessions will be hosted on Tuesday November 8th, 9:00 AM, Wednesday November 9th, 7:00 PM and Monday November 14th, 1:00 PM all Eastern Time (ET).Check out the website for details, Zoom links, and the YouTube recordings: https://www.ri.cmu.edu/why-what-and-how-grad-school-applications/We expect these sessions to benefit anyone from anywhere in the world who is interested in graduate school in robotics or a related area, especially those thinking about graduate school in the US.If you have any questions that you’d like faculty and grad students to talk about, make sure to add them here. submitted by /u/bart-ai [link] [comments]  ( 53 min )
    [Stable Baselines3] If I load a saved model and the use the .learn function will it pick up where it left off or will it be overwritten?
    If the latter is the case how do I update a saved model with more training? submitted by /u/AnonCaptain0022 [link] [comments]  ( 52 min )
    Interesting problems in adaptive control and optimization
    Hi, is there a nice PhD level problem or direction in the intersection of optimization theory, adaptive or dual control theory and reinforcement learning? Looking for something with lots of potential for theoretical guarantees, some industrial application and computational tractability. Could also have some game theory based approach. submitted by /u/No_Difference9752 [link] [comments]  ( 66 min )
    Mixture of Deterministic Policy Gradient and Stochastic Policy Gradient
    I am curious of the possibilities of the combination of those two kinds of policy gradients. Many works such as Q-prop and IPG, treat the mean value of the stochastic policy as an inherent deterministic policy, to update the policy with the additive manipulation of their corresponding losses. My first question comes that, is the mean value trustful as an auxiliary deterministic policy? It becomes crucial to connect the stochastic distribution and deterministic output so as to make both world sounds. Manipulations on both sides i.e. stochastic -> deterministic or deterministic -> stochastic leaves us a room to try out new powerful update rules. Do you ever seen any other literature that gives an attempt? submitted by /u/OutOfCharm [link] [comments]  ( 54 min )
    Papers using PPO?
    Hi guys, I'm writing up my Master's thesis and I'm trying to justify why we picked PPO, so I'd like to cite some papers that use PPO to get successful behaviours/SOTA results. I thought this would be easy to search and find results but I'm struggling 😅 Do any of you know some you could link me to? Cheers submitted by /u/leozinho2r [link] [comments]  ( 59 min )
    Is I.I.D. assumption present for reinforcement learning?
    Hello, For most learning algorithms, we make I.I.D. (Independent Identical Distribution) assumption for the dataset. This assumption is both reasonable and useful (https://ai.stackexchange.com/questions/10839/why-exactly-do-neural-networks-require-i-i-d-data). In deep RL, we learn from the experience tuples of (s_t, a_r, r, s_(t+1)). During training, these tuples are sampled in batches. Are we making I.I.D. assumption at this step? If yes, how do we defend it because clearly, there is a system dynamics that control transitions? Any discussion/pointer on the topic is much appreciated. Thanks! submitted by /u/CoffeeBean05 [link] [comments]  ( 51 min )
    ViZDoom has joined the Farama Foundation
    submitted by /u/jkterry1 [link] [comments]  ( 52 min )
  • Open

    Intelligent document processing with AWS AI and Analytics services in the insurance industry: Part 2
    In Part 1 of this series, we discussed intelligent document processing (IDP), and how IDP can accelerate claims processing use cases in the insurance industry. We discussed how we can use AWS AI services to accurately categorize claims documents along with supporting documents. We also discussed how to extract various types of documents in an […]  ( 11 min )
    Intelligent document processing with AWS AI services in the insurance industry: Part 1
    The goal of intelligent document processing (IDP) is to help your organization make faster and more accurate decisions by applying AI to process your paperwork. This two-part series highlights the AWS AI technologies that insurance companies can use to speed up their business processes. These AI technologies can be used across insurance use cases such […]  ( 9 min )
    Improving stability and flexibility of ML pipelines at Amazon Packaging Innovation with Amazon SageMaker Pipelines
    To delight customers and minimize packaging waste, Amazon must select the optimal packaging type for billions of packages shipped every year. If too little protection is used for a fragile item such as a coffee mug, the item will arrive damaged and Amazon risks their customer’s trust. Using too much protection will result in increased […]  ( 10 min )
  • Open

    How to speed up tensorflow model.predict()
    I've been working on a pre-trained transformer model, which is now fine tuned as well. The model is taking a lot of time to predict, is there any way I can expedite the whole prediction process? I've tried the following Conversion to TFLite: using post training quantization, the new models are completely giving different output as compared to the original model predictions and accuracy is dropping to great extent. Haven't tried Pruning and quantization aware training as they require training the model again (which took a couple of days before). Anyone who has worked with tf models and is aware how I can parallelize the prediction (don't have a GPU access as of now) or expedite it???! submitted by /u/HairySail9036 [link] [comments]  ( 47 min )
    Best Books to Learn Neural Networks in 2022 for Beginners (Updated) -
    submitted by /u/Lakshmireddys [link] [comments]  ( 42 min )
  • Open

    DALL·E API Now Available in Public Beta
    Starting today, developers can begin building apps with the DALL·E API.  ( 3 min )
  • Open

    Take the Green Train: NVIDIA BlueField DPUs Drive Data Center Efficiency
    The numbers are in, and they paint a picture of data centers going a deeper shade of green, thanks to energy-efficient networks accelerated with data processing units (DPUs). A suite of tests run with help from Ericsson, RedHat and VMware show power reductions up to 24% on servers using NVIDIA BlueField-2 DPUs. In one case, Read article > The post Take the Green Train: NVIDIA BlueField DPUs Drive Data Center Efficiency appeared first on NVIDIA Blog.  ( 5 min )
    Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction
    Skycatch, a San Francisco-based startup, has been helping companies mine both data and minerals for nearly a decade. The software-maker is now digging into the creation of digital twins, with an initial focus on the mining and construction industry, using the NVIDIA Omniverse platform for connecting and building custom 3D pipelines. SkyVerse, which is a Read article > The post Unearthing Data: Vision AI Startup Digs Into Digital Twins for Mining and Construction appeared first on NVIDIA Blog.  ( 7 min )
    Check Out 26 New Games Streaming on GeForce NOW in November
    It’s a brand new month, which means this GFN Thursday is all about the new games streaming from the cloud. In November, 26 titles will join the GeForce NOW library. Kick off with 11 additions this week, like Total War: THREE KINGDOMS and new content updates for Genshin Impact and Apex Legends. Plus, leading 5G Read article > The post Check Out 26 New Games Streaming on GeForce NOW in November appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Newton’s method: The Good, The Bad, and The Ugly
    This post will give examples where Newton’s method gives good results, bad results, and really bad results. Our example problem is to solve Kepler’s equation M = E – e sin E for E, given M and e, assuming 0 ≤ M ≤ π and 0 < e < 1. We will apply Newton’s method […] Newton’s method: The Good, The Bad, and The Ugly first appeared on John D. Cook.  ( 6 min )
  • Open

    In machine learning, synthetic data can offer real performance improvements
    Models trained on synthetic data can be more accurate than other models in some cases, which could eliminate some privacy, copyright, and ethical concerns from using real data.  ( 8 min )
  • Open

    AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning. (arXiv:2210.17451v2 [cs.CL] UPDATED)
    Standard fine-tuning of large pre-trained language models (PLMs) for downstream tasks requires updating hundreds of millions to billions of parameters, and storing a large copy of the PLM weights for every task resulting in increased cost for storing, sharing and serving the models. To address this, parameter-efficient fine-tuning (PEFT) techniques were introduced where small trainable components are injected in the PLM and updated during fine-tuning. We propose AdaMix as a general PEFT method that tunes a mixture of adaptation modules -- given the underlying PEFT method of choice -- introduced in each Transformer layer while keeping most of the PLM weights frozen. For instance, AdaMix can leverage a mixture of adapters like Houlsby or a mixture of low rank decomposition matrices like LoRA to improve downstream task performance over the corresponding PEFT methods for fully supervised and few-shot NLU and NLG tasks. Further, we design AdaMix such that it matches the same computational cost and the number of tunable parameters as the underlying PEFT method. By only tuning 0.1-0.2% of PLM parameters, we show that AdaMix outperforms SOTA parameter-efficient fine-tuning and full model fine-tuning for both NLU and NLG tasks.  ( 2 min )
    Better Best of Both Worlds Bounds for Bandits with Switching Costs. (arXiv:2206.03098v2 [cs.LG] UPDATED)
    We study best-of-both-worlds algorithms for bandits with switching cost, recently addressed by Rouyer, Seldin and Cesa-Bianchi, 2021. We introduce a surprisingly simple and effective algorithm that simultaneously achieves minimax optimal regret bound of $\mathcal{O}(T^{2/3})$ in the oblivious adversarial setting and a bound of $\mathcal{O}(\min\{\log (T)/\Delta^2,T^{2/3}\})$ in the stochastically-constrained regime, both with (unit) switching costs, where $\Delta$ is the gap between the arms. In the stochastically constrained case, our bound improves over previous results due to Rouyer et al., that achieved regret of $\mathcal{O}(T^{1/3}/\Delta)$. We accompany our results with a lower bound showing that, in general, $\tilde{\Omega}(\min\{1/\Delta^2,T^{2/3}\})$ regret is unavoidable in the stochastically-constrained case for algorithms with $\mathcal{O}(T^{2/3})$ worst-case regret.  ( 2 min )
    Predicting phoneme-level prosody latents using AR and flow-based Prior Networks for expressive speech synthesis. (arXiv:2211.01327v1 [cs.SD])
    A large part of the expressive speech synthesis literature focuses on learning prosodic representations of the speech signal which are then modeled by a prior distribution during inference. In this paper, we compare different prior architectures at the task of predicting phoneme level prosodic representations extracted with an unsupervised FVAE model. We use both subjective and objective metrics to show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality. Furthermore, we show that the synthesized speech has higher variability, for a given text, due to the nature of normalizing flows. We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.  ( 2 min )
    A view on model misspecification in uncertainty quantification. (arXiv:2210.16938v2 [cs.LG] UPDATED)
    Estimating uncertainty of machine learning models is essential to assess the quality of the predictions that these models provide. However, there are several factors that influence the quality of uncertainty estimates, one of which is the amount of model misspecification. Model misspecification always exists as models are mere simplifications or approximations to reality. The question arises whether the estimated uncertainty under model misspecification is reliable or not. In this paper, we argue that model misspecification should receive more attention, by providing thought experiments and contextualizing these with relevant literature.  ( 2 min )
    Item-based Variational Auto-encoder for Fair Music Recommendation. (arXiv:2211.01333v1 [cs.IR])
    We present our solution for the EvalRS DataChallenge. The EvalRS DataChallenge aims to build a more realistic recommender system considering accuracy, fairness, and diversity in evaluation. Our proposed system is based on an ensemble between an item-based variational auto-encoder (VAE) and a Bayesian personalized ranking matrix factorization (BPRMF). To mitigate the bias in popularity, we use an item-based VAE for each popularity group with an additional fairness regularization. To make a reasonable recommendation even the predictions are inaccurate, we combine the recommended list of BPRMF and that of item-based VAE. Through the experiments, we demonstrate that the item-based VAE with fairness regularization significantly reduces popularity bias compared to the user-based VAE. The ensemble between the item-based VAE and BPRMF makes the top-1 item similar to the ground truth even the predictions are inaccurate. Finally, we propose a `Coefficient Variance based Fairness' as a novel evaluation metric based on our reflections from the extensive experiments.  ( 2 min )
    Comparative analysis of segmentation and generative models for fingerprint retrieval task. (arXiv:2209.06172v2 [cs.CV] UPDATED)
    Biometric Authentication like Fingerprints has become an integral part of the modern technology for authentication and verification of users. It is pervasive in more ways than most of us are aware of. However, these fingerprint images deteriorate in quality if the fingers are dirty, wet, injured or when sensors malfunction. Therefore, extricating the original fingerprint by removing the noise and inpainting it to restructure the image is crucial for its authentication. Hence, this paper proposes a deep learning approach to address these issues using Generative (GAN) and Segmentation models. Qualitative and Quantitative comparison has been done between pix2pixGAN and cycleGAN (generative models) as well as U-net (segmentation model). To train the model, we created our own dataset NFD - Noisy Fingerprint Dataset meticulously with different backgrounds along with scratches in some images to make it more realistic and robust. In our research, the u-net model performed better than the GAN networks  ( 2 min )
    Classical versus Quantum: comparing Tensor Network-based Quantum Circuits on LHC data. (arXiv:2202.10471v2 [quant-ph] UPDATED)
    Tensor Networks (TN) are approximations of high-dimensional tensors designed to represent locally entangled quantum many-body systems efficiently. This study provides a comprehensive comparison between classical TNs and TN-inspired quantum circuits in the context of Machine Learning on highly complex, simulated LHC data. We show that classical TNs require exponentially large bond dimensions and higher Hilbert-space mapping to perform comparably to their quantum counterparts. While such an expansion in the dimensionality allows better performance, we observe that, with increased dimensionality, classical TNs lead to a highly flat loss landscape, rendering the usage of gradient-based optimization methods highly challenging. Furthermore, by employing quantitative metrics, such as the Fisher information and effective dimensions, we show that classical TNs require a more extensive training sample to represent the data as efficiently as TN-inspired quantum circuits. We also engage with the idea of hybrid classical-quantum TNs and show possible architectures to employ a larger phase-space from the data. We offer our results using three main TN ansatz: Tree Tensor Networks, Matrix Product States, and Multi-scale Entanglement Renormalisation Ansatz.  ( 2 min )
    Men Are Elected, Women Are Married: Events Gender Bias on Wikipedia. (arXiv:2106.01601v1 [cs.CL] CROSS LISTED)
    Human activities can be seen as sequences of events, which are crucial to understanding societies. Disproportional event distribution for different demographic groups can manifest and amplify social stereotypes, and potentially jeopardize the ability of members in some groups to pursue certain goals. In this paper, we present the first event-centric study of gender biases in a Wikipedia corpus. To facilitate the study, we curate a corpus of career and personal life descriptions with demographic information consisting of 7,854 fragments from 10,412 celebrities. Then we detect events with a state-of-the-art event detection model, calibrate the results using strategically generated templates, and extract events that have asymmetric associations with genders. Our study discovers that the Wikipedia pages tend to intermingle personal life events with professional events for females but not for males, which calls for the awareness of the Wikipedia community to formalize guidelines and train the editors to mind the implicit biases that contributors carry. Our work also lays the foundation for future works on quantifying and discovering event biases at the corpus level.  ( 2 min )
    Koopman Operator learning for Accelerating Quantum Optimization and Machine Learning. (arXiv:2211.01365v1 [quant-ph])
    Finding efficient optimization methods plays an important role for quantum optimization and quantum machine learning on near-term quantum computers. While backpropagation on classical computers is computationally efficient, obtaining gradients on quantum computers is not, because the computational complexity usually scales with the number of parameters and measurements. In this paper, we connect Koopman operator theory, which has been successful in predicting nonlinear dynamics, with natural gradient methods in quantum optimization. We propose a data-driven approach using Koopman operator learning to accelerate quantum optimization and quantum machine learning. We develop two new families of methods: the sliding window dynamic mode decomposition (DMD) and the neural DMD for efficiently updating parameters on quantum computers. We show that our methods can predict gradient dynamics on quantum computers and accelerate the variational quantum eigensolver used in quantum optimization, as well as quantum machine learning. We further implement our Koopman operator learning algorithm on a real IBM quantum computer and demonstrate their practical effectiveness.  ( 2 min )
    Predictive Crypto-Asset Automated Market Making Architecture for Decentralized Finance using Deep Reinforcement Learning. (arXiv:2211.01346v1 [q-fin.TR])
    The study proposes a quote-driven predictive automated market maker (AMM) platform with on-chain custody and settlement functions, alongside off-chain predictive reinforcement learning capabilities to improve liquidity provision of real-world AMMs. The proposed AMM architecture is an augmentation to the Uniswap V3, a cryptocurrency AMM protocol, by utilizing a novel market equilibrium pricing for reduced divergence and slippage loss. Further, the proposed architecture involves a predictive AMM capability, utilizing a deep hybrid Long Short-Term Memory (LSTM) and Q-learning reinforcement learning framework that looks to improve market efficiency through better forecasts of liquidity concentration ranges, so liquidity starts moving to expected concentration ranges, prior to asset price movement, so that liquidity utilization is improved. The augmented protocol framework is expected have practical real-world implications, by (i) reducing divergence loss for liquidity providers, (ii) reducing slippage for crypto-asset traders, while (iii) improving capital efficiency for liquidity provision for the AMM protocol. To our best knowledge, there are no known protocol or literature that are proposing similar deep learning-augmented AMM that achieves similar capital efficiency and loss minimization objectives for practical real-world applications.  ( 2 min )
    Application of image-to-image translation in improving pedestrian detection. (arXiv:2209.03625v2 [cs.CV] UPDATED)
    The lack of effective target regions makes it difficult to perform several visual functions in low intensity light, including pedestrian recognition, and image-to-image translation. In this situation, with the accumulation of high-quality information by the combined use of infrared and visible images it is possible to detect pedestrians even in low light. In this study we are going to use advanced deep learning models like pix2pixGAN and YOLOv7 on LLVIP dataset, containing visible-infrared image pairs for low light vision. This dataset contains 33672 images and most of the images were captured in dark scenes, tightly synchronized with time and location.  ( 2 min )
    COVID-19 detection using chest X-rays: is lung segmentation important for generalization?. (arXiv:2104.06176v3 [eess.IV] UPDATED)
    Purpose: we evaluated the generalization capability of deep neural networks (DNNs), trained to classify chest X-rays as Covid-19, normal or pneumonia, using a relatively small and mixed dataset. Methods: we proposed a DNN to perform lung segmentation and classification, stacking a segmentation module (U-Net), an original intermediate module and a classification module (DenseNet201). To evaluate generalization, we tested the DNN with an external dataset (from distinct localities) and used Bayesian inference to estimate probability distributions of performance metrics. Results: our DNN achieved 0.917 AUC on the external test dataset, and a DenseNet without segmentation, 0.906. Bayesian inference indicated mean accuracy of 76.1% and [0.695, 0.826] 95% HDI (highest density interval, which concentrates 95% of the metric's probability mass) with segmentation and, without segmentation, 71.7% and [0.646, 0.786]. Conclusion: employing a novel DNN evaluation technique, which uses LRP and Brixia scores, we discovered that areas where radiologists found strong Covid-19 symptoms are the most important for the stacked DNN classification. External validation showed smaller accuracies than internal, indicating difficulty in generalization, which is positively affected by segmentation. Finally, the performance in the external dataset and the analysis with LRP suggest that DNNs can be trained in small and mixed datasets and still successfully detect Covid-19.  ( 3 min )
    Quasi-Newton Steps for Efficient Online Exp-Concave Optimization. (arXiv:2211.01357v1 [math.OC])
    The aim of this paper is to design computationally-efficient and optimal algorithms for the online and stochastic exp-concave optimization settings. Typical algorithms for these settings, such as the Online Newton Step (ONS), can guarantee a $O(d\ln T)$ bound on their regret after $T$ rounds, where $d$ is the dimension of the feasible set. However, such algorithms perform so-called generalized projections whenever their iterates step outside the feasible set. Such generalized projections require $\Omega(d^3)$ arithmetic operations even for simple sets such a Euclidean ball, making the total runtime of ONS of order $d^3 T$ after $T$ rounds, in the worst-case. In this paper, we side-step generalized projections by using a self-concordant barrier as a regularizer to compute the Newton steps. This ensures that the iterates are always within the feasible set without requiring projections. This approach still requires the computation of the inverse of the Hessian of the barrier at every step. However, using the stability properties of the Newton steps, we show that the inverse of the Hessians can be efficiently approximated via Taylor expansions for most rounds, resulting in a $O(d^2 T +d^\omega \sqrt{T})$ total computational complexity, where $\omega$ is the exponent of matrix multiplication. In the stochastic setting, we show that this translates into a $O(d^3/\epsilon)$ computational complexity for finding an $\epsilon$-suboptimal point, answering an open question by Koren 2013. We first show these new results for the simple case where the feasible set is a Euclidean ball. Then, to move to general convex set, we use a reduction to Online Convex Optimization over the Euclidean ball. Our final algorithm can be viewed as a more efficient version of ONS.  ( 3 min )
    Isometric Representations in Neural Networks Improve Robustness. (arXiv:2211.01236v1 [cs.LG])
    Artificial and biological agents cannon learn given completely random and unstructured data. The structure of data is encoded in the metric relationships between data points. In the context of neural networks, neuronal activity within a layer forms a representation reflecting the transformation that the layer implements on its inputs. In order to utilize the structure in the data in a truthful manner, such representations should reflect the input distances and thus be continuous and isometric. Supporting this statement, recent findings in neuroscience propose that generalization and robustness are tied to neural representations being continuously differentiable. In machine learning, most algorithms lack robustness and are generally thought to rely on aspects of the data that differ from those that humans use, as is commonly seen in adversarial attacks. During cross-entropy classification, the metric and structural properties of network representations are usually broken both between and within classes. This side effect from training can lead to instabilities under perturbations near locations where such structure is not preserved. One of the standard solutions to obtain robustness is to add ad hoc regularization terms, but to our knowledge, forcing representations to preserve the metric structure of the input data as a stabilising mechanism has not yet been studied. In this work, we train neural networks to perform classification while simultaneously maintaining within-class metric structure, leading to isometric within-class representations. Such network representations turn out to be beneficial for accurate and robust inference. By stacking layers with this property we create a network architecture that facilitates hierarchical manipulation of internal neural representations. Finally, we verify that isometric regularization improves the robustness to adversarial attacks on MNIST.  ( 3 min )
    Approximate Cross-Validation with Low-Rank Data in High Dimensions. (arXiv:2008.10547v2 [stat.ML] UPDATED)
    Many recent advances in machine learning are driven by a challenging trifecta: large data size $N$; high dimensions; and expensive algorithms. In this setting, cross-validation (CV) serves as an important tool for model assessment. Recent advances in approximate cross validation (ACV) provide accurate approximations to CV with only a single model fit, avoiding traditional CV's requirement for repeated runs of expensive algorithms. Unfortunately, these ACV methods can lose both speed and accuracy in high dimensions -- unless sparsity structure is present in the data. Fortunately, there is an alternative type of simplifying structure that is present in most data: approximate low rank (ALR). Guided by this observation, we develop a new algorithm for ACV that is fast and accurate in the presence of ALR data. Our first key insight is that the Hessian matrix -- whose inverse forms the computational bottleneck of existing ACV methods -- is ALR. We show that, despite our use of the \emph{inverse} Hessian, a low-rank approximation using the largest (rather than the smallest) matrix eigenvalues enables fast, reliable ACV. Our second key insight is that, in the presence of ALR data, error in existing ACV methods roughly grows with the (approximate, low) rank rather than with the (full, high) dimension. These insights allow us to prove theoretical guarantees on the quality of our proposed algorithm -- along with fast-to-compute upper bounds on its error. We demonstrate the speed and accuracy of our method, as well as the usefulness of our bounds, on a range of real and simulated data sets.  ( 3 min )
    3DPG: Distributed Deep Deterministic Policy Gradient Algorithms for Networked Multi-Agent Systems. (arXiv:2201.00570v2 [cs.LG] UPDATED)
    We present Distributed Deep Deterministic Policy Gradient (3DPG), a multi-agent actor-critic (MAAC) algorithm for Markov games. Unlike previous MAAC algorithms, 3DPG is fully distributed during both training and deployment. 3DPG agents calculate local policy gradients based on the most recently available local data (states, actions) and local policies of other agents. During training, this information is exchanged using a potentially lossy and delaying communication network. The network therefore induces Age of Information (AoI) for data and policies. We prove the asymptotic convergence of 3DPG even in the presence of potentially unbounded Age of Information (AoI). This provides an important step towards practical online and distributed multi-agent learning since 3DPG does not assume information to be available deterministically. We analyze 3DPG in the presence of policy and data transfer under mild practical assumptions. Our analysis shows that 3DPG agents converge to a local Nash equilibrium of Markov games in terms of utility functions expressed as the expected value of the agents local approximate action-value functions (Q-functions). The expectations of the local Q-functions are with respect to limiting distributions over the global state-action space shaped by the agents' accumulated local experiences. Our results also shed light on the policies obtained by general MAAC algorithms. We show through a heuristic argument and numerical experiments that 3DPG improves convergence over previous MAAC algorithms that use old actions instead of old policies during training. Further, we show that 3DPG is robust to AoI; it learns competitive policies even with large AoI and low data availability.  ( 3 min )
    A Simple and Optimal Policy Design with Safety against Heavy-tailed Risk for Stochastic Bandits. (arXiv:2206.02969v4 [stat.ML] UPDATED)
    We design new policies that ensure both worst-case optimality for expected regret and light-tailed risk for regret distribution in the stochastic multi-armed bandit problem. Recently, arXiv:2109.13595 showed that information-theoretically optimized bandit algorithms as well as standard UCB policies suffer from some serious heavy-tailed risk. Inspired by their results, we further show that heavy-tailed risk actually exists for all "instance-dependent consistent" policies. In particular, any policy that incurs an instance-dependent $O(\ln T)$ expected regret must incur a linear regret with probability $\Omega(\text{poly}(1/T))$. With the aim to ensure safety against such heavy-tailed risk, starting from the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an optimal exponential rate $\exp(-\Omega(\sqrt{T}))$. Next, we improve the policy design and analysis to the general $K$-armed bandit setting. Specifically, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. We also enhance the policy design to accommodate the "any-time" setting where $T$ is not known a priori. A brief account of numerical experiments is conducted to illustrate the theoretical findings. We conclude by extending our proposed policy design to the general stochastic linear bandit setting and obtain light-tailed regret bound. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk on regret distribution are compatible.  ( 3 min )
    Data Assimilation Networks. (arXiv:2010.09694v2 [cs.LG] UPDATED)
    Data assimilation (DA) aims at forecasting the state of a dynamical system by combining a mathematical representation of the system with noisy observations taking into account their uncertainties. State of the art methods are based on the Gaussian error statistics and the linearization of the non-linear dynamics which may lead to sub-optimal methods. In this respect, there are still open questions how to improve these methods. In this paper, we propose a fully data driven deep learning architecture generalizing recurrent Elman networks and data assimilation algorithms which approximate a sequence of prior and posterior densities conditioned on noisy observations. By construction our approach can be used for general nonlinear dynamics and non-Gaussian densities. On numerical experiments based on the well-known Lorenz-95 system and with Gaussian error statistics, our architecture achieves comparable performance to EnKF on both the analysis and the propagation of probability density functions of the system state at a given time without using any explicit regularization technique.  ( 2 min )
    Bayesian Model Selection of Lithium-Ion Battery Models via Bayesian Quadrature. (arXiv:2210.17299v2 [stat.ME] UPDATED)
    This paper presents a Bayesian model selection approach via Bayesian quadrature and sensitivity analysis of the selection criterion for a lithium-ion battery model. The Bayesian model evidence is adopted as the metric, which can select the simplest but well-describing model based on Occam's razor principle. While the model evidence requires prohibitive integral computations over parameter space, Bayesian quadrature offers sample-efficient integration via model-based inference to minimise the number of battery model evaluations. The posterior distribution of battery model parameters can also be inferred as a byproduct in one go, which is also beneficial in creating a digital twin. The simplest lithium-ion battery models, equivalent circuit models, were used to analyse the sensitivity of the selection criterion at given different datasets and model configurations. We show that popular selection criteria, such as root-mean-square error, and Bayesian information criterion, can fail to select a correct model in a multimodal posterior case. The model evidence can spot the true model in such cases, simultaneously providing the variance of evidence inference itself as an indication of confidence. Bayesian quadrature can compute the evidence faster than popular MCMC solvers.  ( 2 min )
    Learning Debiased Classifier with Biased Committee. (arXiv:2206.10843v3 [cs.LG] UPDATED)
    Neural networks are prone to be biased towards spurious correlations between classes and latent attributes exhibited in a major portion of training data, which ruins their generalization capability. We propose a new method for training debiased classifiers with no spurious attribute label. The key idea is to employ a committee of classifiers as an auxiliary module that identifies bias-conflicting data, i.e., data without spurious correlation, and assigns large weights to them when training the main classifier. The committee is learned as a bootstrapped ensemble so that a majority of its classifiers are biased as well as being diverse, and intentionally fail to predict classes of bias-conflicting data accordingly. The consensus within the committee on prediction difficulty thus provides a reliable cue for identifying and weighting bias-conflicting data. Moreover, the committee is also trained with knowledge transferred from the main classifier so that it gradually becomes debiased along with the main classifier and emphasizes more difficult data as training progresses. On five real-world datasets, our method outperforms prior arts using no spurious attribute label like ours and even surpasses those relying on bias labels occasionally.  ( 2 min )
    Model-Based Reinforcement Learning for Stochastic Hybrid Systems. (arXiv:2111.06211v2 [eess.SY] UPDATED)
    Optimal control of general nonlinear systems is a central challenge in automation. Enabled by powerful function approximators, data-driven approaches to control have recently successfully tackled challenging robotic applications. However, such methods often obscure the structure of dynamics and control behind black-box over-parameterized representations, thus limiting our ability to understand closed-loop behavior. This paper adopts a hybrid-system view of nonlinear modeling and control that lends an explicit hierarchical structure to the problem and breaks down complex dynamics into simpler localized units. We consider a sequence modeling paradigm that captures the temporal structure of the data and derive an expectation-maximization (EM) algorithm that automatically decomposes nonlinear dynamics into stochastic piecewise affine dynamical systems with nonlinear boundaries. Furthermore, we show that these time-series models naturally admit a closed-loop extension that we use to extract local polynomial feedback controllers from nonlinear experts via behavioral cloning. Finally, we introduce a novel hybrid relative entropy policy search (Hb-REPS) technique that incorporates the hierarchical nature of hybrid systems and optimizes a set of time-invariant local feedback controllers derived from a local polynomial approximation of a global state-value function.  ( 2 min )
    Task-Oriented Over-the-Air Computation for Multi-Device Edge AI. (arXiv:2211.01255v1 [cs.IT])
    Departing from the classic paradigm of data-centric designs, the 6G networks for supporting edge AI features task-oriented techniques that focus on effective and efficient execution of AI task. Targeting end-to-end system performance, such techniques are sophisticated as they aim to seamlessly integrate sensing (data acquisition), communication (data transmission), and computation (data processing). Aligned with the paradigm shift, a task-oriented over-the-air computation (AirComp) scheme is proposed in this paper for multi-device split-inference system. In the considered system, local feature vectors, which are extracted from the real-time noisy sensory data on devices, are aggregated over-the-air by exploiting the waveform superposition in a multiuser channel. Then the aggregated features as received at a server are fed into an inference model with the result used for decision making or control of actuators. To design inference-oriented AirComp, the transmit precoders at edge devices and receive beamforming at edge server are jointly optimized to rein in the aggregation error and maximize the inference accuracy. The problem is made tractable by measuring the inference accuracy using a surrogate metric called discriminant gain, which measures the discernibility of two object classes in the application of object/event classification. It is discovered that the conventional AirComp beamforming design for minimizing the mean square error in generic AirComp with respect to the noiseless case may not lead to the optimal classification accuracy. The reason is due to the overlooking of the fact that feature dimensions have different sensitivity towards aggregation errors and are thus of different importance levels for classification. This issue is addressed in this work via a new task-oriented AirComp scheme designed by directly maximizing the derived discriminant gain.  ( 3 min )
    Scalable Gaussian Process Hyperparameter Optimization via Coverage Regularization. (arXiv:2209.11280v2 [cs.LG] UPDATED)
    Gaussian processes (GPs) are Bayesian non-parametric models popular in a variety of applications due to their accuracy and native uncertainty quantification (UQ). Tuning GP hyperparameters is critical to ensure the validity of prediction accuracy and uncertainty; uniquely estimating multiple hyperparameters in, e.g. the Matern kernel can also be a significant challenge. Moreover, training GPs on large-scale datasets is a highly active area of research: traditional maximum likelihood hyperparameter training requires quadratic memory to form the covariance matrix and has cubic training complexity. To address the scalable hyperparameter tuning problem, we present a novel algorithm which estimates the smoothness and length-scale parameters in the Matern kernel in order to improve robustness of the resulting prediction uncertainties. Using novel loss functions similar to those in conformal prediction algorithms in the computational framework provided by the hyperparameter estimation algorithm MuyGPs, we achieve improved UQ over leave-one-out likelihood maximization while maintaining a high degree of scalability as demonstrated in numerical experiments.  ( 2 min )
    Fast, accurate, and transferable many-body interatomic potentials by symbolic regression. (arXiv:1904.01095v3 [physics.comp-ph] CROSS LISTED)
    The length and time scales of atomistic simulations are limited by the computational cost of the methods used to predict material properties. In recent years there has been great progress in the use of machine learning algorithms to develop fast and accurate interatomic potential models, but it remains a challenge to develop models that generalize well and are fast enough to be used at extreme time and length scales. To address this challenge, we have developed a machine learning algorithm based on symbolic regression in the form of genetic programming that is capable of discovering accurate, computationally efficient manybody potential models. The key to our approach is to explore a hypothesis space of models based on fundamental physical principles and select models within this hypothesis space based on their accuracy, speed, and simplicity. The focus on simplicity reduces the risk of overfitting the training data and increases the chances of discovering a model that generalizes well. Our algorithm was validated by rediscovering an exact Lennard-Jones potential and a Sutton Chen embedded atom method potential from training data generated using these models. By using training data generated from density functional theory calculations, we found potential models for elemental copper that are simple, as fast as embedded atom models, and capable of accurately predicting properties outside of their training set. Our approach requires relatively small sets of training data, making it possible to generate training data using highly accurate methods at a reasonable computational cost. We present our approach, the forms of the discovered models, and assessments of their transferability, accuracy and speed.  ( 3 min )
    High-Resolution Peak Demand Estimation Using Generalized Additive Models and Deep Neural Networks. (arXiv:2203.03342v2 [cs.LG] UPDATED)
    This paper covers predicting high-resolution electricity peak demand features given lower-resolution data. This is a relevant setup as it answers whether limited higher-resolution monitoring helps to estimate future high-resolution peak loads when the high-resolution data is no longer available. That question is particularly interesting for network operators considering replacing high-resolution monitoring predictive models due to economic considerations. We propose models to predict half-hourly minima and maxima of high-resolution (every minute) electricity load data while model inputs are of a lower resolution (30 minutes). We combine predictions of generalized additive models (GAM) and deep artificial neural networks (DNN), which are popular in load forecasting. We extensively analyze the prediction models, including the input parameters' importance, focusing on load, weather, and seasonal effects. The proposed method won a data competition organized by Western Power Distribution, a British distribution network operator. In addition, we provide a rigorous evaluation study that goes beyond the competition frame to analyze the models' robustness. The results show that the proposed methods are superior to the competition benchmark concerning the out-of-sample root mean squared error (RMSE). This holds regarding the competition month and the supplementary evaluation study, which covers an additional eleven months. Overall, our proposed model combination reduces the out-of-sample RMSE by 57.4\% compared to the benchmark.  ( 3 min )
    Analysis and Optimization of GNN-Based Recommender Systems on Persistent Memory. (arXiv:2207.11918v2 [cs.IR] UPDATED)
    Graph neural networks (GNNs), which have emerged as an effective method for handling machine learning tasks on graphs, bring a new approach to building recommender systems, where the task of recommendation can be formulated as the link prediction problem on user-item bipartite graphs. Training GNN-based recommender systems (GNNRecSys) on large graphs incurs a large memory footprint, easily exceeding the DRAM capacity on a typical server. Existing solutions resort to distributed subgraph training, which is inefficient due to the high cost of dynamically constructing subgraphs and significant redundancy across subgraphs. The emerging persistent memory technologies provide a significantly larger memory capacity than DRAMs at an affordable cost, making single-machine GNNRecSys training feasible, which eliminates the inefficiencies in distributed training. One major concern of using persistent memory devices for GNNRecSys is their relatively low bandwidth compared with DRAMs. This limitation can be particularly detrimental to achieving high performance for GNNRecSys workloads since their dominant compute kernels are sparse and memory access intensive. To understand whether persistent memory is a good fit for GNNRecSys training, we perform an in-depth characterization of GNNRecSys workloads and a comprehensive analysis of their performance on a persistent memory device, namely, Intel Optane. Based on the analysis, we provide guidance on how to configure Optane for GNNRecSys workloads. Furthermore, we present techniques for large-batch training to fully realize the advantages of single-machine GNNRecSys training. Our experiment results show that with the tuned batch size and optimal system configuration, Optane-based single-machine GNNRecSys training outperforms distributed training by a large margin, especially when handling deep GNN models.  ( 3 min )
    Nonlinear Transform Source-Channel Coding for Semantic Communications. (arXiv:2112.10961v3 [cs.IT] UPDATED)
    In this paper, we propose a class of high-efficiency deep joint source-channel coding methods that can closely adapt to the source distribution under the nonlinear transform, it can be collected under the name nonlinear transform source-channel coding (NTSCC). In the considered model, the transmitter first learns a nonlinear analysis transform to map the source data into latent space, then transmits the latent representation to the receiver via deep joint source-channel coding. Our model incorporates the nonlinear transform as a strong prior to effectively extract the source semantic features and provide side information for source-channel coding. Unlike existing conventional deep joint source-channel coding methods, the proposed NTSCC essentially learns both the source latent representation and an entropy model as the prior on the latent representation. Accordingly, novel adaptive rate transmission and hyperprior-aided codec refinement mechanisms are developed to upgrade deep joint source-channel coding. The whole system design is formulated as an optimization problem whose goal is to minimize the end-to-end transmission rate-distortion performance under established perceptual quality metrics. Across test image sources with various resolutions, we find that the proposed NTSCC transmission method generally outperforms both the analog transmission using the standard deep joint source-channel coding and the classical separation-based digital transmission. Notably, the proposed NTSCC method can potentially support future semantic communications due to its content-aware ability and perceptual optimization goal.  ( 3 min )
    Topology-aware Graph Neural Networks for Learning Feasible and Adaptive ac-OPF Solutions. (arXiv:2205.10129v2 [eess.SY] UPDATED)
    Solving the optimal power flow (OPF) problem is a fundamental task to ensure the system efficiency and reliability in real-time electricity grid operations. We develop a new topology-informed graph neural network (GNN) approach for predicting the optimal solutions of real-time ac-OPF problem. To incorporate grid topology to the NN model, the proposed GNN-for-OPF framework innovatively exploits the locality property of locational marginal prices and voltage magnitude. Furthermore, we develop a physics-aware (ac-)flow feasibility regularization approach for general OPF learning. The advantages of our proposed designs include reduced model complexity, improved generalizability and feasibility guarantees. By providing the analytical understanding on the graph subspace stability under grid topology contingency, we show the proposed GNN can quickly adapt to varying grid topology by an efficient re-training strategy. Numerical tests on various test systems of different sizes have validated the prediction accuracy, improved flow feasibility, and topology adaptivity capability of our proposed GNN-based learning framework.  ( 2 min )
    Does GNN Pretraining Help Molecular Representation?. (arXiv:2207.06010v2 [cs.LG] UPDATED)
    Extracting informative representations of molecules using Graph neural networks (GNNs) is crucial in AI-driven drug discovery. Recently, the graph research community has been trying to replicate the success of self-supervised pretraining in natural language processing, with several successes claimed. However, we find the benefit brought by self-supervised pretraining on small molecular data can be negligible in many cases. We conduct thorough ablation studies on the key components of GNN pretraining, including pretraining objectives, data splitting methods, input features, pretraining dataset scales, and GNN architectures, to see how they affect the accuracy of the downstream tasks. Our first important finding is, self-supervised graph pretraining do not always have statistically significant advantages over non-pretraining methods in many settings. Secondly, although noticeable improvement can be observed with additional supervised pretraining, the improvement may diminish with richer features or more balanced data splits. Thirdly, hyper-parameters could have larger impacts on accuracy of downstream tasks than the choice of pretraining tasks, especially when the scales of downstream tasks are small. Finally, we provide our conjectures where the complexity of some pretraining methods on small molecules might be insufficient, followed by empirical evidences on different pretraining datasets.  ( 2 min )
    Neural Graph Matching for Modification Similarity Applied to Electronic Document Comparison. (arXiv:2204.05486v2 [cs.CV] UPDATED)
    In this paper, we present a novel neural graph matching approach applied to document comparison. Document comparison is a common task in the legal and financial industries. In some cases, the most important differences may be the addition or omission of words, sentences, clauses, or paragraphs. However, it is a challenging task without recording or tracing whole edited process. Under many temporal uncertainties, we explore the potentiality of our approach to proximate the accurate comparison to make sure which element blocks have a relation of edition with others. In beginning, we apply a document layout analysis that combining traditional and modern technics to segment layout in blocks of various types appropriately. Then we transform this issue to a problem of layout graph matching with textual awareness. About graph matching, it is a long-studied problem with a broad range of applications. However, different from previous works focusing on visual images or structural layout, we also bring textual features into our model for adapting this domain. Specifically, based on the electronic document, we introduce an encoder to deal with the visual presentation decoding from PDF. Additionally, because the modifications can cause the inconsistency of document layout analysis between modified documents and the blocks can be merged and split, Sinkhorn divergence is adopted in our graph neural approach, which tries to overcome both these issues with many-to-many block matching. We demonstrate this on two categories of layouts, as follows., legal agreement and scientific articles, collected from our real-case datasets.  ( 3 min )
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v4 [stat.ML] UPDATED)
    Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1 loss over general classification rules and provide tight performance guarantees at learning. We show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.  ( 2 min )
    A Bayesian Learning, Greedy agglomerative clustering approach and evaluation techniques for Author Name Disambiguation Problem. (arXiv:2211.01303v1 [cs.DL])
    Author names often suffer from ambiguity owing to the same author appearing under different names and multiple authors possessing similar names. It creates difficulty in associating a scholarly work with the person who wrote it, thereby introducing inaccuracy in credit attribution, bibliometric analysis, search-by-author in a digital library, and expert discovery. A plethora of techniques for disambiguation of author names has been proposed in the literature. I try to focus on the research efforts targeted to disambiguate author names. I first go through the conventional methods, then I discuss evaluation techniques and the clustering model which finally leads to the Bayesian learning and Greedy agglomerative approach. I believe this concentrated review will be useful for the research community because it discusses techniques applied to a very large real database that is actively used worldwide. The Bayesian and the greedy agglomerative approach used will help to tackle AND problems in a better way. Finally, I try to outline a few directions for future work  ( 2 min )
    An optimal control perspective on diffusion-based generative modeling. (arXiv:2211.01364v1 [cs.LG])
    We establish a connection between stochastic optimal control and generative models based on stochastic differential equations (SDEs) such as recently developed diffusion probabilistic models. In particular, we derive a Hamilton-Jacobi-Bellman equation that governs the evolution of the log-densities of the underlying SDE marginals. This perspective allows to transfer methods from optimal control theory to generative modeling. First, we show that the evidence lower bound is a direct consequence of the well-known verification theorem from control theory. Further, we develop a novel diffusion-based method for sampling from unnormalized densities -- a problem frequently occurring in statistics and computational sciences.  ( 2 min )
    Understanding A Class of Decentralized and Federated Optimization Algorithms: A Multi-Rate Feedback Control Perspective. (arXiv:2204.12663v2 [cs.LG] UPDATED)
    Distributed algorithms have been playing an increasingly important role in many applications such as machine learning, signal processing, and control. Significant research efforts have been devoted to developing and analyzing new algorithms for various applications. In this work, we provide a fresh perspective to understand, analyze, and design distributed optimization algorithms. Through the lens of multi-rate feedback control, we show that a wide class of distributed algorithms, including popular decentralized/federated schemes, can be viewed as discretizing a certain continuous-time feedback control system, possibly with multiple sampling rates, such as decentralized gradient descent, gradient tracking, and federated averaging. This key observation not only allows us to develop a generic framework to analyze the convergence of the entire algorithm class. More importantly, it also leads to an interesting way of designing new distributed algorithms. We develop the theory behind our framework and provide examples to highlight how the framework can be used in practice.  ( 2 min )
    Improving Named Entity Recognition in Telephone Conversations via Effective Active Learning with Human in the Loop. (arXiv:2211.01354v1 [cs.CL])
    Telephone transcription data can be very noisy due to speech recognition errors, disfluencies, etc. Not only that annotating such data is very challenging for the annotators, but also such data may have lots of annotation errors even after the annotation job is completed, resulting in a very poor model performance. In this paper, we present an active learning framework that leverages human in the loop learning to identify data samples from the annotated dataset for re-annotation that are more likely to contain annotation errors. In this way, we largely reduce the need for data re-annotation for the whole dataset. We conduct extensive experiments with our proposed approach for Named Entity Recognition and observe that by re-annotating only about 6% training instances out of the whole dataset, the F1 score for a certain entity type can be significantly improved by about 25%.  ( 2 min )
    Pop2Piano : Pop Audio-based Piano Cover Generation. (arXiv:2211.00895v1 [cs.SD])
    The piano cover of pop music is widely enjoyed by people. However, the generation task of the pop piano cover is still understudied. This is partly due to the lack of synchronized {Pop, Piano Cover} data pairs, which made it challenging to apply the latest data-intensive deep learning-based methods. To leverage the power of the data-driven approach, we make a large amount of paired and synchronized {pop, piano cover} data using an automated pipeline. In this paper, we present Pop2Piano, a Transformer network that generates piano covers given waveforms of pop music. To the best of our knowledge, this is the first model to directly generate a piano cover from pop audio without melody and chord extraction modules. We show that Pop2Piano trained with our dataset can generate plausible piano covers.
    It's DONE: Direct ONE-shot learning with quantile weight imprinting. (arXiv:2204.13361v3 [cs.LG] UPDATED)
    Learning a new concept from one example is a superior function of the human brain and it is drawing attention in the field of machine learning as a one-shot learning task. In this paper, we propose one of the simplest methods for this task with a nonparametric weight imprinting, named Direct ONE-shot learning (DONE). DONE adds new classes to a pretrained deep neural network (DNN) classifier with neither training optimization nor pretrained-DNN modification. DONE is inspired by Hebbian theory and directly uses the neural activity input of the final dense layer obtained from data that belongs to the new additional class as the synaptic weight with a newly-provided-output neuron for the new class, transforming all statistical properties of the neural activity into those of synaptic weight by quantile normalization. DONE requires just one inference for learning a new concept and its procedure is simple, deterministic, not requiring parameter tuning and hyperparameters. DONE overcomes a severe problem of existing weight imprinting methods that DNN-dependently interfere with the classification of original-class images. The performance of DONE depends entirely on the pretrained DNN model used as a backbone model, and we confirmed that DONE with current well-trained backbone models perform at a decent accuracy.
    COIN: Co-Cluster Infomax for Bipartite Graphs. (arXiv:2206.00006v2 [cs.LG] UPDATED)
    Bipartite graphs are powerful data structures to model interactions between two types of nodes, which have been used in a variety of applications, such as recommender systems, information retrieval, and drug discovery. A fundamental challenge for bipartite graphs is how to learn informative node embeddings. Despite the success of recent self-supervised learning methods on bipartite graphs, their objectives are discriminating instance-wise positive and negative node pairs, which could contain cluster-level errors. In this paper, we introduce a novel co-cluster infomax (COIN) framework, which captures the cluster-level information by maximizing the mutual information of co-clusters. Different from previous infomax methods which estimate mutual information by neural networks, COIN could easily calculate mutual information. Besides, COIN is an end-to-end coclustering method which can be trained jointly with other objective functions and optimized via back-propagation. Furthermore, we also provide theoretical analysis for COIN. We theoretically prove that COIN is able to effectively increase the mutual information of node embeddings and COIN is upper-bounded by the prior distributions of nodes. We extensively evaluate the proposed COIN framework on various benchmark datasets and tasks to demonstrate the effectiveness of COIN.
    Improving Scheduled Sampling with Elastic Weight Consolidation for Neural Machine Translation. (arXiv:2109.06308v2 [cs.CL] UPDATED)
    Despite strong performance in many sequence-to-sequence tasks, autoregressive models trained with maximum likelihood estimation suffer from exposure bias, i.e. the discrepancy between the ground-truth prefixes used during training and the model-generated prefixes used at inference time. Scheduled sampling is a simple and empirically successful approach which addresses this issue by incorporating model-generated prefixes into training. However, it has been argued that it is an inconsistent training objective leading to models ignoring the prefixes altogether. In this paper, we conduct systematic experiments and find that scheduled sampling, while it ameliorates exposure bias by increasing model reliance on the input sequence, worsens performance when the prefix at inference time is correct, a form of catastrophic forgetting. We propose to use Elastic Weight Consolidation to better balance mitigating exposure bias with retaining performance. Experiments on four IWSLT'14 and WMT'14 translation datasets demonstrate that our approach alleviates catastrophic forgetting and significantly outperforms maximum likelihood estimation and scheduled sampling baselines.
    A Hybrid Adaptive Velocity Aided Navigation Filter with Application to INS/DVL Fusion. (arXiv:2211.01329v1 [cs.RO])
    Autonomous underwater vehicles (AUV) are commonly used in many underwater applications. Usually, inertial sensors and Doppler velocity log readings are used in a nonlinear filter to estimate the AUV navigation solution. The process noise covariance matrix is tuned according to the inertial sensors' characteristics. This matrix greatly influences filter accuracy, robustness, and performance. A common practice is to assume that this matrix is fixed during the AUV operation. However, it varies over time as the amount of uncertainty is unknown. Therefore, adaptive tuning of this matrix can lead to a significant improvement in the filter performance. In this work, we propose a learning-based adaptive velocity-aided navigation filter. To that end, handcrafted features are generated and used to tune the momentary system noise covariance matrix. Once the process noise covariance is learned, it is fed into the model-based navigation filter. Simulation results show the benefits of our approach compared to other adaptive approaches.
    Demand Prediction Using Machine Learning Methods and Stacked Generalization. (arXiv:2009.09756v2 [cs.LG] UPDATED)
    Supply and demand are two fundamental concepts of sellers and customers. Predicting demand accurately is critical for organizations in order to be able to make plans. In this paper, we propose a new approach for demand prediction on an e-commerce web site. The proposed model differs from earlier models in several ways. The business model used in the e-commerce web site, for which the model is implemented, includes many sellers that sell the same product at the same time at different prices where the company operates a market place model. The demand prediction for such a model should consider the price of the same product sold by competing sellers along the features of these sellers. In this study we first applied different regression algorithms for specific set of products of one department of a company that is one of the most popular online e-commerce companies in Turkey. Then we used stacked generalization or also known as stacking ensemble learning to predict demand. Finally, all the approaches are evaluated on a real world data set obtained from the e-commerce company. The experimental results show that some of the machine learning methods do produce almost as good results as the stacked generalization method.
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v4 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.
    PADA: Pruning Assisted Domain Adaptation for Self-Supervised Speech Representations. (arXiv:2203.16965v3 [cs.CL] UPDATED)
    While self-supervised speech representation learning (SSL) models serve a variety of downstream tasks, these models have been observed to overfit to the domain from which the unlabelled data originates. To alleviate this issue, we propose PADA (Pruning Assisted Domain Adaptation) and zero out redundant weights from models pre-trained on large amounts of out-of-domain (OOD) data. Intuitively, this helps to make space for the target-domain ASR finetuning. The redundant weights can be identified through various pruning strategies which have been discussed in detail as a part of this work. Specifically, we investigate the effect of the recently discovered Task-Agnostic and Task-Aware pruning on PADA and propose a new pruning paradigm based on the latter, which we call Cross-Domain Task-Aware Pruning (CD-TAW). CD-TAW obtains the initial pruning mask from a well fine-tuned OOD model, which makes it starkly different from the rest of the pruning strategies discussed in the paper. Our proposed CD-TAW methodology achieves up to 20.6% relative WER improvement over our baseline when fine-tuned on a 2-hour subset of Switchboard data without language model (LM) decoding. Furthermore, we conduct a detailed analysis to highlight the key design choices of our proposed method.
    Can we globally optimize cross-validation loss? Quasiconvexity in ridge regression. (arXiv:2107.09194v2 [stat.ML] UPDATED)
    Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Cross-validation (CV) is widely used for hyperparameter tuning in these models, but do practical optimization methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss. In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high. More generally, we show that quasiconvexity status is independent of many properties of the observed data (response norm, covariate-matrix right singular vectors and singular-value scaling) and has a complex dependence on the few that remain. We empirically confirm our theory using simulated experiments.
    A transformer-based model for default prediction in mid-cap corporate markets. (arXiv:2111.09902v3 [q-fin.GN] UPDATED)
    In this paper, we study mid-cap companies, i.e. publicly traded companies with less than US $10 billion in market capitalisation. Using a large dataset of US mid-cap companies observed over 30 years, we look to predict the default probability term structure over the medium term and understand which data sources (i.e. fundamental, market or pricing data) contribute most to the default risk. Whereas existing methods typically require that data from different time periods are first aggregated and turned into cross-sectional features, we frame the problem as a multi-label time-series classification problem. We adapt transformer models, a state-of-the-art deep learning model emanating from the natural language processing domain, to the credit risk modelling setting. We also interpret the predictions of these models using attention heat maps. To optimise the model further, we present a custom loss function for multi-label classification and a novel multi-channel architecture with differential training that gives the model the ability to use all input data efficiently. Our results show the proposed deep learning architecture's superior performance, resulting in a 13% improvement in AUC (Area Under the receiver operating characteristic Curve) over traditional models. We also demonstrate how to produce an importance ranking for the different data sources and the temporal relationships using a Shapley approach specific to these models.
    Reinforced Inverse Scattering. (arXiv:2206.04186v2 [cs.LG] UPDATED)
    Inverse wave scattering aims at determining the properties of an object using data on how the object scatters incoming waves. In order to collect information, sensors are put in different locations to send and receive waves from each other. The choice of sensor positions and incident wave frequencies determines the reconstruction quality of scatterer properties. This paper introduces reinforcement learning to develop precision imaging that decides sensor positions and wave frequencies adaptive to different scatterers in an intelligent way, thus obtaining a significant improvement in reconstruction quality with limited imaging resources. Extensive numerical results will be provided to demonstrate the superiority of the proposed method over existing methods.
    A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants. (arXiv:2201.07083v2 [stat.ML] UPDATED)
    Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.
    Projective Ranking-based GNN Evasion Attacks. (arXiv:2202.12993v2 [cs.LG] UPDATED)
    Graph neural networks (GNNs) offer promising learning methods for graph-related tasks. However, GNNs are at risk of adversarial attacks. Two primary limitations of the current evasion attack methods are highlighted: (1) The current GradArgmax ignores the "long-term" benefit of the perturbation. It is faced with zero-gradient and invalid benefit estimates in certain situations. (2) In the reinforcement learning-based attack methods, the learned attack strategies might not be transferable when the attack budget changes. To this end, we first formulate the perturbation space and propose an evaluation framework and the projective ranking method. We aim to learn a powerful attack strategy then adapt it as little as possible to generate adversarial samples under dynamic budget settings. In our method, based on mutual information, we rank and assess the attack benefits of each perturbation for an effective attack strategy. By projecting the strategy, our method dramatically minimizes the cost of learning a new attack strategy when the attack budget changes. In the comparative assessment with GradArgmax and RL-S2V, the results show our method owns high attack performance and effective transferability. The visualization of our method also reveals various attack patterns in the generation of adversarial samples.
    Approximate Discretization Invariance for Deep Learning on Neural Fields. (arXiv:2206.01178v2 [cs.LG] UPDATED)
    While neural fields have emerged as powerful representations of continuous data, there is a need for neural networks that can perform inference on such data without being sensitive to how the field is sampled, a property called (approximate) discretization invariance. We develop DI-Net, a framework for learning discretization invariant operators on neural fields of any type. Whereas current theoretical analyses of discretization invariant networks are restricted to the limit of infinite samples, our analysis does not require infinite samples and establishes upper bounds on the variation in DI-Net outputs given different finite discretizations. Our framework leads to a family of neural networks driven by numerical integration via quasi-Monte Carlo sampling with discretizations of low discrepancy. DI-Nets manifest desirable theoretical properties such as universal approximation of a large class of maps between $L^2$ functions, and gradients that are also discretization invariant. DI-Nets can also be seen as generalizations of many existing network families as they bridge discrete and continuous network classes, such as convolutional neural networks (CNNs) and neural operators respectively. Experimentally, DI-Nets derived from CNNs are able to classify and segment visual data represented by neural fields under various discretizations, and sometimes even generalize to new types of discretizations at test time.
    Continuous LWE is as Hard as LWE & Applications to Learning Gaussian Mixtures. (arXiv:2204.02550v3 [cs.CR] UPDATED)
    We show direct and conceptually simple reductions between the classical learning with errors (LWE) problem and its continuous analog, CLWE (Bruna, Regev, Song and Tang, STOC 2021). This allows us to bring to bear the powerful machinery of LWE-based cryptography to the applications of CLWE. For example, we obtain the hardness of CLWE under the classical worst-case hardness of the gap shortest vector problem. Previously, this was known only under quantum worst-case hardness of lattice problems. More broadly, with our reductions between the two problems, any future developments to LWE will also apply to CLWE and its downstream applications. As a concrete application, we show an improved hardness result for density estimation for mixtures of Gaussians. In this computational problem, given sample access to a mixture of Gaussians, the goal is to output a function that estimates the density function of the mixture. Under the (plausible and widely believed) exponential hardness of the classical LWE problem, we show that Gaussian mixture density estimation in $\mathbb{R}^n$ with roughly $\log n$ Gaussian components given $\mathsf{poly}(n)$ samples requires time quasi-polynomial in $n$. Under the (conservative) polynomial hardness of LWE, we show hardness of density estimation for $n^{\epsilon}$ Gaussians for any constant $\epsilon > 0$, which improves on Bruna, Regev, Song and Tang (STOC 2021), who show hardness for at least $\sqrt{n}$ Gaussians under polynomial (quantum) hardness assumptions. Our key technical tool is a reduction from classical LWE to LWE with $k$-sparse secrets where the multiplicative increase in the noise is only $O(\sqrt{k})$, independent of the ambient dimension $n$.
    Understanding Collapse in Non-Contrastive Siamese Representation Learning. (arXiv:2209.15007v2 [cs.LG] UPDATED)
    Contrastive methods have led a recent surge in the performance of self-supervised representation learning (SSL). Recent methods like BYOL or SimSiam purportedly distill these contrastive methods down to their essence, removing bells and whistles, including the negative examples, that do not contribute to downstream performance. These "non-contrastive" methods work surprisingly well without using negatives even though the global minimum lies at trivial collapse. We empirically analyze these non-contrastive methods and find that SimSiam is extraordinarily sensitive to dataset and model size. In particular, SimSiam representations undergo partial dimensional collapse if the model is too small relative to the dataset size. We propose a metric to measure the degree of this collapse and show that it can be used to forecast the downstream task performance without any fine-tuning or labels. We further analyze architectural design choices and their effect on the downstream performance. Finally, we demonstrate that shifting to a continual learning setting acts as a regularizer and prevents collapse, and a hybrid between continual and multi-epoch training can improve linear probe accuracy by as many as 18 percentage points using ResNet-18 on ImageNet. Our project page is at https://alexanderli.com/noncontrastive-ssl/.
    MemoNet:Memorizing Representations of All Cross Features Efficiently via Multi-Hash Codebook Network for CTR Prediction. (arXiv:2211.01334v1 [cs.IR])
    New findings in natural language processing(NLP) demonstrate that the strong memorization capability contributes a lot to the success of large language models.This inspires us to explicitly bring an independent memory mechanism into CTR ranking model to learn and memorize all cross features' representations.In this paper,we propose multi-Hash Codebook NETwork(HCNet) as the memory mechanism for efficiently learning and memorizing representations of all cross features in CTR tasks.HCNet uses multi-hash codebook as the main memory place and the whole memory procedure consists of three phases: multi-hash addressing,memory restoring and feature shrinking.HCNet can be regarded as a general module and can be incorporated into any current deep CTR model.We also propose a new CTR model named MemoNet which combines HCNet with a DNN backbone.Extensive experimental results on three public datasets show that MemoNet reaches superior performance over state-of-the-art approaches and validate the effectiveness of HCNet as a strong memory module.Besides, MemoNet shows the prominent feature of big models in NLP,which means we can enlarge the size of codebook in HCNet to sustainably obtain performance gains.Our work demonstrates the importance and feasibility of learning and memorizing representations of all cross features ,which sheds light on a new promising research direction.
    Generative machine learning methods for multivariate ensemble post-processing. (arXiv:2211.01345v1 [physics.ao-ph])
    Ensemble weather forecasts based on multiple runs of numerical weather prediction models typically show systematic errors and require post-processing to obtain reliable forecasts. Accurately modeling multivariate dependencies is crucial in many practical applications, and various approaches to multivariate post-processing have been proposed where ensemble predictions are first post-processed separately in each margin and multivariate dependencies are then restored via copulas. These two-step methods share common key limitations, in particular the difficulty to include additional predictors in modeling the dependencies. We propose a novel multivariate post-processing method based on generative machine learning to address these challenges. In this new class of nonparametric data-driven distributional regression models, samples from the multivariate forecast distribution are directly obtained as output of a generative neural network. The generative model is trained by optimizing a proper scoring rule which measures the discrepancy between the generated and observed data, conditional on exogenous input variables. Our method does not require parametric assumptions on univariate distributions or multivariate dependencies and allows for incorporating arbitrary predictors. In two case studies on multivariate temperature and wind speed forecasting at weather stations over Germany, our generative model shows significant improvements over state-of-the-art methods and particularly improves the representation of spatial dependencies.
    Diversely Regularized Matrix Factorization for Accurate and Aggregately Diversified Recommendation. (arXiv:2211.01328v1 [cs.IR])
    When recommending personalized top-$k$ items to users, how can we recommend the items diversely to them while satisfying their needs? Aggregately diversified recommender systems aim to recommend a variety of items across whole users without sacrificing the recommendation accuracy. They increase the exposure opportunities of various items, which in turn increase potential revenue of sellers as well as user satisfaction. However, it is challenging to tackle aggregate-level diversity with a matrix factorization (MF), one of the most common recommendation model, since skewed real world data lead to skewed recommendation results of MF. In this work, we propose DivMF (Diversely Regularized Matrix Factorization), a novel matrix factorization method for aggregately diversified recommendation. DivMF regularizes a score matrix of an MF model to maximize coverage and entropy of top-$k$ recommendation lists to aggregately diversify the recommendation results. We also propose an unmasking mechanism and carefully designed mi i-batch learning technique for accurate and efficient training. Extensive experiments on real-world datasets show that DivMF achieves the state-of-the-art performance in aggregately diversified recommendation.
    Fourier Disentangled Multimodal Prior Knowledge Fusion for Red Nucleus Segmentation in Brain MRI. (arXiv:2211.01353v1 [eess.IV])
    Early and accurate diagnosis of parkinsonian syndromes is critical to provide appropriate care to patients and for inclusion in therapeutic trials. The red nucleus is a structure of the midbrain that plays an important role in these disorders. It can be visualized using iron-sensitive magnetic resonance imaging (MRI) sequences. Different iron-sensitive contrasts can be produced with MRI. Combining such multimodal data has the potential to improve segmentation of the red nucleus. Current multimodal segmentation algorithms are computationally consuming, cannot deal with missing modalities and need annotations for all modalities. In this paper, we propose a new model that integrates prior knowledge from different contrasts for red nucleus segmentation. The method consists of three main stages. First, it disentangles the image into high-level information representing the brain structure, and low-frequency information representing the contrast. The high-frequency information is then fed into a network to learn anatomical features, while the list of multimodal low-frequency information is processed by another module. Finally, feature fusion is performed to complete the segmentation task. The proposed method was used with several iron-sensitive contrasts (iMag, QSM, R2*, SWI). Experiments demonstrate that our proposed model substantially outperforms a baseline UNet model when the training set size is very small.
    Bias-Aware Face Mask Detection Dataset. (arXiv:2211.01207v1 [cs.CV])
    In December 2019, a novel coronavirus (COVID-19) spread so quickly around the world that many countries had to set mandatory face mask rules in public areas to reduce the transmission of the virus. To monitor public adherence, researchers aimed to rapidly develop efficient systems that can detect faces with masks automatically. However, the lack of representative and novel datasets proved to be the biggest challenge. Early attempts to collect face mask datasets did not account for potential race, gender, and age biases. Therefore, the resulting models show inherent biases toward specific race groups, such as Asian or Caucasian. In this work, we present a novel face mask detection dataset that contains images posted on Twitter during the pandemic from around the world. Unlike previous datasets, the proposed Bias-Aware Face Mask Detection (BAFMD) dataset contains more images from underrepresented race and age groups to mitigate the problem for the face mask detection task. We perform experiments to investigate potential biases in widely used face mask detection datasets and illustrate that the BAFMD dataset yields models with better performance and generalization ability. The dataset is publicly available at https://github.com/Alpkant/BAFMD.
    Attention-based Neural Cellular Automata. (arXiv:2211.01233v1 [cs.CV])
    Recent extensions of Cellular Automata (CA) have incorporated key ideas from modern deep learning, dramatically extending their capabilities and catalyzing a new family of Neural Cellular Automata (NCA) techniques. Inspired by Transformer-based architectures, our work presents a new class of $\textit{attention-based}$ NCAs formed using a spatially localized$\unicode{x2014}$yet globally organized$\unicode{x2014}$self-attention scheme. We introduce an instance of this class named $\textit{Vision Transformer Cellular Automata}$ (ViTCA). We present quantitative and qualitative results on denoising autoencoding across six benchmark datasets, comparing ViTCA to a U-Net, a U-Net-based CA baseline (UNetCA), and a Vision Transformer (ViT). When comparing across architectures configured to similar parameter complexity, ViTCA architectures yield superior performance across all benchmarks and for nearly every evaluation metric. We present an ablation study on various architectural configurations of ViTCA, an analysis of its effect on cell states, and an investigation on its inductive biases. Finally, we examine its learned representations via linear probes on its converged cell state hidden representations, yielding, on average, superior results when compared to our U-Net, ViT, and UNetCA baselines.
    An Aggregation of Aggregation Methods in Computational Pathology. (arXiv:2211.01256v1 [cs.CV])
    Image analysis and machine learning algorithms operating on multi-gigapixel whole-slide images (WSIs) often process a large number of tiles (sub-images) and require aggregating predictions from the tiles in order to predict WSI-level labels. In this paper, we present a review of existing literature on various types of aggregation methods with a view to help guide future research in the area of computational pathology (CPath). We propose a general CPath workflow with three pathways that consider multiple levels and types of data and the nature of computation to analyse WSIs for predictive modelling. We categorize aggregation methods according to the context and representation of the data, features of computational modules and CPath use cases. We compare and contrast different methods based on the principle of multiple instance learning, perhaps the most commonly used aggregation method, covering a wide range of CPath literature. To provide a fair comparison, we consider a specific WSI-level prediction task and compare various aggregation methods for that task. Finally, we conclude with a list of objectives and desirable attributes of aggregation methods in general, pros and cons of the various approaches, some recommendations and possible future directions.
    Block-Recurrent Transformers. (arXiv:2203.07852v3 [cs.LG] UPDATED)
    We introduce the Block-Recurrent Transformer, which applies a transformer layer in a recurrent fashion along a sequence, and has linear complexity with respect to sequence length. Our recurrent cell operates on blocks of tokens rather than single tokens during training, and leverages parallel computation within a block in order to make efficient use of accelerator hardware. The cell itself is strikingly simple. It is merely a transformer layer: it uses self-attention and cross-attention to efficiently compute a recurrent function over a large set of state vectors and tokens. Our design was inspired in part by LSTM cells, and it uses LSTM-style gates, but it scales the typical LSTM cell up by several orders of magnitude. Our implementation of recurrence has the same cost in both computation time and parameter count as a conventional transformer layer, but offers dramatically improved perplexity in language modeling tasks over very long sequences. Our model out-performs a long-range Transformer XL baseline by a wide margin, while running twice as fast. We demonstrate its effectiveness on PG19 (books), arXiv papers, and GitHub source code. Our code has been released as open source.
    An Exponentially Converging Particle Method for the Mixed Nash Equilibrium of Continuous Games. (arXiv:2211.01280v1 [math.OC])
    We consider the problem of computing mixed Nash equilibria of two-player zero-sum games with continuous sets of pure strategies and with first-order access to the payoff function. This problem arises for example in game-theory-inspired machine learning applications, such as distributionally-robust learning. In those applications, the strategy sets are high-dimensional and thus methods based on discretisation cannot tractably return high-accuracy solutions. In this paper, we introduce and analyze a particle-based method that enjoys guaranteed local convergence for this problem. This method consists in parametrizing the mixed strategies as atomic measures and applying proximal point updates to both the atoms' weights and positions. It can be interpreted as a time-implicit discretization of the "interacting" Wasserstein-Fisher-Rao gradient flow. We prove that, under non-degeneracy assumptions, this method converges at an exponential rate to the exact mixed Nash equilibrium from any initialization satisfying a natural notion of closeness to optimality. We illustrate our results with numerical experiments and discuss applications to max-margin and distributionally-robust classification using two-layer neural networks, where our method has a natural interpretation as a simultaneous training of the network's weights and of the adversarial distribution.
    POLICE: Provably Optimal Linear Constraint Enforcement for Deep Neural Networks. (arXiv:2211.01340v1 [cs.LG])
    Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that requires minimal changes into a given DNN's forward-pass, that is computationally friendly, and that leaves the optimization of the DNN's parameter to be unconstrained i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space's region at any point during training, and testing. We coin this method POLICE, standing for Provably Optimal LInear Constraint Enforcement.
    RegCLR: A Self-Supervised Framework for Tabular Representation Learning in the Wild. (arXiv:2211.01165v1 [cs.CV])
    Recent advances in self-supervised learning (SSL) using large models to learn visual representations from natural images are rapidly closing the gap between the results produced by fully supervised learning and those produced by SSL on downstream vision tasks. Inspired by this advancement and primarily motivated by the emergence of tabular and structured document image applications, we investigate which self-supervised pretraining objectives, architectures, and fine-tuning strategies are most effective. To address these questions, we introduce RegCLR, a new self-supervised framework that combines contrastive and regularized methods and is compatible with the standard Vision Transformer architecture. Then, RegCLR is instantiated by integrating masked autoencoders as a representative example of a contrastive method and enhanced Barlow Twins as a representative example of a regularized method with configurable input image augmentations in both branches. Several real-world table recognition scenarios (e.g., extracting tables from document images), ranging from standard Word and Latex documents to even more challenging electronic health records (EHR) computer screen images, have been shown to benefit greatly from the representations learned from this new framework, with detection average-precision (AP) improving relatively by 4.8% for Table, 11.8% for Column, and 11.1% for GUI objects over a previous fully supervised baseline on real-world EHR screen images.
    EquiMod: An Equivariance Module to Improve Self-Supervised Learning. (arXiv:2211.01244v1 [cs.LG])
    Self-supervised visual representation methods are closing the gap with supervised learning performance. These methods rely on maximizing the similarity between embeddings of related synthetic inputs created through data augmentations. This can be seen as a task that encourages embeddings to leave out factors modified by these augmentations, i.e. to be invariant to them. However, this only considers one side of the trade-off in the choice of the augmentations: they need to strongly modify the images to avoid simple solution shortcut learning (e.g. using only color histograms), but on the other hand, augmentations-related information may be lacking in the representations for some downstream tasks (e.g. color is important for birds and flower classification). Few recent works proposed to mitigate the problem of using only an invariance task by exploring some form of equivariance to augmentations. This has been performed by learning additional embeddings space(s), where some augmentation(s) cause embeddings to differ, yet in a non-controlled way. In this work, we introduce EquiMod a generic equivariance module that structures the learned latent space, in the sense that our module learns to predict the displacement in the embedding space caused by the augmentations. We show that applying that module to state-of-the-art invariance models, such as SimCLR and BYOL, increases the performances on CIFAR10 and ImageNet datasets. Moreover, while our model could collapse to a trivial equivariance, i.e. invariance, we observe that it instead automatically learns to keep some augmentations-related information beneficial to the representations.
    FiFo: Fishbone Forwarding in Massive IoT Networks. (arXiv:2211.01213v1 [cs.IT])
    Massive Internet of Things (IoT) networks have a wide range of applications, including but not limited to the rapid delivery of emergency and disaster messages. Although various benchmark algorithms have been developed to date for message delivery in such applications, they pose several practical challenges such as insufficient network coverage and/or highly redundant transmissions to expand the coverage area, resulting in considerable energy consumption for each IoT device. To overcome this problem, we first characterize a new performance metric, forwarding efficiency, which is defined as the ratio of the coverage probability to the average number of transmissions per device, to evaluate the data dissemination performance more appropriately. Then, we propose a novel and effective forwarding method, fishbone forwarding (FiFo), which aims to improve the forwarding efficiency with acceptable computational complexity. Our FiFo method completes two tasks: 1) it clusters devices based on the unweighted pair group method with the arithmetic average; and 2) it creates the main axis and sub axes of each cluster using both the expectation-maximization algorithm for the Gaussian mixture model and principal component analysis. We demonstrate the superiority of FiFo by using a real-world dataset. Through intensive and comprehensive simulations, we show that the proposed FiFo method outperforms benchmark algorithms in terms of the forwarding efficiency.
    Faster variational quantum algorithms with quantum kernel-based surrogate models. (arXiv:2211.01134v1 [quant-ph])
    We present a new optimization method for small-to-intermediate scale variational algorithms on noisy near-term quantum processors which uses a Gaussian process surrogate model equipped with a classically-evaluated quantum kernel. Variational algorithms are typically optimized using gradient-based approaches however these are difficult to implement on current noisy devices, requiring large numbers of objective function evaluations. Our scheme shifts this computational burden onto the classical optimizer component of these hybrid algorithms, greatly reducing the number of queries to the quantum processor. We focus on the variational quantum eigensolver (VQE) algorithm and demonstrate numerically that such surrogate models are particularly well suited to the algorithm's objective function. Next, we apply these models to both noiseless and noisy VQE simulations and show that they exhibit better performance than widely-used classical kernels in terms of final accuracy and convergence speed. Compared to the typically-used stochastic gradient-descent approach for VQAs, our quantum kernel-based approach is found to consistently achieve significantly higher accuracy while requiring less than an order of magnitude fewer quantum circuit evaluations. We analyse the performance of the quantum kernel-based models in terms of the kernels' induced feature spaces and explicitly construct their feature maps. Finally, we describe a scheme for approximating the best-performing quantum kernel using a classically-efficient tensor network representation of its input state and so provide a pathway for scaling these methods to larger systems.
    Advertising strategy for profit-maximization: a novel practice on Tmall's online ads manager platforms. (arXiv:2211.01160v1 [cs.IR])
    Ads manager platform gains popularity among numerous e-commercial vendors/advertisers. It helps advertisers to facilitate the process of displaying their ads to target customers. One of the main challenges faced by advertisers, especially small and medium-sized enterprises, is to configure their advertising strategy properly. An ineffective advertising strategy will bring too many ``just looking'' clicks and, eventually, generate high advertising expenditure unproportionally to the growth of sales. In this paper, we present a novel profit-maximization model for online advertising optimization. The optimization problem is constructed to find optimal set of features to maximize the probability that target customers buy advertising products. We further reformulate the optimization problem to a knapsack problem with changeable parameters, and introduce a self-adjusted algorithm for finding the solution to the problem. Numerical experiment based on statistical data from Tmall show that our proposed method can optimize the advertising strategy given expenditure budget effectively.
    Human alignment of neural network representations. (arXiv:2211.01201v1 [cs.CV])
    Today's computer vision models achieve human or near-human level performance across a wide variety of vision tasks. However, their architectures, data, and learning algorithms differ in numerous ways from those that give rise to human vision. In this paper, we investigate the factors that affect alignment between the representations learned by neural networks and human concept representations. Human representations are inferred from behavioral responses in an odd-one-out triplet task, where humans were presented with three images and had to select the odd-one-out. We find that model scale and architecture have essentially no effect on alignment with human behavioral responses, whereas the training dataset and objective function have a much larger impact. Using a sparse Bayesian model of human conceptual representations, we partition triplets by the concept that distinguishes the two similar images from the odd-one-out, finding that some concepts such as food and animals are well-represented in neural network representations whereas others such as royal or sports-related objects are not. Overall, although models trained on larger, more diverse datasets achieve better alignment with humans than models trained on ImageNet alone, our results indicate that scaling alone is unlikely to be sufficient to train neural networks with conceptual representations that match those used by humans.
    Boosting word frequencies in authorship attribution. (arXiv:2211.01289v1 [cs.CL])
    In this paper, I introduce a simple method of computing relative word frequencies for authorship attribution and similar stylometric tasks. Rather than computing relative frequencies as the number of occurrences of a given word divided by the total number of tokens in a text, I argue that a more efficient normalization factor is the total number of relevant tokens only. The notion of relevant words includes synonyms and, usually, a few dozen other words in some ways semantically similar to a word in question. To determine such a semantic background, one of word embedding models can be used. The proposed method outperforms classical most-frequent-word approaches substantially, usually by a few percentage points depending on the input settings.
    Bayesian Nonlocal Operator Regression (BNOR): A Data-Driven Learning Framework of Nonlocal Models with Uncertainty Quantification. (arXiv:2211.01330v1 [cond-mat.mtrl-sci])
    We consider the problem of modeling heterogeneous materials where micro-scale dynamics and interactions affect global behavior. In the presence of heterogeneities in material microstructure it is often impractical, if not impossible, to provide quantitative characterization of material response. The goal of this work is to develop a Bayesian framework for uncertainty quantification (UQ) in material response prediction when using nonlocal models. Our approach combines the nonlocal operator regression (NOR) technique and Bayesian inference. Specifically, we use a Markov chain Monte Carlo (MCMC) method to sample the posterior probability distribution on parameters involved in the nonlocal constitutive law, and associated modeling discrepancies relative to higher fidelity computations. As an application, we consider the propagation of stress waves through a one-dimensional heterogeneous bar with randomly generated microstructure. Several numerical tests illustrate the construction, enabling UQ in nonlocal model predictions. Although nonlocal models have become popular means for homogenization, their statistical calibration with respect to high-fidelity models has not been presented before. This work is a first step towards statistical characterization of nonlocal model discrepancy in the context of homogenization.
    Time-aware Random Walk Diffusion to Improve Dynamic Graph Learning. (arXiv:2211.01214v1 [cs.LG])
    How can we augment a dynamic graph for improving the performance of dynamic graph neural networks? Graph augmentation has been widely utilized to boost the learning performance of GNN-based models. However, most existing approaches only enhance spatial structure within an input static graph by transforming the graph, and do not consider dynamics caused by time such as temporal locality, i.e., recent edges are more influential than earlier ones, which remains challenging for dynamic graph augmentation. In this work, we propose TiaRa (Time-aware Random Walk Diffusion), a novel diffusion-based method for augmenting a dynamic graph represented as a discrete-time sequence of graph snapshots. For this purpose, we first design a time-aware random walk proximity so that a surfer can walk along the time dimension as well as edges, resulting in spatially and temporally localized scores. We then derive our diffusion matrices based on the time-aware random walk, and show they become enhanced adjacency matrices that both spatial and temporal localities are augmented. Throughout extensive experiments, we demonstrate that TiaRaeffectively augments a given dynamic graph, and leads to significant improvements in dynamic GNN models for various graph datasets and tasks.
    Low-Resource Music Genre Classification with Advanced Neural Model Reprogramming. (arXiv:2211.01317v1 [cs.SD])
    Transfer learning (TL) approaches have shown promising results when handling tasks with limited training data. However, considerable memory and computational resources are often required for fine-tuning pre-trained neural networks with target domain data. In this work, we introduce a novel method for leveraging pre-trained models for low-resource (music) classification based on the concept of Neural Model Reprogramming (NMR). NMR aims at re-purposing a pre-trained model from a source domain to a target domain by modifying the input of a frozen pre-trained model. In addition to the known, input-independent, reprogramming method, we propose an advanced reprogramming paradigm: Input-dependent NMR, to increase adaptability to complex input data such as musical audio. Experimental results suggest that a neural model pre-trained on large-scale datasets can successfully perform music genre classification by using this reprogramming method. The two proposed Input-dependent NMR TL methods outperform fine-tuning-based TL methods on a small genre classification dataset.
    Fair Visual Recognition via Intervention with Proxy Features. (arXiv:2211.01253v1 [cs.LG])
    Deep learning models often learn to make predictions that rely on sensitive social attributes like gender and race, which poses significant fairness risks, especially in societal applications, e.g., hiring, banking, and criminal justice. Existing work tackles this issue by minimizing information about social attributes in models for debiasing. However, the high correlation between target task and social attributes makes bias mitigation incompatible with target task accuracy. Recalling that model bias arises because the learning of features in regard to bias attributes (i.e., bias features) helps target task optimization, we explore the following research question: \emph{Can we leverage proxy features to replace the role of bias feature in target task optimization for debiasing?} To this end, we propose \emph{Proxy Debiasing}, to first transfer the target task's learning of bias information from bias features to artificial proxy features, and then employ causal intervention to eliminate proxy features in inference. The key idea of \emph{Proxy Debiasing} is to design controllable proxy features to on one hand replace bias features in contributing to target task during the training stage, and on the other hand easily to be removed by intervention during the inference stage. This guarantees the elimination of bias features without affecting the target information, thus addressing the fairness-accuracy paradox in previous debiasing solutions. We apply \emph{Proxy Debiasing} to several benchmark datasets, and achieve significant improvements over the state-of-the-art debiasing methods in both of accuracy and fairness.
    Generative Poisoning Using Random Discriminators. (arXiv:2211.01086v1 [cs.LG])
    We introduce ShortcutGen, a new data poisoning attack that generates sample-dependent, error-minimizing perturbations by learning a generator. The key novelty of ShortcutGen is the use of a randomly-initialized discriminator, which provides spurious shortcuts needed for generating poisons. Different from recent, iterative methods, our ShortcutGen can generate perturbations with only one forward pass in a label-free manner, and compared to the only existing generative method, DeepConfuse, our ShortcutGen is faster and simpler to train while remaining competitive. We also demonstrate that integrating a simple augmentation strategy can further boost the robustness of ShortcutGen against early stopping, and combining augmentation and non-augmentation leads to new state-of-the-art results in terms of final validation accuracy, especially in the challenging, transfer scenario. Lastly, we speculate, through uncovering its working mechanism, that learning a more general representation space could allow ShortcutGen to work for unseen data.
    Fast Adaptive Federated Bilevel Optimization. (arXiv:2211.01122v1 [cs.LG])
    Bilevel optimization has been widely applied to many machine learning tasks such as meta learning, hyperparameter learning and policy optimization. Although many optimization algorithms recently have been developed, few adaptive algorithm focuses on the bilevel problems under the distributed setting. It is well known that the adaptive gradient methods show superior performances on both distributed and non-distributed optimization. In the paper, thus, we propose an efficient adaptive federated bilevel optimization algorithm (i.e.,AdaFBiO) to solve the distributed bilevel optimization problems, where the objective function of Upper-Level (UL) problem is possibly nonconvex, and that of Lower-Level (LL) problem is strongly convex. Specifically, our AdaFBiO algorithm builds on the momentum-based variance reduced technique and local-SGD to obtain the best known sample and communication complexities simultaneously. In particular, our AdaFBiO algorithm uses the unified adaptive matrices to flexibly incorporate various adaptive learning rates to update variables in both UL and LL problems. Moreover, we provide a convergence analysis framework for our AdaFBiO algorithm, and prove that it reaches the sample complexity of $\tilde{O}(\epsilon^{-3})$ with communication complexity of $\tilde{O}(\epsilon^{-2})$ to find $\epsilon$-stationary point. Experimental results on federated hyper-representation learning and federated data hyper-cleaning tasks verify efficiency of our algorithm.
    Nonparametric Hamiltonian Monte Carlo. (arXiv:2106.10238v2 [cs.LG] UPDATED)
    Probabilistic programming uses programs to express generative models whose posterior probability is then computed by built-in inference engines. A challenging goal is to develop general purpose inference algorithms that work out-of-the-box for arbitrary programs in a universal probabilistic programming language (PPL). The densities defined by such programs, which may use stochastic branching and recursion, are (in general) nonparametric, in the sense that they correspond to models on an infinite-dimensional parameter space. However standard inference algorithms, such as the Hamiltonian Monte Carlo (HMC) algorithm, target distributions with a fixed number of parameters. This paper introduces the Nonparametric Hamiltonian Monte Carlo (NP-HMC) algorithm which generalises HMC to nonparametric models. Inputs to NP-HMC are a new class of measurable functions called "tree representable", which serve as a language-independent representation of the density functions of probabilistic programs in a universal PPL. We provide a correctness proof of NP-HMC, and empirically demonstrate significant performance improvements over existing approaches on several nonparametric examples.
    DC-cycleGAN: Bidirectional CT-to-MR Synthesis from Unpaired Data. (arXiv:2211.01293v1 [eess.IV])
    Magnetic resonance (MR) and computer tomography (CT) images are two typical types of medical images that provide mutually-complementary information for accurate clinical diagnosis and treatment. However, obtaining both images may be limited due to some considerations such as cost, radiation dose and modality missing. Recently, medical image synthesis has aroused gaining research interest to cope with this limitation. In this paper, we propose a bidirectional learning model, denoted as dual contrast cycleGAN (DC-cycleGAN), to synthesis medical images from unpaired data. Specifically, a dual contrast loss is introduced into the discriminators to indirectly build constraints between MR and CT images by taking the advantage of samples from the source domain as negative sample and enforce the synthetic images fall far away from the source domain. In addition, cross entropy and structural similarity index (SSIM) are integrated into the cycleGAN in order to consider both luminance and structure of samples when synthesizing images. The experimental results indicates that DC-cycleGAN is able to produce promising results as compared with other cycleGAN-based medical image synthesis methods such as cycleGAN, RegGAN, DualGAN and NiceGAN. The code will be available at https://github.com/JiayuanWang-JW/DC-cycleGAN.
    Continual Conscious Active Fine-Tuning to Robustify Online Machine Learning Models Against Data Distribution Shifts. (arXiv:2211.01315v1 [cs.LG])
    Unlike their offline traditional counterpart, online machine learning models are capable of handling data distribution shifts while serving at the test time. However, they have limitations in addressing this phenomenon. They are either expensive or unreliable. We propose augmenting an online learning approach called test-time adaptation with a continual conscious active fine-tuning layer to develop an enhanced variation that can handle drastic data distribution shifts reliably and cost-effectively. The proposed augmentation incorporates the following aspects: a continual aspect to confront the ever-ending data distribution shifts, a conscious aspect to imply that fine-tuning is a distribution-shift-aware process that occurs at the appropriate time to address the recently detected data distribution shifts, and an active aspect to indicate employing human-machine collaboration for the relabeling to be cost-effective and practical for diverse applications. Our empirical results show that the enhanced test-time adaptation variation outperforms the traditional variation by a factor of two.
    Web-based Elicitation of Human Perception on mixup Data. (arXiv:2211.01202v1 [cs.LG])
    Synthetic data is proliferating on the web and powering many advances in machine learning. However, it is not always clear if synthetic labels are perceptually sensible to humans. The web provides us with a platform to take a step towards addressing this question through online elicitation. We design a series of elicitation interfaces, which we release as \texttt{HILL MixE Suite}, and recruit 159 participants, to provide perceptual judgments over the kinds of synthetic data constructed during \textit{mixup} training: a powerful regularizer shown to improve model robustness, generalization, and calibration. We find that human perception does not consistently align with the labels traditionally used for synthetic points and begin to demonstrate the applicability of these findings to potentially increase the reliability of downstream models. We release all elicited judgments in a new data hub we call \texttt{H-Mix}.
    A Quantum Kernel Learning Approach to Acoustic Modeling for Spoken Command Recognition. (arXiv:2211.01263v1 [cs.SD])
    We propose a quantum kernel learning (QKL) framework to address the inherent data sparsity issues often encountered in training large-scare acoustic models in low-resource scenarios. We project acoustic features based on classical-to-quantum feature encoding. Different from existing quantum convolution techniques, we utilize QKL with features in the quantum space to design kernel-based classifiers. Experimental results on challenging spoken command recognition tasks for a few low-resource languages, such as Arabic, Georgian, Chuvash, and Lithuanian, show that the proposed QKL-based hybrid approach attains good improvements over existing classical and quantum solutions.
    Knowing the Past to Predict the Future: Reinforcement Virtual Learning. (arXiv:2211.01266v1 [cs.LG])
    Reinforcement Learning (RL)-based control system has received considerable attention in recent decades. However, in many real-world problems, such as Batch Process Control, the environment is uncertain, which requires expensive interaction to acquire the state and reward values. In this paper, we present a cost-efficient framework, such that the RL model can evolve for itself in a Virtual Space using the predictive models with only historical data. The proposed framework enables a step-by-step RL model to predict the future state and select optimal actions for long-sight decisions. The main focuses are summarized as: 1) how to balance the long-sight and short-sight rewards with an optimal strategy; 2) how to make the virtual model interacting with real environment to converge to a final learning policy. Under the experimental settings of Fed-Batch Process, our method consistently outperforms the existing state-of-the-art methods.
    DynamicLight: Dynamically Tuning Traffic Signal Duration with DRL. (arXiv:2211.01025v1 [cs.LG])
    Deep reinforcement learning (DRL) is becoming increasingly popular in implementing traffic signal control (TSC). However, most existing DRL methods employ fixed control strategies, making traffic signal phase duration less flexible. Additionally, the trend of using more complex DRL models makes real-life deployment more challenging. To address these two challenges, we firstly propose a two-stage DRL framework, named DynamicLight, which uses Max Queue-Length to select the proper phase and employs a deep Q-learning network to determine the duration of the corresponding phase. Based on the design of DynamicLight, we also introduce two variants: (1) DynamicLight-Lite, which addresses the first challenge by using only 19 parameters to achieve dynamic phase duration settings; and (2) DynamicLight-Cycle, which tackles the second challenge by actuating a set of phases in a fixed cyclical order to implement flexible phase duration in the respective cyclical phase structure. Numerical experiments are conducted using both real-world and synthetic datasets, covering four most commonly adopted traffic signal intersections in real life. Experimental results show that: (1) DynamicLight can learn satisfactorily on determining the phase duration and achieve a new state-of-the-art, with improvement up to 6% compared to the baselines in terms of adjusted average travel time; (2) DynamicLight-Lite matches or outperforms most baseline methods with only 19 parameters; and (3) DynamicLight-Cycle demonstrates high performance for current TSC systems without remarkable modification in an actual deployment. Our code is released at Github.
    DPM-Solver++: Fast Solver for Guided Sampling of Diffusion Probabilistic Models. (arXiv:2211.01095v1 [cs.LG])
    Diffusion probabilistic models (DPMs) have achieved impressive success in high-resolution image synthesis, especially in recent large-scale text-to-image generation applications. An essential technique for improving the sample quality of DPMs is guided sampling, which usually needs a large guidance scale to obtain the best sample quality. The commonly-used fast sampler for guided sampling is DDIM, a first-order diffusion ODE solver that generally needs 100 to 250 steps for high-quality samples. Although recent works propose dedicated high-order solvers and achieve a further speedup for sampling without guidance, their effectiveness for guided sampling has not been well-tested before. In this work, we demonstrate that previous high-order fast samplers suffer from instability issues, and they even become slower than DDIM when the guidance scale grows large. To further speed up guided sampling, we propose DPM-Solver++, a high-order solver for the guided sampling of DPMs. DPM-Solver++ solves the diffusion ODE with the data prediction model and adopts thresholding methods to keep the solution matches training data distribution. We further propose a multistep variant of DPM-Solver++ to address the instability issue by reducing the effective step size. Experiments show that DPM-Solver++ can generate high-quality samples within only 15 to 20 steps for guided sampling by pixel-space and latent-space DPMs.
    Verifying And Interpreting Neural Networks using Finite Automata. (arXiv:2211.01022v1 [cs.FL])
    Verifying properties and interpreting the behaviour of deep neural networks (DNN) is an important task given their ubiquitous use in applications, including safety-critical ones, and their blackbox nature. We propose an automata-theoric approach to tackling problems arising in DNN analysis. We show that the input-output behaviour of a DNN can be captured precisely by a (special) weak B\"uchi automaton of exponential size. We show how these can be used to address common verification and interpretation tasks like adversarial robustness, minimum sufficient reasons etc. We report on a proof-of-concept implementation translating DNN to automata on finite words for better efficiency at the cost of losing precision in analysis.
    Weighted variance variational autoencoder for speech enhancement. (arXiv:2211.00990v1 [cs.SD])
    We address speech enhancement based on variational autoencoders, which involves learning a speech prior distribution in the time-frequency (TF) domain. A zero-mean complexvalued Gaussian distribution is usually assumed for the generative model, where the speech information is encoded in the variance as a function of a latent variable. While this is the commonly used approach, in this paper we propose a weighted variance generative model, where the contribution of each TF point in parameter learning is weighted. We impose a Gamma prior distribution on the weights, which would effectively lead to a Student's t-distribution instead of Gaussian for speech modeling. We develop efficient training and speech enhancement algorithms based on the proposed generative model. Our experimental results on spectrogram modeling and speech enhancement demonstrate the effectiveness and robustness of the proposed approach compared to the standard unweighted variance model.
    eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert Denoisers. (arXiv:2211.01324v1 [cs.CV])
    Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis. Starting from random noise, such text-to-image diffusion models gradually synthesize images in an iterative fashion while conditioning on text prompts. We find that their synthesis behavior qualitatively changes throughout this process: Early in sampling, generation strongly relies on the text prompt to generate text-aligned content, while later, the text conditioning is almost entirely ignored. This suggests that sharing model parameters throughout the entire generation process may not be ideal. Therefore, in contrast to existing works, we propose to train an ensemble of text-to-image diffusion models specialized for different synthesis stages. To maintain training efficiency, we initially train a single model, which is then split into specialized models that are trained for the specific stages of the iterative generation process. Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference computation cost and preserving high visual quality, outperforming previous large-scale text-to-image diffusion models on the standard benchmark. In addition, we train our model to exploit a variety of embeddings for conditioning, including the T5 text, CLIP text, and CLIP image embeddings. We show that these different embeddings lead to different behaviors. Notably, the CLIP image embedding allows an intuitive way of transferring the style of a reference image to the target text-to-image output. Lastly, we show a technique that enables eDiffi's "paint-with-words" capability. A user can select the word in the input text and paint it in a canvas to control the output, which is very handy for crafting the desired image in mind. The project page is available at https://deepimagination.cc/eDiffi/
    Gradient Knowledge Distillation for Pre-trained Language Models. (arXiv:2211.01071v1 [cs.CL])
    Knowledge distillation (KD) is an effective framework to transfer knowledge from a large-scale teacher to a compact yet well-performing student. Previous KD practices for pre-trained language models mainly transfer knowledge by aligning instance-wise outputs between the teacher and student, while neglecting an important knowledge source, i.e., the gradient of the teacher. The gradient characterizes how the teacher responds to changes in inputs, which we assume is beneficial for the student to better approximate the underlying mapping function of the teacher. Therefore, we propose Gradient Knowledge Distillation (GKD) to incorporate the gradient alignment objective into the distillation process. Experimental results show that GKD outperforms previous KD methods regarding student performance. Further analysis shows that incorporating gradient knowledge makes the student behave more consistently with the teacher, improving the interpretability greatly.
    Neural Block-Slot Representations. (arXiv:2211.01177v1 [cs.CV])
    In this paper, we propose a novel object-centric representation, called Block-Slot Representation. Unlike the conventional slot representation, the Block-Slot Representation provides concept-level disentanglement within a slot. A block-slot is constructed by composing a set of modular concept representations, called blocks, generated from a learned memory of abstract concept prototypes. We call this block-slot construction process Block-Slot Attention. Block-Slot Attention facilitates the emergence of abstract concept blocks within a slot such as color, position, and texture, without any supervision. This brings the benefits of disentanglement into slots and the representation becomes more interpretable. Similar to Slot Attention, this mechanism can be used as a drop-in module in any arbitrary neural architecture. In experiments, we show that our model disentangles object properties significantly better than the previous methods, including complex textured scenes. We also demonstrate the ability to compose novel scenes by composing slots at the block-level.
    Explainable AI over the Internet of Things: Overview, State-of-the-Art and Future Directions. (arXiv:2211.01036v1 [cs.AI])
    Explainable Artificial Intelligence (XAI) is transforming the field of Artificial Intelligence (AI) by enhancing the trust of end-users in machines. As the number of connected devices keeps on growing, the Internet of Things (IoT) market needs to be trustworthy for the end-users. However, existing literature still lacks a systematic and comprehensive survey work on the use of XAI for IoT. To bridge this lacking, in this paper, we address the XAI frameworks with a focus on their characteristics and support for IoT. We illustrate the widely-used XAI services for IoT applications, such as security enhancement, Internet of Medical Things (IoMT), Industrial IoT (IIoT), and Internet of City Things (IoCT). We also suggest the implementation choice of XAI models over IoT systems in these applications with appropriate examples and summarize the key inferences for future works. Moreover, we present the cutting-edge development in edge XAI structures and the support of sixth-generation (6G) communication services for IoT applications, along with key inferences. In a nutshell, this paper constitutes the first holistic compilation on the development of XAI-based frameworks tailored for the demands of future IoT use cases.
    On-Device Model Fine-Tuning with Label Correction in Recommender Systems. (arXiv:2211.01163v1 [cs.IR])
    To meet the practical requirements of low latency, low cost, and good privacy in online intelligent services, more and more deep learning models are offloaded from the cloud to mobile devices. To further deal with cross-device data heterogeneity, the offloaded models normally need to be fine-tuned with each individual user's local samples before being put into real-time inference. In this work, we focus on the fundamental click-through rate (CTR) prediction task in recommender systems and study how to effectively and efficiently perform on-device fine-tuning. We first identify the bottleneck issue that each individual user's local CTR (i.e., the ratio of positive samples in the local dataset for fine-tuning) tends to deviate from the global CTR (i.e., the ratio of positive samples in all the users' mixed datasets on the cloud for training out the initial model). We further demonstrate that such a CTR drift problem makes on-device fine-tuning even harmful to item ranking. We thus propose a novel label correction method, which requires each user only to change the labels of the local samples ahead of on-device fine-tuning and can well align the locally prior CTR with the global CTR. The offline evaluation results over three datasets and five CTR prediction models as well as the online A/B testing results in Mobile Taobao demonstrate the necessity of label correction in on-device fine-tuning and also reveal the improvement over cloud-based learning without fine-tuning.
    UniASM: Binary Code Similarity Detection without Fine-tuning. (arXiv:2211.01144v1 [cs.CR])
    Binary code similarity detection (BCSD) is widely used in various binary analysis tasks such as vulnerability search, malware detection, clone detection, and patch analysis. Recent studies have shown that the learning-based binary code embedding models perform better than the traditional feature-based approaches. In this paper, we proposed a novel transformer-based binary code embedding model, named UniASM, to learn representations of the binary functions. We designed two new training tasks to make the spatial distribution of the generated vectors more uniform, which can be used directly in BCSD without any fine-tuning. In addition, we proposed a new tokenization approach for binary functions, increasing the token's semantic information while mitigating the out-of-vocabulary (OOV) problem. The experimental results show that UniASM outperforms state-of-the-art (SOTA) approaches on the evaluation dataset. We achieved the average scores of recall@1 on cross-compilers, cross-optimization-levels and cross-obfuscations are 0.72, 0.63, and 0.77, which is higher than existing SOTA baselines. In a real-world task of known vulnerability searching, UniASM outperforms all the current baselines.
    Instance-Dependent Generalization Bounds via Optimal Transport. (arXiv:2211.01258v1 [stat.ML])
    Existing generalization bounds fail to explain crucial factors that drive generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization, and fail to account for the fact that the set of parameters, considered during initialization and training, is much more restricted than the entire parameter space. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function} in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
    Unsupervised denoising for sparse multi-spectral computed tomography. (arXiv:2211.01159v1 [eess.IV])
    Multi-energy computed tomography (CT) with photon counting detectors (PCDs) enables spectral imaging as PCDs can assign the incoming photons to specific energy channels. However, PCDs with many spectral channels drastically increase the computational complexity of the CT reconstruction, and bespoke reconstruction algorithms need fine-tuning to varying noise statistics. \rev{Especially if many projections are taken, a large amount of data has to be collected and stored. Sparse view CT is one solution for data reduction. However, these issues are especially exacerbated when sparse imaging scenarios are encountered due to a significant reduction in photon counts.} In this work, we investigate the suitability of learning-based improvements to the challenging task of obtaining high-quality reconstructions from sparse measurements for a 64-channel PCD-CT. In particular, to overcome missing reference data for the training procedure, we propose an unsupervised denoising and artefact removal approach by exploiting different filter functions in the reconstruction and an explicit coupling of spectral channels with the nuclear norm. Performance is assessed on both simulated synthetic data and the openly available experimental Multi-Spectral Imaging via Computed Tomography (MUSIC) dataset. We compared the quality of our unsupervised method to iterative total nuclear variation regularized reconstructions and a supervised denoiser trained with reference data. We show that improved reconstruction quality can be achieved with flexibility on noise statistics and effective suppression of streaking artefacts when using unsupervised denoising with spectral coupling.
    Thunderstorm nowcasting with deep learning: a multi-hazard data fusion model. (arXiv:2211.01001v1 [physics.ao-ph])
    Predictions of thunderstorm-related hazards are needed in several sectors, including first responders, infrastructure management and aviation. To address this need, we present a deep learning model that can be adapted to different hazard types. The model can utilize multiple data sources; we use data from weather radar, lightning detection, satellite visible/infrared imagery, numerical weather prediction and digital elevation models. It can be trained to operate with any combination of these sources, such that predictions can still be provided if one or more of the sources become unavailable. We demonstrate the ability of the model to predict lightning, hail and heavy precipitation probabilistically on a 1 km resolution grid, with a time resolution of 5 min and lead times up to 60 min. Shapley values quantify the importance of the different data sources, showing that the weather radar products are the most important predictors for all three hazard types.
    A Two Step Approach to Weighted Bipartite Link Recommendations. (arXiv:2211.01153v1 [cs.IR])
    Many real world person-person or person-product relationships can be modeled graphically. More specifically, bipartite graphs can be especially useful when modeling scenarios that involve two disjoint groups. As a result, many existing papers have utilized bipartite graphs for the classical link recommendation problem. In this paper, using the principle of bipartite graphs, we present another approach to this problem with a two step algorithm that takes into account frequency and similarity between common edges to make recommendations. We test this approach with bipartite data gathered from the Epinions and Movielens data sources, and find it to perform with roughly 14 percent error, which improves upon baseline results. This is a promising result, and can be refined to generate even more accurate recommendations.
    Audio-visual speech enhancement with a deep Kalman filter generative model. (arXiv:2211.00988v1 [cs.CV])
    Deep latent variable generative models based on variational autoencoder (VAE) have shown promising performance for audiovisual speech enhancement (AVSE). The underlying idea is to learn a VAEbased audiovisual prior distribution for clean speech data, and then combine it with a statistical noise model to recover a speech signal from a noisy audio recording and video (lip images) of the target speaker. Existing generative models developed for AVSE do not take into account the sequential nature of speech data, which prevents them from fully incorporating the power of visual data. In this paper, we present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables and effectively fuses audiovisual data. Moreover, we develop an efficient inference methodology to estimate speech signals at test time. We conduct a set of experiments to compare different variants of generative models for speech enhancement. The results demonstrate the superiority of the AV-DKF model compared with both its audio-only version and the non-sequential audio-only and audiovisual VAE-based models.
    Inference and Denoise: Causal Inference-based Neural Speech Enhancement. (arXiv:2211.01189v1 [eess.AS])
    This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention. Based on the potential outcome framework, the proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE. Specifically, we use the presence of noise as guidance for EM selection during training, and the noise detector selects the enhancement module according to the prediction of the presence of noise for each frame. Moreover, we derived a SE-specific average treatment effect to quantify the causal effect adequately. Experimental evidence demonstrates that CISE outperforms a non-causal mask-based SE approach in the studied settings and has better performance and efficiency than more complex SE models.
    Generation of Anonymous Chest Radiographs Using Latent Diffusion Models for Training Thoracic Abnormality Classification Systems. (arXiv:2211.01323v1 [eess.IV])
    The availability of large-scale chest X-ray datasets is a requirement for developing well-performing deep learning-based algorithms in thoracic abnormality detection and classification. However, biometric identifiers in chest radiographs hinder the public sharing of such data for research purposes due to the risk of patient re-identification. To counteract this issue, synthetic data generation offers a solution for anonymizing medical images. This work employs a latent diffusion model to synthesize an anonymous chest X-ray dataset of high-quality class-conditional images. We propose a privacy-enhancing sampling strategy to ensure the non-transference of biometric information during the image generation process. The quality of the generated images and the feasibility of serving as exclusive training data are evaluated on a thoracic abnormality classification task. Compared to a real classifier, we achieve competitive results with a performance gap of only 3.5% in the area under the receiver operating characteristic curve.
    Fantasizing with Dual GPs in Bayesian Optimization and Active Learning. (arXiv:2211.01053v1 [cs.LG])
    Gaussian processes (GPs) are the main surrogate functions used for sequential modelling such as Bayesian Optimization and Active Learning. Their drawbacks are poor scaling with data and the need to run an optimization loop when using a non-Gaussian likelihood. In this paper, we focus on `fantasizing' batch acquisition functions that need the ability to condition on new fantasized data computationally efficiently. By using a sparse Dual GP parameterization, we gain linear scaling with batch size as well as one-step updates for non-Gaussian likelihoods, thus extending sparse models to greedy batch fantasizing acquisition functions.
    Entropic Neural Optimal Transport via Diffusion Processes. (arXiv:2211.01156v1 [cs.LG])
    We propose a novel neural algorithm for the fundamental problem of computing the entropic optimal transport (EOT) plan between probability distributions which are accessible by samples. Our algorithm is based on the saddle point reformulation of the dynamic version of EOT which is known as the Schr\"odinger Bridge problem. In contrast to the prior methods for large-scale EOT, our algorithm is end-to-end and consists of a single learning step, has fast inference procedure, and allows handling small values of the entropy regularization coefficient which is of particular importance in some applied problems. Empirically, we show the performance of the method on several large-scale EOT tasks.
    Discover Important Paths in the Knowledge Graph Based on Dynamic Relation Confidence. (arXiv:2211.00914v1 [cs.AI])
    Most of the existing knowledge graphs are not usually complete and can be complemented by some reasoning algorithms. The reasoning method based on path features is widely used in the field of knowledge graph reasoning and completion on account of that its have strong interpretability. However, reasoning methods based on path features still have several problems in the following aspects: Path search isinefficient, insufficient paths for sparse tasks and some paths are not helpful for reasoning tasks. In order to solve the above problems, this paper proposes a method called DC-Path that combines dynamic relation confidence and other indicators to evaluate path features, and then guide path search, finally conduct relation reasoning. Experimental result show that compared with the existing relation reasoning algorithm, this method can select the most representative features in the current reasoning task from the knowledge graph and achieve better performance on the current relation reasoning task.
    User-Entity Differential Privacy in Learning Natural Language Models. (arXiv:2211.01141v1 [cs.CR])
    In this paper, we introduce a novel concept of user-entity differential privacy (UeDP) to provide formal privacy protection simultaneously to both sensitive entities in textual data and data owners in learning natural language models (NLMs). To preserve UeDP, we developed a novel algorithm, called UeDP-Alg, optimizing the trade-off between privacy loss and model utility with a tight sensitivity bound derived from seamlessly combining user and sensitive entity sampling processes. An extensive theoretical analysis and evaluation show that our UeDP-Alg outperforms baseline approaches in model utility under the same privacy budget consumption on several NLM tasks, using benchmark datasets.
    Model-based Reinforcement Learning with a Hamiltonian Canonical ODE Network. (arXiv:2211.00942v1 [cs.LG])
    Model-based reinforcement learning usually suffers from a high sample complexity in training the world model, especially for the environments with complex dynamics. To make the training for general physical environments more efficient, we introduce Hamiltonian canonical ordinary differential equations into the learning process, which inspires a novel model of neural ordinary differential auto-encoder (NODA). NODA can model the physical world by nature and is flexible to impose Hamiltonian mechanics (e.g., the dimension of the physical equations) which can further accelerate training of the environment models. It can consequentially empower an RL agent with the robust extrapolation using a small amount of samples as well as the guarantee on the physical plausibility. Theoretically, we prove that NODA has uniform bounds for multi-step transition errors and value errors under certain conditions. Extensive experiments show that NODA can learn the environment dynamics effectively with a high sample efficiency, making it possible to facilitate reinforcement learning agents at the early stage.
    Deep Reinforcement Learning for Power Control in Next-Generation WiFi Network Systems. (arXiv:2211.01107v1 [cs.NI])
    This paper presents a deep reinforcement learning (DRL) solution for power control in wireless communications, describes its embedded implementation with WiFi transceivers for a WiFi network system, and evaluates the performance with high-fidelity emulation tests. In a multi-hop wireless network, each mobile node measures its link quality and signal strength, and controls its transmit power. As a model-free solution, reinforcement learning allows nodes to adapt their actions by observing the states and maximize their cumulative rewards over time. For each node, the state consists of transmit power, link quality and signal strength; the action adjusts the transmit power; and the reward combines energy efficiency (throughput normalized by energy consumption) and penalty of changing the transmit power. As the state space is large, Q-learning is hard to implement on embedded platforms with limited memory and processing power. By approximating the Q-values with a DQN, DRL is implemented for the embedded platform of each node combining an ARM processor and a WiFi transceiver for 802.11n. Controllable and repeatable emulation tests are performed by inducing realistic channel effects on RF signals. Performance comparison with benchmark schemes of fixed and myopic power allocations shows that power control with DRL provides major improvements to energy efficiency and throughput in WiFi network systems.
    Joint Correlation Detection and Alignment of Gaussian Databases. (arXiv:2211.01069v1 [cs.IT])
    In this work, we propose an efficient two-stage algorithm solving a joint problem of correlation detection and permutation recovery between two Gaussian databases. Correlation detection is an hypothesis testing problem; under the null hypothesis, the databases are independent, and under the alternate hypothesis, they are correlated, under an unknown row permutation. We develop relatively tight bounds on the type-I and type-II error probabilities, and show that the analyzed detector performs better than a recently proposed detector, at least for some specific parameter choices. Since the proposed detector relies on a statistic, which is a sum of dependent indicator random variables, then in order to bound the type-I probability of error, we develop a novel graph-theoretic technique for bounding the $k$-th order moments of such statistics. When the databases are accepted as correlated, the algorithm also outputs an estimation for the underlying row permutation. By comparing to known converse results for this problem, we prove that the alignment error probability converges to zero under the asymptotically lowest possible correlation coefficient.
    Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches. (arXiv:2211.00889v1 [cs.LG])
    SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at the parameter server by using communication collectives like Ring All-Reduce for synchronization. While the parameter updates in distributed SGD may happen asynchronously there is still a synchronization barrier to make sure that the local training epoch at every learner is complete before the learners can advance to the next epoch. The delays in waiting for the slowest learners(stragglers) remain to be a problem in the synchronization steps of these state-of-the-art decentralized frameworks. In this paper, we propose the (de)centralized Non-blocking SGD (Non-blocking SGD) which can address the straggler problem in a heterogeneous environment. The main idea of Non-blocking SGD is to split the original batch into mini-batches, then accumulate the gradients and update the model based on finished mini-batches. The Non-blocking idea can be implemented using decentralized algorithms including Ring All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using gradient accumulation to update the model also guarantees convergence and avoids gradient staleness. Run-time analysis with random straggler delays and computational efficiency/throughput of devices is also presented to show the advantage of Non-blocking SGD. Experiments on a suite of datasets and deep learning networks validate the theoretical analyses and demonstrate that Non-blocking SGD speeds up the training and fastens the convergence. Compared with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training loss in a heterogeneous environment.
    Deep Learning for Inflexible Multi-Asset Hedging of incomplete market. (arXiv:2211.00948v1 [q-fin.ST])
    Models trained under assumptions in the complete market usually don't take effect in the incomplete market. This paper solves the hedging problem in incomplete market with three sources of incompleteness: risk factor, illiquidity, and discrete transaction dates. A new jump-diffusion model is proposed to describe stochastic asset prices. Three neutral networks, including RNN, LSTM, Mogrifier-LSTM are used to attain hedging strategies with MSE Loss and Huber Loss implemented and compared.As a result, Mogrifier-LSTM is the fastest model with the best results under MSE and Huber Loss.
    Neural Active Learning on Heteroskedastic Distributions. (arXiv:2211.00928v1 [cs.LG])
    Models that can actively seek out the best quality training data hold the promise of more accurate, adaptable, and efficient machine learning. State-of-the-art active learning techniques tend to prefer examples that are the most difficult to classify. While this works well on homogeneous datasets, we find that it can lead to catastrophic failures when performed on multiple distributions with different degrees of label noise or heteroskedasticity. These active learning algorithms strongly prefer to draw from the distribution with more noise, even if their examples have no informative structure (such as solid color images with random labels). To this end, we demonstrate the catastrophic failure of these active learning algorithms on heteroskedastic distributions and propose a fine-tuning-based approach to mitigate these failures. Further, we propose a new algorithm that incorporates a model difference scoring function for each data point to filter out the noisy examples and sample clean examples that maximize accuracy, outperforming the existing active learning techniques on the heteroskedastic datasets. We hope these observations and techniques are immediately helpful to practitioners and can help to challenge common assumptions in the design of active learning algorithms.
    Variational Hierarchical Mixtures for Learning Probabilistic Inverse Dynamics. (arXiv:2211.01120v1 [cs.LG])
    Well-calibrated probabilistic regression models are a crucial learning component in robotics applications as datasets grow rapidly and tasks become more complex. Classical regression models are usually either probabilistic kernel machines with a flexible structure that does not scale gracefully with data or deterministic and vastly scalable automata, albeit with a restrictive parametric form and poor regularization. In this paper, we consider a probabilistic hierarchical modeling paradigm that combines the benefits of both worlds to deliver computationally efficient representations with inherent complexity regularization. The presented approaches are probabilistic interpretations of local regression techniques that approximate nonlinear functions through a set of local linear or polynomial units. Importantly, we rely on principles from Bayesian nonparametrics to formulate flexible models that adapt their complexity to the data and can potentially encompass an infinite number of components. We derive two efficient variational inference techniques to learn these representations and highlight the advantages of hierarchical infinite local regression models, such as dealing with non-smooth functions, mitigating catastrophic forgetting, and enabling parameter sharing and fast predictions. Finally, we validate this approach on a set of large inverse dynamics datasets and test the learned models in real-world control scenarios.
    Spot the fake lungs: Generating Synthetic Medical Images using Neural Diffusion Models. (arXiv:2211.00902v1 [eess.IV])
    Generative models are becoming popular for the synthesis of medical images. Recently, neural diffusion models have demonstrated the potential to generate photo-realistic images of objects. However, their potential to generate medical images is not explored yet. In this work, we explore the possibilities of synthesis of medical images using neural diffusion models. First, we use a pre-trained DALLE2 model to generate lungs X-Ray and CT images from an input text prompt. Second, we train a stable diffusion model with 3165 X-Ray images and generate synthetic images. We evaluate the synthetic image data through a qualitative analysis where two independent radiologists label randomly chosen samples from the generated data as real, fake, or unsure. Results demonstrate that images generated with the diffusion model can translate characteristics that are otherwise very specific to certain medical conditions in chest X-Ray or CT images. Careful tuning of the model can be very promising. To the best of our knowledge, this is the first attempt to generate lungs X-Ray and CT images using neural diffusion models. This work aims to introduce a new dimension in artificial intelligence for medical imaging. Given that this is a new topic, the paper will serve as an introduction and motivation for the research community to explore the potential of diffusion models for medical image synthesis. We have released the synthetic images on https://www.kaggle.com/datasets/hazrat/awesomelungs.
    Passage-Mask: A Learnable Regularization Strategy for Retriever-Reader Models. (arXiv:2211.00915v1 [cs.CL])
    Retriever-reader models achieve competitive performance across many different NLP tasks such as open question answering and dialogue conversations. In this work, we notice these models easily overfit the top-rank retrieval passages and standard training fails to reason over the entire retrieval passages. We introduce a learnable passage mask mechanism which desensitizes the impact from the top-rank retrieval passages and prevents the model from overfitting. Controlling the gradient variance with fewer mask candidates and selecting the mask candidates with one-shot bi-level optimization, our learnable regularization strategy enforces the answer generation to focus on the entire retrieval passages. Experiments on different tasks across open question answering, dialogue conversation, and fact verification show that our method consistently outperforms its baselines. Extensive experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
    Multi-task Learning for Source Attribution and Field Reconstruction for Methane Monitoring. (arXiv:2211.00864v1 [cs.LG])
    Inferring the source information of greenhouse gases, such as methane, from spatially sparse sensor observations is an essential element in mitigating climate change. While it is well understood that the complex behavior of the atmospheric dispersion of such pollutants is governed by the Advection-Diffusion equation, it is difficult to directly apply the governing equations to identify the source location and magnitude (inverse problem) because of the spatially sparse and noisy observations, i.e., the pollution concentration is known only at the sensor locations and sensors sensitivity is limited. Here, we develop a multi-task learning framework that can provide high-fidelity reconstruction of the concentration field and identify emission characteristics of the pollution sources such as their location, emission strength, etc. from sparse sensor observations. We demonstrate that our proposed framework is able to achieve accurate reconstruction of the methane concentrations from sparse sensor measurements as well as precisely pin-point the location and emission strength of these pollution sources.
    A Data-driven Case-based Reasoning in Bankruptcy Prediction. (arXiv:2211.00921v1 [q-fin.RM])
    There has been intensive research regarding machine learning models for predicting bankruptcy in recent years. However, the lack of interpretability limits their growth and practical implementation. This study proposes a data-driven explainable case-based reasoning (CBR) system for bankruptcy prediction. Empirical results from a comparative study show that the proposed approach performs superior to existing, alternative CBR systems and is competitive with state-of-the-art machine learning models. We also demonstrate that the asymmetrical feature similarity comparison mechanism in the proposed CBR system can effectively capture the asymmetrically distributed nature of financial attributes, such as a few companies controlling more cash than the majority, hence improving both the accuracy and explainability of predictions. In addition, we delicately examine the explainability of the CBR system in the decision-making process of bankruptcy prediction. While much research suggests a trade-off between improving prediction accuracy and explainability, our findings show a prospective research avenue in which an explainable model that thoroughly incorporates data attributes by design can reconcile the dilemma.
    Linear Embedding-based High-dimensional Batch Bayesian Optimization without Reconstruction Mappings. (arXiv:2211.00947v1 [stat.ML])
    The optimization of high-dimensional black-box functions is a challenging problem. When a low-dimensional linear embedding structure can be assumed, existing Bayesian optimization (BO) methods often transform the original problem into optimization in a low-dimensional space. They exploit the low-dimensional structure and reduce the computational burden. However, we reveal that this approach could be limited or inefficient in exploring the high-dimensional space mainly due to the biased reconstruction of the high-dimensional queries from the low-dimensional queries. In this paper, we investigate a simple alternative approach: tackling the problem in the original high-dimensional space using the information from the learned low-dimensional structure. We provide a theoretical analysis of the exploration ability. Furthermore, we show that our method is applicable to batch optimization problems with thousands of dimensions without any computational difficulty. We demonstrate the effectiveness of our method on high-dimensional benchmarks and a real-world function.
    Automatic Quantitative Analysis of Brain Organoids via Deep Learning. (arXiv:2211.00750v1 [eess.IV])
    Recent advances in brain organoid technology are exciting new ways, which have the potential to change the way how doctors and researchers understand and treat cerebral diseases. Despite the remarkable use of brain organoids derived from human stem cells in new drug testing, disease modeling, and scientific research, it is still heavily time-consuming work to observe and analyze the internal structure, cells, and neural inside the organoid by humans, specifically no standard quantitative analysis method combined growing AI technology for brain organoid. In this paper, an automated computer-assisted analysis method is proposed for brain organoid slice channels tagged with different fluorescent. We applied the method on two channels of two group microscopy images and the experiment result shows an obvious difference between Wild Type and Mutant Type cerebral organoids.  ( 2 min )
    Geodesic Sinkhorn: optimal transport for high-dimensional datasets. (arXiv:2211.00805v1 [cs.LG])
    Understanding the dynamics and reactions of cells from population snapshots is a major challenge in single-cell transcriptomics. Here, we present Geodesic Sinkhorn, a method for interpolating populations along a data manifold that leverages existing kernels developed for single-cell dimensionality reduction and visualization methods. Our Geodesic Sinkhorn method uses a heat-geodesic ground distance that, as compared to Euclidean ground distances, is more accurate for interpolating single-cell dynamics on a wide variety of datasets and significantly speeds up the computation for sparse kernels. We first apply Geodesic Sinkhorn to 10 single-cell transcriptomics time series interpolation datasets as a drop-in replacement for existing interpolation methods where it outperforms on all datasets, showing its effectiveness in modeling cell dynamics. Second, we show how to efficiently approximate the operator with polynomial kernels allowing us to improve scaling to large datasets. Finally, we define the conditional Wasserstein-average treatment effect and show how it can elucidate the treatment effect on single-cell populations on a drug screen.
    A new method for determining Wasserstein 1 optimal transport maps from Kantorovich potentials, with deep learning applications. (arXiv:2211.00820v1 [math.OC])
    Wasserstein 1 optimal transport maps provide a natural correspondence between points from two probability distributions, $\mu$ and $\nu$, which is useful in many applications. Available algorithms for computing these maps do not appear to scale well to high dimensions. In deep learning applications, efficient algorithms have been developed for approximating solutions of the dual problem, known as Kantorovich potentials, using neural networks (e.g. [Gulrajani et al., 2017]). Importantly, such algorithms work well in high dimensions. In this paper we present an approach towards computing Wasserstein 1 optimal transport maps that relies only on Kantorovich potentials. In general, a Wasserstein 1 optimal transport map is not unique and is not computable from a potential alone. Our main result is to prove that if $\mu$ has a density and $\nu$ is supported on a submanifold of codimension at least 2, an optimal transport map is unique and can be written explicitly in terms of a potential. These assumptions are natural in many image processing contexts and other applications. When the Kantorovich potential is only known approximately, our result motivates an iterative procedure wherein data is moved in optimal directions and with the correct average displacement. Since this provides an approach for transforming one distribution to another, it can be used as a multipurpose algorithm for various transport problems; we demonstrate through several proof of concept experiments that this algorithm successfully performs various imaging tasks, such as denoising, generation, translation and deblurring, which normally require specialized techniques.  ( 3 min )
    Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints. (arXiv:2211.01052v1 [cs.LG])
    Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
    PromptCast: A New Prompt-based Learning Paradigm for Time Series Forecasting. (arXiv:2210.08964v2 [stat.ME] UPDATED)
    This paper studies the time series forecasting problem from a whole new perspective. In the existing SOTA time-series representation learning methods, the forecasting models take a sequence of numerical values as input and yield numerical values as output. The existing SOTA models are largely based on Transformer architecture, modified with multiple encoding mechanisms to incorporate the context and semantics around the historical data. In this paper, we approach representation learning of time-series from the paradigm of prompt-based natural language modeling. Inspired by the successes of pre-trained language foundation models, we pose a question about whether these models can also be adapted to solve time-series forecasting. Thus, we propose a new forecasting paradigm: prompt-based time series forecasting (PromptCast). In this novel task, the numerical input and output are transformed into prompts. We frame the forecasting task in a sentence-to-sentence manner which makes it possible to directly apply language models for forecasting purposes. To support and facilitate the research of this task, we also present a large-scale dataset (PISA) that includes three real-world forecasting scenarios. We evaluate different SOTA numerical-based forecasting methods and language generation models such as Bart. The benchmark results with single- and multi-step forecasting settings demonstrate that the proposed prompt-based time series forecasting with language generation models is a promising research direction. In addition, in comparison to conventional numerical-based forecasting, PromptCast shows a much better generalization ability under the zero-shot setting. We believe that the proposed PromptCast task as well as our PISA dataset could provide novel insights and further lead to new research directions in the domain of time-series representation learning and forecasting.
    Gradient Descent and the Power Method: Exploiting their connection to find the leftmost eigen-pair and escape saddle points. (arXiv:2211.00866v1 [math.OC])
    This work shows that applying Gradient Descent (GD) with a fixed step size to minimize a (possibly nonconvex) quadratic function is equivalent to running the Power Method (PM) on the gradients. The connection between GD with a fixed step size and the PM, both with and without fixed momentum, is thus established. Consequently, valuable eigen-information is available via GD. Recent examples show that GD with a fixed step size, applied to locally quadratic nonconvex functions, can take exponential time to escape saddle points (Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Aarti Singh, and Barnabas Poczos: "Gradient descent can take exponential time to escape saddle points"; S. Paternain, A. Mokhtari, and A. Ribeiro: "A newton-based method for nonconvex optimization with fast evasion of saddle points"). Here, those examples are revisited and it is shown that eigenvalue information was missing, so that the examples may not provide a complete picture of the potential practical behaviour of GD. Thus, ongoing investigation of the behaviour of GD on nonconvex functions, possibly with an \emph{adaptive} or \emph{variable} step size, is warranted. It is shown that, in the special case of a quadratic in $R^2$, if an eigenvalue is known, then GD with a fixed step size will converge in two iterations, and a complete eigen-decomposition is available. By considering the dynamics of the gradients and iterates, new step size strategies are proposed to improve the practical performance of GD. Several numerical examples are presented, which demonstrate the advantages of exploiting the GD--PM connection.
    Large deviations rates for stochastic gradient descent with strongly convex functions. (arXiv:2211.00969v1 [cs.LG])
    Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.
    Impact of annotation modality on label quality and model performance in the automatic assessment of laughter in-the-wild. (arXiv:2211.00794v1 [cs.SD])
    Laughter is considered one of the most overt signals of joy. Laughter is well-recognized as a multimodal phenomenon but is most commonly detected by sensing the sound of laughter. It is unclear how perception and annotation of laughter differ when annotated from other modalities like video, via the body movements of laughter. In this paper we take a first step in this direction by asking if and how well laughter can be annotated when only audio, only video (containing full body movement information) or audiovisual modalities are available to annotators. We ask whether annotations of laughter are congruent across modalities, and compare the effect that labeling modality has on machine learning model performance. We compare annotations and models for laughter detection, intensity estimation, and segmentation, three tasks common in previous studies of laughter. Our analysis of more than 4000 annotations acquired from 48 annotators revealed evidence for incongruity in the perception of laughter, and its intensity between modalities. Further analysis of annotations against consolidated audiovisual reference annotations revealed that recall was lower on average for video when compared to the audio condition, but tended to increase with the intensity of the laughter samples. Our machine learning experiments compared the performance of state-of-the-art unimodal (audio-based, video-based and acceleration-based) and multi-modal models for different combinations of input modalities, training label modality, and testing label modality. Models with video and acceleration inputs had similar performance regardless of training label modality, suggesting that it may be entirely appropriate to train models for laughter detection from body movements using video-acquired labels, despite their lower inter-rater agreement.  ( 3 min )
    Maximum Likelihood Distillation for Robust Modulation Classification. (arXiv:2211.00748v1 [cs.LG])
    Deep Neural Networks are being extensively used in communication systems and Automatic Modulation Classification (AMC) in particular. However, they are very susceptible to small adversarial perturbations that are carefully crafted to change the network decision. In this work, we build on knowledge distillation ideas and adversarial training in order to build more robust AMC systems. We first outline the importance of the quality of the training data in terms of accuracy and robustness of the model. We then propose to use the Maximum Likelihood function, which could solve the AMC problem in offline settings, to generate better training labels. Those labels teach the model to be uncertain in challenging conditions, which permits to increase the accuracy, as well as the robustness of the model when combined with adversarial training. Interestingly, we observe that this increase in performance transfers to online settings, where the Maximum Likelihood function cannot be used in practice. Overall, this work highlights the potential of learning to be uncertain in difficult scenarios, compared to directly removing label noise.  ( 2 min )
    Generalizability of Functional Forms for Interatomic Potential Models Discovered by Symbolic Regression. (arXiv:2210.15124v1 [cond-mat.mtrl-sci] CROSS LISTED)
    In recent years there has been great progress in the use of machine learning algorithms to develop interatomic potential models. Machine-learned potential models are typically orders of magnitude faster than density functional theory but also orders of magnitude slower than physics-derived models such as the embedded atom method. In our previous work, we used symbolic regression to develop fast, accurate and transferrable interatomic potential models for copper with novel functional forms that resemble those of the embedded atom method. To determine the extent to which the success of these forms was specific to copper, here we explore the generalizability of these models to other elements and analyze their out-of-sample performance on several material properties. We found that these forms work particularly well on elements that are chemically similar to copper. When compared to optimized Sutton-Chen models, which have similar complexity, the functional forms discovered using symbolic regression perform better across all elements considered except gold where they have a similar performance. They perform similarly to a moderately more complex embedded atom form on properties on which they were trained, and they are more accurate on average on other properties. We attribute this improved generalized accuracy to the relative simplicity of the models discovered using symbolic regression. The genetic programming models are found to outperform other models from the literature about 50% of the time, with about 1/10th the model complexity on average. We discuss the implications of these results to the broader application of symbolic regression to the development of new potentials and highlight how models discovered for one element can be used to seed new searches for different elements.  ( 3 min )
    Spatial-temporal recurrent reinforcement learning for autonomous ships. (arXiv:2211.01004v1 [cs.LG])
    The paper proposes a spatial-temporal recurrent neural network architecture for Deep $Q$-Networks to steer an autonomous ship. The network design allows handling an arbitrary number of surrounding target ships while offering robustness to partial observability. Further, a state-of-the-art collision risk metric is proposed to enable an easier assessment of different situations by the agent. The COLREG rules of maritime traffic are explicitly considered in the design of the reward function. The final policy is validated on a custom set of newly created single-ship encounters called "Around the Clock" problems and the commonly chosen Imazu (1987) problems, which include 18 multi-ship scenarios. Additionally, the framework shows robustness when deployed simultaneously in multi-agent scenarios. The proposed network architecture is compatible with other deep reinforcement learning algorithms, including actor-critic frameworks.
    Certified Robustness of Quantum Classifiers against Adversarial Examples through Quantum Noise. (arXiv:2211.00887v1 [quant-ph])
    Recently, quantum classifiers have been known to be vulnerable to adversarial attacks, where quantum classifiers are fooled by imperceptible noises to have misclassification. In this paper, we propose one first theoretical study that utilizing the added quantum random rotation noise can improve the robustness of quantum classifiers against adversarial attacks. We connect the definition of differential privacy and demonstrate the quantum classifier trained with the natural presence of additive noise is differentially private. Lastly, we derive a certified robustness bound to enable quantum classifiers to defend against adversarial examples supported by experimental results.
    Adversarial Auto-Augment with Label Preservation: A Representation Learning Principle Guided Approach. (arXiv:2211.00824v1 [cs.LG])
    Data augmentation is a critical contributing factor to the success of deep learning but heavily relies on prior domain knowledge which is not always available. Recent works on automatic data augmentation learn a policy to form a sequence of augmentation operations, which are still pre-defined and restricted to limited options. In this paper, we show that a prior-free autonomous data augmentation's objective can be derived from a representation learning principle that aims to preserve the minimum sufficient information of the labels. Given an example, the objective aims at creating a distant "hard positive example" as the augmentation, while still preserving the original label. We then propose a practical surrogate to the objective that can be optimized efficiently and integrated seamlessly into existing methods for a broad class of machine learning tasks, e.g., supervised, semi-supervised, and noisy-label learning. Unlike previous works, our method does not require training an extra generative model but instead leverages the intermediate layer representations of the end-task model for generating data augmentations. In experiments, we show that our method consistently brings non-trivial improvements to the three aforementioned learning tasks from both efficiency and final performance, either or not combined with strong pre-defined augmentations, e.g., on medical images when domain knowledge is unavailable and the existing augmentation techniques perform poorly. Code is available at: https://github.com/kai-wen-yang/LPA3}{https://github.com/kai-wen-yang/LPA3.
    CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification. (arXiv:2211.00640v1 [cs.LG])
    Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent approaches, such as XR-Transformer and LightXML, leverage a transformer instance to achieve state-of-the-art performance. However, in this process, these approaches need to make various trade-offs between performance and computational requirements. A major shortcoming, as compared to the Bi-LSTM based AttentionXML, is that they fail to keep separate feature representations for each resolution in a label tree. We thus propose CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness the multi-layered architecture of a transformer model for attending to different label resolutions with separate feature representations. CascadeXML significantly outperforms all existing approaches with non-trivial gains obtained on benchmark datasets consisting of up to three million labels. Code for CascadeXML will be made publicly available at \url{https://github.com/xmc-aalto/cascadexml}.  ( 2 min )
    Balancing Utility and Fairness in Submodular Maximization (Technical Report). (arXiv:2211.00980v1 [cs.DS])
    Submodular function maximization is central in numerous data science applications, including data summarization, influence maximization, and recommendation. In many of these problems, our goal is to find a solution that maximizes the \emph{average} of the utilities for all users, each measured by a monotone submodular function. When the population of users is composed of several demographic groups, another critical problem is whether the utility is fairly distributed across groups. In the context of submodular optimization, we seek to improve the welfare of the \emph{least well-off} group, i.e., to maximize the minimum utility for any group, to ensure fairness. Although the \emph{utility} and \emph{fairness} objectives are both desirable, they might contradict each other, and, to our knowledge, little attention has been paid to optimizing them jointly. In this paper, we propose a novel problem called \emph{Bicriteria Submodular Maximization} (BSM) to strike a balance between utility and fairness. Specifically, it requires finding a fixed-size solution to maximize the utility function, subject to the value of the fairness function not being below a threshold. Since BSM is inapproximable within any constant factor in general, we propose efficient data-dependent approximation algorithms for BSM by converting it into other submodular optimization problems and utilizing existing algorithms for the converted problems to obtain solutions to BSM. Using real-world and synthetic datasets, we showcase applications of our framework in three submodular maximization problems, namely maximum coverage, influence maximization, and facility location.
    SIMD-size aware weight regularization for fast neural vocoding on CPU. (arXiv:2211.00898v1 [cs.SD])
    This paper proposes weight regularization for a faster neural vocoder. Pruning time-consuming DNN modules is a promising way to realize a real-time vocoder on a CPU (e.g. WaveRNN, LPCNet). Regularization that encourages sparsity is also effective in avoiding the quality degradation created by pruning. However, the orders of weight matrices must be contiguous in SIMD size for fast vocoding. To ensure this order, we propose explicit SIMD size aware regularization. Our proposed method reshapes a weight matrix into a tensor so that the weights are aligned by group size in advance, and then computes the group Lasso-like regularization loss. Experiments on 70% sparse subband WaveRNN show that pruning in conventional Lasso and column-wise group Lasso degrades the synthetic speech's naturalness. The vocoder with proposed regularization 1) achieves comparable naturalness to that without pruning and 2) performs meaningfully faster than other conventional vocoders using regularization.
    Fair Wrapping for Black-box Predictions. (arXiv:2201.12947v3 [stat.ML] UPDATED)
    We introduce a new family of techniques to post-process ("wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an $\alpha$-tree, which modifies the prediction. We provide two generic boosting algorithms to learn $\alpha$-trees. We show that our modification has appealing properties in terms of composition of $\alpha$-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value-at-risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.
    A Bayesian Framework on Asymmetric Mixture of Factor Analyser. (arXiv:2211.00729v1 [stat.ME])
    Mixture of factor analyzer (MFA) model is an efficient model for the analysis of high dimensional data through which the factor-analyzer technique based on the covariance matrices reducing the number of free parameters. The model also provides an important methodology to determine latent groups in data. There are several pieces of research to extend the model based on the asymmetrical and/or with outlier datasets with some known computational limitations that have been examined in frequentist cases. In this paper, an MFA model with a rich and flexible class of skew normal (unrestricted) generalized hyperbolic (called SUNGH) distributions along with a Bayesian structure with several computational benefits have been introduced. The SUNGH family provides considerable flexibility to model skewness in different directions as well as allowing for heavy tailed data. There are several desirable properties in the structure of the SUNGH family, including, an analytically flexible density which leads to easing up the computation applied for the estimation of parameters. Considering factor analysis models, the SUNGH family also allows for skewness and heavy tails for both the error component and factor scores. In the present study, the advantages of using this family of distributions have been discussed and the suitable efficiency of the introduced MFA model using real data examples and simulation has been demonstrated.
    Data-Driven Modeling of Landau Damping by Physics-Informed Neural Networks. (arXiv:2211.01021v1 [physics.plasm-ph])
    Kinetic approaches are generally accurate in dealing with microscale plasma physics problems but are computationally expensive for large-scale or multiscale systems. One of the long-standing problems in plasma physics is the integration of kinetic physics into fluid models, which is often achieved through sophisticated analytical closure terms. In this study, we successfully construct a multi-moment fluid model with an implicit fluid closure included in the neural network using machine learning. The multi-moment fluid model is trained with a small fraction of sparsely sampled data from kinetic simulations of Landau damping, using the physics-informed neural network (PINN) and the gradient-enhanced physics-informed neural network (gPINN). The multi-moment fluid model constructed using either PINN or gPINN reproduces the time evolution of the electric field energy, including its damping rate, and the plasma dynamics from the kinetic simulations. For the first time, we introduce a new variant of the gPINN architecture, namely, gPINN$p$ to capture the Landau damping process. Instead of including the gradients of all the equation residuals, gPINN$p$ only adds the gradient of the pressure equation residual as one additional constraint. Among the three approaches, the gPINN$p$-constructed multi-moment fluid model offers the most accurate results. This work sheds new light on the accurate and efficient modeling of large-scale systems, which can be extended to complex multiscale laboratory, space, and astrophysical plasma physics problems.
    LMD: A Learnable Mask Network to Detect Adversarial Examples for Speaker Verification. (arXiv:2211.00825v1 [eess.AS])
    Although the security of automatic speaker verification (ASV) is seriously threatened by recently emerged adversarial attacks, there have been some countermeasures to alleviate the threat. However, many defense approaches not only require the prior knowledge of the attackers but also possess weak interpretability. To address this issue, in this paper, we propose an attacker-independent and interpretable method, named learnable mask detector (LMD), to separate adversarial examples from the genuine ones. It utilizes score variation as an indicator to detect adversarial examples, where the score variation is the absolute discrepancy between the ASV scores of an original audio recording and its transformed audio synthesized from its masked complex spectrogram. A core component of the score variation detector is to generate the masked spectrogram by a neural network. The neural network needs only genuine examples for training, which makes it an attacker-independent approach. Its interpretability lies that the neural network is trained to minimize the score variation of the targeted ASV, and maximize the number of the masked spectrogram bins of the genuine training examples. Its foundation is based on the observation that, masking out the vast majority of the spectrogram bins with little speaker information will inevitably introduce a large score variation to the adversarial example, and a small score variation to the genuine example. Experimental results with 12 attackers and two representative ASV systems show that our proposed method outperforms five state-of-the-art baselines. The extensive experimental results can also be a benchmark for the detection-based ASV defenses.
    mlr3spatiotempcv: Spatiotemporal resampling methods for machine learning in R. (arXiv:2110.12674v2 [stat.ML] UPDATED)
    Spatial and spatiotemporal machine-learning models require a suitable framework for their model assessment, model selection, and hyperparameter tuning, in order to avoid error estimation bias and over-fitting. This contribution reviews the state-of-the-art in spatial and spatiotemporal cross-validation, and introduces the {R} package {mlr3spatiotempcv} as an extension package of the machine-learning framework {mlr3}. Currently various {R} packages implementing different spatiotemporal partitioning strategies exist: {blockCV}, {CAST}, {skmeans} and {sperrorest}. The goal of {mlr3spatiotempcv} is to gather the available spatiotemporal resampling methods in {R} and make them available to users through a simple and common interface. This is made possible by integrating the package directly into the {mlr3} machine-learning framework, which already has support for generic non-spatiotemporal resampling methods such as random partitioning. One advantage is the use of a consistent nomenclature in an overarching machine-learning toolkit instead of a varying package-specific syntax, making it easier for users to choose from a variety of spatiotemporal resampling methods. This package avoids giving recommendations which method to use in practice as this decision depends on the predictive task at hand, the autocorrelation within the data, and the spatial structure of the sampling design or geographic objects being studied.  ( 2 min )
    Transposed Variational Auto-encoder with Intrinsic Feature Learning for Traffic Forecasting. (arXiv:2211.00641v1 [cs.LG])
    In this technical report, we present our solutions to the Traffic4cast 2022 core challenge and extended challenge. In this competition, the participants are required to predict the traffic states for the future 15-minute based on the vehicle counter data in the previous hour. Compared to other competitions in the same series, this year focuses on the prediction of different data sources and sparse vertex-to-edge generalization. To address these issues, we introduce the Transposed Variational Auto-encoder (TVAE) model to reconstruct the missing data and Graph Attention Networks (GAT) to strengthen the correlations between learned representations. We further apply feature selection to learn traffic patterns from diverse but easily available data. Our solutions have ranked first in both challenges on the final leaderboard. The source code is available at \url{https://github.com/Daftstone/Traffic4cast}  ( 2 min )
    Reinforcement Learning in Education: A Multi-Armed Bandit Approach. (arXiv:2211.00779v1 [cs.LG])
    Advances in reinforcement learning research have demonstrated the ways in which different agent-based models can learn how to optimally perform a task within a given environment. Reinforcement leaning solves unsupervised problems where agents move through a state-action-reward loop to maximize the overall reward for the agent, which in turn optimizes the solving of a specific problem in a given environment. However, these algorithms are designed based on our understanding of actions that should be taken in a real-world environment to solve a specific problem. One such problem is the ability to identify, recommend and execute an action within a system where the users are the subject, such as in education. In recent years, the use of blended learning approaches integrating face-to-face learning with online learning in the education context, has in-creased. Additionally, online platforms used for education require the automation of certain functions such as the identification, recommendation or execution of actions that can benefit the user, in this sense, the student or learner. As promising as these scientific advances are, there is still a need to conduct research in a variety of different areas to ensure the successful deployment of these agents within education systems. Therefore, the aim of this study was to contextualise and simulate the cumulative reward within an environment for an intervention recommendation problem in the education context.
    An efficient algorithm for the $\ell_{p}$ norm based metric nearness problem. (arXiv:2211.01245v1 [math.OC])
    Given a dissimilarity matrix, the metric nearness problem is to find the nearest matrix of distances that satisfy the triangle inequalities. This problem has wide applications, such as sensor networks, image processing, and so on. But it is of great challenge even to obtain a moderately accurate solution due to the $O(n^{3})$ metric constraints and the nonsmooth objective function which is usually a weighted $\ell_{p}$ norm based distance. In this paper, we propose a delayed constraint generation method with each subproblem solved by the semismooth Newton based proximal augmented Lagrangian method (PALM) for the metric nearness problem. Due to the high memory requirement for the storage of the matrix related to the metric constraints, we take advantage of the special structure of the matrix and do not need to store the corresponding constraint matrix. A pleasing aspect of our algorithm is that we can solve these problems involving up to $10^{8}$ variables and $10^{13}$ constraints. Numerical experiments demonstrate the efficiency of our algorithm. In theory, firstly, under a mild condition, we establish a primal-dual error bound condition which is very essential for the analysis of local convergence rate of PALM. Secondly, we prove the equivalence between the dual nondegeneracy condition and nonsingularity of the generalized Jacobian for the inner subproblem of PALM. Thirdly, when $q(\cdot)=\|\cdot\|_{1}$ or $\|\cdot\|_{\infty}$, without the strict complementarity condition, we also prove the equivalence between the the dual nondegeneracy condition and the uniqueness of the primal solution.  ( 3 min )
    Practical Phase Retrieval Using Double Deep Image Priors. (arXiv:2211.00799v1 [cs.CV])
    Phase retrieval (PR) concerns the recovery of complex phases from complex magnitudes. We identify the connection between the difficulty level and the number and variety of symmetries in PR problems. We focus on the most difficult far-field PR (FFPR), and propose a novel method using double deep image priors. In realistic evaluation, our method outperforms all competing methods by large margins. As a single-instance method, our method requires no training data and minimal hyperparameter tuning, and hence enjoys good practicality.
    ADPTriage: Approximate Dynamic Programming for Bug Triage. (arXiv:2211.00872v1 [cs.SE])
    Bug triaging is a critical task in any software development project. It entails triagers going over a list of open bugs, deciding whether each is required to be addressed, and, if so, which developer should fix it. However, the manual bug assignment in issue tracking systems (ITS) offers only a limited solution and might easily fail when triagers must handle a large number of bug reports. During the automated assignment, there are multiple sources of uncertainties in the ITS, which should be addressed meticulously. In this study, we develop a Markov decision process (MDP) model for an online bug triage task. In addition to an optimization-based myopic technique, we provide an ADP-based bug triage solution, called ADPTriage, which has the ability to reflect the downstream uncertainty in the bug arrivals and developers' timetables. Specifically, without placing any limits on the underlying stochastic process, this technique enables real-time decision-making on bug assignments while taking into consideration developers' expertise, bug type, and bug fixing time. Our result shows a significant improvement over the myopic approach in terms of assignment accuracy and fixing time. We also demonstrate the empirical convergence of the model and conduct sensitivity analysis with various model parameters. Accordingly, this work constitutes a significant step forward in addressing the uncertainty in bug triage solutions
    Nonparametric Involutive Markov Chain Monte Carlo. (arXiv:2211.01100v1 [cs.LG])
    A challenging problem in probabilistic programming is to develop inference algorithms that work for arbitrary programs in a universal probabilistic programming language (PPL). We present the nonparametric involutive Markov chain Monte Carlo (NP-iMCMC) algorithm as a method for constructing MCMC inference algorithms for nonparametric models expressible in universal PPLs. Building on the unifying involutive MCMC framework, and by providing a general procedure for driving state movement between dimensions, we show that NP-iMCMC can generalise numerous existing iMCMC algorithms to work on nonparametric models. We prove the correctness of the NP-iMCMC sampler. Our empirical study shows that the existing strengths of several iMCMC algorithms carry over to their nonparametric extensions. Applying our method to the recently proposed Nonparametric HMC, an instance of (Multiple Step) NP-iMCMC, we have constructed several nonparametric extensions (all of which new) that exhibit significant performance improvements.  ( 2 min )
    Recurrent Neural Network Training with Convex Loss and Regularization Functions by Extended Kalman Filtering. (arXiv:2111.02673v3 [cs.LG] UPDATED)
    This paper investigates the use of extended Kalman filtering to train recurrent neural networks with rather general convex loss functions and regularization terms on the network parameters, including $\ell_1$-regularization. We show that the learning method is competitive with respect to stochastic gradient descent in a nonlinear system identification benchmark and in training a linear system with binary outputs. We also explore the use of the algorithm in data-driven nonlinear model predictive control and its relation with disturbance models for offset-free closed-loop tracking.
    Impact Of Missing Data Imputation On The Fairness And Accuracy Of Graph Node Classifiers. (arXiv:2211.00783v1 [cs.LG])
    Analysis of the fairness of machine learning (ML) algorithms recently attracted many researchers' interest. Most ML methods show bias toward protected groups, which limits the applicability of ML models in many applications like crime rate prediction etc. Since the data may have missing values which, if not appropriately handled, are known to further harmfully affect fairness. Many imputation methods are proposed to deal with missing data. However, the effect of missing data imputation on fairness is not studied well. In this paper, we analyze the effect on fairness in the context of graph data (node attributes) imputation using different embedding and neural network methods. Extensive experiments on six datasets demonstrate severe fairness issues in missing data imputation under graph node classification. We also find that the choice of the imputation method affects both fairness and accuracy. Our results provide valuable insights into graph data fairness and how to handle missingness in graphs efficiently. This work also provides directions regarding theoretical studies on fairness in graph data.
    A Model-Constrained Tangent Slope Learning Approach for Dynamical Systems. (arXiv:2208.04995v2 [cs.LG] UPDATED)
    Real-time accurate solutions of large-scale complex dynamical systems are in critical need for control, optimization, uncertainty quantification, and decision-making in practical engineering and science applications, especially digital twin applications. This paper contributes in this direction a model-constrained tangent slope learning (mcTangent) approach. At the heart of mcTangent is the synergy of several desirable strategies: i) a tangent slope learning to take advantage of the neural network speed and the time-accurate nature of the method of lines; ii) a model-constrained approach to encode the neural network tangent slope with the underlying governing equations; iii) sequential learning strategies to promote long-time stability and accuracy; and iv) data randomization approach to implicitly enforce the smoothness of the neural network tangent slope and its likeliness to the truth tangent slope up second order derivatives in order to further enhance the stability and accuracy of mcTangent solutions. Rigorous results are provided to analyze and justify the proposed approach. Several numerical results for the transport equation, viscous Burgers equation, and Navier-Stokes equation are presented to study and demonstrate the robustness and long-time accuracy of the proposed mcTangent learning approach.
    OpenSRH: optimizing brain tumor surgery using intraoperative stimulated Raman histology. (arXiv:2206.08439v2 [eess.IV] UPDATED)
    Accurate intraoperative diagnosis is essential for providing safe and effective care during brain tumor surgery. Our standard-of-care diagnostic methods are time, resource, and labor intensive, which restricts access to optimal surgical treatments. To address these limitations, we propose an alternative workflow that combines stimulated Raman histology (SRH), a rapid optical imaging method, with deep learning-based automated interpretation of SRH images for intraoperative brain tumor diagnosis and real-time surgical decision support. Here, we present OpenSRH, the first public dataset of clinical SRH images from 300+ brain tumors patients and 1300+ unique whole slide optical images. OpenSRH contains data from the most common brain tumors diagnoses, full pathologic annotations, whole slide tumor segmentations, raw and processed optical imaging data for end-to-end model development and validation. We provide a framework for patch-based whole slide SRH classification and inference using weak (i.e. patient-level) diagnostic labels. Finally, we benchmark two computer vision tasks: multiclass histologic brain tumor classification and patch-based contrastive representation learning. We hope OpenSRH will facilitate the clinical translation of rapid optical imaging and real-time ML-based surgical decision support in order to improve the access, safety, and efficacy of cancer surgery in the era of precision medicine. Dataset access, code, and benchmarks are available at opensrh.mlins.org.  ( 3 min )
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v2 [stat.ML] UPDATED)
    Graph prediction problems prevail in data analysis and machine learning. The inverse prediction problem, namely to infer input data from given output labels, is of emerging interest in various applications. In this work, we develop \textit{invertible graph neural network} (iGNN), a deep generative model to tackle the inverse prediction problem on graphs by casting it as a conditional generative task. The proposed model consists of an invertible sub-network that maps one-to-one from data to an intermediate encoded feature, which allows forward prediction by a linear classification sub-network as well as efficient generation from output labels via a parametric mixture model. The invertibility of the encoding sub-network is ensured by a Wasserstein-2 regularization which allows free-form layers in the residual blocks. The model is scalable to large graphs by a factorized parametric mixture model of the encoded feature and is computationally scalable by using GNN layers. The existence of invertible flow mapping is backed by theories of optimal transport and diffusion process, and we prove the expressiveness of graph convolution layers to approximate the theoretical flows of graph data. The proposed iGNN model is experimentally examined on synthetic data, including the example on large graphs, and the empirical advantage is also demonstrated on real-application datasets of solar ramping event data and traffic flow anomaly detection.
    One-shot Neural Backdoor Erasing via Adversarial Weight Masking. (arXiv:2207.04497v2 [cs.LG] UPDATED)
    Recent studies show that despite achieving high accuracy on a number of real-world applications, deep neural networks (DNNs) can be backdoored: by injecting triggered data samples into the training dataset, the adversary can mislead the trained model into classifying any test data to the target class as long as the trigger pattern is presented. To nullify such backdoor threats, various methods have been proposed. Particularly, a line of research aims to purify the potentially compromised model. However, one major limitation of this line of work is the requirement to access sufficient original training data: the purifying performance is a lot worse when the available training data is limited. In this work, we propose Adversarial Weight Masking (AWM), a novel method capable of erasing the neural backdoors even in the one-shot setting. The key idea behind our method is to formulate this into a min-max optimization problem: first, adversarially recover the trigger patterns and then (soft) mask the network weights that are sensitive to the recovered patterns. Comprehensive evaluations of several benchmark datasets suggest that AWM can largely improve the purifying effects over other state-of-the-art methods on various available training dataset sizes.  ( 2 min )
    Preventing Over-Smoothing for Hypergraph Neural Networks. (arXiv:2203.17159v2 [cs.LG] UPDATED)
    In recent years, hypergraph learning has attracted great attention due to its capacity in representing complex and high-order relationships. However, current neural network approaches designed for hypergraphs are mostly shallow, thus limiting their ability to extract information from high-order neighbors. In this paper, we show both theoretically and empirically, that the performance of hypergraph neural networks does not improve as the number of layers increases, which is known as the over-smoothing problem. To avoid this issue, we develop a new deep hypergraph convolutional network called Deep-HGCN, which can maintain the heterogeneity of node representation in deep layers. Specifically, we prove that a $k$-layer Deep-HGCN simulates a polynomial filter of order $k$ with arbitrary coefficients, which can relieve the problem of over-smoothing. Experimental results on various datasets demonstrate the superior performance of the proposed model compared to the state-of-the-art hypergraph learning approaches.  ( 2 min )
    Diffusion-based Generative Speech Source Separation. (arXiv:2210.17327v2 [eess.AS] UPDATED)
    We propose DiffSep, a new single channel source separation method based on score-matching of a stochastic differential equation (SDE). We craft a tailored continuous time diffusion-mixing process starting from the separated sources and converging to a Gaussian distribution centered on their mixture. This formulation lets us apply the machinery of score-based generative modelling. First, we train a neural network to approximate the score function of the marginal probabilities or the diffusion-mixing process. Then, we use it to solve the reverse time SDE that progressively separates the sources starting from their mixture. We propose a modified training strategy to handle model mismatch and source permutation ambiguity. Experiments on the WSJ0 2mix dataset demonstrate the potential of the method. Furthermore, the method is also suitable for speech enhancement and shows performance competitive with prior work on the VoiceBank-DEMAND dataset.  ( 2 min )
    Beyond Not-Forgetting: Continual Learning with Backward Knowledge Transfer. (arXiv:2211.00789v1 [cs.LG])
    By learning a sequence of tasks continually, an agent in continual learning (CL) can improve the learning performance of both a new task and `old' tasks by leveraging the forward knowledge transfer and the backward knowledge transfer, respectively. However, most existing CL methods focus on addressing catastrophic forgetting in neural networks by minimizing the modification of the learnt model for old tasks. This inevitably limits the backward knowledge transfer from the new task to the old tasks, because judicious model updates could possibly improve the learning performance of the old tasks as well. To tackle this problem, we first theoretically analyze the conditions under which updating the learnt model of old tasks could be beneficial for CL and also lead to backward knowledge transfer, based on the gradient projection onto the input subspaces of old tasks. Building on the theoretical analysis, we next develop a ContinUal learning method with Backward knowlEdge tRansfer (CUBER), for a fixed capacity neural network without data replay. In particular, CUBER first characterizes the task correlation to identify the positively correlated old tasks in a layer-wise manner, and then selectively modifies the learnt model of the old tasks when learning the new task. Experimental studies show that CUBER can even achieve positive backward knowledge transfer on several existing CL benchmarks for the first time without data replay, where the related baselines still suffer from catastrophic forgetting (negative backward knowledge transfer). The superior performance of CUBER on the backward knowledge transfer also leads to higher accuracy accordingly.
    Behavior Prior Representation learning for Offline Reinforcement Learning. (arXiv:2211.00863v1 [cs.LG])
    Offline reinforcement learning (RL) struggles in environments with rich and noisy inputs, where the agent only has access to a fixed dataset without environment interactions. Past works have proposed common workarounds based on the pre-training of state representations, followed by policy training. In this work, we introduce a simple, yet effective approach for learning state representations. Our method, Behavior Prior Representation (BPR), learns state representations with an easy-to-integrate objective based on behavior cloning of the dataset: we first learn a state representation by mimicking actions from the dataset, and then train a policy on top of the fixed representation, using any off-the-shelf Offline RL algorithm. Theoretically, we prove that BPR carries out performance guarantees when integrated into algorithms that have either policy improvement guarantees (conservative algorithms) or produce lower bounds of the policy values (pessimistic algorithms). Empirically, we show that BPR combined with existing state-of-the-art Offline RL algorithms leads to significant improvements across several offline control benchmarks.
    Monte Carlo Tree Descent for Black-Box Optimization. (arXiv:2211.00778v1 [cs.LG])
    The key to Black-Box Optimization is to efficiently search through input regions with potentially widely-varying numerical properties, to achieve low-regret descent and fast progress toward the optima. Monte Carlo Tree Search (MCTS) methods have recently been introduced to improve Bayesian optimization by computing better partitioning of the search space that balances exploration and exploitation. Extending this promising framework, we study how to further integrate sample-based descent for faster optimization. We design novel ways of expanding Monte Carlo search trees, with new descent methods at vertices that incorporate stochastic search and Gaussian Processes. We propose the corresponding rules for balancing progress and uncertainty, branch selection, tree expansion, and backpropagation. The designed search process puts more emphasis on sampling for faster descent and uses localized Gaussian Processes as auxiliary metrics for both exploitation and exploration. We show empirically that the proposed algorithms can outperform state-of-the-art methods on many challenging benchmark problems.
    Uncertainty Aware Trader-Company Method: Interpretable Stock Price Prediction Capturing Uncertainty. (arXiv:2210.17030v2 [q-fin.CP] UPDATED)
    Machine learning is an increasingly popular tool with some success in predicting stock prices. One promising method is the Trader-Company~(TC) method, which takes into account the dynamism of the stock market and has both high predictive power and interpretability. Machine learning-based stock prediction methods including the TC method have been concentrating on point prediction. However, point prediction in the absence of uncertainty estimates lacks credibility quantification and raises concerns about safety. The challenge in this paper is to make an investment strategy that combines high predictive power and the ability to quantify uncertainty. We propose a novel approach called Uncertainty Aware Trader-Company Method~(UTC) method. The core idea of this approach is to combine the strengths of both frameworks by merging the TC method with the probabilistic modeling, which provides probabilistic predictions and uncertainty estimations. We expect this to retain the predictive power and interpretability of the TC method while capturing the uncertainty. We theoretically prove that the proposed method estimates the posterior variance and does not introduce additional biases from the original TC method. We conduct a comprehensive evaluation of our approach based on the synthetic and real market datasets. We confirm with synthetic data that the UTC method can detect situations where the uncertainty increases and the prediction is difficult. We also confirmed that the UTC method can detect abrupt changes in data generating distributions. We demonstrate with real market data that the UTC method can achieve higher returns and lower risks than baselines.  ( 3 min )
    Region-of-Interest Based Neural Video Compression. (arXiv:2203.01978v2 [eess.IV] UPDATED)
    Humans do not perceive all parts of a scene with the same resolution, but rather focus on few regions of interest (ROIs). Traditional Object-Based codecs take advantage of this biological intuition, and are capable of non-uniform allocation of bits in favor of salient regions, at the expense of increased distortion the remaining areas: such a strategy allows a boost in perceptual quality under low rate constraints. Recently, several neural codecs have been introduced for video compression, yet they operate uniformly over all spatial locations, lacking the capability of ROI-based processing. In this paper, we introduce two models for ROI-based neural video coding. First, we propose an implicit model that is fed with a binary ROI mask and it is trained by de-emphasizing the distortion of the background. Secondly, we design an explicit latent scaling method, that allows control over the quantization binwidth for different spatial regions of latent variables, conditioned on the ROI mask. By extensive experiments, we show that our methods outperform all our baselines in terms of Rate-Distortion (R-D) performance in the ROI. Moreover, they can generalize to different datasets and to any arbitrary ROI at inference time. Finally, they do not require expensive pixel-level annotations during training, as synthetic ROI masks can be used with little to no degradation in performance. To the best of our knowledge, our proposals are the first solutions that integrate ROI-based capabilities into neural video compression models.
    Port-Hamiltonian Neural Networks with State-Dependent Ports. (arXiv:2206.02660v3 [cs.LG] UPDATED)
    Hybrid machine learning based on Hamiltonian formulations has recently been successfully demonstrated for simple mechanical systems, both energy conserving and not energy conserving. We show that port-Hamiltonian neural network models can be used to learn external forces acting on a system. We argue that this property is particularly useful when the external forces are state dependent, in which case it is the port-Hamiltonian structure that facilitates the separation of internal and external forces. Numerical results are provided for a forced and damped mass-spring system and a tank system of higher complexity, and a symmetric fourth-order integration scheme is introduced for improved training on sparse and noisy data.  ( 2 min )
    RCD-SGD: Resource-Constrained Distributed SGD in Heterogeneous Environment via Submodular Partitioning. (arXiv:2211.00839v1 [cs.LG])
    The convergence of SGD based distributed training algorithms is tied to the data distribution across workers. Standard partitioning techniques try to achieve equal-sized partitions with per-class population distribution in proportion to the total dataset. Partitions having the same overall population size or even the same number of samples per class may still have Non-IID distribution in the feature space. In heterogeneous computing environments, when devices have different computing capabilities, even-sized partitions across devices can lead to the straggler problem in distributed SGD. We develop a framework for distributed SGD in heterogeneous environments based on a novel data partitioning algorithm involving submodular optimization. Our data partitioning algorithm explicitly accounts for resource heterogeneity across workers while achieving similar class-level feature distribution and maintaining class balance. Based on this algorithm, we develop a distributed SGD framework that can accelerate existing SOTA distributed training algorithms by up to 32%.
    Unsupervised Model Adaptation for Source-free Segmentation of Medical Images. (arXiv:2211.00807v1 [cs.CV])
    The recent prevalence of deep neural networks has lead semantic segmentation networks to achieve human-level performance in the medical field when sufficient training data is provided. Such networks however fail to generalize when tasked with predicting semantic maps for out-of-distribution images, requiring model re-training on the new distributions. This expensive process necessitates expert knowledge in order to generate training labels. Distribution shifts can arise naturally in the medical field via the choice of imaging device, i.e. MRI or CT scanners. To combat the need for labeling images in a target domain after a model is successfully trained in a fully annotated \textit{source domain} with a different data distribution, unsupervised domain adaptation (UDA) can be used. Most UDA approaches ensure target generalization by creating a shared source/target latent feature space. This allows a source trained classifier to maintain performance on the target domain. However most UDA approaches require joint source and target data access, which may create privacy leaks with respect to patient information. We propose an UDA algorithm for medical image segmentation that does not require access to source data during adaptation, and is thus capable in maintaining patient data privacy. We rely on an approximation of the source latent features at adaptation time, and create a joint source/target embedding space by minimizing a distributional distance metric based on optimal transport. We demonstrate our approach is competitive to recent UDA medical segmentation works even with the added privacy requisite.
    Interpretable estimation of the risk of heart failure hospitalization from a 30-second electrocardiogram. (arXiv:2211.00819v1 [cs.LG])
    Survival modeling in healthcare relies on explainable statistical models; yet, their underlying assumptions are often simplistic and, thus, unrealistic. Machine learning models can estimate more complex relationships and lead to more accurate predictions, but are non-interpretable. This study shows it is possible to estimate hospitalization for congestive heart failure by a 30 seconds single-lead electrocardiogram signal. Using a machine learning approach not only results in greater predictive power but also provides clinically meaningful interpretations. We train an eXtreme Gradient Boosting accelerated failure time model and exploit SHapley Additive exPlanations values to explain the effect of each feature on predictions. Our model achieved a concordance index of 0.828 and an area under the curve of 0.853 at one year and 0.858 at two years on a held-out test set of 6,573 patients. These results show that a rapid test based on an electrocardiogram could be crucial in targeting and treating high-risk individuals.
    More Speaking or More Speakers?. (arXiv:2211.00854v1 [cs.LG])
    Self-training (ST) and self-supervised learning (SSL) methods have demonstrated strong improvements in automatic speech recognition (ASR). In spite of these advances, to the best of our knowledge, there is no analysis of how the composition of the labelled and unlabelled datasets used in these methods affects the results. In this work we aim to analyse the effect of numbers of speakers in the training data on a recent SSL algorithm (wav2vec 2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on both labeled and unlabeled data by varying the number of speakers while keeping the number of hours fixed and vice versa. Our findings suggest that SSL requires a large amount of unlabeled data to produce high accuracy results, while ST requires a sufficient number of speakers in the labelled data, especially in the low-regime setting. In this manner these two approaches improve supervised learning in different regimes of dataset composition.
    Operator Selection in Adaptive Large Neighborhood Search using Deep Reinforcement Learning. (arXiv:2211.00759v1 [cs.LG])
    Large Neighborhood Search (LNS) is a popular heuristic for solving combinatorial optimization problems. LNS iteratively explores the neighborhoods in solution spaces using destroy and repair operators. Determining the best operators for LNS to solve a problem at hand is a labor-intensive process. Hence, Adaptive Large Neighborhood Search (ALNS) has been proposed to adaptively select operators during the search process based on operator performances of the previous search iterations. Such an operator selection procedure is a heuristic, based on domain knowledge, which is ineffective with complex, large solution spaces. In this paper, we address the problem of selecting operators for each search iteration of ALNS as a sequential decision problem and propose a Deep Reinforcement Learning based method called Deep Reinforced Adaptive Large Neighborhood Search. As such, the proposed method aims to learn based on the state of the search which operation to select to obtain a high long-term reward, i.e., a good solution to the underlying optimization problem. The proposed method is evaluated on a time-dependent orienteering problem with stochastic weights and time windows. Results show that our approach effectively learns a strategy that adaptively selects operators for large neighborhood search, obtaining competitive results compared to a state-of-the-art machine learning approach while trained with much fewer observations on small-sized problem instances.
    An Information-Theoretic Approach for Estimating Scenario Generalization in Crowd Motion Prediction. (arXiv:2211.00817v1 [cs.LG])
    Learning-based approaches to modeling crowd motion have become increasingly successful but require training and evaluation on large datasets, coupled with complex model selection and parameter tuning. To circumvent this tremendously time-consuming process, we propose a novel scoring method, which characterizes generalization of models trained on source crowd scenarios and applied to target crowd scenarios using a training-free, model-agnostic Interaction + Diversity Quantification score, ISDQ. The Interaction component aims to characterize the difficulty of scenario domains, while the diversity of a scenario domain is captured in the Diversity score. Both scores can be computed in a computation tractable manner. Our experimental results validate the efficacy of the proposed method on several simulated and real-world (source,target) generalization tasks, demonstrating its potential to select optimal domain pairs before training and testing a model.
    Towards Better Out-of-Distribution Generalization of Neural Algorithmic Reasoning Tasks. (arXiv:2211.00692v1 [cs.LG])
    In this paper, we study the OOD generalization of neural algorithmic reasoning tasks, where the goal is to learn an algorithm (e.g., sorting, breadth-first search, and depth-first search) from input-output pairs using deep neural networks. First, we argue that OOD generalization in this setting is significantly different than common OOD settings. For example, some phenomena in OOD generalization of image classifications such as \emph{accuracy on the line} are not observed here, and techniques such as data augmentation methods do not help as assumptions underlying many augmentation techniques are often violated. Second, we analyze the main challenges (e.g., input distribution shift, non-representative data generation, and uninformative validation metrics) of the current leading benchmark, i.e., CLRS \citep{deepmind2021clrs}, which contains 30 algorithmic reasoning tasks. We propose several solutions, including a simple-yet-effective fix to the input distribution shift and improved data generation. Finally, we propose an attention-based 2WL-graph neural network (GNN) processor which complements message-passing GNNs so their combination outperforms the state-of-the-art model by a 3% margin averaged over all algorithms. Our code is available at: \url{https://github.com/smahdavi4/clrs}.
    Multi-Agent Reinforcement Learning for Adaptive Mesh Refinement. (arXiv:2211.00801v1 [cs.LG])
    Adaptive mesh refinement (AMR) is necessary for efficient finite element simulations of complex physical phenomenon, as it allocates limited computational budget based on the need for higher or lower resolution, which varies over space and time. We present a novel formulation of AMR as a fully-cooperative Markov game, in which each element is an independent agent who makes refinement and de-refinement choices based on local information. We design a novel deep multi-agent reinforcement learning (MARL) algorithm called Value Decomposition Graph Network (VDGN), which solves the two core challenges that AMR poses for MARL: posthumous credit assignment due to agent creation and deletion, and unstructured observations due to the diversity of mesh geometries. For the first time, we show that MARL enables anticipatory refinement of regions that will encounter complex features at future times, thereby unlocking entirely new regions of the error-cost objective landscape that are inaccessible by traditional methods based on local error estimators. Comprehensive experiments show that VDGN policies significantly outperform error threshold-based policies in global error and cost metrics. We show that learned policies generalize to test problems with physical features, mesh geometries, and longer simulation times that were not seen in training. We also extend VDGN with multi-objective optimization capabilities to find the Pareto front of the tradeoff between cost and error.  ( 2 min )
    Semi-Supervised Domain Adaptation for Cross-Survey Galaxy Morphology Classification and Anomaly Detection. (arXiv:2211.00677v1 [astro-ph.GA])
    In the era of big astronomical surveys, our ability to leverage artificial intelligence algorithms simultaneously for multiple datasets will open new avenues for scientific discovery. Unfortunately, simply training a deep neural network on images from one data domain often leads to very poor performance on any other dataset. Here we develop a Universal Domain Adaptation method DeepAstroUDA, capable of performing semi-supervised domain alignment that can be applied to datasets with different types of class overlap. Extra classes can be present in any of the two datasets, and the method can even be used in the presence of unknown classes. For the first time, we demonstrate the successful use of domain adaptation on two very different observational datasets (from SDSS and DECaLS). We show that our method is capable of bridging the gap between two astronomical surveys, and also performs well for anomaly detection and clustering of unknown data in the unlabeled dataset. We apply our model to two examples of galaxy morphology classification tasks with anomaly detection: 1) classifying spiral and elliptical galaxies with detection of merging galaxies (three classes including one unknown anomaly class); 2) a more granular problem where the classes describe more detailed morphological properties of galaxies, with the detection of gravitational lenses (ten classes including one unknown anomaly class).  ( 3 min )
    MAgNET: A Graph U-Net Architecture for Mesh-Based Simulations. (arXiv:2211.00713v1 [cs.LG])
    Mesh-based approaches are fundamental to solving physics-based simulations, however, they require significant computational efforts, especially for highly non-linear problems. Deep learning techniques accelerate physics-based simulations, however, they fail to perform efficiently as the size and complexity of the problem increases. Hence in this work, we propose MAgNET: Multi-channel Aggregation Network, a novel geometric deep learning framework for performing supervised learning on mesh-based graph data. MAgNET is based on the proposed MAg (Multichannel Aggregation) operation which generalises the concept of multi-channel local operations in convolutional neural networks to arbitrary non-grid inputs. MAg can efficiently perform non-linear regression mapping for graph-structured data. MAg layers are interleaved with the proposed novel graph pooling operations to constitute a graph U-Net architecture that is robust, handles arbitrary complex meshes and scales efficiently with the size of the problem. Although not limited to the type of discretisation, we showcase the predictive capabilities of MAgNET for several non-linear finite element simulations.  ( 2 min )
    Reduce, Reuse, Recycle: Improving Training Efficiency with Distillation. (arXiv:2211.00683v1 [cs.LG])
    Methods for improving the efficiency of deep network training (i.e. the resources required to achieve a given level of model quality) are of immediate benefit to deep learning practitioners. Distillation is typically used to compress models or improve model quality, but it's unclear if distillation actually improves training efficiency. Can the quality improvements of distillation be converted into training speed-ups, or do they simply increase final model quality with no resource savings? We conducted a series of experiments to investigate whether and how distillation can be used to accelerate training using ResNet-50 trained on ImageNet and BERT trained on C4 with a masked language modeling objective and evaluated on GLUE, using common enterprise hardware (8x NVIDIA A100). We found that distillation can speed up training by up to 1.96x in ResNet-50 trained on ImageNet and up to 1.42x on BERT when evaluated on GLUE. Furthermore, distillation for BERT yields optimal results when it is only performed for the first 20-50% of training. We also observed that training with distillation is almost always more efficient than training without distillation, even when using the poorest-quality model as a teacher, in both ResNet-50 and BERT. Finally, we found that it's possible to gain the benefit of distilling from an ensemble of teacher models, which has O(n) runtime cost, by randomly sampling a single teacher from the pool of teacher models on each step, which only has a O(1) runtime cost. Taken together, these results show that distillation can substantially improve training efficiency in both image classification and language modeling, and that a few simple optimizations to distillation protocols can further enhance these efficiency improvements.  ( 3 min )
    Farm-wide virtual load monitoring for offshore wind structures via Bayesian neural networks. (arXiv:2211.00642v1 [cs.LG])
    Offshore wind structures are subject to deterioration mechanisms throughout their operational lifetime. Even if the deterioration evolution of structural elements can be estimated through physics-based deterioration models, the uncertainties involved in the process hurdle the selection of lifecycle management decisions. In this scenario, the collection of relevant information through an efficient monitoring system enables the reduction of uncertainties, ultimately driving more optimal lifecycle decisions. However, a full monitoring instrumentation implemented on all wind turbines in a farm might become unfeasible due to practical and economical constraints. Besides, certain load monitoring systems often become defective after a few years of marine environment exposure. Addressing the aforementioned concerns, a farm-wide virtual load monitoring scheme directed by a fleet-leader wind turbine offers an attractive solution. Fetched with data retrieved from a fully-instrumented wind turbine, a model can be trained and then deployed, thus yielding load predictions of non-fully monitored wind turbines, from which only standard data remains available. In this paper, we propose a virtual load monitoring framework formulated via Bayesian neural networks (BNNs) and we provide relevant implementation details needed for the construction, training, and deployment of BNN data-based virtual monitoring models. As opposed to their deterministic counterparts, BNNs intrinsically announce the uncertainties associated with generated load predictions and allow to detect inaccurate load estimations generated for non-fully monitored wind turbines. The proposed virtual load monitoring is thoroughly tested through an experimental campaign in an operational offshore wind farm and the results demonstrate the effectiveness of BNN models for fleet-leader-based farm-wide virtual monitoring.  ( 3 min )
    A Federated Learning Scheme for Neuro-developmental Disorders: Multi-Aspect ASD Detection. (arXiv:2211.00643v1 [eess.IV])
    Autism Spectrum Disorder (ASD) is a neuro-developmental syndrome resulting from alterations in the embryological brain before birth. This disorder distinguishes its patients by special socially restricted and repetitive behavior in addition to specific behavioral traits. Hence, this would possibly deteriorate their social behavior among other individuals, as well as their overall interaction within their community. Moreover, medical research has proved that ASD also affects the facial characteristics of its patients, making the syndrome recognizable from distinctive signs within an individual's face. Given that as a motivation behind our work, we propose a novel privacy-preserving federated learning scheme to predict ASD in a certain individual based on their behavioral and facial features, embedding a merging process of both data features through facial feature extraction while respecting patient data privacy. After training behavioral and facial image data on federated machine learning models, promising results are achieved, with 70\% accuracy for the prediction of ASD according to behavioral traits in a federated learning environment, and a 62\% accuracy is reached for the prediction of ASD given an image of the patient's face. Then, we test the behavior of regular as well as federated ML on our merged data, behavioral and facial, where a 65\% accuracy is achieved with the regular logistic regression model and 63\% accuracy with the federated learning model.  ( 2 min )
    Inferring school district learning modalities during the COVID-19 pandemic with a hidden Markov model. (arXiv:2211.00708v1 [cs.CY])
    In this study, learning modalities offered by public schools across the United States were investigated to track changes in the proportion of schools offering fully in-person, hybrid and fully remote learning over time. Learning modalities from 14,688 unique school districts from September 2020 to June 2021 were reported by Burbio, MCH Strategic Data, the American Enterprise Institute's Return to Learn Tracker and individual state dashboards. A model was needed to combine and deconflict these data to provide a more complete description of modalities nationwide. A hidden Markov model (HMM) was used to infer the most likely learning modality for each district on a weekly basis. This method yielded higher spatiotemporal coverage than any individual data source and higher agreement with three of the four data sources than any other single source. The model output revealed that the percentage of districts offering fully in-person learning rose from 40.3% in September 2020 to 54.7% in June of 2021 with increases across 45 states and in both urban and rural districts. This type of probabilistic model can serve as a tool for fusion of incomplete and contradictory data sources in support of public health surveillance and research efforts.  ( 3 min )
    On the Interaction Between Differential Privacy and Gradient Compression in Deep Learning. (arXiv:2211.00734v1 [cs.LG])
    While differential privacy and gradient compression are separately well-researched topics in machine learning, the study of interaction between these two topics is still relatively new. We perform a detailed empirical study on how the Gaussian mechanism for differential privacy and gradient compression jointly impact test accuracy in deep learning. The existing literature in gradient compression mostly evaluates compression in the absence of differential privacy guarantees, and demonstrate that sufficiently high compression rates reduce accuracy. Similarly, existing literature in differential privacy evaluates privacy mechanisms in the absence of compression, and demonstrates that sufficiently strong privacy guarantees reduce accuracy. In this work, we observe while gradient compression generally has a negative impact on test accuracy in non-private training, it can sometimes improve test accuracy in differentially private training. Specifically, we observe that when employing aggressive sparsification or rank reduction to the gradients, test accuracy is less affected by the Gaussian noise added for differential privacy. These observations are explained through an analysis how differential privacy and compression effects the bias and variance in estimating the average gradient. We follow this study with a recommendation on how to improve test accuracy under the context of differentially private deep learning and gradient compression. We evaluate this proposal and find that it can reduce the negative impact of noise added by differential privacy mechanisms on test accuracy by up to 24.6%, and reduce the negative impact of gradient sparsification on test accuracy by up to 15.1%.  ( 3 min )
    TorchFL: A Performant Library for Bootstrapping Federated Learning Experiments. (arXiv:2211.00735v1 [cs.LG])
    With the increased legislation around data privacy, federated learning (FL) has emerged as a promising technique that allows the clients (end-user) to collaboratively train deep learning (DL) models without transferring and storing the data in a centralized, third-party server. Despite the theoretical success, FL is yet to be adopted in real-world systems due to the hardware, computing, and various infrastructure constraints presented by the edge and mobile devices of the clients. As a result, simulated datasets, models, and experiments are heavily used by the FL research community to validate their theories and findings. We introduce TorchFL, a performant library for (i) bootstrapping the FL experiments, (ii) executing them using various hardware accelerators, (iii) profiling the performance, and (iv) logging the overall and agent-specific results on the go. Being built on a bottom-up design using PyTorch and Lightning, TorchFL provides ready-to-use abstractions for models, datasets, and FL algorithms, while allowing the developers to customize them as and when required.  ( 2 min )
    VIINTER: View Interpolation with Implicit Neural Representations of Images. (arXiv:2211.00722v1 [cs.CV])
    We present VIINTER, a method for view interpolation by interpolating the implicit neural representation (INR) of the captured images. We leverage the learned code vector associated with each image and interpolate between these codes to achieve viewpoint transitions. We propose several techniques that significantly enhance the interpolation quality. VIINTER signifies a new way to achieve view interpolation without constructing 3D structure, estimating camera poses, or computing pixel correspondence. We validate the effectiveness of VIINTER on several multi-view scenes with different types of camera layout and scene composition. As the development of INR of images (as opposed to surface or volume) has centered around tasks like image fitting and super-resolution, with VIINTER, we show its capability for view interpolation and offer a promising outlook on using INR for image manipulation tasks.  ( 2 min )
    Measuring Air Quality via Multimodal AI and Satellite Imagery. (arXiv:2211.00780v1 [cs.LG])
    Climate change may be classified as the most important environmental problem that the Earth is currently facing, and affects all living species on Earth. Given that air-quality monitoring stations are typically ground-based their abilities to detect pollutant distributions are often restricted to wide areas. Satellites however have the potential for studying the atmosphere at large; the European Space Agency (ESA) Copernicus project satellite, "Sentinel-5P" is a newly launched satellite capable of measuring a variety of pollutant information with publicly available data outputs. This paper seeks to create a multi-modal machine learning model for predicting air-quality metrics where monitoring stations do not exist. The inputs of this model will include a fusion of ground measurements and satellite data with the goal of highlighting pollutant distribution and motivating change in societal and industrial behaviors. A new dataset of European pollution monitoring station measurements is created with features including $\textit{altitude, population, etc.}$ from the ESA Copernicus project. This dataset is used to train a multi-modal ML model, Air Quality Network (AQNet) capable of fusing these various types of data sources to output predictions of various pollutants. These predictions are then aggregated to create an "air-quality index" that could be used to compare air quality over different regions. Three pollutants, NO$_2$, O$_3$, and PM$_{10}$, are predicted successfully by AQNet and the network was found to be useful compared to a model only using satellite imagery. It was also found that the addition of supporting data improves predictions. When testing the developed AQNet on out-of-sample data of the UK and Ireland, we obtain satisfactory estimates though on average pollution metrics were roughly overestimated by around 20\%.  ( 3 min )
    Comparision Of Adversarial And Non-Adversarial LSTM Music Generative Models. (arXiv:2211.00731v1 [cs.LG])
    Algorithmic music composition is a way of composing musical pieces with minimal to no human intervention. While recurrent neural networks are traditionally applied to many sequence-to-sequence prediction tasks, including successful implementations of music composition, their standard supervised learning approach based on input-to-output mapping leads to a lack of note variety. These models can therefore be seen as potentially unsuitable for tasks such as music generation. Generative adversarial networks learn the generative distribution of data and lead to varied samples. This work implements and compares adversarial and non-adversarial training of recurrent neural network music composers on MIDI data. The resulting music samples are evaluated by human listeners, their preferences recorded. The evaluation indicates that adversarial training produces more aesthetically pleasing music.  ( 2 min )
    Optical Channel Impulse Response-Based Localization Using An Artificial Neural Network. (arXiv:2211.00806v1 [cs.IT])
    Visible light positioning has the potential to yield sub-centimeter accuracy in indoor environments, yet conventional received signal strength (RSS)-based localization algorithms cannot achieve this because their performance degrades from optical multipath reflection. However, this part of the optical received signal is deterministic due to the often static and predictable nature of the optical wireless channel. In this paper, the performance of optical channel impulse response (OCIR)-based localization is studied using an artificial neural network (ANN) to map embedded features of the OCIR to the user equipment's location. Numerical results show that OCIR-based localization outperforms conventional RSS techniques by two orders of magnitude using only two photodetectors as anchor points. The ANN technique can take advantage of multipath features in a wide range of scenarios, from using only the DC value to relying on high-resolution time sampling that can result in sub-centimeter accuracy.  ( 2 min )
    Privacy Induces Robustness: Information-Computation Gaps and Sparse Mean Estimation. (arXiv:2211.00724v1 [stat.ML])
    We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.  ( 2 min )
    Forecasting Patient Flows with Pandemic Induced Concept Drift using Explainable Machine Learning. (arXiv:2211.00739v1 [cs.LG])
    Accurately forecasting patient arrivals at Urgent Care Clinics (UCCs) and Emergency Departments (EDs) is important for effective resourcing and patient care. However, correctly estimating patient flows is not straightforward since it depends on many drivers. The predictability of patient arrivals has recently been further complicated by the COVID-19 pandemic conditions and the resulting lockdowns. This study investigates how a suite of novel quasi-real-time variables like Google search terms, pedestrian traffic, the prevailing incidence levels of influenza, as well as the COVID-19 Alert Level indicators can both generally improve the forecasting models of patient flows and effectively adapt the models to the unfolding disruptions of pandemic conditions. This research also uniquely contributes to the body of work in this domain by employing tools from the eXplainable AI field to investigate more deeply the internal mechanics of the models than has previously been done. The Voting ensemble-based method combining machine learning and statistical techniques was the most reliable in our experiments. Our study showed that the prevailing COVID-19 Alert Level feature together with Google search terms and pedestrian traffic were effective at producing generalisable forecasts. The implications of this study are that proxy variables can effectively augment standard autoregressive features to ensure accurate forecasting of patient flows. The experiments showed that the proposed features are potentially effective model inputs for preserving forecast accuracies in the event of future pandemic outbreaks.  ( 3 min )
    Learning Melanocytic Cell Masks from Adjacent Stained Tissue. (arXiv:2211.00646v1 [q-bio.QM])
    Melanoma is one of the most aggressive forms of skin cancer, causing a large proportion of skin cancer deaths. However, melanoma diagnoses by pathologists shows low interrater reliability. As melanoma is a cancer of the melanocyte, there is a clear need to develop a melanocytic cell segmentation tool that is agnostic to pathologist variability and automates pixel-level annotation. Gigapixel-level pathologist labeling, however, is impractical. Herein, we propose a means to train deep neural networks for melanocytic cell segmentation from hematoxylin and eosin (H&E) stained slides using paired immunohistochemical (IHC) slides of adjacent tissue sections, achieving a mean IOU of 0.64 despite imperfect ground-truth labels.  ( 2 min )
    Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian. (arXiv:2211.00716v1 [cs.LG])
    Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as "enforcers of occupancy validity" than "promoters of conservatism."  ( 3 min )
    Concrete Score Matching: Generalized Score Matching for Discrete Data. (arXiv:2211.00802v1 [cs.LG])
    Representing probability distributions by the gradient of their density functions has proven effective in modeling a wide range of continuous data modalities. However, this representation is not applicable in discrete domains where the gradient is undefined. To this end, we propose an analogous score function called the "Concrete score", a generalization of the (Stein) score for discrete settings. Given a predefined neighborhood structure, the Concrete score of any input is defined by the rate of change of the probabilities with respect to local directional changes of the input. This formulation allows us to recover the (Stein) score in continuous domains when measuring such changes by the Euclidean distance, while using the Manhattan distance leads to our novel score function in discrete domains. Finally, we introduce a new framework to learn such scores from samples called Concrete Score Matching (CSM), and propose an efficient training objective to scale our approach to high dimensions. Empirically, we demonstrate the efficacy of CSM on density estimation tasks on a mixture of synthetic, tabular, and high-dimensional image datasets, and demonstrate that it performs favorably relative to existing baselines for modeling discrete data.  ( 2 min )
  • Open

    High-Resolution Peak Demand Estimation Using Generalized Additive Models and Deep Neural Networks. (arXiv:2203.03342v2 [cs.LG] UPDATED)
    This paper covers predicting high-resolution electricity peak demand features given lower-resolution data. This is a relevant setup as it answers whether limited higher-resolution monitoring helps to estimate future high-resolution peak loads when the high-resolution data is no longer available. That question is particularly interesting for network operators considering replacing high-resolution monitoring predictive models due to economic considerations. We propose models to predict half-hourly minima and maxima of high-resolution (every minute) electricity load data while model inputs are of a lower resolution (30 minutes). We combine predictions of generalized additive models (GAM) and deep artificial neural networks (DNN), which are popular in load forecasting. We extensively analyze the prediction models, including the input parameters' importance, focusing on load, weather, and seasonal effects. The proposed method won a data competition organized by Western Power Distribution, a British distribution network operator. In addition, we provide a rigorous evaluation study that goes beyond the competition frame to analyze the models' robustness. The results show that the proposed methods are superior to the competition benchmark concerning the out-of-sample root mean squared error (RMSE). This holds regarding the competition month and the supplementary evaluation study, which covers an additional eleven months. Overall, our proposed model combination reduces the out-of-sample RMSE by 57.4\% compared to the benchmark.  ( 3 min )
    Demand Prediction Using Machine Learning Methods and Stacked Generalization. (arXiv:2009.09756v2 [cs.LG] UPDATED)
    Supply and demand are two fundamental concepts of sellers and customers. Predicting demand accurately is critical for organizations in order to be able to make plans. In this paper, we propose a new approach for demand prediction on an e-commerce web site. The proposed model differs from earlier models in several ways. The business model used in the e-commerce web site, for which the model is implemented, includes many sellers that sell the same product at the same time at different prices where the company operates a market place model. The demand prediction for such a model should consider the price of the same product sold by competing sellers along the features of these sellers. In this study we first applied different regression algorithms for specific set of products of one department of a company that is one of the most popular online e-commerce companies in Turkey. Then we used stacked generalization or also known as stacking ensemble learning to predict demand. Finally, all the approaches are evaluated on a real world data set obtained from the e-commerce company. The experimental results show that some of the machine learning methods do produce almost as good results as the stacked generalization method.  ( 3 min )
    Fair Wrapping for Black-box Predictions. (arXiv:2201.12947v3 [stat.ML] UPDATED)
    We introduce a new family of techniques to post-process ("wrap") a black-box classifier in order to reduce its bias. Our technique builds on the recent analysis of improper loss functions whose optimization can correct any twist in prediction, unfairness being treated as a twist. In the post-processing, we learn a wrapper function which we define as an $\alpha$-tree, which modifies the prediction. We provide two generic boosting algorithms to learn $\alpha$-trees. We show that our modification has appealing properties in terms of composition of $\alpha$-trees, generalization, interpretability, and KL divergence between modified and original predictions. We exemplify the use of our technique in three fairness notions: conditional value-at-risk, equality of opportunity, and statistical parity; and provide experiments on several readily available datasets.  ( 2 min )
    Multi-model Ensemble Analysis with Neural Network Gaussian Processes. (arXiv:2202.04152v3 [stat.AP] UPDATED)
    Multi-model ensemble analysis integrates information from multiple climate models into a unified projection. However, existing integration approaches based on model averaging can dilute fine-scale spatial information and incur bias from rescaling low-resolution climate models. We propose a statistical approach, called NN-GPR, using Gaussian process regression (GPR) with an infinitely wide deep neural network based covariance function. NN-GPR requires no assumptions about the relationships between models, no interpolation to a common grid, no stationarity assumptions, and automatically downscales as part of its prediction algorithm. Model experiments show that NN-GPR can be highly skillful at surface temperature and precipitation forecasting by preserving geospatial signals at multiple scales and capturing inter-annual variability. Our projections particularly show improved accuracy and uncertainty quantification skill in regions of high variability, which allows us to cheaply assess tail behavior at a 0.44$^\circ$/50 km spatial resolution without a regional climate model (RCM). Evaluations on reanalysis data and SSP245 forced climate models show that NN-GPR produces similar, overall climatologies to the model ensemble while better capturing fine scale spatial patterns. Finally, we compare NN-GPR's regional predictions against two RCMs and show that NN-GPR can rival the performance of RCMs using only global model data as input.  ( 2 min )
    Minimax risk classifiers with 0-1 loss. (arXiv:2201.06487v4 [stat.ML] UPDATED)
    Supervised classification techniques use training samples to learn a classification rule with small expected 0-1 loss (error probability). Conventional methods enable tractable learning and provide out-of-sample generalization by using surrogate losses instead of the 0-1 loss and considering specific families of rules (hypothesis classes). This paper presents minimax risk classifiers (MRCs) that minimize the worst-case 0-1 loss over general classification rules and provide tight performance guarantees at learning. We show that MRCs are strongly universally consistent using feature mappings given by characteristic kernels. The paper also proposes efficient optimization techniques for MRC learning and shows that the methods presented can provide accurate classification together with tight performance guarantees in practice.  ( 2 min )
    Invertible Neural Networks for Graph Prediction. (arXiv:2206.01163v2 [stat.ML] UPDATED)
    Graph prediction problems prevail in data analysis and machine learning. The inverse prediction problem, namely to infer input data from given output labels, is of emerging interest in various applications. In this work, we develop \textit{invertible graph neural network} (iGNN), a deep generative model to tackle the inverse prediction problem on graphs by casting it as a conditional generative task. The proposed model consists of an invertible sub-network that maps one-to-one from data to an intermediate encoded feature, which allows forward prediction by a linear classification sub-network as well as efficient generation from output labels via a parametric mixture model. The invertibility of the encoding sub-network is ensured by a Wasserstein-2 regularization which allows free-form layers in the residual blocks. The model is scalable to large graphs by a factorized parametric mixture model of the encoded feature and is computationally scalable by using GNN layers. The existence of invertible flow mapping is backed by theories of optimal transport and diffusion process, and we prove the expressiveness of graph convolution layers to approximate the theoretical flows of graph data. The proposed iGNN model is experimentally examined on synthetic data, including the example on large graphs, and the empirical advantage is also demonstrated on real-application datasets of solar ramping event data and traffic flow anomaly detection.
    A Model-Constrained Tangent Slope Learning Approach for Dynamical Systems. (arXiv:2208.04995v2 [cs.LG] UPDATED)
    Real-time accurate solutions of large-scale complex dynamical systems are in critical need for control, optimization, uncertainty quantification, and decision-making in practical engineering and science applications, especially digital twin applications. This paper contributes in this direction a model-constrained tangent slope learning (mcTangent) approach. At the heart of mcTangent is the synergy of several desirable strategies: i) a tangent slope learning to take advantage of the neural network speed and the time-accurate nature of the method of lines; ii) a model-constrained approach to encode the neural network tangent slope with the underlying governing equations; iii) sequential learning strategies to promote long-time stability and accuracy; and iv) data randomization approach to implicitly enforce the smoothness of the neural network tangent slope and its likeliness to the truth tangent slope up second order derivatives in order to further enhance the stability and accuracy of mcTangent solutions. Rigorous results are provided to analyze and justify the proposed approach. Several numerical results for the transport equation, viscous Burgers equation, and Navier-Stokes equation are presented to study and demonstrate the robustness and long-time accuracy of the proposed mcTangent learning approach.  ( 2 min )
    Scalable Gaussian Process Hyperparameter Optimization via Coverage Regularization. (arXiv:2209.11280v2 [cs.LG] UPDATED)
    Gaussian processes (GPs) are Bayesian non-parametric models popular in a variety of applications due to their accuracy and native uncertainty quantification (UQ). Tuning GP hyperparameters is critical to ensure the validity of prediction accuracy and uncertainty; uniquely estimating multiple hyperparameters in, e.g. the Matern kernel can also be a significant challenge. Moreover, training GPs on large-scale datasets is a highly active area of research: traditional maximum likelihood hyperparameter training requires quadratic memory to form the covariance matrix and has cubic training complexity. To address the scalable hyperparameter tuning problem, we present a novel algorithm which estimates the smoothness and length-scale parameters in the Matern kernel in order to improve robustness of the resulting prediction uncertainties. Using novel loss functions similar to those in conformal prediction algorithms in the computational framework provided by the hyperparameter estimation algorithm MuyGPs, we achieve improved UQ over leave-one-out likelihood maximization while maintaining a high degree of scalability as demonstrated in numerical experiments.  ( 2 min )
    A Simple and Optimal Policy Design with Safety against Heavy-tailed Risk for Stochastic Bandits. (arXiv:2206.02969v4 [stat.ML] UPDATED)
    We design new policies that ensure both worst-case optimality for expected regret and light-tailed risk for regret distribution in the stochastic multi-armed bandit problem. Recently, arXiv:2109.13595 showed that information-theoretically optimized bandit algorithms as well as standard UCB policies suffer from some serious heavy-tailed risk. Inspired by their results, we further show that heavy-tailed risk actually exists for all "instance-dependent consistent" policies. In particular, any policy that incurs an instance-dependent $O(\ln T)$ expected regret must incur a linear regret with probability $\Omega(\text{poly}(1/T))$. With the aim to ensure safety against such heavy-tailed risk, starting from the two-armed bandit setting, we provide a simple policy design that (i) has the worst-case optimality for the expected regret at order $\tilde O(\sqrt{T})$ and (ii) has the worst-case tail probability of incurring a linear regret decay at an optimal exponential rate $\exp(-\Omega(\sqrt{T}))$. Next, we improve the policy design and analysis to the general $K$-armed bandit setting. Specifically, the worst-case probability of incurring a regret larger than $x$ is upper bounded by $\exp(-\Omega(x/\sqrt{KT}))$. We also enhance the policy design to accommodate the "any-time" setting where $T$ is not known a priori. A brief account of numerical experiments is conducted to illustrate the theoretical findings. We conclude by extending our proposed policy design to the general stochastic linear bandit setting and obtain light-tailed regret bound. Our results reveal insights on the incompatibility between consistency and light-tailed risk, whereas indicate that worst-case optimality on expected regret and light-tailed risk on regret distribution are compatible.
    Bayesian sequential design of computer experiments to estimate reliable sets. (arXiv:2211.01008v1 [stat.ML])
    We consider an unknown multivariate function representing a system-such as a complex numerical simulator-taking both deterministic and uncertain inputs. Our objective is to estimate the set of deterministic inputs leading to outputs whose probability (with respect to the distribution of the uncertain inputs) to belong to a given set is controlled by a given threshold. To solve this problem, we propose a Bayesian strategy based on the Stepwise Uncertainty Reduction (SUR) principle to sequentially choose the points at which the function should be evaluated to approximate the set of interest. We illustrate its performance and interest in several numerical experiments.  ( 2 min )
    An optimal control perspective on diffusion-based generative modeling. (arXiv:2211.01364v1 [cs.LG])
    We establish a connection between stochastic optimal control and generative models based on stochastic differential equations (SDEs) such as recently developed diffusion probabilistic models. In particular, we derive a Hamilton-Jacobi-Bellman equation that governs the evolution of the log-densities of the underlying SDE marginals. This perspective allows to transfer methods from optimal control theory to generative modeling. First, we show that the evidence lower bound is a direct consequence of the well-known verification theorem from control theory. Further, we develop a novel diffusion-based method for sampling from unnormalized densities -- a problem frequently occurring in statistics and computational sciences.  ( 2 min )
    Properties of the Concrete distribution. (arXiv:2211.01306v1 [math.PR])
    We examine properties of the Concrete (or Gumbel-softmax) distribution on the simplex. Using the natural vector space structure of the simplex, the Concrete distribution can be regarded as a transformation of the uniform distribution through a reflection and a location-scale transformation. The Fisher information is computed and the corresponding information metric is hyperbolic space. We explicitly give an explicit transformation of the parameters of the distribution to Poincar\'e half-space coordinates, which correspond to an orthogonal parameterization, and the Fisher-Rao geodesic distance is computed.  ( 2 min )
    An efficient algorithm for the $\ell_{p}$ norm based metric nearness problem. (arXiv:2211.01245v1 [math.OC])
    Given a dissimilarity matrix, the metric nearness problem is to find the nearest matrix of distances that satisfy the triangle inequalities. This problem has wide applications, such as sensor networks, image processing, and so on. But it is of great challenge even to obtain a moderately accurate solution due to the $O(n^{3})$ metric constraints and the nonsmooth objective function which is usually a weighted $\ell_{p}$ norm based distance. In this paper, we propose a delayed constraint generation method with each subproblem solved by the semismooth Newton based proximal augmented Lagrangian method (PALM) for the metric nearness problem. Due to the high memory requirement for the storage of the matrix related to the metric constraints, we take advantage of the special structure of the matrix and do not need to store the corresponding constraint matrix. A pleasing aspect of our algorithm is that we can solve these problems involving up to $10^{8}$ variables and $10^{13}$ constraints. Numerical experiments demonstrate the efficiency of our algorithm. In theory, firstly, under a mild condition, we establish a primal-dual error bound condition which is very essential for the analysis of local convergence rate of PALM. Secondly, we prove the equivalence between the dual nondegeneracy condition and nonsingularity of the generalized Jacobian for the inner subproblem of PALM. Thirdly, when $q(\cdot)=\|\cdot\|_{1}$ or $\|\cdot\|_{\infty}$, without the strict complementarity condition, we also prove the equivalence between the the dual nondegeneracy condition and the uniqueness of the primal solution.  ( 3 min )
    Fantasizing with Dual GPs in Bayesian Optimization and Active Learning. (arXiv:2211.01053v1 [cs.LG])
    Gaussian processes (GPs) are the main surrogate functions used for sequential modelling such as Bayesian Optimization and Active Learning. Their drawbacks are poor scaling with data and the need to run an optimization loop when using a non-Gaussian likelihood. In this paper, we focus on `fantasizing' batch acquisition functions that need the ability to condition on new fantasized data computationally efficiently. By using a sparse Dual GP parameterization, we gain linear scaling with batch size as well as one-step updates for non-Gaussian likelihoods, thus extending sparse models to greedy batch fantasizing acquisition functions.  ( 2 min )
    Nonparametric Hamiltonian Monte Carlo. (arXiv:2106.10238v2 [cs.LG] UPDATED)
    Probabilistic programming uses programs to express generative models whose posterior probability is then computed by built-in inference engines. A challenging goal is to develop general purpose inference algorithms that work out-of-the-box for arbitrary programs in a universal probabilistic programming language (PPL). The densities defined by such programs, which may use stochastic branching and recursion, are (in general) nonparametric, in the sense that they correspond to models on an infinite-dimensional parameter space. However standard inference algorithms, such as the Hamiltonian Monte Carlo (HMC) algorithm, target distributions with a fixed number of parameters. This paper introduces the Nonparametric Hamiltonian Monte Carlo (NP-HMC) algorithm which generalises HMC to nonparametric models. Inputs to NP-HMC are a new class of measurable functions called "tree representable", which serve as a language-independent representation of the density functions of probabilistic programs in a universal PPL. We provide a correctness proof of NP-HMC, and empirically demonstrate significant performance improvements over existing approaches on several nonparametric examples.  ( 2 min )
    Linear Embedding-based High-dimensional Batch Bayesian Optimization without Reconstruction Mappings. (arXiv:2211.00947v1 [stat.ML])
    The optimization of high-dimensional black-box functions is a challenging problem. When a low-dimensional linear embedding structure can be assumed, existing Bayesian optimization (BO) methods often transform the original problem into optimization in a low-dimensional space. They exploit the low-dimensional structure and reduce the computational burden. However, we reveal that this approach could be limited or inefficient in exploring the high-dimensional space mainly due to the biased reconstruction of the high-dimensional queries from the low-dimensional queries. In this paper, we investigate a simple alternative approach: tackling the problem in the original high-dimensional space using the information from the learned low-dimensional structure. We provide a theoretical analysis of the exploration ability. Furthermore, we show that our method is applicable to batch optimization problems with thousands of dimensions without any computational difficulty. We demonstrate the effectiveness of our method on high-dimensional benchmarks and a real-world function.  ( 2 min )
    POLICE: Provably Optimal Linear Constraint Enforcement for Deep Neural Networks. (arXiv:2211.01340v1 [cs.LG])
    Deep Neural Networks (DNNs) outshine alternative function approximators in many settings thanks to their modularity in composing any desired differentiable operator. The formed parametrized functional is then tuned to solve a task at hand from simple gradient descent. This modularity comes at the cost of making strict enforcement of constraints on DNNs, e.g. from a priori knowledge of the task, or from desired physical properties, an open challenge. In this paper we propose the first provable affine constraint enforcement method for DNNs that requires minimal changes into a given DNN's forward-pass, that is computationally friendly, and that leaves the optimization of the DNN's parameter to be unconstrained i.e. standard gradient-based method can be employed. Our method does not require any sampling and provably ensures that the DNN fulfills the affine constraint on a given input space's region at any point during training, and testing. We coin this method POLICE, standing for Provably Optimal LInear Constraint Enforcement.  ( 2 min )
    Machine Learning for Metasurfaces Design and Their Applications. (arXiv:2211.01296v1 [physics.app-ph])
    Metasurfaces (MTSs) are increasingly emerging as enabling technologies to meet the demands for multi-functional, small form-factor, efficient, reconfigurable, tunable, and low-cost radio-frequency (RF) components because of their ability to manipulate waves in a sub-wavelength thickness through modified boundary conditions. They enable the design of reconfigurable intelligent surfaces (RISs) for adaptable wireless channels and smart radio environments, wherein the inherently stochastic nature of the wireless environment is transformed into a programmable propagation channel. In particular, space-limited RF applications, such as communications and radar, that have strict radiation requirements are currently being investigated for potential RIS deployment. The RIS comprises sub-wavelength units or meta-atoms, which are independently controlled and whose geometry and material determine the spectral response of the RIS. Conventionally, designing RIS to yield the desired EM response requires trial and error by iteratively investigating a large possibility of various geometries and materials through thousands of full-wave EM simulations. In this context, machine/deep learning (ML/DL) techniques are proving critical in reducing the computational cost and time of RIS inverse design. Instead of explicitly solving Maxwell's equations, DL models learn physics-based relationships through supervised training data. The ML/DL techniques also aid in RIS deployment for numerous wireless applications, which requires dealing with multiple channel links between the base station (BS) and the users. As a result, the BS and RIS beamformers require a joint design, wherein the RIS elements must be rapidly reconfigured. This chapter provides a synopsis of DL techniques for both inverse RIS design and RIS-assisted wireless systems.  ( 3 min )
    Nonparametric Involutive Markov Chain Monte Carlo. (arXiv:2211.01100v1 [cs.LG])
    A challenging problem in probabilistic programming is to develop inference algorithms that work for arbitrary programs in a universal probabilistic programming language (PPL). We present the nonparametric involutive Markov chain Monte Carlo (NP-iMCMC) algorithm as a method for constructing MCMC inference algorithms for nonparametric models expressible in universal PPLs. Building on the unifying involutive MCMC framework, and by providing a general procedure for driving state movement between dimensions, we show that NP-iMCMC can generalise numerous existing iMCMC algorithms to work on nonparametric models. We prove the correctness of the NP-iMCMC sampler. Our empirical study shows that the existing strengths of several iMCMC algorithms carry over to their nonparametric extensions. Applying our method to the recently proposed Nonparametric HMC, an instance of (Multiple Step) NP-iMCMC, we have constructed several nonparametric extensions (all of which new) that exhibit significant performance improvements.  ( 2 min )
    mlr3spatiotempcv: Spatiotemporal resampling methods for machine learning in R. (arXiv:2110.12674v2 [stat.ML] UPDATED)
    Spatial and spatiotemporal machine-learning models require a suitable framework for their model assessment, model selection, and hyperparameter tuning, in order to avoid error estimation bias and over-fitting. This contribution reviews the state-of-the-art in spatial and spatiotemporal cross-validation, and introduces the {R} package {mlr3spatiotempcv} as an extension package of the machine-learning framework {mlr3}. Currently various {R} packages implementing different spatiotemporal partitioning strategies exist: {blockCV}, {CAST}, {skmeans} and {sperrorest}. The goal of {mlr3spatiotempcv} is to gather the available spatiotemporal resampling methods in {R} and make them available to users through a simple and common interface. This is made possible by integrating the package directly into the {mlr3} machine-learning framework, which already has support for generic non-spatiotemporal resampling methods such as random partitioning. One advantage is the use of a consistent nomenclature in an overarching machine-learning toolkit instead of a varying package-specific syntax, making it easier for users to choose from a variety of spatiotemporal resampling methods. This package avoids giving recommendations which method to use in practice as this decision depends on the predictive task at hand, the autocorrelation within the data, and the spatial structure of the sampling design or geographic objects being studied.  ( 2 min )
    Can we globally optimize cross-validation loss? Quasiconvexity in ridge regression. (arXiv:2107.09194v2 [stat.ML] UPDATED)
    Models like LASSO and ridge regression are extensively used in practice due to their interpretability, ease of use, and strong theoretical guarantees. Cross-validation (CV) is widely used for hyperparameter tuning in these models, but do practical optimization methods minimize the true out-of-sample loss? A recent line of research promises to show that the optimum of the CV loss matches the optimum of the out-of-sample loss (possibly after simple corrections). It remains to show how tractable it is to minimize the CV loss. In the present paper, we show that, in the case of ridge regression, the CV loss may fail to be quasiconvex and thus may have multiple local optima. We can guarantee that the CV loss is quasiconvex in at least one case: when the spectrum of the covariate matrix is nearly flat and the noise in the observed responses is not too high. More generally, we show that quasiconvexity status is independent of many properties of the observed data (response norm, covariate-matrix right singular vectors and singular-value scaling) and has a complex dependence on the few that remain. We empirically confirm our theory using simulated experiments.  ( 2 min )
    Propensity score models are better when post-calibrated. (arXiv:2211.01221v1 [stat.ME])
    Theoretical guarantees for causal inference using propensity scores are partly based on the scores behaving like conditional probabilities. However, scores between zero and one, especially when outputted by flexible statistical estimators, do not necessarily behave like probabilities. We perform a simulation study to assess the error in estimating the average treatment effect before and after applying a simple and well-established post-processing method to calibrate the propensity scores. We find that post-calibration reduces the error in effect estimation for expressive uncalibrated statistical estimators, and that this improvement is not mediated by better balancing. The larger the initial lack of calibration, the larger the improvement in effect estimation, with the effect on already-calibrated estimators being very small. Given the improvement in effect estimation and that post-calibration is computationally cheap, we recommend it will be adopted when modelling propensity scores with expressive models.  ( 2 min )
    The Neural Testbed: Evaluating Joint Predictions. (arXiv:2110.04629v4 [cs.LG] UPDATED)
    Predictive distributions quantify uncertainties ignored by point estimates. This paper introduces The Neural Testbed: an open-source benchmark for controlled and principled evaluation of agents that generate such predictions. Crucially, the testbed assesses agents not only on the quality of their marginal predictions per input, but also on their joint predictions across many inputs. We evaluate a range of agents using a simple neural network data generating process. Our results indicate that some popular Bayesian deep learning agents do not fare well with joint predictions, even when they can produce accurate marginal predictions. We also show that the quality of joint predictions drives performance in downstream decision tasks. We find these results are robust across choice a wide range of generative models, and highlight the practical importance of joint predictions to the community.  ( 2 min )
    Stability of clinical prediction models developed using statistical or machine learning methods. (arXiv:2211.01061v1 [stat.ME])
    Clinical prediction models estimate an individual's risk of a particular health outcome, conditional on their values of multiple predictors. A developed model is a consequence of the development dataset and the chosen model building strategy, including the sample size, number of predictors and analysis method (e.g., regression or machine learning). Here, we raise the concern that many models are developed using small datasets that lead to instability in the model and its predictions (estimated risks). We define four levels of model stability in estimated risks moving from the overall mean to the individual level. Then, through simulation and case studies of statistical and machine learning approaches, we show instability in a model's estimated risks is often considerable, and ultimately manifests itself as miscalibration of predictions in new data. Therefore, we recommend researchers should always examine instability at the model development stage and propose instability plots and measures to do so. This entails repeating the model building steps (those used in the development of the original prediction model) in each of multiple (e.g., 1000) bootstrap samples, to produce multiple bootstrap models, and then deriving (i) a prediction instability plot of bootstrap model predictions (y-axis) versus original model predictions (x-axis), (ii) a calibration instability plot showing calibration curves for the bootstrap models in the original sample; and (iii) the instability index, which is the mean absolute difference between individuals' original and bootstrap model predictions. A case study is used to illustrate how these instability assessments help reassure (or not) whether model predictions are likely to be reliable (or not), whilst also informing a model's critical appraisal (risk of bias rating), fairness assessment and further validation requirements.  ( 3 min )
    Instance-Dependent Generalization Bounds via Optimal Transport. (arXiv:2211.01258v1 [stat.ML])
    Existing generalization bounds fail to explain crucial factors that drive generalization of modern neural networks. Since such bounds often hold uniformly over all parameters, they suffer from over-parametrization, and fail to account for the fact that the set of parameters, considered during initialization and training, is much more restricted than the entire parameter space. As an alternative, we propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function} in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.  ( 2 min )
    Approximate Cross-Validation with Low-Rank Data in High Dimensions. (arXiv:2008.10547v2 [stat.ML] UPDATED)
    Many recent advances in machine learning are driven by a challenging trifecta: large data size $N$; high dimensions; and expensive algorithms. In this setting, cross-validation (CV) serves as an important tool for model assessment. Recent advances in approximate cross validation (ACV) provide accurate approximations to CV with only a single model fit, avoiding traditional CV's requirement for repeated runs of expensive algorithms. Unfortunately, these ACV methods can lose both speed and accuracy in high dimensions -- unless sparsity structure is present in the data. Fortunately, there is an alternative type of simplifying structure that is present in most data: approximate low rank (ALR). Guided by this observation, we develop a new algorithm for ACV that is fast and accurate in the presence of ALR data. Our first key insight is that the Hessian matrix -- whose inverse forms the computational bottleneck of existing ACV methods -- is ALR. We show that, despite our use of the \emph{inverse} Hessian, a low-rank approximation using the largest (rather than the smallest) matrix eigenvalues enables fast, reliable ACV. Our second key insight is that, in the presence of ALR data, error in existing ACV methods roughly grows with the (approximate, low) rank rather than with the (full, high) dimension. These insights allow us to prove theoretical guarantees on the quality of our proposed algorithm -- along with fast-to-compute upper bounds on its error. We demonstrate the speed and accuracy of our method, as well as the usefulness of our bounds, on a range of real and simulated data sets.  ( 3 min )
    Large deviations rates for stochastic gradient descent with strongly convex functions. (arXiv:2211.00969v1 [cs.LG])
    Recent works have shown that high probability metrics with stochastic gradient descent (SGD) exhibit informativeness and in some cases advantage over the commonly adopted mean-square error-based ones. In this work we provide a formal framework for the study of general high probability bounds with SGD, based on the theory of large deviations. The framework allows for a generic (not-necessarily bounded) gradient noise satisfying mild technical assumptions, allowing for the dependence of the noise distribution on the current iterate. Under the preceding assumptions, we find an upper large deviations bound for SGD with strongly convex functions. The corresponding rate function captures analytical dependence on the noise distribution and other problem parameters. This is in contrast with conventional mean-square error analysis that captures only the noise dependence through the variance and does not capture the effect of higher order moments nor interplay between the noise geometry and the shape of the cost function. We also derive exact large deviation rates for the case when the objective function is quadratic and show that the obtained function matches the one from the general upper bound hence showing the tightness of the general upper bound. Numerical examples illustrate and corroborate theoretical findings.  ( 2 min )
    A Short Tutorial on The Weisfeiler-Lehman Test And Its Variants. (arXiv:2201.07083v2 [stat.ML] UPDATED)
    Graph neural networks are designed to learn functions on graphs. Typically, the relevant target functions are invariant with respect to actions by permutations. Therefore the design of some graph neural network architectures has been inspired by graph-isomorphism algorithms. The classical Weisfeiler-Lehman algorithm (WL) -- a graph-isomorphism test based on color refinement -- became relevant to the study of graph neural networks. The WL test can be generalized to a hierarchy of higher-order tests, known as $k$-WL. This hierarchy has been used to characterize the expressive power of graph neural networks, and to inspire the design of graph neural network architectures. A few variants of the WL hierarchy appear in the literature. The goal of this short note is pedagogical and practical: We explain the differences between the WL and folklore-WL formulations, with pointers to existing discussions in the literature. We illuminate the differences between the formulations by visualizing an example.  ( 2 min )
    Offline RL With Realistic Datasets: Heteroskedasticity and Support Constraints. (arXiv:2211.01052v1 [cs.LG])
    Offline reinforcement learning (RL) learns policies entirely from static datasets, thereby avoiding the challenges associated with online data collection. Practical applications of offline RL will inevitably require learning from datasets where the variability of demonstrated behaviors changes non-uniformly across the state space. For example, at a red light, nearly all human drivers behave similarly by stopping, but when merging onto a highway, some drivers merge quickly, efficiently, and safely, while many hesitate or merge dangerously. Both theoretically and empirically, we show that typical offline RL methods, which are based on distribution constraints fail to learn from data with such non-uniform variability, due to the requirement to stay close to the behavior policy to the same extent across the state space. Ideally, the learned policy should be free to choose per state how closely to follow the behavior policy to maximize long-term return, as long as the learned policy stays within the support of the behavior policy. To instantiate this principle, we reweight the data distribution in conservative Q-learning (CQL) to obtain an approximate support constraint formulation. The reweighted distribution is a mixture of the current policy and an additional policy trained to mine poor actions that are likely under the behavior policy. Our method, CQL (ReDS), is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.  ( 2 min )
    Bipartite Mixed Membership Distribution-Free Model. A novel model for community detection in overlapping bipartite weighted networks. (arXiv:2211.00912v1 [cs.SI])
    Modeling and estimating mixed memberships for un-directed un-weighted networks in which nodes can belong to multiple communities has been well studied in recent years. However, for a more general case, the bipartite weighted networks in which nodes can belong to multiple communities, row nodes can be different from column nodes, and all elements of adjacency matrices can be any finite real values, to our knowledge, there is no model for such bipartite weighted networks. To close this gap, this paper introduces a novel model, the Bipartite Mixed Membership Distribution-Free (BiMMDF) model. As a special case, bipartite signed networks with mixed memberships can also be generated from BiMMDF. Our model enjoys its advantage by allowing all elements of an adjacency matrix to be generated from any distribution as long as the expectation adjacency matrix has a block structure related to node memberships under BiMMDF. The proposed model can be viewed as an extension of many previous models, including the popular mixed membership stochastic blcokmodels. An efficient algorithm with a theoretical guarantee of consistent estimation is applied to fit BiMMDF. In particular, for a standard bipartite weighted network with two row (and column) communities, to make the algorithm's error rates small with high probability, separation conditions are obtained when adjacency matrices are generated from different distributions under BiMMDF. The behavior differences of different distributions on separation conditions are verified by extensive synthetic bipartite weighted networks generated under BiMMDF. Experiments on real-world directed weighted networks illustrate the advantage of the algorithm in studying highly mixed nodes and asymmetry between row and column communities.  ( 3 min )
    Optimal Conservative Offline RL with General Function Approximation via Augmented Lagrangian. (arXiv:2211.00716v1 [cs.LG])
    Offline reinforcement learning (RL), which refers to decision-making from a previously-collected dataset of interactions, has received significant attention over the past years. Much effort has focused on improving offline RL practicality by addressing the prevalent issue of partial data coverage through various forms of conservative policy learning. While the majority of algorithms do not have finite-sample guarantees, several provable conservative offline RL algorithms are designed and analyzed within the single-policy concentrability framework that handles partial coverage. Yet, in the nonlinear function approximation setting where confidence intervals are difficult to obtain, existing provable algorithms suffer from computational intractability, prohibitively strong assumptions, and suboptimal statistical rates. In this paper, we leverage the marginalized importance sampling (MIS) formulation of RL and present the first set of offline RL algorithms that are statistically optimal and practical under general function approximation and single-policy concentrability, bypassing the need for uncertainty quantification. We identify that the key to successfully solving the sample-based approximation of the MIS problem is ensuring that certain occupancy validity constraints are nearly satisfied. We enforce these constraints by a novel application of the augmented Lagrangian method and prove the following result: with the MIS formulation, augmented Lagrangian is enough for statistically optimal offline RL. In stark contrast to prior algorithms that induce additional conservatism through methods such as behavior regularization, our approach provably eliminates this need and reinterprets regularizers as "enforcers of occupancy validity" than "promoters of conservatism."  ( 3 min )
    Privacy Induces Robustness: Information-Computation Gaps and Sparse Mean Estimation. (arXiv:2211.00724v1 [stat.ML])
    We establish a simple connection between robust and differentially-private algorithms: private mechanisms which perform well with very high probability are automatically robust in the sense that they retain accuracy even if a constant fraction of the samples they receive are adversarially corrupted. Since optimal mechanisms typically achieve these high success probabilities, our results imply that optimal private mechanisms for many basic statistics problems are robust. We investigate the consequences of this observation for both algorithms and computational complexity across different statistical problems. Assuming the Brennan-Bresler secret-leakage planted clique conjecture, we demonstrate a fundamental tradeoff between computational efficiency, privacy leakage, and success probability for sparse mean estimation. Private algorithms which match this tradeoff are not yet known -- we achieve that (up to polylogarithmic factors) in a polynomially-large range of parameters via the Sum-of-Squares method. To establish an information-computation gap for private sparse mean estimation, we also design new (exponential-time) mechanisms using fewer samples than efficient algorithms must use. Finally, we give evidence for privacy-induced information-computation gaps for several other statistics and learning problems, including PAC learning parity functions and estimation of the mean of a multivariate Gaussian.  ( 2 min )
    CascadeXML: Rethinking Transformers for End-to-end Multi-resolution Training in Extreme Multi-label Classification. (arXiv:2211.00640v1 [cs.LG])
    Extreme Multi-label Text Classification (XMC) involves learning a classifier that can assign an input with a subset of most relevant labels from millions of label choices. Recent approaches, such as XR-Transformer and LightXML, leverage a transformer instance to achieve state-of-the-art performance. However, in this process, these approaches need to make various trade-offs between performance and computational requirements. A major shortcoming, as compared to the Bi-LSTM based AttentionXML, is that they fail to keep separate feature representations for each resolution in a label tree. We thus propose CascadeXML, an end-to-end multi-resolution learning pipeline, which can harness the multi-layered architecture of a transformer model for attending to different label resolutions with separate feature representations. CascadeXML significantly outperforms all existing approaches with non-trivial gains obtained on benchmark datasets consisting of up to three million labels. Code for CascadeXML will be made publicly available at \url{https://github.com/xmc-aalto/cascadexml}.  ( 2 min )

  • Open

    [R] Code as Policies: Language Model Programs for Embodied Control - Robotics at Google 2022 - Robots that write their own code
    Paper: https://arxiv.org/abs/2209.07753 Github: https://github.com/google-research/google-research/tree/master/code_as_policies Blog: https://code-as-policies.github.io/ Googleaiblog: https://ai.googleblog.com/2022/11/robots-that-write-their-own-code.html Abstract: Large language models (LLMs) trained on code completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these code-writing LLMs can be re-purposed to write robot policy code, given natural language commands. Specifically, policy code can express functions or feedback loops that process perception outputs (e.g.,from object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example language commands (formatted as comments) fo…  ( 55 min )
    [P] Mixing Prompts (with Weights) in Stable Diffusion Models
    Getting a text-to-image model to generate the right thing (e.g., represent multiple concepts) can be notoriously hard . One idea to improve this is to represent each concept using its own prompt, provide weights for each concept (to control its effect) and use the combined prompts to seed generation. Each prompt can be fintetuned or iterated on independently and them mixed. In the example below, we have two prompts (one on a leprechaun and another on clint eastwod) and apply a weight of 0.5 to each ​ Mixing prompt embeddings (weighted mean of multiple prompts) for better control of stable diffusion ​ Mixing prompt embeddings. Prompts on leprechauns and clint eastwood Mixing prompt embeddings. Prompts on clint eastwood and an amphitheatre of sand Mixing prompt embeddings (weighted mean of multiple prompts) for better control of stable diffusion ​ Also fairly easy to implement (based on the huggingface diffusers library) # for each text embedding, apply weight, sum and compute meanfor i in range(len(prompt_weights)):text_embeddings[i] = text_embeddings[i] * prompt_weights[i]text_embeddings = torch.sum(text_embeddings, dim=0, keepdim=True)text_embeddings = torch.mean(text_embeddings, dim=0).unsqueeze(0) ​ The concept is not entirely novel, but figuring out a good user experience is the more interesting part.It is implemented in the Peacasso library (based on huggingface diffusers). ​ Peacasso Interface for Stable diffusion. https://github.com/victordibia/peacasso submitted by /u/vykthur [link] [comments]  ( 58 min )
    [D] What is the SOTA architecture for permutation-invariant approximation on graphs?
    I'm interested in the graph classification & almost uniform subgraph sampling of bipartite graphs. It seems like the all recent papers on combinatorial optimization via GNNs are using architectures with low distinguishing power of the Weisfeiler-Lehman test. So these solutions are very limited in terms of combinatorics decision and model counting. (Example 1, Example 2, Example 3) Are there any scalable (1million nodes) and training-efficient methods beyond power of 4-WL? Hypergraph support is not required. Should I look forward SpeqNets and EnGCN ? submitted by /u/iVoider [link] [comments]  ( 58 min )
    [P] Implementation of MagicMix from ByteDance researchers, - New way to interpolate concepts with much more natural, geometric coherency (implemented with Stable Diffusion!)
    Hi. Today I've came across this interesting paper https://arxiv.org/abs/2210.16056 that proposes interesting method to combine semantics of text and image in diffusion process. In short, this mixes "layout" with "content", however unlike style transfer, "...semantic mixing aims to fuse multiple semantics into one single object." I was surprised by the examples they showed, so I wanted to try it but the code wasn't available. I've implemented the method myself, and I wanted to share it here! https://github.com/cloneofsimo/magicmix Layout of \"realistic photo of a rabbit\" with content of \"tiger\" I hope my implementation helps who is reading the paper! Note: I'm not the author of the paper, and this is not an official implementation submitted by /u/cloneofsimo [link] [comments]  ( 57 min )
    [R] Sequence 2 Mat
    Hello dear Redditors! I wanted to ask a pretty straight forward question: Can anybody recommend me some NN Architectures which are good in transforming a long sequence/vector into a quadratic matrix? I wanted to ask, since I by now just lost track of all the new architectures and tricks, that i thought maybe someone can give me some inspiration:). Thanks in advance submitted by /u/dry-leaf [link] [comments]  ( 54 min )
    [D] What are the benefits of being a reviewer?
    So I just got invited to serve as a reviewer for CVPR'23, but I am quite new to the field. I have only one accepted paper and one under review at top conferences, and I have never been a reviewer before. Because I understand that being a reviewer (especially for CVPR) is a huge responsibility, I would love to know what benefits I could gain from this experience. For those who have done it before, what makes you voluntarily want to be a reviewer again? submitted by /u/Signal-Mixture-4046 [link] [comments]  ( 60 min )
    [D] Graph neural networks
    I have never used GNNs before. I was wondering if I need to give in input to the neural network also info about connections among data (an adjacency matrix) in addition to the data themselves. Thanks! submitted by /u/No_Captain_856 [link] [comments]  ( 57 min )
    [P] Stream and Upload Versioned Data
    TL;DR We've built an open-source Python API that lets you stream data and get all the benefits of working with DVC-versioned data with the ease of use an API provides. --- Hi r/MachineLearning I'm an ML Team Lead at DagsHub (https://www.dagshub.com/), and I wanted to share something cool that we've been working on. As you all know, DVC (dvc.org) is an open-source CLI tool that acts as an extension to Git for large-scale data version control. A while back we integrated into the platform, providing a built-in DVC remote. We still faced a fundamental challenge in accessing and uploading data. Data management is done at a version level rather than on a specific file granularity. This is exactly something we need when we do intense data work. Moreover, to append files to an existing dataset one must first pull the entire dataset, which can be a long and expensive process. Given the fact that data is at the core of what we do, we felt the process has too much friction. That's why we built an open-source Python API that gives easy access to DVC-versioned files hosted on DagsHub. This means you get all the benefits of DVC and data versioning in an intuitive API. Why I think this is cool Stream data in batches - start training your model without waiting for the entire dataset to be downloaded. Append data to DVC directories - this is a challenging process today which requires pulling the entire dataset to add a single file. With Direct Data Access you can upload the data via the API and the computation is done automatically. Best of both worlds - get the ease of use of API with the versioning and management of a CLI (DVC). Check out the official project https://github.com/dagshub/client As always we’d love to hear your thoughts and feedback! submitted by /u/RepresentativeCod613 [link] [comments]  ( 59 min )
    [R] Instance segmentation of paper fiber networks
    I've been trying to perform instance segmentation of 3D images (tomograms) of paper fiber networks for my master thesis, but with no good results. Here is an example of a 2D slice of such an image: https://preview.redd.it/v8ijpzxdhjx91.png?width=971&format=png&auto=webp&s=79e30e1bf80384630792693c7eefe171f264826d I've been trying with the model presented here: https://arxiv.org/pdf/1901.01034.pdf. The authors have used it to segment glass fiber networks, which are less complex than paper fiber networks. Any tips or comments are appreciated. For example, how well could one even expect to solve this problem? I believe that the biggest difficulty lies in acquiring good enough training data. Using the method mentioned above, one performs instance segmentation on smaller patches of an image, and then merge the results to get the segmentation of the entire image. Here is an 80x80 patch of the above image: https://preview.redd.it/07l3i3yghjx91.png?width=373&format=png&auto=webp&s=6daa3276bbfc5c6a38169f870ba08add9b294368 I have been manually annotating images like this (of dimensions 5x80x80), but even as a human it is very difficult to see what goes on in most of these images, making the annotations very inaccurate. Any tips/ideas on how to approach this issue? Thank you! submitted by /u/user11532 [link] [comments]  ( 58 min )
    [D] NeurIPS 2022 public reviews
    The reviews for NeurIPS 2022 accepted papers & rejected papers that opted in are now public! Here is the link to the OpenReview) page submitted by /u/No_Nico [link] [comments]  ( 56 min )
    [R] question about hand written OCR
    hello everyone, I'm working on a project that requires an OCR(for handwritten characters) of the hebrew language, unfortunately, this tech does not exist currently(well, it does, but the accuracy of this is poor), and what does exist, isn't open source, so I decided to make my own, I have a model that can recognize separate letters(it takes an image of a letter and it gives a prediction on what letter it is), but for this model to work I need to feed him separate letters, what does not happen on real word images, so I need to make a segmentation process. I tried several algorithms that does not require some complex learning, I tried using vertical projection histograms, and opencv's findContours function, but this methods aren't reliable enough to actually use(I got to this conclusion by trying them), so my question is, is there any algorithm/process/library or anything that can make a trusty segmentation for handwritten letters? if you know of any, please provide a link or some lead... submitted by /u/nurigrf05 [link] [comments]  ( 59 min )
    [D] About the evaluation of the features extracted by an Autoencoder
    Hello Guys, In this period I am working on a dimensionality reduction task using a Convolutional Autoencoder. I am working on a 3d dataset, thus the model is a 3D AE (with attention). Regarding the MSE metric, it seems working well (it can reconstruct the input quite well, even if I am trying to switch to a denoising task), but I would like to understand if the features extracted are somehow meaningful. I know that it depends on the downstream task, but I read in this paper about Contractive AE (https://icml.cc/2011/papers/455_icmlpaper.pdf) that the Frobenius norm of the jacobian matrix of the encoder is strongly correlated with the test error of the downstream task. The problem is that I am having a hard time implementing this metric since I am not using an MLP autoencoder and because I am not using the sigmoid nonlinearity. Is it by chance that nobody talks about Contractive Autoencoder in Convolutional AE? Do you have any general advice about my objective (evaluating the quality of my features)? Thank you very much in advance submitted by /u/Dear-Vehicle-3215 [link] [comments]  ( 58 min )
    [R] Is there any work being done on reduction of training weight vector size but not reducing computational overhead (eg pruning)?
    The specific application is for orbital cameras - networks can be trained on earth and then sent to orbital FPGA's for use in image recognition systems. Both the earth based training system and orbital FPGA have a lot of computational power so there is no real need for reduction there, but transmission bandwidth is incredibly limited. For context I'm trying to find a PhD topic - I have a strong background in FPGA's, space-bourne imaging systems and comms, but I'm a machine learning noob (currently furiously trying to get my head around it). I may not be using the right terminology but my searches haven't turned up anything. (Also it may be the case that due to some information theory reason that Pruning is the optimal solution to this problem). Any suggestions of papers, pointers in a direction or any other related tidbits would be highly appreciated. submitted by /u/Moose_a_Lini [link] [comments]  ( 58 min )
    [N] Adversarial Policies Beat Professional-Level Go AIs
    Paper: https://arxiv.org/abs/2211.00241 Project Page: goattack.alignmentfund.org ​ We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See this https URL for example games. submitted by /u/xutw21 [link] [comments]  ( 68 min )
    [D] [P] Bidirectional conditional text generation (generating text around another text)
    I have a paired dataset of source text A and output text B. For each row of the dataset, A may occur in the front, middle, or end of B. I want to create a conditional text generation model that generates context in the form of text B in both directions, given a key phrase in the form of text A. text_b = generated_context_front + text_a + generated_context_end Strategies that I can think of are masking the context of text A like "[mask] text A [mask]" and filling them in to generate text B. Or, maybe I can try text generation models like GPT, but isn't GPT only supposed to look in one direction, not both? Any ideas on what I can try? submitted by /u/pseudobookish [link] [comments]  ( 62 min )
  • Open

    MagicMirror.Photo - A web service where you can use your photos to create a character that looks like you (even too much)
    By uploading a couple of photos, you get over 30 characters in different styles. The algorithms are trained on a lot of data and the result is beautiful MagicMirror.Photo submitted by /u/WorthElderberry8640 [link] [comments]  ( 41 min )
    Definitely some new stuff that I came across, on embedded AI in databases
    Save times on creating models and data set cleansing. You can check out the demo I made https://youtu.be/81Q2aQWuwDM submitted by /u/Klutzy_Accountant113 [link] [comments]  ( 40 min )
    Can AI be used to create unexpected connections from my digital notes and create new ones based on my interests?
    submitted by /u/CrackerJackJack [link] [comments]  ( 41 min )
    GPT-3 web app "Explainpaper" explains complex science in simple terms
    submitted by /u/much_successes [link] [comments]  ( 41 min )
    Extended submission deadline — EvoMUSART 2023 conference
    Good news: The submission deadline of EvoMUSART 2023 has been extended to November 16th! 🙌 You still have time to submit your work to the 12th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART). If you work with Artificial Intelligence techniques applied to visual art, music, sound synthesis, architecture, video, poetry, design, or other creative tasks, don't miss the opportunity to submit your work to EvoMUSART. EvoMUSART 2023 will be held in Brno, Czech Republic, between 12 and 14 April 2023. 🇨🇿 For more information, visit the conference webpage: https://www.evostar.org/2023/evomusart/ https://preview.redd.it/l52tiqmkckx91.png?width=2083&format=png&auto=webp&s=a6c87830d363334aa75a36e66dec615a401de10e submitted by /u/evomusart_conference [link] [comments]  ( 41 min )
    Neuromorphic Computing And The Future Of AI
    submitted by /u/SamuelSmith1416 [link] [comments]  ( 40 min )
    Sparkly Fantasy AI-Generated Girls ✨!
    submitted by /u/ArtifulDream [link] [comments]  ( 40 min )
    I recently learned how to use interpolation to breathe some life into AI images (details inside)
    submitted by /u/LorestForest [link] [comments]  ( 41 min )
    A comprehensive list of the recent most powerful AI advances, including meta-learning, language and generative models in Cutting-edge AI: October digest.
    submitted by /u/SpaceDepix [link] [comments]  ( 40 min )
    Robotic AI for the warehouse gets the quick deployment treatment, available through RaaS
    submitted by /u/robotpromoter [link] [comments]  ( 40 min )
    How to Generate your AI Avatar for Free Without Coding
    submitted by /u/Alex-L [link] [comments]  ( 40 min )
    The Best AI Articles of October 2022! From decision trees to why AI will change the content industry...
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 40 min )
    Generating pictures based on speech: Whisper🗣 & Stable Diffusion🎨
    🔑 Note: For this tutorial I will use Google Colab as I do not have a computer with a GPU. You can use your local computer. Remember to use GPU! First, we need to install the depedencies we need. We will install FFmpeg - tool to record, convert and stream audio and video. !apt update !apt install ffmpeg Now I will install necessary packages: !pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116 !pip install git+https://github.com/openai/whisper.git !pip install diffusers==0.2.4 !pip install transformers scipy ftfy !pip install "ipywidgets>=7,<8" 🔑 Note: If you have any problems installing Whisper go here. Next step is authentication of the Stable Diffusion with Hugging Face. from google.colab import output from huggingface_hub import n…  ( 43 min )
    Machine learning facilitates “turbulence tracking” in fusion reactors
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    AI helps researchers design microneedle patches that restore hair in balding mice
    submitted by /u/qptbook [link] [comments]  ( 40 min )
    Is there a website that could using AI generate words that give the input "vibes"?
    for example from "loud, striking, heavy" something like "storm". Thanks for any tips! submitted by /u/helpmeihatemyself [link] [comments]  ( 40 min )
    Meet DALL-E-Bot: An Artificial Intelligence (AI) Based Robotics System That Gives Web-Scale Diffusion Models An Embodiment To Realise The Scenes That They Imagine
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
  • Open

    Academic research groups outside of the US?
    I'm considering applying for PhDs and am looking for good (university affiliated) academic research groups outside of the US that study RL. I am particularly interested in foundational concepts like intrinsic motivation, meta-RL, hierarchical RL, etc. NOT looking for robotics/robot learning/applied RL labs. Unfortunately a lot of prominent RL researchers have been poached by industrial research companies like DeepMind, so there aren't as many as there used to be, but the ones I know of already are: RL Lab at McGill Rich Sutton's lab at Alberta DARK Lab at UCL Whiteson Lab at Oxford Does anyone know of any others? (Preferably in Canada/UK/EU) Thanks :) submitted by /u/GodIReallyHateYouTim [link] [comments]  ( 53 min )
    Need help understanding fully PPO's Entropy.
    Hello everyone! I'm developing an algorithm atm using PPO and because I am kinda struggling tuning or even finding out if entropy is beneficial for me I've completely removed it from my loss function, but...! I think I'm getting stuck in local minimums, thus I'm trying to understand entropy so I said that I would give it a go. For the time being I'm using an action std initialised to 1.0 and linearly degrading to 0.1 after many many epochs. The reason I'm using such high action std is because my problem may need to explore really far out actions from the already known ones. My questions are: If I were to use entropy, should I keep my std constant or keep it as it is now? Is there a way of tuning the entropy coefficient? Should I keep it constant or gradually degrade entropy to the point where it gets zero so us I can get some "exploitation" in a sense with a small std after the main training has been complete? Any help is welcome! Thanks! submitted by /u/White_Sirilo [link] [comments]  ( 52 min )
    What does the observation for a centralized critic look like?
    What does the shape of the observation look like when the critic is centralized? For example, pistonball environment (https://pettingzoo.farama.org/environments/butterfly/pistonball/): there are 20 agents and the obs of each agent is (4, 64, 64). Is the observation for the centralized critic going to be (20, 80, 1280, 1280)? Thanks! submitted by /u/No_Possibility_7588 [link] [comments]  ( 52 min )
    [Neat-Python] Neural network not learning in custom environment
    Hello! For the past week or so I've been coding a game myself in Python. I am not a professional, but I got the game working! My problem is when I use Neat-Python with the game. I will post the code (and config file!) on Pastebin so this reddit post doesn't get huge. This is an endless runner-type game in PyGame. The goal is to survive as long as possible. Rows of objects come down from the top of the screen. The objects can be walls, coins, or empty space. The player can collect the coin or go through an empty space, but if they touch a wall then it's game over. The AI gets rewards for surviving every timestep as well as a bonus when it gets a coin. The neural network does not seem to learn. If I turn rendering off for the game and use parallel processing, it goes very fast, which should make the AI learn faster in terms of real time. Its fitness seems random (which is expected because the environment is not deterministic), but it doesn't seem to have an upward trend, even after hundreds of generations. Here is the code And here is the config file I used this script as a basis for the NEAT + parallel processing: link Can anyone tell me what is wrong with my code and/or my config file that is causing the AI to not learn? Thanks in advance! submitted by /u/Unsightedmetal6 [link] [comments]  ( 54 min )
  • Open

    Is it possible to use something like Genetic Algorithm (or other better optimization algorithm) to optimize the parameters of a NN (aka to train it)?
    Backpropagation is a gradient based method, so instead of using it, i wonder if its possible to use something like GA, or Particle Swarm, or any other algorithm like that. Also, are the parameters of a NN (weights and biases), a convex function? submitted by /u/Rare_Jellyfish_3679 [link] [comments]  ( 44 min )
    Output to Input - A question and discussion
    I am very layman yet curious when it comes to Neural Networks, but I've watched and read a lot about them and seen several examples in effect over the recent years and one thing stuck to my mind. I wish I was able to program in order to test it out, but since I can't and have no idea where to start, I thought to ask here. Since Neural Networks are designed to work somehow inspired on a single neuron, as the name implies, I always thought why not make a "Neural Networks Integrated Brains" with several neural networks feeding their outputs into other neural networks inputs? For example, you can train a Neural Network into being a text-to-text conversation partner, another to be a speech-to-text translator where it hears the audio and translates into text, and another to being a text-to-sp…  ( 41 min )
  • Open

    Uniform sampling from an ellipse
    There is a simple way to randomly sample points from an ellipse, but it is not uniform. Assume your ellipse is parameterized by with t running from 0 to 2π. The naive approach would be to take uniform samples from t and stick them into the equations above. Rather than looking at random sampling, this […] Uniform sampling from an ellipse first appeared on John D. Cook.  ( 6 min )
    How to calculate length of an elliptic arc
    This post will show how to find the length of piece of an ellipse and explain what elliptic integrals have to do with ellipses. Assume we have an ellipse centered at the origin with semi-major axis a and semi-minor axis b. So a > b > 0, the longest diameter of the ellipse is 2a […] How to calculate length of an elliptic arc first appeared on John D. Cook.  ( 6 min )
    Python code to solve Kepler’s equation
    The previous post looked at solving Kepler’s equation using Newton’s method. The problem with using Newton’s method is that it may not converge when the eccentricity e is large unless you start very close to the solution. As discussed at the end of that post, John Machin came up with a clever way to start […] Python code to solve Kepler’s equation first appeared on John D. Cook.  ( 5 min )
  • Open

    How AI and IoT are Leading to Better Gas Station Solutions?
    No content preview
    AI in ecommerce: Importance and Use Cases
    No content preview
    X-Ray Image Segmentation using U-Nets
    No content preview
    A Deep Dive into ONNX & ONNX Runtime (Part 2)
    No content preview
    A Deep Dive into ONNX & ONNX Runtime (Part 1)
    No content preview
    RetinaFace-Face Detection Model
    No content preview
    Your first Data Science Project
    No content preview
  • Open

    Improve data extraction and document processing with Amazon Textract
    Intelligent document processing (IDP) has seen widespread adoption across enterprise and government organizations. Gartner estimates the IDP market will grow more than 100% year over year, and is projected to reach $4.8 billion in 2022. IDP helps transform structured, semi-structured, and unstructured data from a variety of document formats into actionable information. Processing unstructured data […]  ( 6 min )
  • Open

    365 Data Science courses free until November 21
    Sponsored Post   The unlimited access initiative presents a risk-free way to break into data science.     The online educational platform 365 Data Science launches the #21DaysFREE campaign and provides 100% free unlimited access to all content for three weeks. From November 1 to 21, you can take courses from renowned instructors and earn […] The post 365 Data Science courses free until November 21 appeared first on Machine Learning Mastery.  ( 15 min )
  • Open

    Study urges caution when comparing neural networks to the brain
    Computing systems that appear to generate brain-like activity may be the result of researchers guiding them to a specific outcome.  ( 8 min )
  • Open

    Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition for Single and Multi-Person Video. (arXiv:2201.10439v3 [cs.CV] UPDATED)
    Audio-visual automatic speech recognition (AV-ASR) extends speech recognition by introducing the video modality as an additional source of information. In this work, the information contained in the motion of the speaker's mouth is used to augment the audio features. The video modality is traditionally processed with a 3D convolutional neural network (e.g. 3D version of VGG). Recently, image transformer networks arXiv:2010.11929 demonstrated the ability to extract rich visual features for image classification tasks. Here, we propose to replace the 3D convolution with a video transformer to extract visual features. We train our baselines and the proposed model on a large scale corpus of YouTube videos. The performance of our approach is evaluated on a labeled subset of YouTube videos as well as on the LRS3-TED public corpus. Our best video-only model obtains 31.4% WER on YTDEV18 and 17.0% on LRS3-TED, a 10% and 15% relative improvements over our convolutional baseline. We achieve the state of the art performance of the audio-visual recognition on the LRS3-TED after fine-tuning our model (1.6% WER). In addition, in a series of experiments on multi-person AV-ASR, we obtained an average relative reduction of 2% over our convolutional video frontend.  ( 3 min )
    Position-Aware Subgraph Neural Networks with Data-Efficient Learning. (arXiv:2211.00572v1 [cs.LG])
    Data-efficient learning on graphs (GEL) is essential in real-world applications. Existing GEL methods focus on learning useful representations for nodes, edges, or entire graphs with ``small'' labeled data. But the problem of data-efficient learning for subgraph prediction has not been explored. The challenges of this problem lie in the following aspects: 1) It is crucial for subgraphs to learn positional features to acquire structural information in the base graph in which they exist. Although the existing subgraph neural network method is capable of learning disentangled position encodings, the overall computational complexity is very high. 2) Prevailing graph augmentation methods for GEL, including rule-based, sample-based, adaptive, and automated methods, are not suitable for augmenting subgraphs because a subgraph contains fewer nodes but richer information such as position, neighbor, and structure. Subgraph augmentation is more susceptible to undesirable perturbations. 3) Only a small number of nodes in the base graph are contained in subgraphs, which leads to a potential ``bias'' problem that the subgraph representation learning is dominated by these ``hot'' nodes. By contrast, the remaining nodes fail to be fully learned, which reduces the generalization ability of subgraph representation learning. In this paper, we aim to address the challenges above and propose a Position-Aware Data-Efficient Learning framework for subgraph neural networks called PADEL. Specifically, we propose a novel node position encoding method that is anchor-free, and design a new generative subgraph augmentation method based on a diffused variational subgraph autoencoder, and we propose exploratory and exploitable views for subgraph contrastive learning. Extensive experiment results on three real-world datasets show the superiority of our proposed method over state-of-the-art baselines.  ( 3 min )
    TE2Rules: Extracting Rule Lists from Tree Ensembles. (arXiv:2206.14359v3 [cs.LG] UPDATED)
    Tree Ensemble (TE) models (e.g. Gradient Boosted Trees and Random Forests) often provide higher prediction performance compared to single decision trees. However, TE models generally lack transparency and interpretability, as humans have difficulty understanding their decision logic. This paper presents a novel approach to convert a TE trained for a binary classification task, to a rule list (RL) that is a global equivalent to the TE and is comprehensible for a human. This RL captures all necessary and sufficient conditions for decision making by the TE. Experiments on benchmark datasets demonstrate that, compared to state-of-the-art methods, (i) predictions from the RL generated by TE2Rules have high fidelity with respect to the original TE, (ii) the RL from TE2Rules has high interpretability measured by the number and the length of the decision rules, (iii) the run-time of TE2Rules algorithm can be reduced significantly at the cost of a slightly lower fidelity, and (iv) the RL is a fast alternative to the state-of-the-art rule-based instance-level outcome explanation techniques.  ( 2 min )
    Explainable AI for engineering design: A unified approach of systems engineering and component-based deep learning. (arXiv:2108.13836v3 [cs.LG] UPDATED)
    Data-driven models created by machine learning gain in importance in all fields of design and engineering. They have high potential to assists decision-makers in creating novel artefacts with better performance and sustainability. However, limited generalization and the black-box nature of these models lead to limited explainability and reusability. These drawbacks provide significant barriers retarding adoption in engineering design. To overcome this situation, we propose a component-based approach to create partial component models by machine learning (ML). This component-based approach aligns deep learning to systems engineering (SE). By means of the example of energy efficient building design, we first demonstrate better generalization of the component-based method by analyzing prediction accuracy outside the training data. Especially for representative designs different in structure, we observe a much higher accuracy (R2 = 0.94) compared to conventional monolithic methods (R2 = 0.71). Second, we illustrate explainability by exemplary demonstrating how sensitivity information from SE and rules from low-depth decision trees serve engineering. Third, we evaluate explainability by qualitative and quantitative methods demonstrating the matching of preliminary knowledge and data-driven derived strategies and show correctness of activations at component interfaces compared to white-box simulation results (envelope components: R2 = 0.92..0.99; zones: R2 = 0.78..0.93). The key for component-based explainability is that activations at interfaces between the components are interpretable engineering quantities. In this way, the hierarchical component system forms a deep neural network (DNN) that a priori integrates information for engineering explainability. ...  ( 3 min )
    SCONE: Surface Coverage Optimization in Unknown Environments by Volumetric Integration. (arXiv:2208.10449v2 [cs.CV] UPDATED)
    Next Best View computation (NBV) is a long-standing problem in robotics, and consists in identifying the next most informative sensor position(s) for reconstructing a 3D object or scene efficiently and accurately. Like most current methods, we consider NBV prediction from a depth sensor like Lidar systems. Learning-based methods relying on a volumetric representation of the scene are suitable for path planning, but have lower accuracy than methods using a surface-based representation. However, the latter do not scale well with the size of the scene and constrain the camera to a small number of poses. To obtain the advantages of both representations, we show that we can maximize surface metrics by Monte Carlo integration over a volumetric representation. In particular, we propose an approach, SCONE, that relies on two neural modules: The first module predicts occupancy probability in the entire volume of the scene. Given any new camera pose, the second module samples points in the scene based on their occupancy probability and leverages a self-attention mechanism to predict the visibility of the samples. Finally, we integrate the visibility to evaluate the gain in surface coverage for the new camera pose. NBV is selected as the pose that maximizes the gain in total surface coverage. Our method scales to large scenes and handles free camera motion: It takes as input an arbitrarily large point cloud gathered by a depth sensor as well as camera poses to predict NBV. We demonstrate our approach on a novel dataset made of large and complex 3D scenes.  ( 3 min )
    Machine learning can guide experimental approaches for protein digestibility estimations. (arXiv:2211.00625v1 [q-bio.QM])
    Food protein digestibility and bioavailability are critical aspects in addressing human nutritional demands, particularly when seeking sustainable alternatives to animal-based proteins. In this study, we propose a machine learning approach to predict the true ileal digestibility coefficient of food items. The model makes use of a unique curated dataset that combines nutritional information from different foods with FASTA sequences of some of their protein families. We extracted the biochemical properties of the proteins and combined these properties with embeddings from a Transformer-based protein Language Model (pLM). In addition, we used SHAP to identify features that contribute most to the model prediction and provide interpretability. This first AI-based model for predicting food protein digestibility has an accuracy of 90% compared to existing experimental techniques. With this accuracy, our model can eliminate the need for lengthy in-vivo or in-vitro experiments, making the process of creating new foods faster, cheaper, and more ethical.  ( 2 min )
    TAE: A Semi-supervised Controllable Behavior-aware Trajectory Generator and Predictor. (arXiv:2203.01261v2 [cs.RO] UPDATED)
    Trajectory generation and prediction are two interwoven tasks that play important roles in planner evaluation and decision making for intelligent vehicles. Most existing methods focus on one of the two and are optimized to directly output the final generated/predicted trajectories, which only contain limited information for critical scenario augmentation and safe planning. In this work, we propose a novel behavior-aware Trajectory Autoencoder (TAE) that explicitly models drivers' behavior such as aggressiveness and intention in the latent space, using semi-supervised adversarial autoencoder and domain knowledge in transportation. Our model addresses trajectory generation and prediction in a unified architecture and benefits both tasks: the model can generate diverse, controllable and realistic trajectories to enhance planner optimization in safety-critical and long-tailed scenarios, and it can provide prediction of critical behavior in addition to the final trajectories for decision making. Experimental results demonstrate that our method achieves promising performance on both trajectory generation and prediction.  ( 2 min )
    GEO-BLEU: Similarity Measure for Geospatial Sequences. (arXiv:2112.07144v2 [cs.LG] UPDATED)
    In recent geospatial research, the importance of modeling large-scale human mobility data and predicting trajectories is rising, in parallel with progress in text generation using large-scale corpora in natural language processing. Whereas there are already plenty of feasible approaches applicable to geospatial sequence modeling itself, there seems to be room to improve with regard to evaluation, specifically about measuring the similarity between generated and reference trajectories. In this work, we propose a novel similarity measure, GEO-BLEU, which can be especially useful in the context of geospatial sequence modeling and generation. As the name suggests, this work is based on BLEU, one of the most popular measures used in machine translation research, while introducing spatial proximity to the idea of n-gram. We compare this measure with an established baseline, dynamic time warping, applying it to actual generated geospatial sequences. Using crowdsourced annotated data on the similarity between geospatial sequences collected from over 12,000 cases, we quantitatively and qualitatively show the proposed method's superiority.  ( 2 min )
    Robust Large-Margin Learning in Hyperbolic Space. (arXiv:2004.05465v3 [cs.LG] UPDATED)
    Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with significantly fewer dimensions than standard Euclidean spaces. However, the viability and benefits of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space. Specifically, we consider the problem of learning a large-margin classifier for data possessing a hierarchical structure. We provide an algorithm to efficiently learn a large-margin hyperplane, relying on the careful injection of adversarial examples. Finally, we prove that for hierarchical data that embeds well into hyperbolic space, the low embedding dimension ensures superior guarantees when learning the classifier directly in hyperbolic space.  ( 2 min )
    Event Tables for Efficient Experience Replay. (arXiv:2211.00576v1 [cs.LG])
    Experience replay (ER) is a crucial component of many deep reinforcement learning (RL) systems. However, uniform sampling from an ER buffer can lead to slow convergence and unstable asymptotic behaviors. This paper introduces Stratified Sampling from Event Tables (SSET), which partitions an ER buffer into Event Tables, each capturing important subsequences of optimal behavior. We prove a theoretical advantage over the traditional monolithic buffer approach and combine SSET with an existing prioritized sampling strategy to further improve learning speed and stability. Empirical results in challenging MiniGrid domains, benchmark RL environments, and a high-fidelity car racing simulator demonstrate the advantages and versatility of SSET over existing ER buffer sampling approaches.  ( 2 min )
    Fighting Selection Bias in Statistical Learning: Application to Visual Recognition from Biased Image Databases. (arXiv:2109.02357v2 [cs.CV] UPDATED)
    In practice, and especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven performances on different population segments has highlighted the representativeness issues induced by a naive aggregation of the datasets. In this paper, we show how biasing models can remedy these problems. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition is that the supports of the biased distributions must partly overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach.  ( 2 min )
    Composite Feature Selection using Deep Ensembles. (arXiv:2211.00631v1 [cs.LG])
    In many real world problems, features do not act alone but in combination with each other. For example, in genomics, diseases might not be caused by any single mutation but require the presence of multiple mutations. Prior work on feature selection either seeks to identify individual features or can only determine relevant groups from a predefined set. We investigate the problem of discovering groups of predictive features without predefined grouping. To do so, we define predictive groups in terms of linear and non-linear interactions between features. We introduce a novel deep learning architecture that uses an ensemble of feature selection models to find predictive groups, without requiring candidate groups to be provided. The selected groups are sparse and exhibit minimum overlap. Furthermore, we propose a new metric to measure similarity between discovered groups and the ground truth. We demonstrate the utility of our model on multiple synthetic tasks and semi-synthetic chemistry datasets, where the ground truth structure is known, as well as an image dataset and a real-world cancer dataset.  ( 2 min )
    Learning utterance-level representations through token-level acoustic latents prediction for Expressive Speech Synthesis. (arXiv:2211.00523v1 [cs.SD])
    This paper proposes an Expressive Speech Synthesis model that utilizes token-level latent prosodic variables in order to capture and control utterance-level attributes, such as character acting voice and speaking style. Current works aim to explicitly factorize such fine-grained and utterance-level speech attributes into different representations extracted by modules that operate in the corresponding level. We show that the fine-grained latent space also captures coarse-grained information, which is more evident as the dimension of latent space increases in order to capture diverse prosodic representations. Therefore, a trade-off arises between the diversity of the token-level and utterance-level representations and their disentanglement. We alleviate this issue by first capturing rich speech attributes into a token-level latent space and then, separately train a prior network that given the input text, learns utterance-level representations in order to predict the phoneme-level, posterior latents extracted during the previous step. Both qualitative and quantitative evaluations are used to demonstrate the effectiveness of the proposed approach. Audio samples are available in our demo page.  ( 2 min )
    Statistical Learning from Biased Training Samples. (arXiv:1906.12304v4 [stat.ML] UPDATED)
    With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many practical situations, the poor control of the data acquisition processes may naturally jeopardize the outputs of machine learning algorithms, and selection bias issues are now the subject of much attention in the literature. The present article investigates how to extend Empirical Risk Minimization, the principal paradigm in statistical learning, when training observations are generated from biased models, i.e., from distributions that are different from that in the test/prediction stage, and absolutely continuous with respect to the latter. Precisely, we show how to build a "nearly debiased" training statistical population from biased samples and the related biasing functions, following in the footsteps of the approach originally proposed in Vardi (1985). Furthermore, we study from a nonasymptotic perspective the performance of minimizers of an empirical version of the risk computed from the statistical population thus created. Remarkably, the learning rate achieved by this procedure is of the same order as that attained in absence of selection bias. Beyond the theoretical guarantees, we also present experimental results supporting the relevance of the algorithmic approach promoted in this paper.  ( 2 min )
    GLINKX: A Scalable Unified Framework For Homophilous and Heterophilous Graphs. (arXiv:2211.00550v1 [cs.LG])
    In graph learning, there have been two predominant inductive biases regarding graph-inspired architectures: On the one hand, higher-order interactions and message passing work well on homophilous graphs and are leveraged by GCNs and GATs. Such architectures, however, cannot easily scale to large real-world graphs. On the other hand, shallow (or node-level) models using ego features and adjacency embeddings work well in heterophilous graphs. In this work, we propose a novel scalable shallow method -- GLINKX -- that can work both on homophilous and heterophilous graphs. GLINKX leverages (i) novel monophilous label propagations, (ii) ego/node features, (iii) knowledge graph embeddings as positional embeddings, (iv) node-level training, and (v) low-dimensional message passing. Formally, we prove novel error bounds and justify the components of GLINKX. Experimentally, we show its effectiveness on several homophilous and heterophilous datasets.  ( 2 min )
    Law Informs Code: A Legal Informatics Approach to Aligning Artificial Intelligence with Humans. (arXiv:2209.13020v6 [cs.CY] UPDATED)
    We are currently unable to specify human goals and societal values in a way that reliably directs AI behavior. Law-making and legal interpretation form a computational engine that converts opaque human values into legible directives. "Law Informs Code" is the research agenda capturing complex computational legal processes, and embedding them in AI. Similar to how parties to a legal contract cannot foresee every potential contingency of their future relationship, and legislators cannot predict all the circumstances under which their proposed bills will be applied, we cannot ex ante specify rules that provably direct good AI behavior. Legal theory and practice have developed arrays of tools to address these specification problems. For instance, legal standards allow humans to develop shared understandings and adapt them to novel situations. In contrast to more prosaic uses of the law (e.g., as a deterrent of bad behavior through the threat of sanction), leveraged as an expression of how humans communicate their goals, and what society values, Law Informs Code. We describe how data generated by legal processes (methods of law-making, statutory interpretation, contract drafting, applications of legal standards, legal reasoning, etc.) can facilitate the robust specification of inherently vague human goals. This increases human-AI alignment and the local usefulness of AI. Toward society-AI alignment, we present a framework for understanding law as the applied philosophy of multi-agent alignment. Although law is partly a reflection of historically contingent political power - and thus not a perfect aggregation of citizen preferences - if properly parsed, its distillation offers the most legitimate computational comprehension of societal values available. If law eventually informs powerful AI, engaging in the deliberative political process to improve law takes on even more meaning.  ( 3 min )
    AliCG: Fine-grained and Evolvable Conceptual Graph Construction for Semantic Search at Alibaba. (arXiv:2106.01686v2 [cs.AI] CROSS LISTED)
    Conceptual graphs, which is a particular type of Knowledge Graphs, play an essential role in semantic search. Prior conceptual graph construction approaches typically extract high-frequent, coarse-grained, and time-invariant concepts from formal texts. In real applications, however, it is necessary to extract less-frequent, fine-grained, and time-varying conceptual knowledge and build taxonomy in an evolving manner. In this paper, we introduce an approach to implementing and deploying the conceptual graph at Alibaba. Specifically, We propose a framework called AliCG which is capable of a) extracting fine-grained concepts by a novel bootstrapping with alignment consensus approach, b) mining long-tail concepts with a novel low-resource phrase mining approach, c) updating the graph dynamically via a concept distribution estimation method based on implicit and explicit user behaviors. We have deployed the framework at Alibaba UC Browser. Extensive offline evaluation as well as online A/B testing demonstrate the efficacy of our approach.  ( 2 min )
    Generating Gender-Ambiguous Text-to-Speech Voices. (arXiv:2211.00375v1 [cs.SD])
    The gender of a voice assistant or any voice user interface is a central element of its perceived identity. While a female voice is a common choice, there is an increasing interest in alternative approaches where the gender is ambiguous rather than clearly identifying as female or male. This work addresses the task of generating gender-ambiguous text-to-speech (TTS) voices that do not correspond to any existing person. This is accomplished by sampling from a latent speaker embeddings' space that was formed while training a multilingual, multi-speaker TTS system on data from multiple male and female speakers. Various options are investigated regarding the sampling process. In our experiments, the effects of different sampling choices on the gender ambiguity and the naturalness of the resulting voices are evaluated. The proposed method is shown able to efficiently generate novel speakers that are superior to a baseline averaged speaker embedding. To our knowledge, this is the first systematic approach that can reliably generate a range of gender-ambiguous voices to meet diverse user requirements.  ( 2 min )
    Contrastive Demonstration Tuning for Pre-trained Language Models. (arXiv:2204.04392v3 [cs.CL] CROSS LISTED)
    Pretrained language models can be effectively stimulated by textual prompts or demonstrations, especially in low-data scenarios. Recent works have focused on automatically searching discrete or continuous prompts or optimized verbalizers, yet studies for the demonstration are still limited. Concretely, the demonstration examples are crucial for an excellent final performance of prompt-tuning. In this paper, we propose a novel pluggable, extensible, and efficient approach named contrastive demonstration tuning, which is free of demonstration sampling. Furthermore, the proposed approach can be: (i) Plugged into any previous prompt-tuning approaches; (ii) Extended to widespread classification tasks with a large number of categories. Experimental results on 16 datasets illustrate that our method integrated with previous approaches LM-BFF and P-tuning can yield better performance. Code is available in https://github.com/zjunlp/PromptKG/tree/main/research/Demo-Tuning.
    A Typology to Explore the Mitigation of Shortcut Behavior. (arXiv:2203.03668v2 [cs.LG] UPDATED)
    As machine learning models become increasingly larger, trained weakly supervised on large, possibly uncurated data sets, it becomes increasingly important to establish mechanisms for inspecting, interacting, and revising models to mitigate learning shortcuts and guarantee their learned knowledge is aligned with human knowledge. The recently proposed XIL framework was developed for this purpose, and several such methods have been introduced, each with individual motivations and methodological details. In this work, we provide a unification of various XIL methods into a single typology by establishing a common set of basic modules. In doing so, we pave the way for a principled comparison of existing, but, importantly, also future XIL approaches. In addition, we discuss existing and introduce novel measures and benchmarks for evaluating the overall abilities of a XIL method. Given this extensive toolbox, including our typology, measures, and benchmarks, we finally compare several recent XIL methods methodologically and quantitatively. In our evaluations, all methods prove to revise a model successfully. However, we found remarkable differences in individual benchmark tasks, revealing valuable application-relevant aspects for integrating these benchmarks in developing future methods.
    Stretching Sentence-pair NLI Models to Reason over Long Documents and Clusters. (arXiv:2204.07447v2 [cs.CL] UPDATED)
    Natural Language Inference (NLI) has been extensively studied by the NLP community as a framework for estimating the semantic relation between sentence pairs. While early work identified certain biases in NLI models, recent advancements in modeling and datasets demonstrated promising performance. In this work, we further explore the direct zero-shot applicability of NLI models to real applications, beyond the sentence-pair setting they were trained on. First, we analyze the robustness of these models to longer and out-of-domain inputs. Then, we develop new aggregation methods to allow operating over full documents, reaching state-of-the-art performance on the ContractNLI dataset. Interestingly, we find NLI scores to provide strong retrieval signals, leading to more relevant evidence extractions compared to common similarity-based methods. Finally, we go further and investigate whole document clusters to identify both discrepancies and consensus among sources. In a test case, we find real inconsistencies between Wikipedia pages in different languages about the same topic.
    Modelling black-box audio effects with time-varying feature modulation. (arXiv:2211.00497v1 [cs.SD])
    Deep learning approaches for black-box modelling of audio effects have shown promise, however, the majority of existing work focuses on nonlinear effects with behaviour on relatively short time-scales, such as guitar amplifiers and distortion. While recurrent and convolutional architectures can theoretically be extended to capture behaviour at longer time scales, we show that simply scaling the width, depth, or dilation factor of existing architectures does not result in satisfactory performance when modelling audio effects such as fuzz and dynamic range compression. To address this, we propose the integration of time-varying feature-wise linear modulation into existing temporal convolutional backbones, an approach that enables learnable adaptation of the intermediate activations. We demonstrate that our approach more accurately captures long-range dependencies for a range of fuzz and compressor implementations across both time and frequency domain metrics. We provide sound examples, source code, and pretrained models to faciliate reproducibility.  ( 2 min )
    CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. (arXiv:2106.08087v6 [cs.CL] CROSS LISTED)
    Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.
    PromptKG: A Prompt Learning Framework for Knowledge Graph Representation Learning and Application. (arXiv:2210.00305v1 [cs.CL] CROSS LISTED)
    Knowledge Graphs (KGs) often have two characteristics: heterogeneous graph structure and text-rich entity/relation information. KG representation models should consider graph structures and text semantics, but no comprehensive open-sourced framework is mainly designed for KG regarding informative text description. In this paper, we present PromptKG, a prompt learning framework for KG representation learning and application that equips the cutting-edge text-based methods, integrates a new prompt learning model and supports various tasks (e.g., knowledge graph completion, question answering, recommendation, and knowledge probing). PromptKG is publicly open-sourced at https://github.com/zjunlp/PromptKG with long-term technical support.
    Relation Extraction as Open-book Examination: Retrieval-enhanced Prompt Tuning. (arXiv:2205.02355v1 [cs.CL] CROSS LISTED)
    Pre-trained language models have contributed significantly to relation extraction by demonstrating remarkable few-shot learning abilities. However, prompt tuning methods for relation extraction may still fail to generalize to those rare or hard patterns. Note that the previous parametric learning paradigm can be viewed as memorization regarding training data as a book and inference as the close-book test. Those long-tailed or hard patterns can hardly be memorized in parameters given few-shot instances. To this end, we regard RE as an open-book examination and propose a new semiparametric paradigm of retrieval-enhanced prompt tuning for relation extraction. We construct an open-book datastore for retrieval regarding prompt-based instance representations and corresponding relation labels as memorized key-value pairs. During inference, the model can infer relations by linearly interpolating the base output of PLM with the non-parametric nearest neighbor distribution over the datastore. In this way, our model not only infers relation through knowledge stored in the weights during training but also assists decision-making by unwinding and querying examples in the open-book datastore. Extensive experiments on benchmark datasets show that our method can achieve state-of-the-art in both standard supervised and few-shot settings. Code are available in https://github.com/zjunlp/PromptKG/tree/main/research/RetrievalRE.
    DeepKE: A Deep Learning Based Knowledge Extraction Toolkit for Knowledge Base Population. (arXiv:2201.03335v5 [cs.CL] CROSS LISTED)
    We present an open-source and extensible knowledge extraction toolkit DeepKE, supporting complicated low-resource, document-level and multimodal scenarios in the knowledge base population. DeepKE implements various information extraction tasks, including named entity recognition, relation extraction and attribute extraction. With a unified framework, DeepKE allows developers and researchers to customize datasets and models to extract information from unstructured data according to their requirements. Specifically, DeepKE not only provides various functional modules and model implementation for different tasks and scenarios but also organizes all components by consistent frameworks to maintain sufficient modularity and extensibility. We release the source code at GitHub in https://github.com/zjunlp/DeepKE with Google Colab tutorials and comprehensive documents for beginners. Besides, we present an online system in this http URL for real-time extraction of various tasks, and a demo video.
    OntoProtein: Protein Pretraining With Gene Ontology Embedding. (arXiv:2201.11147v6 [q-bio.BM] CROSS LISTED)
    Self-supervised protein language models have proved their effectiveness in learning the proteins representations. With the increasing computational power, current protein language models pre-trained with millions of diverse sequences can advance the parameter scale from million-level to billion-level and achieve remarkable improvement. However, those prevailing approaches rarely consider incorporating knowledge graphs (KGs), which can provide rich structured knowledge facts for better protein representations. We argue that informative biology knowledge in KGs can enhance protein representation with external knowledge. In this work, we propose OntoProtein, the first general framework that makes use of structure in GO (Gene Ontology) into protein pre-training models. We construct a novel large-scale knowledge graph that consists of GO and its related proteins, and gene annotation texts or protein sequences describe all nodes in the graph. We propose novel contrastive learning with knowledge-aware negative sampling to jointly optimize the knowledge graph and protein embedding during pre-training. Experimental results show that OntoProtein can surpass state-of-the-art methods with pre-trained protein language models in TAPE benchmark and yield better performance compared with baselines in protein-protein interaction and protein function prediction. Code and datasets are available in https://github.com/zjunlp/OntoProtein.
    Falsehoods that ML researchers believe about OOD detection. (arXiv:2210.12767v2 [stat.ML] UPDATED)
    An intuitive way to detect out-of-distribution (OOD) data is via the density function of a fitted probabilistic generative model: points with low density may be classed as OOD. But this approach has been found to fail, in deep learning settings. In this paper, we list some falsehoods that machine learning researchers believe about density-based OOD detection. Many recent works have proposed likelihood-ratio-based methods to `fix' the problem. We propose a framework, the OOD proxy framework, to unify these methods, and we argue that likelihood ratio is a principled method for OOD detection and not a mere `fix'. Finally, we discuss the relationship between domain discrimination and semantics.
    Attention-Based Capsule Networks with Dynamic Routing for Relation Extraction. (arXiv:1812.11321v1 [cs.IR] CROSS LISTED)
    A capsule is a group of neurons, whose activity vector represents the instantiation parameters of a specific type of entity. In this paper, we explore the capsule networks used for relation extraction in a multi-instance multi-label learning framework and propose a novel neural approach based on capsule networks with attention mechanisms. We evaluate our method with different benchmarks, and it is demonstrated that our method improves the precision of the predicted relations. Particularly, we show that capsule networks improve multiple entity pairs relation extraction.
    KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction. (arXiv:2104.07650v6 [cs.CL] CROSS LISTED)
    Recently, prompt-tuning has achieved promising results for specific few-shot classification tasks. The core idea of prompt-tuning is to insert text pieces (i.e., templates) into the input and transform a classification task into a masked language modeling problem. However, for relation extraction, determining an appropriate prompt template requires domain expertise, and it is cumbersome and time-consuming to obtain a suitable label word. Furthermore, there exists abundant semantic and prior knowledge among the relation labels that cannot be ignored. To this end, we focus on incorporating knowledge among relation labels into prompt-tuning for relation extraction and propose a Knowledge-aware Prompt-tuning approach with synergistic optimization (KnowPrompt). Specifically, we inject latent knowledge contained in relation labels into prompt construction with learnable virtual type words and answer words. Then, we synergistically optimize their representation with structured constraints. Extensive experimental results on five datasets with standard and low-resource settings demonstrate the effectiveness of our approach. Our code and datasets are available in https://github.com/zjunlp/KnowPrompt for reproducibility.
    Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods. (arXiv:2202.07262v2 [math.OC] UPDATED)
    Stochastic Gradient Descent-Ascent (SGDA) is one of the most prominent algorithms for solving min-max optimization and variational inequalities problems (VIP) appearing in various machine learning tasks. The success of the method led to several advanced extensions of the classical SGDA, including variants with arbitrary sampling, variance reduction, coordinate randomization, and distributed variants with compression, which were extensively studied in the literature, especially during the last few years. In this paper, we propose a unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities. A key to our unified framework is a parametric assumption on the stochastic estimates. Via our general theoretical framework, we either recover the sharpest known rates for the known special cases or tighten them. Moreover, to illustrate the flexibility of our approach we develop several new variants of SGDA such as a new variance-reduced method (L-SVRGDA), new distributed methods with compression (QSGDA, DIANA-SGDA, VR-DIANA-SGDA), and a new method with coordinate randomization (SEGA-SGDA). Although variants of the new methods are known for solving minimization problems, they were never considered or analyzed for solving min-max problems and VIPs. We also demonstrate the most important properties of the new methods through extensive numerical experiments.  ( 3 min )
    Decoupling Knowledge from Memorization: Retrieval-augmented Prompt Learning. (arXiv:2205.14704v3 [cs.CL] CROSS LISTED)
    Prompt learning approaches have made waves in natural language processing by inducing better few-shot performance while they still follow a parametric-based learning paradigm; the oblivion and rote memorization problems in learning may encounter unstable generalization issues. Specifically, vanilla prompt learning may struggle to utilize atypical instances by rote during fully-supervised training or overfit shallow patterns with low-shot data. To alleviate such limitations, we develop RetroPrompt with the motivation of decoupling knowledge from memorization to help the model strike a balance between generalization and memorization. In contrast with vanilla prompt learning, RetroPrompt constructs an open-book knowledge-store from training instances and implements a retrieval mechanism during the process of input, training and inference, thus equipping the model with the ability to retrieve related contexts from the training corpus as cues for enhancement. Extensive experiments demonstrate that RetroPrompt can obtain better performance in both few-shot and zero-shot settings. Besides, we further illustrate that our proposed RetroPrompt can yield better generalization abilities with new datasets. Detailed analysis of memorization indeed reveals RetroPrompt can reduce the reliance of language models on memorization; thus, improving generalization for downstream tasks. Code is available in https://github.com/zjunlp/PromptKG/tree/main/research/RetroPrompt.  ( 2 min )
    Goal-Space Planning with Subgoal Models. (arXiv:2206.02902v3 [cs.LG] UPDATED)
    This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning and avoids learning the transition dynamics entirely. We show that our GSP algorithm can learn significantly faster than a Double DQN baseline in a variety of situations.  ( 2 min )
    Differentially private stochastic expectation propagation (DP-SEP). (arXiv:2111.13219v5 [cs.LG] UPDATED)
    We are interested in privatizing an approximate posterior inference algorithm called Expectation Propagation (EP). EP approximates the posterior by iteratively refining approximations to the local likelihoods, and is known to provide better posterior uncertainties than those by variational inference (VI). However, EP needs a large memory to maintain all local approximates associated with each datapoint in the training data. To overcome this challenge, stochastic expectation propagation (SEP) considers a single unique local factor that captures the average effect of each likelihood term to the posterior and refines it in a way analogous to EP. In terms of privacy, SEP is more tractable than EP because at each refining step of a factor, the remaining factors are fixed and do not depend on other datapoints as in EP, which makes the sensitivity analysis straightforward. We provide a theoretical analysis of the privacy-accuracy trade-off in the posterior estimates under our method, called differentially private stochastic expectation propagation (DP-SEP). Furthermore, we demonstrate the performance of our DP-SEP algorithm evaluated on both synthetic and real-world datasets in terms of the quality of posterior estimates at different levels of guaranteed privacy.
    Modeling Occasion Evolution in Frequency Domain for Promotion-Aware Click-Through Rate Prediction. (arXiv:2112.13747v5 [cs.LG] UPDATED)
    Promotions are becoming more important and prevalent in e-commerce to attract customers and boost sales, leading to frequent changes of occasions, which drives users to behave differently. In such situations, most existing Click-Through Rate (CTR) models can't generalize well to online serving due to distribution uncertainty of the upcoming occasion. In this paper, we propose a novel CTR model named MOEF for recommendations under frequent changes of occasions. Firstly, we design a time series that consists of occasion signals generated from the online business scenario. Since occasion signals are more discriminative in the frequency domain, we apply Fourier Transformation to sliding time windows upon the time series, obtaining a sequence of frequency spectrum which is then processed by Occasion Evolution Layer (OEL). In this way, a high-order occasion representation can be learned to handle the online distribution uncertainty. Moreover, we adopt multiple experts to learn feature representations from multiple aspects, which are guided by the occasion representation via an attention mechanism. Accordingly, a mixture of feature representations is obtained adaptively for different occasions to predict the final CTR. Experimental results on real-world datasets validate the superiority of MOEF and online A/B tests also show MOEF outperforms representative CTR models significantly.
    Recurrent Neural Networks and Universal Approximation of Bayesian Filters. (arXiv:2211.00335v1 [stat.ML])
    We consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive mapping from observational inputs to the desired estimator statistics. The main focus of this article is the approximation capabilities of this framework. We provide approximation error bounds for filtering in general non-compact domains. We also consider strong time-uniform approximation error bounds that guarantee good long-time performance. We discuss and illustrate a number of practical concerns and implications of these results.
    LightNER: A Lightweight Tuning Paradigm for Low-resource NER via Pluggable Prompting. (arXiv:2109.00720v5 [cs.CL] CROSS LISTED)
    Most NER methods rely on extensive labeled data for model training, which struggles in the low-resource scenarios with limited training data. Existing dominant approaches usually suffer from the challenge that the target domain has different label sets compared with a resource-rich source domain, which can be concluded as class transfer and domain transfer. In this paper, we propose a lightweight tuning paradigm for low-resource NER via pluggable prompting (LightNER). Specifically, we construct the unified learnable verbalizer of entity categories to generate the entity span sequence and entity categories without any label-specific classifiers, thus addressing the class transfer issue. We further propose a pluggable guidance module by incorporating learnable parameters into the self-attention layer as guidance, which can re-modulate the attention and adapt pre-trained weights. Note that we only tune those inserted module with the whole parameter of the pre-trained language model fixed, thus, making our approach lightweight and flexible for low-resource scenarios and can better transfer knowledge across domains. Experimental results show that LightNER can obtain comparable performance in the standard supervised setting and outperform strong baselines in low-resource settings. Code is in https://github.com/zjunlp/DeepKE/tree/main/example/ner/few-shot.
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v2 [stat.ML] UPDATED)
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and a central observation that SW may be interpreted as an average risk, the quantity PAC-Bayesian bounds have been designed to characterize. We provide three types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. SW defined with respect to arbitrary distributions of slices (among which data-dependent distributions), ii) a principled procedure to learn the distribution of slices that yields maximally discriminative SW, by optimizing our theoretical bounds, and iii) empirical illustrations of our theoretical findings.
    One-Shot Federated Learning for Model Clustering and Learning in Heterogeneous Environments. (arXiv:2209.10866v2 [cs.LG] UPDATED)
    The paper presents a communication efficient approach for federated learning in heterogeneous environments in which users obtain data from one of $K$ different data distributions. In the proposed setup, the number $K$ of data distributions and their underlying statistical properties as well as the user cluster structure (i.e., the grouping of users based on the data distributions they sample) are apriori unknown. A one-shot decentralized learning approach, based on a single round of communication between the users and the server, is proposed with the objective of learning the true model at each user. The proposed one-shot approach, based on local computations at the users and a convex clustering based aggregation step at the server is shown to provide strong learning guarantees in such heterogeneous environments. In particular, it is shown that for strongly convex learning setups, as long as the number of data points per user is above a threshold, the proposed approach achieves order-optimal mean-squared error (MSE) rates in terms of the sample size with respect to a hypothetical oracle that has access to all data points at all users and perfect information about the number of different distributions and the user cluster structure, i.e., assignment of distributions to users. An explicit characterization of the threshold is provided in terms of the problem parameters. Numerical experiments illustrate the findings and corroborate the performance of the proposed method.
    Sub-8-bit quantization for on-device speech recognition: a regularization-free approach. (arXiv:2210.09188v2 [cs.SD] UPDATED)
    For on-device automatic speech recognition (ASR), quantization aware training (QAT) is ubiquitous to achieve the trade-off between model predictive performance and efficiency. Among existing QAT methods, one major drawback is that the quantization centroids have to be predetermined and fixed. To overcome this limitation, we introduce a regularization-free, "soft-to-hard" compression mechanism with self-adjustable centroids in a mu-Law constrained space, resulting in a simpler yet more versatile quantization scheme, called General Quantizer (GQ). We apply GQ to ASR tasks using Recurrent Neural Network Transducer (RNN-T) and Conformer architectures on both LibriSpeech and de-identified far-field datasets. Without accuracy degradation, GQ can compress both RNN-T and Conformer into sub-8-bit, and for some RNN-T layers, to 1-bit for fast and accurate inference. We observe a 30.73% memory footprint saving and 31.75% user-perceived latency reduction compared to 8-bit QAT via physical device benchmarking.
    Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining. (arXiv:2110.08412v3 [cs.CL] UPDATED)
    To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for making a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.
    Error Controlled Feature Selection for Ultrahigh Dimensional and Highly Correlated Feature Space Using Deep Learning. (arXiv:2209.07011v3 [stat.ML] UPDATED)
    In recent years, deep learning has been at the center of analytics due to its impressive empirical success in analyzing complex data objects. Despite this success, most of the existing tools behave like black-box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultra-high dimensional and highly correlated features, in addition to the high noise level. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data-adaptive multi-resolutional discovery of highly correlated predictors with a controlled error rate. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum.
    USE-Evaluator: Performance Metrics for Medical Image Segmentation Models with Uncertain, Small or Empty Reference Annotations. (arXiv:2209.13008v3 [eess.IV] UPDATED)
    Performance metrics for medical image segmentation models are used to measure the agreement between the reference annotation and the predicted segmentation. Usually, overlap metrics, such as the Dice, are used as a metric to evaluate the performance of these models in order for results to be comparable. However, there is a mismatch between the distributions of cases and difficulty level of segmentation tasks in public data sets compared to clinical practice. Common metrics fail to measure the impact of this mismatch, especially for clinical data sets that include low signal pathologies, a difficult segmentation task, and uncertain, small, or empty reference annotations. This limitation may result in ineffective research of machine learning practitioners in designing and optimizing models. Dimensions of evaluating clinical value include consideration of the uncertainty of reference annotations, independence from reference annotation volume size, and evaluation of classification of empty reference annotations. We study how uncertain, small, and empty reference annotations influence the value of metrics for medical image segmentation on an in-house data set regardless of the model. We examine metrics behavior on the predictions of a standard deep learning framework in order to identify metrics with clinical value. We compare to a public benchmark data set (BraTS 2019) with a high-signal pathology and certain, larger, and no empty reference annotations. We may show machine learning practitioners, how uncertain, small, or empty reference annotations require a rethinking of the evaluation and optimizing procedures. The evaluation code was released to encourage further analysis of this topic. https://github.com/SophieOstmeier/UncertainSmallEmpty.git
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v1 [cs.LG])
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.
    Focal Modulation Networks. (arXiv:2203.11926v2 [cs.CV] UPDATED)
    We propose focal modulation networks (FocalNets in short), where self-attention (SA) is completely replaced by a focal modulation mechanism for modeling token interactions in vision. Focal modulation comprises three components: (i) hierarchical contextualization, implemented using a stack of depth-wise convolutional layers, to encode visual contexts from short to long ranges, (ii) gated aggregation to selectively gather contexts for each query token based on its content, and (iii) element-wise modulation or affine transformation to inject the aggregated context into the query. Extensive experiments show FocalNets outperform the state-of-the-art SA counterparts (e.g., Swin and Focal Transformers) with similar computational costs on the tasks of image classification, object detection, and segmentation. Specifically, FocalNets with tiny and base size achieve 82.3% and 83.9% top-1 accuracy on ImageNet-1K. After pretrained on ImageNet-22K in 2242 resolution, it attains 86.5% and 87.3% top-1 accuracy when finetuned with resolution 2242 and 3842, respectively. When transferred to downstream tasks, FocalNets exhibit clear superiority. For object detection with Mask R-CNN, FocalNet base trained with 1\times outperforms the Swin counterpart by 2.1 points and already surpasses Swin trained with 3\times schedule (49.0 v.s. 48.5). For semantic segmentation with UPerNet, FocalNet base at single-scale outperforms Swin by 2.4, and beats Swin at multi-scale (50.5 v.s. 49.7). Using large FocalNet and Mask2former, we achieve 58.5 mIoU for ADE20K semantic segmentation, and 57.9 PQ for COCO Panoptic Segmentation. Using huge FocalNet and DINO, we achieved 64.2 and 64.3 mAP on COCO minival and test-dev, respectively, establishing new SoTA on top of much larger attention-based models like Swinv2-G and BEIT-3. Code is available at \url{https://github.com/microsoft/FocalNet}.
    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. (arXiv:2211.00593v1 [cs.LG])
    Research in mechanistic interpretability seeks to explain behaviors of machine learning models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models, or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria--faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, opening opportunities to scale our understanding to both larger models and more complex tasks.
    Self-Supervised RF Signal Representation Learning for NextG Signal Classification with Deep Learning. (arXiv:2207.03046v2 [cs.NI] UPDATED)
    Deep learning (DL) finds rich applications in the wireless domain to improve spectrum awareness. Typically, DL models are either randomly initialized following a statistical distribution or pretrained on tasks from other domains in the form of transfer learning without accounting for the unique characteristics of wireless signals. Self-supervised learning (SSL) enables the learning of useful representations from Radio Frequency (RF) signals themselves even when only limited training data samples with labels are available. We present a self-supervised RF signal representation learning method and apply it to the automatic modulation recognition (AMR) task by specifically formulating a set of transformations to capture the wireless signal characteristics. We show that the sample efficiency (the number of labeled samples needed to achieve a certain performance) of AMR can be significantly increased (almost an order of magnitude) by learning signal representations with SSL. This translates to substantial time and cost savings. Furthermore, SSL increases the model accuracy compared to the state-of-the-art DL methods and maintains high accuracy when limited training data is available.
    Multi-modal Protein Knowledge Graph Construction and Applications. (arXiv:2207.10080v2 [q-bio.QM] CROSS LISTED)
    Existing data-centric methods for protein science generally cannot sufficiently capture and leverage biology knowledge, which may be crucial for many protein tasks. To facilitate research in this field, we create ProteinKG65, a knowledge graph for protein science. Using gene ontology and Uniprot knowledge base as a basis, we transform and integrate various kinds of knowledge with aligned descriptions and protein sequences, respectively, to GO terms and protein entities. ProteinKG65 is mainly dedicated to providing a specialized protein knowledge graph, bringing the knowledge of Gene Ontology to protein function and structure prediction. The current version contains about 614,099 entities, 5,620,437 triples (including 5,510,437 protein-go triplets and 110,000 GO-GO triplets). We also illustrate the potential applications of ProteinKG65 with a prototype. Our dataset can be downloaded at https://w3id.org/proteinkg65.
    Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study. (arXiv:2210.10678v1 [cs.CL] CROSS LISTED)
    This paper presents an empirical study to build relation extraction systems in low-resource settings. Based upon recent pre-trained language models, we comprehensively investigate three schemes to evaluate the performance in low-resource settings: (i) different types of prompt-based methods with few-shot labeled data; (ii) diverse balancing methods to address the long-tailed distribution issue; (iii) data augmentation technologies and self-training to generate more labeled in-domain data. We create a benchmark with 8 relation extraction (RE) datasets covering different languages, domains and contexts and perform extensive comparisons over the proposed schemes with combinations. Our experiments illustrate: (i) Though prompt-based tuning is beneficial in low-resource RE, there is still much potential for improvement, especially in extracting relations from cross-sentence contexts with multiple relational triples; (ii) Balancing methods are not always helpful for RE with long-tailed distribution; (iii) Data augmentation complements existing baselines and can bring much performance gain, while self-training may not consistently achieve advancement to low-resource RE. Code and datasets are in https://github.com/zjunlp/LREBench.
    Enhancing the Transformer Decoder with Transition-based Syntax. (arXiv:2101.12640v4 [cs.CL] UPDATED)
    Notwithstanding recent advances, syntactic generalization remains a challenge for text decoders. While some studies showed gains from incorporating source-side symbolic syntactic and semantic structure into text generation Transformers, very little work addressed the decoding of such structure. We propose a general approach for tree decoding using a transition-based approach. Examining the challenging test case of incorporating Universal Dependencies syntax into machine translation, we present substantial improvements on test sets that focus on syntactic generalization, while presenting improved or comparable performance on standard MT benchmarks. Further qualitative analysis addresses cases where syntactic generalization in the vanilla Transformer decoder is inadequate and demonstrates the advantages afforded by integrating syntactic information.
    Structured Uncertainty in the Observation Space of Variational Autoencoders. (arXiv:2205.12533v2 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) are a popular class of deep generative models with many variants and a wide range of applications. Improvements upon the standard VAE mostly focus on the modelling of the posterior distribution over the latent space and the properties of the neural network decoder. In contrast, improving the model for the observational distribution is rarely considered and typically defaults to a pixel-wise independent categorical or normal distribution. In image synthesis, sampling from such distributions produces spatially-incoherent results with uncorrelated pixel noise, resulting in only the sample mean being somewhat useful as an output prediction. In this paper, we aim to stay true to VAE theory by improving the samples from the observational distribution. We propose SOS-VAE, an alternative model for the observation space, encoding spatial dependencies via a low-rank parameterisation. We demonstrate that this new observational distribution has the ability to capture relevant covariance between pixels, resulting in spatially-coherent samples. In contrast to pixel-wise independent distributions, our samples seem to contain semantically-meaningful variations from the mean allowing the prediction of multiple plausible outputs with a single forward pass.
    A Confidence Machine for Sparse High-Order Interaction Model. (arXiv:2205.14317v2 [stat.ML] UPDATED)
    In predictive modeling for high-stake decision-making, predictors must be not only accurate but also reliable. Conformal prediction (CP) is a promising approach for obtaining the confidence of prediction results with fewer theoretical assumptions. To obtain the confidence set by so-called full-CP, we need to refit the predictor for all possible values of prediction results, which is only possible for simple predictors. For complex predictors such as random forests (RFs) or neural networks (NNs), split-CP is often employed where the data is split into two parts: one part for fitting and another to compute the confidence set. Unfortunately, because of the reduced sample size, split-CP is inferior to full-CP both in fitting as well as confidence set computation. In this paper, we develop a full-CP of sparse high-order interaction model (SHIM), which is sufficiently flexible as it can take into account high-order interactions among variables. We resolve the computational challenge for full-CP of SHIM by introducing a novel approach called homotopy mining. Through numerical experiments, we demonstrate that SHIM is as accurate as complex predictors such as RF and NN and enjoys the superior statistical power of full-CP.
    Learning Truthful, Efficient, and Welfare Maximizing Auction Rules. (arXiv:1907.05181v2 [cs.MA] UPDATED)
    From social networks to supply chains, more and more aspects of how humans, firms and organizations interact is mediated by artificial learning agents. As the influence of machine learning systems grows, it is paramount that we study how to imbue our modern institutions with our own values and principles. Here we consider the problem of allocating goods to buyers who have preferences over them in settings where the seller's aim is not to maximize their monetary gains, but rather to advance some notion of social welfare (e.g. the government trying to award construction licenses for hospitals or schools). This problem has a long history in economics, and solutions take the form of auction rules. Researchers have proposed reliable auction rules that work in extremely general settings, and in the presence of information asymmetry and strategic buyers. However, these protocols require significant payments from participants resulting in low aggregate welfare. Here we address this shortcoming by casting auction rule design as a statistical learning problem, and trade generality for participant welfare effectively and automatically with a novel deep learning network architecture and auction representation. Our analysis shows that our auction rules outperform state-of-the art approaches in terms of participants welfare, applicability, robustness.
    Multitask Online Mirror Descent. (arXiv:2106.02393v3 [cs.LG] UPDATED)
    We introduce and analyze MT-OMD, a multitask generalization of Online Mirror Descent (OMD) which operates by sharing updates between tasks. We prove that the regret of MT-OMD is of order $\sqrt{1 + \sigma^2(N-1)}\sqrt{T}$, where $\sigma^2$ is the task variance according to the geometry induced by the regularizer, $N$ is the number of tasks, and $T$ is the time horizon. Whenever tasks are similar, that is $\sigma^2 \le 1$, our method improves upon the $\sqrt{NT}$ bound obtained by running independent OMDs on each task. We further provide a matching lower bound, and show that our multitask extensions of Online Gradient Descent and Exponentiated Gradient, two major instances of OMD, enjoy closed-form updates, making them easy to use in practice. Finally, we present experiments which support our theoretical findings.
    Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning. (arXiv:2206.02072v2 [cs.LG] UPDATED)
    The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.
    Meta-Learning Guarantees for Online Receding Horizon Learning Control. (arXiv:2010.11327v15 [eess.SY] UPDATED)
    In this paper we provide provable regret guarantees for an online meta-learning receding horizon control algorithm in an iterative control setting. We consider the setting where, in each iteration the system to be controlled is a linear deterministic system that is different and unknown, the cost for the controller in an iteration is a general additive cost function and there are affine control input constraints. By analysing conditions under which sub-linear regret is achievable, we prove that the meta-learning online receding horizon controller achieves an average of the dynamic regret for the controller cost that is $\tilde{O}((1+1/\sqrt{N})T^{3/4})$ with the number of iterations $N$. Thus, we show that the worst regret for learning within an iteration improves with experience of more iterations, with guarantee on rate of improvement.
    Dynamical Wasserstein Barycenters for Time-series Modeling. (arXiv:2110.06741v3 [cs.LG] UPDATED)
    Many time series can be modeled as a sequence of segments representing high-level discrete states, such as running and walking in a human activity application. Flexible models should describe the system state and observations in stationary "pure-state" periods as well as transition periods between adjacent segments, such as a gradual slowdown between running and walking. However, most prior work assumes instantaneous transitions between pure discrete states. We propose a dynamical Wasserstein barycentric (DWB) model that estimates the system state over time as well as the data-generating distributions of pure states in an unsupervised manner. Our model assumes each pure state generates data from a multivariate normal distribution, and characterizes transitions between states via displacement-interpolation specified by the Wasserstein barycenter. The system state is represented by a barycentric weight vector which evolves over time via a random walk on the simplex. Parameter learning leverages the natural Riemannian geometry of Gaussian distributions under the Wasserstein distance, which leads to improved convergence speeds. Experiments on several human activity datasets show that our proposed DWB model accurately learns the generating distribution of pure states while improving state estimation for transition periods compared to the commonly used linear interpolation mixture models.
    Supervised Robustness-preserving Data-free Neural Network Pruning. (arXiv:2204.00783v2 [cs.LG] UPDATED)
    When deploying pre-trained neural network models in real-world applications, model consumers often encounter resource-constraint platforms such as mobile and smart devices. They typically use the pruning technique to reduce the size and complexity of the model, generating a lighter one with less resource consumption. Nonetheless, most existing pruning methods are proposed with the premise that the model after being pruned has a chance to be fine-tuned or even retrained based on the original training data. This may be unrealistic in practice, as the data controllers are often reluctant to provide their model consumers with the original data. In this work, we study the neural network pruning in the data-free context, aiming to yield lightweight models that are not only accurate in prediction but also robust against undesired inputs in open-world deployments. Considering the absence of the fine-tuning and retraining that can fix the mis-pruned units, we replace the traditional aggressive one-shot strategy with a conservative one that treats the pruning as a progressive process. We propose a pruning method based on stochastic optimization that uses robustness-related metrics to guide the pruning process. Our method is implemented as a Python program and evaluated with a series of experiments on diverse neural network models. The experimental results show that it significantly outperforms existing one-shot data-free pruning approaches in terms of robustness preservation and accuracy.
    Long-tail Relation Extraction via Knowledge Graph Embeddings and Graph Convolution Networks. (arXiv:1903.01306v1 [cs.IR] CROSS LISTED)
    We propose a distance supervised relation extraction approach for long-tailed, imbalanced data which is prevalent in real-world settings. Here, the challenge is to learn accurate "few-shot" models for classes existing at the tail of the class distribution, for which little data is available. Inspired by the rich semantic correlations between classes at the long tail and those at the head, we take advantage of the knowledge from data-rich classes at the head of the distribution to boost the performance of the data-poor classes at the tail. First, we propose to leverage implicit relational knowledge among class labels from knowledge graph embeddings and learn explicit relational knowledge using graph convolution networks. Second, we integrate that relational knowledge into relation extraction model by coarse-to-fine knowledge-aware attention mechanism. We demonstrate our results for a large-scale benchmark dataset which show that our approach significantly outperforms other baselines, especially for long-tail relations.
    CLIP: Cheap Lipschitz Training of Neural Networks. (arXiv:2103.12531v2 [cs.LG] UPDATED)
    Despite the large success of deep neural networks (DNN) in recent years, most neural networks still lack mathematical guarantees in terms of stability. For instance, DNNs are vulnerable to small or even imperceptible input perturbations, so called adversarial examples, that can cause false predictions. This instability can have severe consequences in applications which influence the health and safety of humans, e.g., biomedical imaging or autonomous driving. While bounding the Lipschitz constant of a neural network improves stability, most methods rely on restricting the Lipschitz constants of each layer which gives a poor bound for the actual Lipschitz constant. In this paper we investigate a variational regularization method named CLIP for controlling the Lipschitz constant of a neural network, which can easily be integrated into the training procedure. We mathematically analyze the proposed model, in particular discussing the impact of the chosen regularization parameter on the output of the network. Finally, we numerically evaluate our method on both a nonlinear regression problem and the MNIST and Fashion-MNIST classification databases, and compare our results with a weight regularization approach.
    Improved Rates for Differentially Private Stochastic Convex Optimization with Heavy-Tailed Data. (arXiv:2106.01336v6 [cs.LG] UPDATED)
    We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy (DP). Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu \cite{WangXDX20}, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under concentrated DP for convex and strongly convex loss functions. Along the way, we derive new algorithms for private mean estimation of heavy-tailed distributions, under both pure and concentrated DP. Finally, we prove nearly-matching lower bounds for private stochastic convex optimization with strongly convex losses and mean estimation, showing new separations between pure and concentrated DP.
    A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks. (arXiv:2206.08514v2 [cs.LG] UPDATED)
    Textual backdoor attacks are a kind of practical threat to NLP systems. By injecting a backdoor in the training phase, the adversary could control model predictions via predefined triggers. As various attack and defense models have been proposed, it is of great significance to perform rigorous evaluations. However, we highlight two issues in previous backdoor learning evaluations: (1) The differences between real-world scenarios (e.g. releasing poisoned datasets or models) are neglected, and we argue that each scenario has its own constraints and concerns, thus requires specific evaluation protocols; (2) The evaluation metrics only consider whether the attacks could flip the models' predictions on poisoned samples and retain performances on benign samples, but ignore that poisoned samples should also be stealthy and semantic-preserving. To address these issues, we categorize existing works into three practical scenarios in which attackers release datasets, pre-trained models, and fine-tuned models respectively, then discuss their unique evaluation methodologies. On metrics, to completely evaluate poisoned samples, we use grammar error increase and perplexity difference for stealthiness, along with text similarity for validity. After formalizing the frameworks, we develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. With this toolkit, we perform extensive experiments to benchmark attack and defense models under the suggested paradigm. To facilitate the underexplored defenses against poisoned datasets, we further propose CUBE, a simple yet strong clustering-based defense baseline. We hope that our frameworks and benchmarks could serve as the cornerstones for future model development and evaluations.
    Bayesian Continual Learning via Spiking Neural Networks. (arXiv:2208.13723v2 [cs.NE] UPDATED)
    Among the main features of biological intelligence are energy efficiency, capacity for continual adaptation, and risk management via uncertainty quantification. Neuromorphic engineering has been thus far mostly driven by the goal of implementing energy-efficient machines that take inspiration from the time-based computing paradigm of biological brains. In this paper, we take steps towards the design of neuromorphic systems that are capable of adaptation to changing learning tasks, while producing well-calibrated uncertainty quantification estimates. To this end, we derive online learning rules for spiking neural networks (SNNs) within a Bayesian continual learning framework. In it, each synaptic weight is represented by parameters that quantify the current epistemic uncertainty resulting from prior knowledge and observed data. The proposed online rules update the distribution parameters in a streaming fashion as data are observed. We instantiate the proposed approach for both real-valued and binary synaptic weights. Experimental results using Intel's Lava platform show the merits of Bayesian over frequentist learning in terms of capacity for adaptation and uncertainty quantification.
    Revisiting Neural Scaling Laws in Language and Vision. (arXiv:2209.06640v2 [cs.LG] UPDATED)
    The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark. Finally, we release a benchmark dataset comprising of 90 evaluation tasks to facilitate research in this domain.
    What drives a goalkeepers' decisions?. (arXiv:2211.00374v1 [cs.LG])
    In soccer games, the goalkeeper's performance is an important factor to the success of the whole team. Despite the goalkeeper's importance, little attention has been paid to their performance in events and tracking data. Here, we developed a model to predict which movements would be most effective for shot-stopping and compare it to the real-life behavior of goalkeepers. This model evaluates the performance of goalkeepers based on their position and dive radius. We found that contrary to the movements that were considered most effective by our model, real-life goalkeepers' movements were more diverse. We further used our model to develop a tool to analyse goalkeepers' behavior in real-life soccer games. In addition, a simulator function allows team analysts or couches to identify situations that allow further improvement of the reaction of the goalkeeper.
    Convergence of policy gradient methods for finite-horizon stochastic linear-quadratic control problems. (arXiv:2211.00617v1 [math.OC])
    We study the global linear convergence of policy gradient (PG) methods for finite-horizon exploratory linear-quadratic control (LQC) problems. The setting includes stochastic LQC problems with indefinite costs and allows additional entropy regularisers in the objective. We consider a continuous-time Gaussian policy whose mean is linear in the state variable and whose covariance is state-independent. Contrary to discrete-time problems, the cost is noncoercive in the policy and not all descent directions lead to bounded iterates. We propose geometry-aware gradient descents for the mean and covariance of the policy using the Fisher geometry and the Bures-Wasserstein geometry, respectively. The policy iterates are shown to satisfy an a-priori bound, and converge globally to the optimal policy with a linear rate. We further propose a novel PG method with discrete-time policies. The algorithm leverages the continuous-time analysis, and achieves a robust linear convergence across different action frequencies. A numerical experiment confirms the convergence and robustness of the proposed algorithm.
    When Bioprocess Engineering Meets Machine Learning: A Survey from the Perspective of Automated Bioprocess Development. (arXiv:2209.01083v2 [cs.LG] UPDATED)
    Machine learning (ML) is becoming increasingly crucial in many fields of engineering but has not yet played out its full potential in bioprocess engineering. While experimentation has been accelerated by increasing levels of lab automation, experimental planning and data modeling are still largerly depend on human intervention. ML can be seen as a set of tools that contribute to the automation of the whole experimental cycle, including model building and practical planning, thus allowing human experts to focus on the more demanding and overarching cognitive tasks. First, probabilistic programming is used for the autonomous building of predictive models. Second, machine learning automatically assesses alternative decisions by planning experiments to test hypotheses and conducting investigations to gather informative data that focus on model selection based on the uncertainty of model predictions. This review provides a comprehensive overview of ML-based automation in bioprocess development. On the one hand, the biotech and bioengineering community should be aware of the potential and, most importantly, the limitation of existing ML solutions for their application in biotechnology and biopharma. On the other hand, it is essential to identify the missing links to enable the easy implementation of ML and Artificial Intelligence (AI) tools in valuable solutions for the bio-community.
    Compressed Gastric Image Generation Based on Soft-Label Dataset Distillation for Medical Data Sharing. (arXiv:2209.14635v2 [cs.CV] UPDATED)
    Background and objective: Sharing of medical data is required to enable the cross-agency flow of healthcare information and construct high-accuracy computer-aided diagnosis systems. However, the large sizes of medical datasets, the massive amount of memory of saved deep convolutional neural network (DCNN) models, and patients' privacy protection are problems that can lead to inefficient medical data sharing. Therefore, this study proposes a novel soft-label dataset distillation method for medical data sharing. Methods: The proposed method distills valid information of medical image data and generates several compressed images with different data distributions for anonymous medical data sharing. Furthermore, our method can extract essential weights of DCNN models to reduce the memory required to save trained models for efficient medical data sharing. Results: The proposed method can compress tens of thousands of images into several soft-label images and reduce the size of a trained model to a few hundredths of its original size. The compressed images obtained after distillation have been visually anonymized; therefore, they do not contain the private information of the patients. Furthermore, we can realize high-detection performance with a small number of compressed images. Conclusions: The experimental results show that the proposed method can improve the efficiency and security of medical data sharing.
    IE-GAN: An Improved Evolutionary Generative Adversarial Network Using a New Fitness Function and a Generic Crossover Operator. (arXiv:2109.11078v3 [cs.NE] UPDATED)
    The training of generative adversarial networks (GANs) is usually vulnerable to mode collapse and vanishing gradients. The evolutionary generative adversarial network (E-GAN) attempts to alleviate these issues by optimizing the learning strategy with multiple loss functions. It uses a learning-based evolutionary framework, which develops new mutation operators specifically for general deep neural networks. However, the evaluation mechanism in the fitness function of E-GAN cannot truly reflect the adaptability of individuals to their environment, leading to an inaccurate assessment of the diversity of individuals. Moreover, the evolution step of E-GAN only contains mutation operators without considering the crossover operator jointly, isolating the superior characteristics among individuals. To address these issues, we propose an improved E-GAN framework called IE-GAN, which introduces a new fitness function and a generic crossover operator. In particular, the proposed fitness function, from an objective perspective, can model the evolutionary process of individuals more accurately. The crossover operator, which has been commonly adopted in evolutionary algorithms, can enable offspring to imitate the superior gene expression of their parents through knowledge distillation. Experiments on various datasets demonstrate the effectiveness of our proposed IE-GAN in terms of the quality of the generated samples and time efficiency.
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v2 [cs.LG] UPDATED)
    Recent years have seen advances in generalization bounds for noisy stochastic algorithms, especially stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical contributions. First, we bound the generalization error in terms of expected (not uniform) stability which arguably leads to quantitatively sharper bounds. Second, as our main contribution, we introduce Exponential Family Langevin Dynamics (EFLD), a substantial generalization of SGLD, which includes noisy versions of Sign-SGD and quantized SGD as special cases. We establish data-dependent expected stability based generalization bounds for any EFLD algorithm with a O(1/n) sample dependence and dependence on gradient discrepancy rather than the norm of gradients, yielding significantly sharper bounds. Third, we establish optimization guarantees for special cases of EFLD. Further, empirical results on benchmarks illustrate that our bounds are non-vacuous, quantitatively sharper than existing bounds, and behave correctly under noisy labels.
    Conservative Likelihood Ratio Estimator for Infrequent Data Slightly above a Frequency Threshold. (arXiv:2211.00545v1 [stat.ML])
    A naive likelihood ratio (LR) estimation using the observed frequencies of events can overestimate LRs for infrequent data. One approach to avoid this problem is to use a frequency threshold and set the estimates to zero for frequencies below the threshold. This approach eliminates the computation of some estimates, thereby making practical tasks using LRs more efficient. However, it still overestimates LRs for low frequencies near the threshold. This study proposes a conservative estimator for low frequencies, slightly above the threshold. Our experiment used LRs to predict the occurrence contexts of named entities from a corpus. The experimental results demonstrate that our estimator improves the prediction accuracy while maintaining efficiency in the context prediction task.
    Contextual Mixture of Experts: Integrating Knowledge into Predictive Modeling. (arXiv:2211.00558v1 [cs.LG])
    This work proposes a new data-driven model devised to integrate process knowledge into its structure to increase the human-machine synergy in the process industry. The proposed Contextual Mixture of Experts (cMoE) explicitly uses process knowledge along the model learning stage to mold the historical data to represent operators' context related to the process through possibility distributions. This model was evaluated in two real case studies for quality prediction, including a sulfur recovery unit and a polymerization process. The contextual mixture of experts was employed to represent different contexts in both experiments. The results indicate that integrating process knowledge has increased predictive performance while improving interpretability by providing insights into the variables affecting the process's different regimes.
    On Medians of (Randomized) Pairwise Means. (arXiv:2211.00603v1 [stat.ML])
    Tournament procedures, recently introduced in Lugosi & Mendelson (2016), offer an appealing alternative, from a theoretical perspective at least, to the principle of Empirical Risk Minimization in machine learning. Statistical learning by Median-of-Means (MoM) basically consists in segmenting the training data into blocks of equal size and comparing the statistical performance of every pair of candidate decision rules on each data block: that with highest performance on the majority of the blocks is declared as the winner. In the context of nonparametric regression, functions having won all their duels have been shown to outperform empirical risk minimizers w.r.t. the mean squared error under minimal assumptions, while exhibiting robustness properties. It is the purpose of this paper to extend this approach in order to address other learning problems, in particular for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learning. Precisely, it is proved here that the bounds achieved by MoM are essentially conserved when the blocks are built by means of independent sampling without replacement schemes instead of a simple segmentation. These results are next extended to situations where the risk is related to a pairwise loss function and its empirical counterpart is of the form of a $U$-statistic. Beyond theoretical results guaranteeing the performance of the learning/estimation methods proposed, some numerical experiments provide empirical evidence of their relevance in practice.
    Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data. (arXiv:2105.07536v4 [stat.ML] UPDATED)
    This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications.
    Evaluation of a blockchain-enabled resource management mechanism for NGNs. (arXiv:2211.00457v1 [cs.NI])
    A new era in ICT has begun with the evolution of Next Generation Networks (NGNs) and the development of human-centric applications. Ultra-low latency, high throughput, and high availability are a few of the main characteristics of modern networks. Network Providers (NPs) are responsible for the development and maintenance of network infrastructures ready to support the most demanding applications that should be available not only in urban areas but in every corner of the earth. The NPs must collaborate to offer high-quality services and keep their overall cost low. The collaboration among competitive entities can in principle be regulated by a trusted 3rd party or by a distributed approach/technology which can guarantee integrity, security, and trust. This paper examines the use of blockchain technology for resource management and negotiation among NPs and presents the results of experiments conducted in a dedicated real testbed. The implementation of the resource management mechanism is described in a Smart Contract (SC) and the testbeds use the Raft and the IBFT consensus mechanisms respectively. The goal of this paper is two-fold: to assess its performance in terms of transaction throughput and latency so that we can assess the granularity at which this solution can operate (e.g. support resource re-allocation among NPs on micro-service level or not) and define implementation-specific parameters like the consensus mechanism that is the most suitable for this use case based on performance metrics.
    Augmentation Invariant Manifold Learning. (arXiv:2211.00460v1 [stat.ML])
    Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-art performance in many applications. To demystify the role of data augmentation, we develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations. Compared with existing methods, the new data representation simultaneously exploits the manifold's geometric structure and invariant property of augmented data. Our theoretical investigation precisely characterizes how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to support the theoretical results in this paper.
    Task-Aware Network Coding Over Butterfly Network. (arXiv:2201.11917v2 [cs.IT] UPDATED)
    Network coding allows distributed information sources such as sensors to efficiently compress and transmit data to distributed receivers across a bandwidth-limited network. Classical network coding is largely task-agnostic -- the coding schemes mainly aim to faithfully reconstruct data at the receivers, regardless of what ultimate task the received data is used for. In this paper, we analyze a new task-driven network coding problem, where distributed receivers pass transmitted data through machine learning (ML) tasks, which provides an opportunity to improve efficiency by transmitting salient task-relevant data representations. Specifically, we formulate a task-aware network coding problem over a butterfly network in real-coordinate space, where lossy analog compression through principal component analysis (PCA) can be applied. A lower bound for the total loss function for the formulated problem is given, and necessary and sufficient conditions for achieving this lower bound are also provided. We introduce ML algorithms to solve the problem in the general case, and our evaluation demonstrates the effectiveness of task-aware network coding.
    Design of a Supervisory Control System for Autonomous Operation of Advanced Reactors. (arXiv:2209.04334v2 [eess.SY] UPDATED)
    Advanced reactors to be deployed in the coming decades will face deregulated energy markets, and may adopt flexible operation to boost profitability. To aid in the transition from baseload to flexible operation paradigm, autonomous operation is sought. This work focuses on the control aspect of autonomous operation. Specifically, a hierarchical control system is designed to support constraint enforcement during routine operational transients. Within the system, data-driven modeling, physics-based state observation, and classical control algorithms are integrated to provide an adaptable and robust solution. A 320 MW Fluoride-cooled High-temperature Pebble-bed Reactor is the design basis for demonstrating the control system. The hierarchical control system consists of a supervisory layer and low-level layer. The supervisory layer receives requests to change the system's operating conditions, and accepts or rejects them based on constraints that have been assigned. Constraints are issued to keep the plant within an optimal operating region. The low-level layer interfaces with the actuators of the system to fulfill requested changes, while maintaining tracking and regulation duties. To accept requests at the supervisory layer, the Reference Governor algorithm was adopted. To model the dynamics of the reactor, a system identification algorithm, Dynamic Mode Decomposition, was utilized. To estimate the evolution of process variables that cannot be directly measured, the Unscented Kalman Filter, incorporating a nonlinear model of nuclear dynamics, was adopted. The composition of these algorithms led to a numerical demonstration of constraint enforcement during a 40 % power drop transient. Adaptability was demonstrated by modifying the constraint values, and enforcing them during the transient. Robustness was demonstrated by enforcing constraints under noisy environments.
    Transfer Learning with Physics-Informed Neural Networks for Efficient Simulation of Branched Flows. (arXiv:2211.00214v1 [cs.LG])
    Physics-Informed Neural Networks (PINNs) offer a promising approach to solving differential equations and, more generally, to applying deep learning to problems in the physical sciences. We adopt a recently developed transfer learning approach for PINNs and introduce a multi-head model to efficiently obtain accurate solutions to nonlinear systems of ordinary differential equations with random potentials. In particular, we apply the method to simulate stochastic branched flows, a universal phenomenon in random wave dynamics. Finally, we compare the results achieved by feed forward and GAN-based PINNs on two physically relevant transfer learning tasks and show that our methods provide significant computational speedups in comparison to standard PINNs trained from scratch.
    An Adversarial Approach to Structural Estimation. (arXiv:2007.06169v3 [econ.EM] UPDATED)
    We propose a new simulation-based estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates simulated observations using the structural model) and a discriminator (which classifies whether an observation is simulated). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that our estimator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v2 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.
    Preserving In-Context Learning ability in Large Language Model Fine-tuning. (arXiv:2211.00635v1 [cs.CL])
    Pretrained large language models (LLMs) are strong in-context learners that are able to perform few-shot learning without changing model parameters. However, as we show, fine-tuning an LLM on any specific task generally destroys its in-context ability. We discover an important cause of this loss, format specialization, where the model overfits to the format of the fine-tuned task and is unable to output anything beyond this format. We further show that format specialization happens at the beginning of fine-tuning. To solve this problem, we propose Prompt Tuning with MOdel Tuning (ProMoT), a simple yet effective two-stage fine-tuning framework that preserves in-context abilities of the pretrained model. ProMoT first trains a soft prompt for the fine-tuning target task, and then fine-tunes the model itself with this soft prompt attached. ProMoT offloads task-specific formats into the soft prompt that can be removed when doing other in-context tasks. We fine-tune mT5 XXL with ProMoT on natural language inference (NLI) and English-French translation and evaluate the in-context abilities of the resulting models on 8 different NLP tasks. ProMoT achieves similar performance on the fine-tuned tasks compared with vanilla fine-tuning, but with much less reduction of in-context learning performances across the board. More importantly, ProMoT shows remarkable generalization ability on tasks that have different formats, e.g. fine-tuning on a NLI binary classification task improves the model's in-context ability to do summarization (+0.53 Rouge-2 score compared to the pretrained model), making ProMoT a promising method to build general purpose capabilities such as grounding and reasoning into LLMs with small but high quality datasets. When extended to sequential or multi-task training, ProMoT can achieve even better out-of-domain generalization performance.
    MAZE: Data-Free Model Stealing Attack Using Zeroth-Order Gradient Estimation. (arXiv:2005.03161v2 [stat.ML] UPDATED)
    Model Stealing (MS) attacks allow an adversary with black-box access to a Machine Learning model to replicate its functionality, compromising the confidentiality of the model. Such attacks train a clone model by using the predictions of the target model for different inputs. The effectiveness of such attacks relies heavily on the availability of data necessary to query the target model. Existing attacks either assume partial access to the dataset of the target model or availability of an alternate dataset with semantic similarities. This paper proposes MAZE -- a data-free model stealing attack using zeroth-order gradient estimation. In contrast to prior works, MAZE does not require any data and instead creates synthetic data using a generative model. Inspired by recent works in data-free Knowledge Distillation (KD), we train the generative model using a disagreement objective to produce inputs that maximize disagreement between the clone and the target model. However, unlike the white-box setting of KD, where the gradient information is available, training a generator for model stealing requires performing black-box optimization, as it involves accessing the target model under attack. MAZE relies on zeroth-order gradient estimation to perform this optimization and enables a highly accurate MS attack. Our evaluation with four datasets shows that MAZE provides a normalized clone accuracy in the range of 0.91x to 0.99x, and outperforms even the recent attacks that rely on partial data (JBDA, clone accuracy 0.13x to 0.69x) and surrogate data (KnockoffNets, clone accuracy 0.52x to 0.97x). We also study an extension of MAZE in the partial-data setting and develop MAZE-PD, which generates synthetic data closer to the target distribution. MAZE-PD further improves the clone accuracy (0.97x to 1.0x) and reduces the query required for the attack by 2x-24x.
    Graph Neural Networks for Multivariate Time Series Regression with Application to Seismic Data. (arXiv:2201.00818v3 [cs.LG] UPDATED)
    Machine learning, with its advances in deep learning has shown great potential in analyzing time series. In many scenarios, however, additional information that can potentially improve the predictions is available. This is crucial for data that arise from e.g., sensor networks that contain information about sensor locations. Then, such spatial information can be exploited by modeling it via graph structures, along with the sequential (time series) information. Recent advances in adapting deep learning to graphs have shown potential in various tasks. However, these methods have not been adapted for time series tasks to a great extent. Most attempts have essentially consolidated around time series forecasting with small sequence lengths. Generally, these architectures are not well suited for regression or classification tasks where the value to be predicted is not strictly depending on the most recent values, but rather on the whole length of the time series. We propose TISER-GCN, a novel graph neural network architecture for processing, in particular, these long time series in a multivariate regression task. Our proposed model is tested on two seismic datasets containing earthquake waveforms, where the goal is to predict maximum intensity measurements of ground shaking at each seismic station. Our findings demonstrate promising results of our approach -- with an average MSE reduction of 16.3% - compared to the best performing baselines. In addition, our approach matches the baseline scores by needing only half the input size. The results are discussed in depth with an additional ablation study.
    Mixed Reality Interface for Digital Twin of Plant Factory. (arXiv:2211.00597v1 [cs.HC])
    An easier and intuitive interface architecture is necessary for digital twin of plant factory. I suggest an immersive and interactive mixed reality interface for digital twin models of smart farming, for remote work rather than simulation of components. The environment is constructed with UI display and a streaming background scene, which is a real time scene taken from camera device located in the plant factory, processed with deformable neural radiance fields. User can monitor and control the remote plant factory facilities with HMD or 2D display based mixed reality environment. This paper also introduces detailed concept and describes the system architecture to implement suggested mixed reality interface.
    Neuron with Steady Response Leads to Better Generalization. (arXiv:2111.15414v3 [cs.LG] UPDATED)
    Regularization can mitigate the generalization gap between training and inference by introducing inductive bias. Existing works have already proposed various inductive biases from diverse perspectives. However, none of them explores inductive bias from the perspective of class-dependent response distribution of individual neurons. In this paper, we conduct a substantial analysis of the characteristics of such distribution. Based on the analysis results, we articulate the Neuron Steadiness Hypothesis: the neuron with similar responses to instances of the same class leads to better generalization. Accordingly, we propose a new regularization method called Neuron Steadiness Regularization (NSR) to reduce neuron intra-class response variance. Based on the Complexity Measure, we theoretically guarantee the effectiveness of NSR for improving generalization. We conduct extensive experiments on Multilayer Perceptron, Convolutional Neural Networks, and Graph Neural Networks with popular benchmark datasets of diverse domains, which show that our Neuron Steadiness Regularization consistently outperforms the vanilla version of models with significant gain and low additional computational overhead.
    Fault diagnosis for three-phase PWM rectifier based on deep feedforward network with transient synthetic features. (arXiv:2211.00228v1 [cs.LG])
    Three-phase PWM rectifiers are adopted extensively in industry because of their excellent properties and potential advantages. However, while the IGBT has an open-circuit fault, the system does not crash suddenly, the performance will be reduced for instance voltages fluctuation and current harmonics. A fault diagnosis method based on deep feedforward network with transient synthetic features is proposed to reduce the dependence on the fault mathematical models in this paper, which mainly uses the transient phase current to train the deep feedforward network classifier. Firstly, the features of fault phase current are analyzed in this paper. Secondly, the historical fault data after feature synthesis is employed to train the deep feedforward network classifier, and the average fault diagnosis accuracy can reach 97.85% for transient synthetic fault data, the classifier trained by the transient synthetic features obtained more than 1% gain in performance compared with original transient features. Finally, the online fault diagnosis experiments show that the method can accurately locate the fault IGBTs, and the final diagnosis result is determined by multiple groups results, which has the ability to increase the accuracy and reliability of the diagnosis results. (c) 2020 ISA. Published by Elsevier Ltd. All rights reserved.
    Angular upsampling in diffusion MRI using contextual HemiHex sub-sampling in q-space. (arXiv:2211.00240v1 [eess.IV])
    Artificial Intelligence (Deep Learning(DL)/ Machine Learning(ML)) techniques are widely being used to address and overcome all kinds of ill-posed problems in medical imaging which was or in fact is seemingly impossible. Reducing gradient directions but harnessing high angular resolution(HAR) diffusion data in MR that retains clinical features is an important and challenging problem in the field. While the DL/ML approaches are promising, it is important to incorporate relevant context for the data to ensure that maximum prior information is provided for the AI model to infer the posterior. In this paper, we introduce HemiHex (HH) subsampling to suggestively address training data sampling on q-space geometry, followed by a nearest neighbor regression training on the HH-samples to finally upsample the dMRI data. Earlier studies has tried to use regression for up-sampling dMRI data but yields performance issues as it fails to provide structured geometrical measures for inference. Our proposed approach is a geometrically optimized regression technique which infers the unknown q-space thus addressing the limitations in the earlier studies.
    Gradient Flows for Regularized Stochastic Control Problems. (arXiv:2006.05956v4 [math.OC] UPDATED)
    This paper studies stochastic control problems with the action space taken to be the space of measures, regularized by the relative entropy. We identify suitable metric space on which we construct a gradient flow for the measure-valued control process along which the cost functional is guaranteed to decrease. It is shown that any invariant measure of this gradient flow satisfies the Pontryagin optimality principle. If the problem we work with is sufficiently convex, the gradient flow converges exponentially fast. Furthermore, the optimal measure-valued control admits Bayesian interpretation which means that one can incorporate prior knowledge when solving stochastic control problem. This work is motivated by a desire to extend the theoretical underpinning for the convergence of stochastic gradient type algorithms widely used in the reinforcement learning community to solve control problems.
    Learning Utilities and Equilibria in Non-Truthful Auctions. (arXiv:2007.01722v3 [cs.GT] UPDATED)
    In non-truthful auctions, agents' utility for a strategy depends on the strategies of the opponents and also the prior distribution over their private types; the set of Bayes Nash equilibria generally has an intricate dependence on the prior. Using the First Price Auction as our main demonstrating example, we show that $\tilde O(n / \epsilon^2)$ samples from the prior with $n$ agents suffice for an algorithm to learn the interim utilities for all monotone bidding strategies. As a consequence, this number of samples suffice for learning all approximate equilibria. We give almost matching (up to polylog factors) lower bound on the sample complexity for learning utilities. We also consider a setting where agents must pay a search cost to discover their own types. Drawing on a connection between this setting and the first price auction, discovered recently by Kleinberg et al. (2016), we show that $\tilde O(n / \epsilon^2)$ samples suffice for utilities and equilibria to be estimated in a near welfare-optimal descending auction in this setting. En route, we improve the sample complexity bound, recently obtained by Guo et al. (2021), for the Pandora's Box problem, which is a classical model for sequential consumer search.
    Data Leakage in Federated Averaging. (arXiv:2206.12395v3 [cs.LG] UPDATED)
    Recent attacks have shown that user data can be recovered from FedSGD updates, thus breaking privacy. However, these attacks are of limited practical relevance as federated learning typically uses the FedAvg algorithm. Compared to FedSGD, recovering data from FedAvg updates is much harder as: (i) the updates are computed at unobserved intermediate network weights, (ii) a large number of batches are used, and (iii) labels and network weights vary simultaneously across client steps. In this work, we propose a new optimization-based attack which successfully attacks FedAvg by addressing the above challenges. First, we solve the optimization problem using automatic differentiation that forces a simulation of the client's update that generates the unobserved parameters for the recovered labels and inputs to match the received client update. Second, we address the large number of batches by relating images from different epochs with a permutation invariant prior. Third, we recover the labels by estimating the parameters of existing FedSGD attacks at every FedAvg step. On the popular FEMNIST dataset, we demonstrate that on average we successfully recover >45% of the client's images from realistic FedAvg updates computed on 10 local epochs of 10 batches each with 5 images, compared to only <10% using the baseline. Our findings show many real-world federated learning implementations based on FedAvg are vulnerable.
    Automated Imbalanced Learning. (arXiv:2211.00376v1 [cs.LG])
    Automated Machine Learning has grown very successful in automating the time-consuming, iterative tasks of machine learning model development. However, current methods struggle when the data is imbalanced. Since many real-world datasets are naturally imbalanced, and improper handling of this issue can lead to quite useless models, this issue should be handled carefully. This paper first introduces a new benchmark to study how different AutoML methods are affected by label imbalance. Second, we propose strategies to better deal with imbalance and integrate them into an existing AutoML framework. Finally, we present a systematic study which evaluates the impact of these strategies and find that their inclusion in AutoML systems significantly increases their robustness against label imbalance.
    Optimal Complexity in Non-Convex Decentralized Learning over Time-Varying Networks. (arXiv:2211.00533v1 [cs.LG])
    Decentralized optimization with time-varying networks is an emerging paradigm in machine learning. It saves remarkable communication overhead in large-scale deep training and is more robust in wireless scenarios especially when nodes are moving. Federated learning can also be regarded as decentralized optimization with time-varying communication patterns alternating between global averaging and local updates. While numerous studies exist to clarify its theoretical limits and develop efficient algorithms, it remains unclear what the optimal complexity is for non-convex decentralized stochastic optimization over time-varying networks. The main difficulties lie in how to gauge the effectiveness when transmitting messages between two nodes via time-varying communications, and how to establish the lower bound when the network size is fixed (which is a prerequisite in stochastic optimization). This paper resolves these challenges and establish the first lower bound complexity. We also develop a new decentralized algorithm to nearly attain the lower bound, showing the tightness of the lower bound and the optimality of our algorithm.
    Reliability-Aware Deployment of DNNs on In-Memory Analog Computing Architectures. (arXiv:2211.00590v1 [cs.LG])
    Conventional in-memory computing (IMC) architectures consist of analog memristive crossbars to accelerate matrix-vector multiplication (MVM), and digital functional units to realize nonlinear vector (NLV) operations in deep neural networks (DNNs). These designs, however, require energy-hungry signal conversion units which can dissipate more than 95% of the total power of the system. In-Memory Analog Computing (IMAC) circuits, on the other hand, remove the need for signal converters by realizing both MVM and NLV operations in the analog domain leading to significant energy savings. However, they are more susceptible to reliability challenges such as interconnect parasitic and noise. Here, we introduce a practical approach to deploy large matrices in DNNs onto multiple smaller IMAC subarrays to alleviate the impacts of noise and parasitics while keeping the computation in the analog domain.
    Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. (arXiv:2101.05974v5 [cs.LG] UPDATED)
    Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 10% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 4 out of the 6 networks in the transductive setting.
    End-to-End Optimization and Learning for Multiagent Ensembles. (arXiv:2211.00251v1 [cs.LG])
    Multiagent ensemble learning is an important class of algorithms aimed at creating accurate and robust machine learning models by combining predictions from individual agents. A key challenge for the design of these models is to create effective rules to combine individual predictions for any particular input sample. This paper addresses this challenge and proposes a unique integration of constrained optimization and learning to derive specialized consensus rules to compose accurate predictions from a pretrained ensemble. The resulting strategy, called end-to-end Multiagent ensemble Learning (e2e-MEL), learns to select appropriate predictors to combine for a particular input sample. The paper shows how to derive the ensemble learning task into a differentiable selection program which is trained end-to-end within the ensemble learning model. Results over standard benchmarks demonstrate the ability of e2e-MEL to substantially outperform conventional consensus rules in a variety of settings.
    Order-sensitive Neural Constituency Parsing. (arXiv:2211.00421v1 [cs.CL])
    We propose a novel algorithm that improves on the previous neural span-based CKY decoder for constituency parsing. In contrast to the traditional span-based decoding, where spans are combined only based on the sum of their scores, we introduce an order-sensitive strategy, where the span combination scores are more carefully derived from an order-sensitive basis. Our decoder can be regarded as a generalization over existing span-based decoder in determining a finer-grain scoring scheme for the combination of lower-level spans into higher-level spans, where we emphasize on the order of the lower-level spans and use order-sensitive span scores as well as order-sensitive combination grammar rule scores to enhance prediction accuracy. We implement the proposed decoding strategy harnessing GPU parallelism and achieve a decoding speed on par with state-of-the-art span-based parsers. Using the previous state-of-the-art model without additional data as our baseline, we outperform it and improve the F1 score on the Penn Treebank Dataset by 0.26% and on the Chinese Treebank Dataset by 0.35%.
    Cooperative Distribution Alignment via JSD Upper Bound. (arXiv:2207.02286v2 [cs.LG] UPDATED)
    Unsupervised distribution alignment estimates a transformation that maps two or more source distributions to a shared aligned distribution given only samples from each distribution. This task has many applications including generative modeling, unsupervised domain adaptation, and socially aware learning. Most prior works use adversarial learning (i.e., min-max optimization), which can be challenging to optimize and evaluate. A few recent works explore non-adversarial flow-based (i.e., invertible) approaches, but they lack a unified perspective and are limited in efficiently aligning multiple distributions. Therefore, we propose to unify and generalize previous flow-based approaches under a single non-adversarial framework, which we prove is equivalent to minimizing an upper bound on the Jensen-Shannon Divergence (JSD). Importantly, our problem reduces to a min-min, i.e., cooperative, problem and can provide a natural evaluation metric for unsupervised distribution alignment. We show empirical results on both simulated and real-world datasets to demonstrate the benefits of our approach. Code is available at https://github.com/inouye-lab/alignment-upper-bound.
    Deep Learning for Global Wildfire Forecasting. (arXiv:2211.00534v1 [cs.LG])
    Climate change is expected to aggravate wildfire activity through the exacerbation of fire weather. Improving our capabilities to anticipate wildfires on a global scale is of uttermost importance for mitigating their negative effects. In this work, we create a global fire dataset and demonstrate a prototype for predicting the presence of global burned areas on a sub-seasonal scale with the use of segmentation deep learning models. Particularly, we present an open-access global analysis-ready datacube, which contains a variety of variables related to the seasonal and sub-seasonal fire drivers (climate, vegetation, oceanic indices, human-related variables), as well as the historical burned areas and wildfire emissions for 2001-2021. We train a deep learning model, which treats global wildfire forecasting as an image segmentation task and skillfully predicts the presence of burned areas 8, 16, 32 and 64 days ahead of time. Our work motivates the use of deep learning for global burned area forecasting and paves the way towards improved anticipation of global wildfire patterns.
    MLBiNet: A Cross-Sentence Collective Event Detection Network. (arXiv:2105.09458v3 [cs.CL] CROSS LISTED)
    We consider the problem of collectively detecting multiple events, particularly in cross-sentence settings. The key to dealing with the problem is to encode semantic information and model event inter-dependency at a document-level. In this paper, we reformulate it as a Seq2Seq task and propose a Multi-Layer Bidirectional Network (MLBiNet) to capture the document-level association of events and semantic information simultaneously. Specifically, a bidirectional decoder is firstly devised to model event inter-dependency within a sentence when decoding the event tag vector sequence. Secondly, an information aggregation module is employed to aggregate sentence-level semantic and event tag information. Finally, we stack multiple bidirectional decoders and feed cross-sentence information, forming a multi-layer bidirectional tagging architecture to iteratively propagate information across sentences. We show that our approach provides significant improvement in performance compared to the current state-of-the-art results.
    Discrete Factorial Representations as an Abstraction for Goal Conditioned Reinforcement Learning. (arXiv:2211.00247v1 [cs.LG])
    Goal-conditioned reinforcement learning (RL) is a promising direction for training agents that are capable of solving multiple tasks and reach a diverse set of objectives. How to \textit{specify} and \textit{ground} these goals in such a way that we can both reliably reach goals during training as well as generalize to new goals during evaluation remains an open area of research. Defining goals in the space of noisy and high-dimensional sensory inputs poses a challenge for training goal-conditioned agents, or even for generalization to novel goals. We propose to address this by learning factorial representations of goals and processing the resulting representation via a discretization bottleneck, for coarser goal specification, through an approach we call DGRL. We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups, by experimentally evaluating this method on tasks ranging from maze environments to complex robotic navigation and manipulation. Additionally, we prove a theorem lower-bounding the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive combinatorial structure.
    Higher-order mutual information reveals synergistic sub-networks for multi-neuron importance. (arXiv:2211.00416v1 [cs.LG])
    Quantifying which neurons are important with respect to the classification decision of a trained neural network is essential for understanding their inner workings. Previous work primarily attributed importance to individual neurons. In this work, we study which groups of neurons contain synergistic or redundant information using a multivariate mutual information method called the O-information. We observe the first layer is dominated by redundancy suggesting general shared features (i.e. detecting edges) while the last layer is dominated by synergy indicating local class-specific features (i.e. concepts). Finally, we show the O-information can be used for multi-neuron importance. This can be demonstrated by re-training a synergistic sub-network, which results in a minimal change in performance. These results suggest our method can be used for pruning and unsupervised representation learning.
    Exploring Effects of Computational Parameter Changes to Image Recognition Systems. (arXiv:2211.00471v1 [cs.LG])
    Image recognition tasks typically use deep learning and require enormous processing power, thus relying on hardware accelerators like GPUs and FPGAs for fast, timely processing. Failure in real-time image recognition tasks can occur due to incorrect mapping on hardware accelerators, which may lead to timing uncertainty and incorrect behavior. Owing to the increased use of image recognition tasks in safety-critical applications like autonomous driving and medical imaging, it is imperative to assess their robustness to changes in the computational environment as parameters like deep learning frameworks, compiler optimizations for code generation, and hardware devices are not regulated with varying impact on model performance and correctness. In this paper we conduct robustness analysis of four popular image recognition models (MobileNetV2, ResNet101V2, DenseNet121 and InceptionV3) with the ImageNet dataset, assessing the impact of the following parameters in the model's computational environment: (1) deep learning frameworks; (2) compiler optimizations; and (3) hardware devices. We report sensitivity of model performance in terms of output label and inference time for changes in each of these environment parameters. We find that output label predictions for all four models are sensitive to choice of deep learning framework (by up to 57%) and insensitive to other parameters. On the other hand, model inference time was affected by all environment parameters with changes in hardware device having the most effect. The extent of effect was not uniform across models.
    Efficient Graph Neural Network Inference at Large Scale. (arXiv:2211.00495v1 [cs.LG])
    Graph neural networks (GNNs) have demonstrated excellent performance in a wide range of applications. However, the enormous size of large-scale graphs hinders their applications under real-time inference scenarios. Although existing scalable GNNs leverage linear propagation to preprocess the features and accelerate the training and inference procedure, these methods still suffer from scalability issues when making inferences on unseen nodes, as the feature preprocessing requires the graph is known and fixed. To speed up the inference in the inductive setting, we propose a novel adaptive propagation order approach that generates the personalized propagation order for each node based on its topological information. This could successfully avoid the redundant computation of feature propagation. Moreover, the trade-off between accuracy and inference latency can be flexibly controlled by simple hyper-parameters to match different latency constraints of application scenarios. To compensate for the potential inference accuracy loss, we further propose Inception Distillation to exploit the multi scale reception information and improve the inference performance. Extensive experiments are conducted on four public datasets with different scales and characteristics, and the experimental results show that our proposed inference acceleration framework outperforms the SOTA graph inference acceleration baselines in terms of both accuracy and efficiency. In particular, the advantage of our proposed method is more significant on larger-scale datasets, and our framework achieves $75\times$ inference speedup on the largest Ogbn-products dataset.
    Robust Direct Learning for Causal Data Fusion. (arXiv:2211.00249v1 [stat.ML])
    In the era of big data, the explosive growth of multi-source heterogeneous data offers many exciting challenges and opportunities for improving the inference of conditional average treatment effects. In this paper, we investigate homogeneous and heterogeneous causal data fusion problems under a general setting that allows for the presence of source-specific covariates. We provide a direct learning framework for integrating multi-source data that separates the treatment effect from other nuisance functions, and achieves double robustness against certain misspecification. To improve estimation precision and stability, we propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory; it assigns larger weights to samples containing more causal information with high interpretability. We introduce a two-step algorithm, the weighted multi-source direct learner, based on constructing a pseudo-outcome and regressing it on covariates under a weighted least square criterion; it offers us a powerful tool for causal data fusion, enjoying the advantages of easy implementation, double robustness and model flexibility. In simulation studies, we demonstrate the effectiveness of our proposed methods in both homogeneous and heterogeneous causal data fusion scenarios.
    Improving Variational Autoencoders with Density Gap-based Regularization. (arXiv:2211.00321v1 [cs.LG])
    Variational autoencoders (VAEs) are one of the powerful unsupervised learning frameworks in NLP for latent representation learning and latent-directed generation. The classic optimization goal of VAEs is to maximize the Evidence Lower Bound (ELBo), which consists of a conditional likelihood for generation and a negative Kullback-Leibler (KL) divergence for regularization. In practice, optimizing ELBo often leads the posterior distribution of all samples converge to the same degenerated local optimum, namely posterior collapse or KL vanishing. There are effective ways proposed to prevent posterior collapse in VAEs, but we observe that they in essence make trade-offs between posterior collapse and hole problem, i.e., mismatch between the aggregated posterior distribution and the prior distribution. To this end, we introduce new training objectives to tackle both two problems through a novel regularization based on the probabilistic density gap between the aggregated posterior distribution and the prior distribution. Through experiments on language modeling, latent space visualization and interpolation, we show that our proposed method can solve both problems effectively and thus outperforms the existing methods in latent-directed generation. To the best of our knowledge, we are the first to jointly solve the hole problem and the posterior collapse.
    Fast and parallel decoding for transducer. (arXiv:2211.00484v1 [eess.AS])
    The transducer architecture is becoming increasingly popular in the field of speech recognition, because it is naturally streaming as well as high in accuracy. One of the drawbacks of transducer is that it is difficult to decode in a fast and parallel way due to an unconstrained number of symbols that can be emitted per time step. In this work, we introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences; we also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step in transducer decoding, making it more efficient to decode in parallel with batches. Furthermore, we propose an finite state automaton-based (FSA) parallel beam search algorithm that can run with graphs on GPU efficiently. The experiment results show that we achieve slight word error rate (WER) improvement as well as significant speedup in decoding. Our work is open-sourced and publicly available\footnote{https://github.com/k2-fsa/icefall}.
    Delay-penalized transducer for low-latency streaming ASR. (arXiv:2211.00490v1 [eess.AS])
    In streaming automatic speech recognition (ASR), it is desirable to reduce latency as much as possible while having minimum impact on recognition accuracy. Although a few existing methods are able to achieve this goal, they are difficult to implement due to their dependency on external alignments. In this paper, we propose a simple way to penalize symbol delay in transducer model, so that we can balance the trade-off between symbol delay and accuracy for streaming models without external alignments. Specifically, our method adds a small constant times (T/2 - t), where T is the number of frames and t is the current frame, to all the non-blank log-probabilities (after normalization) that are fed into the two dimensional transducer recursion. For both streaming Conformer models and unidirectional long short-term memory (LSTM) models, experimental results show that it can significantly reduce the symbol delay with an acceptable performance degradation. Our method achieves similar delay-accuracy trade-off to the previously published FastEmit, but we believe our method is preferable because it has a better justification: it is equivalent to penalizing the average symbol delay. Our work is open-sourced and publicly available (https://github.com/k2-fsa/k2).
    Optimization of Oblivious Decision Tree Ensembles Evaluation for CPU. (arXiv:2211.00391v1 [cs.LG])
    CatBoost is a popular machine learning library. CatBoost models are based on oblivious decision trees, making training and evaluation rapid. CatBoost has many applications, and some require low latency and high throughput evaluation. This paper investigates the possibilities for improving CatBoost's performance in single-core CPU computations. We explore the new features provided by the AVX instruction sets to optimize evaluation. We increase performance by 20-40% using AVX2 instructions without quality impact. We also introduce a new trade-off between speed and quality. Using float16 for leaf values and AVX-512 instructions, we achieve 50-70% speed-up.
    Learning Task-Aware Effective Brain Connectivity for fMRI Analysis with Graph Neural Networks. (arXiv:2211.00261v1 [q-bio.NC])
    Functional magnetic resonance imaging (fMRI) has become one of the most common imaging modalities for brain function analysis. Recently, graph neural networks (GNN) have been adopted for fMRI analysis with superior performance. Unfortunately, traditional functional brain networks are mainly constructed based on similarities among region of interests (ROI), which are noisy and agnostic to the downstream prediction tasks and can lead to inferior results for GNN-based models. To better adapt GNNs for fMRI analysis, we propose TBDS, an end-to-end framework based on \underline{T}ask-aware \underline{B}rain connectivity \underline{D}AG (short for Directed Acyclic Graph) \underline{S}tructure generation for fMRI analysis. The key component of TBDS is the brain network generator which adopts a DAG learning approach to transform the raw time-series into task-aware brain connectivities. Besides, we design an additional contrastive regularization to inject task-specific knowledge during the brain network generation process. Comprehensive experiments on two fMRI datasets, namely Adolescent Brain Cognitive Development (ABCD) and Philadelphia Neuroimaging Cohort (PNC) datasets demonstrate the efficacy of TBDS. In addition, the generated brain networks also highlight the prediction-related brain regions and thus provide unique interpretations of the prediction results. Our implementation will be published to https://github.com/yueyu1030/TBDS upon acceptance.
    Rating Triggers for Collateral-Inclusive XVA via Machine Learning and SDEs on Lie Groups. (arXiv:2211.00326v1 [q-fin.RM])
    In this paper, we model the rating process of an entity by using a geometrical approach. We model rating transitions as an SDE on a Lie group. Specifically, we focus on calibrating the model to both historical data (rating transition matrices) and market data (CDS quotes) and compare the most popular choices of changes of measure to switch from the historical probability to the risk-neutral one. For this, we show how the classical Girsanov theorem can be applied in the Lie group setting. Moreover, we overcome some of the imperfections of rating matrices published by rating agencies, which are computed with the cohort method, by using a novel Deep Learning approach. This leads to an improvement of the entire scheme and makes the model more robust for applications. We apply our model to compute bilateral credit and debit valuation adjustments of a netting set under a CSA with thresholds depending on ratings of the two parties.
    Wavelet Neural Networks versus Wavelet-based Neural Networks. (arXiv:2211.00396v1 [cs.LG])
    This is the first paper in a sequence of studies in which we introduce a new type of neural networks (NNs) -- wavelet-based neural networks (WBNNs) -- and study their properties and potential for applications. We begin this study with a comparison to the currently existing type of wavelet neural networks (WNNs) and show that WBNNs vastly outperform WNNs. One reason for the vast superiority of WBNNs is their advanced hierarchical tree structure based on biorthonormal multiresolution analysis (MRA). Another reason for this is the implementation of our new idea to incorporate the wavelet tree depth into the neural width of the NN. The separation of the roles of wavelet depth and neural depth provides a conceptually and algorithmically simple but highly efficient methodology for sharp increase in functionality of swarm and deep WBNNs and rapid acceleration of the machine learning process.
    The Enemy of My Enemy is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training. (arXiv:2211.00525v1 [cs.CV])
    Although current deep learning techniques have yielded superior performance on various computer vision tasks, yet they are still vulnerable to adversarial examples. Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples. These methods usually regularize the difference between output probabilities for an adversarial and its corresponding natural example. However, it may have a negative impact if the model misclassifies a natural example. To circumvent this issue, we propose a novel adversarial training scheme that encourages the model to produce similar outputs for an adversarial example and its ``inverse adversarial'' counterpart. These samples are generated to maximize the likelihood in the neighborhood of natural examples. Extensive experiments on various vision datasets and architectures demonstrate that our training method achieves state-of-the-art robustness as well as natural accuracy. Furthermore, using a universal version of inverse adversarial examples, we improve the performance of single-step adversarial training techniques at a low computational cost.
    Distributed Graph Neural Network Training: A Survey. (arXiv:2211.00216v1 [cs.LG])
    Graph neural networks (GNNs) are a type of deep learning models that learning over graphs, and have been successfully applied in many domains. Despite the effectiveness of GNNs, it is still challenging for GNNs to efficiently scale to large graphs. As a remedy, distributed computing becomes a promising solution of training large-scale GNNs, since it is able to provide abundant computing resources. However, the dependency of graph structure increases the difficulty of achieving high-efficiency distributed GNN training, which suffers from the massive communication and workload imbalance. In recent years, many efforts have been made on distributed GNN training, and an array of training algorithms and systems have been proposed. Yet, there is a lack of systematic review on the optimization techniques from graph processing to distributed execution. In this survey, we analyze three major challenges in distributed GNN training that are massive feature communication, the loss of model accuracy and workload imbalance. Then we introduce a new taxonomy for the optimization techniques in distributed GNN training that address the above challenges. The new taxonomy classifies existing techniques into four categories that are GNN data partition, GNN batch generation, GNN execution model, and GNN communication protocol.We carefully discuss the techniques in each category. In the end, we summarize existing distributed GNN systems for multi-GPUs, GPU-clusters and CPU-clusters, respectively, and give a discussion about the future direction on scalable GNNs.
    DensePure: Understanding Diffusion Models towards Adversarial Robustness. (arXiv:2211.00322v1 [cs.LG])
    Diffusion models have been recently employed to improve certified robustness through the process of denoising. However, the theoretical understanding of why diffusion models are able to improve the certified robustness is still lacking, preventing from further improvement. In this study, we close this gap by analyzing the fundamental properties of diffusion models and establishing the conditions under which they can enhance certified robustness. This deeper understanding allows us to propose a new method DensePure, designed to improve the certified robustness of a pretrained model (i.e. classifier). Given an (adversarial) input, DensePure consists of multiple runs of denoising via the reverse process of the diffusion model (with different random seeds) to get multiple reversed samples, which are then passed through the classifier, followed by majority voting of inferred labels to make the final prediction. This design of using multiple runs of denoising is informed by our theoretical analysis of the conditional distribution of the reversed sample. Specifically, when the data density of a clean sample is high, its conditional density under the reverse process in a diffusion model is also high; thus sampling from the latter conditional distribution can purify the adversarial example and return the corresponding clean sample with a high probability. By using the highest density point in the conditional distribution as the reversed sample, we identify the robust region of a given instance under the diffusion model's reverse process. We show that this robust region is a union of multiple convex sets, and is potentially much larger than the robust regions identified in previous works. In practice, DensePure can approximate the label of the high density region in the conditional distribution so that it can enhance certified robustness.
    Zero Day Threat Detection Using Metric Learning Autoencoders. (arXiv:2211.00441v1 [cs.CR])
    The proliferation of zero-day threats (ZDTs) to companies' networks has been immensely costly and requires novel methods to scan traffic for malicious behavior at massive scale. The diverse nature of normal behavior along with the huge landscape of attack types makes deep learning methods an attractive option for their ability to capture highly-nonlinear behavior patterns. In this paper, the authors demonstrate an improvement upon a previously introduced methodology, which used a dual-autoencoder approach to identify ZDTs in network flow telemetry. In addition to the previously-introduced asset-level graph features, which help abstractly represent the role of a host in its network, this new model uses metric learning to train the second autoencoder on labeled attack data. This not only produces stronger performance, but it has the added advantage of improving the interpretability of the model by allowing for multiclass classification in the latent space. This can potentially save human threat hunters time when they investigate predicted ZDTs by showing them which known attack classes were nearby in the latent space. The models presented here are also trained and evaluated with two more datasets, and continue to show promising results even when generalizing to new network topologies.
    Transfer Learning with Kernel Methods. (arXiv:2211.00227v1 [cs.LG])
    Transfer learning refers to the process of adapting a model trained on a source task to a target task. While kernel methods are conceptually and computationally simple machine learning models that are competitive on a variety of tasks, it has been unclear how to perform transfer learning for kernel methods. In this work, we propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. In particular, we show that transferring modern kernels trained on large-scale image datasets can result in substantial performance increase as compared to using the same kernel trained directly on the target task. In addition, we show that transfer-learned kernels allow a more accurate prediction of the effect of drugs on cancer cell lines. For both applications, we identify simple scaling laws that characterize the performance of transfer-learned kernels as a function of the number of target examples. We explain this phenomenon in a simplified linear setting, where we are able to derive the exact scaling laws. By providing a simple and effective transfer learning framework for kernel methods, our work enables kernel methods trained on large datasets to be easily adapted to a variety of downstream target tasks.
    Detection of (Hidden) Emotions from Videos using Muscles Movements and Face Manifold Embedding. (arXiv:2211.00233v1 [cs.CV])
    We provide a new non-invasive, easy-to-scale for large amounts of subjects and a remotely accessible method for (hidden) emotion detection from videos of human faces. Our approach combines face manifold detection for accurate location of the face in the video with local face manifold embedding to create a common domain for the measurements of muscle micro-movements that is invariant to the movement of the subject in the video. In the next step, we employ the Digital Image Speckle Correlation (DISC) and the optical flow algorithm to compute the pattern of micro-movements in the face. The corresponding vector field is mapped back to the original space and superimposed on the original frames of the videos. Hence, the resulting videos include additional information about the direction of the movement of the muscles in the face. We take the publicly available CK++ dataset of visible emotions and add to it videos of the same format but with hidden emotions. We process all the videos using our micro-movement detection and use the results to train a state-of-the-art network for emotions classification from videos -- Frame Attention Network (FAN). Although the original FAN model achieves very high out-of-sample performance on the original CK++ videos, it does not perform so well on hidden emotions videos. The performance improves significantly when the model is trained and tested on videos with the vector fields of muscle movements. Intuitively, the corresponding arrows serve as edges in the image that are easily captured by the convolutions filters in the FAN network.
    Entity Matching by Pool-based Active Learning. (arXiv:2211.00311v1 [cs.LG])
    The goal of entity matching is to find the corresponding records representing the same real-world entity from different data sources. At present, in the mainstream methods, rule-based entity matching methods need tremendous domain knowledge. The machine-learning based or deep-learning based entity matching methods need a large number of labeled samples to build the model, which is difficult to achieve in some applications. In addition, learning-based methods are easy to over-fitting, so the quality requirements of training samples are very high. In this paper, we present an active learning method ALMatcher for the entity matching tasks. This method needs to manually label only a small number of valuable samples, and use these samples to build a model with high quality. This paper proposes a hybrid uncertainty as query strategy to find those valuable samples for labeling, which can minimize the number of labeled training samples meanwhile meet the task requirements. The proposed method has been validated on seven data sets in different fields. The experiment shows that ALMatcher uses only a small number of labeled samples and achieves better results compared to existing approaches.
    Generalized Quadratic-Embeddings for Nonlinear Dynamics using Deep Learning. (arXiv:2211.00357v1 [math.DS])
    The engineering design process (e.g., control and forecasting) relies on mathematical modeling, describing the underlying dynamic behavior. For complex dynamics behavior, modeling procedures, as well as models, can be intricated, which can make the design process cumbersome. Therefore, it is desirable to have a common model structure, which is also simple enough, for all nonlinear dynamics to enhance design processes. The simplest dynamical model -- one can think of -- is linear, but linear models are often not expressive enough to apprehend complex dynamics. In this work, we propose a modeling approach for nonlinear dynamics and discuss a common framework to model nonlinear dynamic processes, which is built upon a \emph{lifting-principle}. The preeminent idea of the principle is that smooth nonlinear systems can be written as quadratic systems in an appropriate lifted coordinate system without any approximation error. Hand-designing these coordinates is not straightforward. In this work, we utilize deep learning capabilities and discuss suitable neural network architectures to find such a coordinate system using data. We present innovative neural architectures and the corresponding objective criterion to achieve our goal. We illustrate the approach using data coming from applications in engineering and biology.
    No-audio speaking status detection in crowded settings via visual pose-based filtering and wearable acceleration. (arXiv:2211.00549v1 [cs.CV])
    Recognizing who is speaking in a crowded scene is a key challenge towards the understanding of the social interactions going on within. Detecting speaking status from body movement alone opens the door for the analysis of social scenes in which personal audio is not obtainable. Video and wearable sensors make it possible recognize speaking in an unobtrusive, privacy-preserving way. When considering the video modality, in action recognition problems, a bounding box is traditionally used to localize and segment out the target subject, to then recognize the action taking place within it. However, cross-contamination, occlusion, and the articulated nature of the human body, make this approach challenging in a crowded scene. Here, we leverage articulated body poses for subject localization and in the subsequent speech detection stage. We show that the selection of local features around pose keypoints has a positive effect on generalization performance while also significantly reducing the number of local features considered, making for a more efficient method. Using two in-the-wild datasets with different viewpoints of subjects, we investigate the role of cross-contamination in this effect. We additionally make use of acceleration measured through wearable sensors for the same task, and present a multimodal approach combining both methods.
    Time Series Alignment with Global Invariances. (arXiv:2002.03848v2 [stat.ML] UPDATED)
    Multivariate time series are ubiquitous objects in signal processing. Measuring a distance or similarity between two such objects is of prime interest in a variety of applications, including machine learning, but can be very difficult as soon as the temporal dynamics and the representation of the time series, {\em i.e.} the nature of the observed quantities, differ from one another. In this work, we propose a novel distance accounting both feature space and temporal variabilities by learning a latent global transformation of the feature space together with a temporal alignment, cast as a joint optimization problem. The versatility of our framework allows for several variants depending on the invariance class at stake. Among other contributions, we define a differentiable loss for time series and present two algorithms for the computation of time series barycenters under this new geometry. We illustrate the interest of our approach on both simulated and real world data and show the robustness of our approach compared to state-of-the-art methods.
    Infusing known operators in convolutional neural networks for lateral strain imaging in ultrasound elastography. (arXiv:2211.00172v1 [eess.IV])
    Convolutional Neural Networks (CNN) have been employed for displacement estimation in ultrasound elastography (USE). High-quality axial strains (derivative of the axial displacement in the axial direction) can be estimated by the proposed networks. In contrast to axial strain, lateral strain, which is highly required in Poisson's ratio imaging and elasticity reconstruction, has a poor quality. The main causes include low sampling frequency, limited motion, and lack of phase information in the lateral direction. Recently, physically inspired constraint in unsupervised regularized elastography (PICTURE) has been proposed. This method took into account the range of the feasible lateral strain defined by the rules of physics of motion and employed a regularization strategy to improve the lateral strains. Despite the substantial improvement, the regularization was only applied during the training; hence it did not guarantee during the test that the lateral strain is within the feasible range. Furthermore, only the feasible range was employed, other constraints such as incompressibility were not investigated. In this paper, we address these two issues and propose kPICTURE in which two iterative algorithms were infused into the network architecture in the form of known operators to ensure the lateral strain is within the feasible range and impose incompressibility during the test phase.  ( 2 min )
    Logic-Based Explainability in Machine Learning. (arXiv:2211.00541v1 [cs.AI])
    The last decade witnessed an ever-increasing stream of successes in Machine Learning (ML). These successes offer clear evidence that ML is bound to become pervasive in a wide range of practical uses, including many that directly affect humans. Unfortunately, the operation of the most successful ML models is incomprehensible for human decision makers. As a result, the use of ML models, especially in high-risk and safety-critical settings is not without concern. In recent years, there have been efforts on devising approaches for explaining ML models. Most of these efforts have focused on so-called model-agnostic approaches. However, all model-agnostic and related approaches offer no guarantees of rigor, hence being referred to as non-formal. For example, such non-formal explanations can be consistent with different predictions, which renders them useless in practice. This paper overviews the ongoing research efforts on computing rigorous model-based explanations of ML models; these being referred to as formal explanations. These efforts encompass a variety of topics, that include the actual definitions of explanations, the characterization of the complexity of computing explanations, the currently best logical encodings for reasoning about different ML models, and also how to make explanations interpretable for human decision makers, among others.
    Homodyned K-distribution: parameter estimation and uncertainty quantification using Bayesian neural networks. (arXiv:2211.00175v1 [eess.SP])
    Quantitative ultrasound (QUS) allows estimating the intrinsic tissue properties. Speckle statistics are the QUS parameters that describe the first order statistics of ultrasound (US) envelope data. The parameters of Homodyned K-distribution (HK-distribution) are the speckle statistics that can model the envelope data in diverse scattering conditions. However, they require a large amount of data to be estimated reliably. Consequently, finding out the intrinsic uncertainty of the estimated parameters can help us to have a better understanding of the estimated parameters. In this paper, we propose a Bayesian Neural Network (BNN) to estimate the parameters of HK-distribution and quantify the uncertainty of the estimator.  ( 2 min )
    Adversarial Policies Beat Professional-Level Go AIs. (arXiv:2211.00241v1 [cs.LG])
    We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See https://goattack.alignmentfund.org/ for example games.
    Strategies for Optimizing End-to-End Artificial Intelligence Pipelines on Intel Xeon Processors. (arXiv:2211.00286v1 [cs.LG])
    End-to-end (E2E) artificial intelligence (AI) pipelines are composed of several stages including data preprocessing, data ingestion, defining and training the model, hyperparameter optimization, deployment, inference, postprocessing, followed by downstream analyses. To obtain efficient E2E workflow, it is required to optimize almost all the stages of pipeline. Intel Xeon processors come with large memory capacities, bundled with AI acceleration (e.g., Intel Deep Learning Boost), well suited to run multiple instances of training and inference pipelines in parallel and has low total cost of ownership (TCO). To showcase the performance on Xeon processors, we applied comprehensive optimization strategies coupled with software and hardware acceleration on variety of E2E pipelines in the areas of Computer Vision, NLP, Recommendation systems, etc. We were able to achieve a performance improvement, ranging from 1.8x to 81.7x across different E2E pipelines. In this paper, we will be highlighting the optimization strategies adopted by us to achieve this performance on Intel Xeon processors with a set of eight different E2E pipelines.  ( 2 min )
    A Close Look into the Calibration of Pre-trained Language Models. (arXiv:2211.00151v1 [cs.CL])
    Pre-trained language models (PLMs) achieve remarkable performance on many downstream tasks, but may fail in giving reliable estimates of their predictive uncertainty. Given the lack of a comprehensive understanding of PLMs calibration, we take a close look into this new research problem, aiming to answer two questions: (1) Do PLMs learn to become calibrated in the training process? (2) How effective are existing calibration methods? For the first question, we conduct fine-grained control experiments to study the dynamic change in PLMs' calibration performance in training. We consider six factors as control variables, including dataset difficulty, available training samples, training steps, the number of tunable parameters, model scale, and pretraining. In experiments, we observe a consistent change in calibration performance across six factors. We find that PLMs don't learn to become calibrated in training, evidenced by the continual increase in confidence, no matter the predictions are correct or not. We highlight that our finding presents some contradiction with two established conclusions: (a) Larger PLMs are more calibrated; (b) Pretraining improves model calibration. Next, we study the effectiveness of existing calibration methods in mitigating the overconfidence issue, in both in-distribution and various out-of-distribution settings. Besides unlearnable calibration methods, we adapt two recently proposed learnable methods that directly collect data to train models to have reasonable confidence estimations. Also, we propose extended learnable methods based on existing ones to further improve or maintain PLMs calibration without sacrificing the original task performance. Experimental results show that learnable methods significantly reduce PLMs' confidence in wrong predictions, and our methods exhibit superior performance compared with previous methods.
    HFN: Heterogeneous Feature Network for Multivariate Time Series Anomaly Detection. (arXiv:2211.00277v1 [cs.LG])
    Network or physical attacks on industrial equipment or computer systems may cause massive losses. Therefore, a quick and accurate anomaly detection (AD) based on monitoring data, especially the multivariate time-series (MTS) data, is of great significance. As the key step of anomaly detection for MTS data, learning the relations among different variables has been explored by many approaches. However, most of the existing approaches do not consider the heterogeneity between variables, that is, different types of variables (continuous numerical variables, discrete categorical variables or hybrid variables) may have different and distinctive edge distributions. In this paper, we propose a novel semi-supervised anomaly detection framework based on a heterogeneous feature network (HFN) for MTS, learning heterogeneous structure information from a mass of unlabeled time-series data to improve the accuracy of anomaly detection, and using attention coefficient to provide an explanation for the detected anomalies. Specifically, we first combine the embedding similarity subgraph generated by sensor embedding and feature value similarity subgraph generated by sensor values to construct a time-series heterogeneous graph, which fully utilizes the rich heterogeneous mutual information among variables. Then, a prediction model containing nodes and channel attentions is jointly optimized to obtain better time-series representations. This approach fuses the state-of-the-art technologies of heterogeneous graph structure learning (HGSL) and representation learning. The experiments on four sensor datasets from real-world applications demonstrate that our approach detects the anomalies more accurately than those baseline approaches, thus providing a basis for the rapid positioning of anomalies.
    HDNet: Hierarchical Dynamic Network for Gait Recognition using Millimeter-Wave Radar. (arXiv:2211.00312v1 [cs.CV])
    Gait recognition is widely used in diversified practical applications. Currently, the most prevalent approach is to recognize human gait from RGB images, owing to the progress of computer vision technologies. Nevertheless, the perception capability of RGB cameras deteriorates in rough circumstances, and visual surveillance may cause privacy invasion. Due to the robustness and non-invasive feature of millimeter wave (mmWave) radar, radar-based gait recognition has attracted increasing attention in recent years. In this research, we propose a Hierarchical Dynamic Network (HDNet) for gait recognition using mmWave radar. In order to explore more dynamic information, we propose point flow as a novel point clouds descriptor. We also devise a dynamic frame sampling module to promote the efficiency of computation without deteriorating performance noticeably. To prove the superiority of our methods, we perform extensive experiments on two public mmWave radar-based gait recognition datasets, and the results demonstrate that our model is superior to existing state-of-the-art methods.
    Revisiting Heterophily in Graph Convolution Networks by Learning Representations Across Topological and Feature Spaces. (arXiv:2211.00565v1 [cs.LG])
    Graph convolution networks (GCNs) have been enormously successful in learning representations over several graph-based machine learning tasks. Specific to learning rich node representations, most of the methods have solely relied on the homophily assumption and have shown limited performance on the heterophilous graphs. While several methods have been developed with new architectures to address heterophily, we argue that by learning graph representations across two spaces i.e., topology and feature space GCNs can address heterophily. In this work, we experimentally demonstrate the performance of the proposed GCN framework over semi-supervised node classification task on both homophilous and heterophilous graph benchmarks by learning and combining representations across the topological and the feature spaces.
    SIMPLE-RC: Group Network Inference with Non-Sharp Nulls and Weak Signals. (arXiv:2211.00128v1 [stat.ML])
    Large-scale network inference with uncertainty quantification has important applications in natural, social, and medical sciences. The recent work of Fan, Fan, Han and Lv (2022) introduced a general framework of statistical inference on membership profiles in large networks (SIMPLE) for testing the sharp null hypothesis that a pair of given nodes share the same membership profiles. In real applications, there are often groups of nodes under investigation that may share similar membership profiles at the presence of relatively weaker signals than the setting considered in SIMPLE. To address these practical challenges, in this paper we propose a SIMPLE method with random coupling (SIMPLE-RC) for testing the non-sharp null hypothesis that a group of given nodes share similar (not necessarily identical) membership profiles under weaker signals. Utilizing the idea of random coupling, we construct our test as the maximum of the SIMPLE tests for subsampled node pairs from the group. Such technique reduces significantly the correlation among individual SIMPLE tests while largely maintaining the power, enabling delicate analysis on the asymptotic distributions of the SIMPLE-RC test. Our method and theory cover both the cases with and without node degree heterogeneity. These new theoretical developments are empowered by a second-order expansion of spiked eigenvectors under the $\ell_\infty$-norm, built upon our work for random matrices with weak spikes. Our theoretical results and the practical advantages of the newly suggested method are demonstrated through several simulation and real data examples.  ( 3 min )
    RGMIM: Region-Guided Masked Image Modeling for COVID-19 Detection. (arXiv:2211.00313v1 [cs.CV])
    Self-supervised learning has developed rapidly and also advances computer-aided diagnosis in the medical field. Masked image modeling (MIM) is one of the self-supervised learning methods that masks a portion of input pixels and tries to predict the masked pixels. Traditional MIM methods often use a random masking strategy. However, medical images often have a small region of interest for disease detection compared to ordinary images. For example, the regions outside the lung do not contain the information for decision, which may cause the random masking strategy not to learn enough information for COVID-19 detection. Hence, we propose a novel region-guided masked image modeling method (RGMIM) for COVID-19 detection in this paper. In our method, we design a new masking strategy that uses lung mask information to locate valid regions to learn more helpful information for COVID-19 detection. Experimental results show that RGMIM can outperform other state-of-the-art self-supervised learning methods on an open COVID-19 radiography dataset.
    LinkFormer: Automatic Contextualised Link Recovery of Software Artifacts in both Project-based and Transfer Learning Settings. (arXiv:2211.00381v1 [cs.SE])
    Software artifacts often interact with each other throughout the software development cycle. Associating related artifacts is a common practice for effective documentation and maintenance of software projects. Conventionally, to register the link between an issue report and its associated commit, developers manually include the issue identifier in the message of the relevant commit. Research has shown that developers tend to forget to connect said artifacts manually, resulting in a loss of links. Hence, several link recovery techniques were proposed to discover and revive such links automatically. However, the literature mainly focuses on improving the prediction accuracy on a randomly-split test set, while neglecting other important aspects of this problem, including the effect of time and generalizability of the predictive models. In this paper, we propose LinkFormer to address this problem from three aspects; 1) Accuracy: To better utilize contextual information for prediction, we employ the Transformer architecture and fine-tune multiple pre-trained models on textual and metadata of issues and commits. 2) Data leakage: To empirically assess the impact of time through the splitting policy, we train and test our proposed model along with several existing approaches on both randomly- and temporally split data. 3) Generalizability: To provide a generic model that can perform well across different projects, we further fine-tune LinkFormer in two transfer learning settings. We empirically show that researchers should preserve the temporal flow of data when training learning-based models to resemble the real-world setting. In addition, LinkFormer significantly outperforms the state-of-the-art by large margins. LinkFormer is also capable of extending the knowledge it learned to unseen projects with little to no historical data.
    PELICAN: Permutation Equivariant and Lorentz Invariant or Covariant Aggregator Network for Particle Physics. (arXiv:2211.00454v1 [hep-ph])
    Many current approaches to machine learning in particle physics use generic architectures that require large numbers of parameters and disregard underlying physics principles, limiting their applicability as scientific modeling tools. In this work, we present a machine learning architecture that uses a set of inputs maximally reduced with respect to the full 6-dimensional Lorentz symmetry, and is fully permutation-equivariant throughout. We study the application of this network architecture to the standard task of top quark tagging and show that the resulting network outperforms all existing competitors despite much lower model complexity. In addition, we present a Lorentz-covariant variant of the same network applied to a 4-momentum regression task.
    CPG-RL: Learning Central Pattern Generators for Quadruped Locomotion. (arXiv:2211.00458v1 [cs.RO])
    In this letter, we present a method for integrating central pattern generators (CPGs), i.e. systems of coupled oscillators, into the deep reinforcement learning (DRL) framework to produce robust and omnidirectional quadruped locomotion. The agent learns to directly modulate the intrinsic oscillator setpoints (amplitude and frequency) and coordinate rhythmic behavior among different oscillators. This approach also allows the use of DRL to explore questions related to neuroscience, namely the role of descending pathways, interoscillator couplings, and sensory feedback in gait generation. We train our policies in simulation and perform a sim-to-real transfer to the Unitree A1 quadruped, where we observe robust behavior to disturbances unseen during training, most notably to a dynamically added 13.75 kg load representing 115% of the nominal quadruped mass. We test several different observation spaces based on proprioceptive sensing and show that our framework is deployable with no domain randomization and very little feedback, where along with the oscillator states, it is possible to provide only contact booleans in the observation space. Video results can be found at https://youtu.be/xqXHLzLsEV4.
    Improving Fairness in Image Classification via Sketching. (arXiv:2211.00168v1 [cs.CV])
    Fairness is a fundamental requirement for trustworthy and human-centered Artificial Intelligence (AI) system. However, deep neural networks (DNNs) tend to make unfair predictions when the training data are collected from different sub-populations with different attributes (i.e. color, sex, age), leading to biased DNN predictions. We notice that such a troubling phenomenon is often caused by data itself, which means that bias information is encoded to the DNN along with the useful information (i.e. class information, semantic information). Therefore, we propose to use sketching to handle this phenomenon. Without losing the utility of data, we explore the image-to-sketching methods that can maintain useful semantic information for the target classification while filtering out the useless bias information. In addition, we design a fair loss to further improve the model fairness. We evaluate our method through extensive experiments on both general scene dataset and medical scene dataset. Our results show that the desired image-to-sketching method improves model fairness and achieves satisfactory results among state-of-the-art.  ( 2 min )
    SOLAR: A Highly Optimized Data Loading Framework for Distributed Training of CNN-based Scientific Surrogates. (arXiv:2211.00224v1 [cs.DC])
    CNN-based surrogates have become prevalent in scientific applications to replace conventional time-consuming physical approaches. Although these surrogates can yield satisfactory results with significantly lower computation costs over small training datasets, our benchmarking results show that data-loading overhead becomes the major performance bottleneck when training surrogates with large datasets. In practice, surrogates are usually trained with high-resolution scientific data, which can easily reach the terabyte scale. Several state-of-the-art data loaders are proposed to improve the loading throughput in general CNN training; however, they are sub-optimal when applied to the surrogate training. In this work, we propose SOLAR, a surrogate data loader, that can ultimately increase loading throughput during the training. It leverages our three key observations during the benchmarking and contains three novel designs. Specifically, SOLAR first generates a pre-determined shuffled index list and accordingly optimizes the global access order and the buffer eviction scheme to maximize the data reuse and the buffer hit rate. It then proposes a tradeoff between lightweight computational imbalance and heavyweight loading workload imbalance to speed up the overall training. It finally optimizes its data access pattern with HDF5 to achieve a better parallel I/O throughput. Our evaluation with three scientific surrogates and 32 GPUs illustrates that SOLAR can achieve up to 24.4X speedup over PyTorch Data Loader and 3.52X speedup over state-of-the-art data loaders.  ( 3 min )
    Recognizing Nested Entities from Flat Supervision: A New NER Subtask, Feasibility and Challenges. (arXiv:2211.00301v1 [cs.CL])
    Many recent named entity recognition (NER) studies criticize flat NER for its non-overlapping assumption, and switch to investigating nested NER. However, existing nested NER models heavily rely on training data annotated with nested entities, while labeling such data is costly. This study proposes a new subtask, nested-from-flat NER, which corresponds to a realistic application scenario: given data annotated with flat entities only, one may still desire the trained model capable of recognizing nested entities. To address this task, we train span-based models and deliberately ignore the spans nested inside labeled entities, since these spans are possibly unlabeled entities. With nested entities removed from the training data, our model achieves 54.8%, 54.2% and 41.1% F1 scores on the subset of spans within entities on ACE 2004, ACE 2005 and GENIA, respectively. This suggests the effectiveness of our approach and the feasibility of the task. In addition, the model's performance on flat entities is entirely unaffected. We further manually annotate the nested entities in the test set of CoNLL 2003, creating a nested-from-flat NER benchmark. Analysis results show that the main challenges stem from the data and annotation inconsistencies between the flat and nested entities.
    Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise. (arXiv:2206.01095v2 [math.OC] UPDATED)
    Stochastic first-order methods such as Stochastic Extragradient (SEG) or Stochastic Gradient Descent-Ascent (SGDA) for solving smooth minimax problems and, more generally, variational inequality problems (VIP) have been gaining a lot of attention in recent years due to the growing popularity of adversarial formulations in machine learning. However, while high-probability convergence bounds are known to reflect the actual behavior of stochastic methods more accurately, most convergence results are provided in expectation. Moreover, the only known high-probability complexity results have been derived under restrictive sub-Gaussian (light-tailed) noise and bounded domain assumption [Juditsky et al., 2011]. In this work, we prove the first high-probability complexity results with logarithmic dependence on the confidence level for stochastic methods for solving monotone and structured non-monotone VIPs with non-sub-Gaussian (heavy-tailed) noise and unbounded domains. In the monotone case, our results match the best-known ones in the light-tails case [Juditsky et al., 2011], and are novel for structured non-monotone problems such as negative comonotone, quasi-strongly monotone, and/or star-cocoercive ones. We achieve these results by studying SEG and SGDA with clipping. In addition, we numerically validate that the gradient noise of many practical GAN formulations is heavy-tailed and show that clipping improves the performance of SEG/SGDA.
    Clustering-Based Approaches for Symbolic Knowledge Extraction. (arXiv:2211.00234v1 [cs.AI])
    Opaque models belonging to the machine learning world are ever more exploited in the most different application areas. These models, acting as black boxes (BB) from the human perspective, cannot be entirely trusted if the application is critical unless there exists a method to extract symbolic and human-readable knowledge out of them. In this paper we analyse a recurrent design adopted by symbolic knowledge extractors for BB regressors - that is, the creation of rules associated with hypercubic input space regions. We argue that this kind of partitioning may lead to suboptimal solutions when the data set at hand is high-dimensional or does not satisfy symmetric constraints. We then propose a (deep) clustering-based approach to be performed before symbolic knowledge extraction to achieve better performance with data sets of any kind.
    Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features. (arXiv:2211.00342v1 [cs.SD])
    Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. Such MOS prediction models include MOSNet and LDNet that use spectral features as input, and SSL-MOS that relies on a pretrained self-supervised learning model that directly uses the speech signal as input. In modern high-quality neural TTS systems, prosodic appropriateness with regard to the spoken content is a decisive factor for speech naturalness. For this reason, we propose to include prosodic and linguistic features as additional inputs in MOS prediction systems, and evaluate their impact on the prediction outcome. We consider phoneme level F0 and duration features as prosodic inputs, as well as Tacotron encoder outputs, POS tags and BERT embeddings as higher-level linguistic inputs. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations. Results show that the proposed additional features are beneficial in the MOS prediction task, by improving the predicted MOS scores' correlation with the ground truths, both at utterance-level and system-level predictions.
    Batch Active Learning from the Perspective of Sparse Approximation. (arXiv:2211.00246v1 [cs.LG])
    Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators. We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective. Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart. We realize the framework as sparsity-constrained discontinuous optimization problems, which explicitly balance uncertainty and representation for large-scale applications and could be solved by greedy or proximal iterative hard thresholding algorithms. The proposed method can adapt to various settings, including both Bayesian and non-Bayesian neural networks. Numerical experiments show that our work achieves competitive performance across different settings with lower computational complexity.
    SADT: Combining Sharpness-Aware Minimization with Self-Distillation for Improved Model Generalization. (arXiv:2211.00310v1 [cs.LG])
    Methods for improving deep neural network training times and model generalizability consist of various data augmentation, regularization, and optimization approaches, which tend to be sensitive to hyperparameter settings and make reproducibility more challenging. This work jointly considers two recent training strategies that address model generalizability: sharpness-aware minimization, and self-distillation, and proposes the novel training strategy of Sharpness-Aware Distilled Teachers (SADT). The experimental section of this work shows that SADT consistently outperforms previously published training strategies in model convergence time, test-time performance, and model generalizability over various neural architectures, datasets, and hyperparameter settings.
    The future is different: Large pre-trained language models fail in prediction tasks. (arXiv:2211.00384v1 [cs.CL])
    Large pre-trained language models (LPLM) have shown spectacular success when fine-tuned on downstream supervised tasks. Yet, it is known that their performance can drastically drop when there is a distribution shift between the data used during training and that used at inference time. In this paper we focus on data distributions that naturally change over time and introduce four new REDDIT datasets, namely the WALLSTREETBETS, ASKSCIENCE, THE DONALD, and POLITICS sub-reddits. First, we empirically demonstrate that LPLM can display average performance drops of about 88% (in the best case!) when predicting the popularity of future posts from sub-reddits whose topic distribution changes with time. We then introduce a simple methodology that leverages neural variational dynamic topic models and attention mechanisms to infer temporal language model representations for regression tasks. Our models display performance drops of only about 40% in the worst cases (2% in the best ones) when predicting the popularity of future posts, while using only about 7% of the total number of parameters of LPLM and providing interpretable representations that offer insight into real-world events, like the GameStop short squeeze of 2021
    Hybrid CNN -Interpreter: Interpret local and global contexts for CNN-based Models. (arXiv:2211.00185v1 [cs.LG])
    Convolutional neural network (CNN) models have seen advanced improvements in performance in various domains, but lack of interpretability is a major barrier to assurance and regulation during operation for acceptance and deployment of AI-assisted applications. There have been many works on input interpretability focusing on analyzing the input-output relations, but the internal logic of models has not been clarified in the current mainstream interpretability methods. In this study, we propose a novel hybrid CNN-interpreter through: (1) An original forward propagation mechanism to examine the layer-specific prediction results for local interpretability. (2) A new global interpretability that indicates the feature correlation and filter importance effects. By combining the local and global interpretabilities, hybrid CNN-interpreter enables us to have a solid understanding and monitoring of model context during the whole learning process with detailed and consistent representations. Finally, the proposed interpretabilities have been demonstrated to adapt to various CNN-based model structures.
    ClassActionPrediction: A Challenging Benchmark for Legal Judgment Prediction of Class Action Cases in the US. (arXiv:2211.00582v1 [cs.CL])
    The research field of Legal Natural Language Processing (NLP) has been very active recently, with Legal Judgment Prediction (LJP) becoming one of the most extensively studied tasks. To date, most publicly released LJP datasets originate from countries with civil law. In this work, we release, for the first time, a challenging LJP dataset focused on class action cases in the US. It is the first dataset in the common law system that focuses on the harder and more realistic task involving the complaints as input instead of the often used facts summary written by the court. Additionally, we study the difficulty of the task by collecting expert human predictions, showing that even human experts can only reach 53% accuracy on this dataset. Our Longformer model clearly outperforms the human baseline (63%), despite only considering the first 2,048 tokens. Furthermore, we perform a detailed error analysis and find that the Longformer model is significantly better calibrated than the human experts. Finally, we publicly release the dataset and the code used for the experiments.
    FL Games: A Federated Learning Framework for Distribution Shifts. (arXiv:2211.00184v1 [cs.LG])
    Federated learning aims to train predictive models for data that is distributed across clients, under the orchestration of a server. However, participating clients typically each hold data from a different distribution, which can yield to catastrophic generalization on data from a different client, which represents a new domain. In this work, we argue that in order to generalize better across non-i.i.d. clients, it is imperative to only learn correlations that are stable and invariant across domains. We propose FL GAMES, a game-theoretic framework for federated learning that learns causal features that are invariant across clients. While training to achieve the Nash equilibrium, the traditional best response strategy suffers from high-frequency oscillations. We demonstrate that FL GAMES effectively resolves this challenge and exhibits smooth performance curves. Further, FL GAMES scales well in the number of clients, requires significantly fewer communication rounds, and is agnostic to device heterogeneity. Through empirical evaluation, we demonstrate that FL GAMES achieves high out-of-distribution performance on various benchmarks.
    ARDIR: Improving Robustness using Knowledge Distillation of Internal Representation. (arXiv:2211.00239v1 [cs.LG])
    Adversarial training is the most promising method for learning robust models against adversarial examples. A recent study has shown that knowledge distillation between the same architectures is effective in improving the performance of adversarial training. Exploiting knowledge distillation is a new approach to improve adversarial training and has attracted much attention. However, its performance is still insufficient. Therefore, we propose Adversarial Robust Distillation with Internal Representation~(ARDIR) to utilize knowledge distillation even more effectively. In addition to the output of the teacher model, ARDIR uses the internal representation of the teacher model as a label for adversarial training. This enables the student model to be trained with richer, more informative labels. As a result, ARDIR can learn more robust student models. We show that ARDIR outperforms previous methods in our experiments.
    Adversarial Training with Complementary Labels: On the Benefit of Gradually Informative Attacks. (arXiv:2211.00269v1 [cs.LG])
    Adversarial training (AT) with imperfect supervision is significant but receives limited attention. To push AT towards more practical scenarios, we explore a brand new yet challenging setting, i.e., AT with complementary labels (CLs), which specify a class that a data sample does not belong to. However, the direct combination of AT with existing methods for CLs results in consistent failure, but not on a simple baseline of two-stage training. In this paper, we further explore the phenomenon and identify the underlying challenges of AT with CLs as intractable adversarial optimization and low-quality adversarial examples. To address the above problems, we propose a new learning strategy using gradually informative attacks, which consists of two critical components: 1) Warm-up Attack (Warm-up) gently raises the adversarial perturbation budgets to ease the adversarial optimization with CLs; 2) Pseudo-Label Attack (PLA) incorporates the progressively informative model predictions into a corrected complementary loss. Extensive experiments are conducted to demonstrate the effectiveness of our method on a range of benchmarked datasets. The code is publicly available at: https://github.com/RoyalSkye/ATCL.  ( 2 min )
    Meta-Learning for Unsupervised Outlier Detection with Optimal Transport. (arXiv:2211.00372v1 [cs.LG])
    Automated machine learning has been widely researched and adopted in the field of supervised classification and regression, but progress in unsupervised settings has been limited. We propose a novel approach to automate outlier detection based on meta-learning from previous datasets with outliers. Our premise is that the selection of the optimal outlier detection technique depends on the inherent properties of the data distribution. We leverage optimal transport in particular, to find the dataset with the most similar underlying distribution, and then apply the outlier detection techniques that proved to work best for that data distribution. We evaluate the robustness of our approach and find that it outperforms the state of the art methods in unsupervised outlier detection. This approach can also be easily generalized to automate other unsupervised settings.
    Informed Priors for Knowledge Integration in Trajectory Prediction. (arXiv:2211.00348v1 [cs.LG])
    Informed machine learning methods allow the integration of prior knowledge into learning systems. This can increase accuracy and robustness or reduce data needs. However, existing methods often assume hard constraining knowledge, that does not require to trade-off prior knowledge with observations, but can be used to directly reduce the problem space. Other approaches use specific, architectural changes as representation of prior knowledge, limiting applicability. We propose an informed machine learning method, based on continual learning. This allows the integration of arbitrary, prior knowledge, potentially from multiple sources, and does not require specific architectures. Furthermore, our approach enables probabilistic and multi-modal predictions, that can improve predictive accuracy and robustness. We exemplify our approach by applying it to a state-of-the-art trajectory predictor for autonomous driving. This domain is especially dependent on informed learning approaches, as it is subject to an overwhelming large variety of possible environments and very rare events, while requiring robust and accurate predictions. We evaluate our model on a commonly used benchmark dataset, only using data already available in a conventional setup. We show that our method outperforms both non-informed and informed learning methods, that are often used in the literature. Furthermore, we are able to compete with a conventional baseline, even using half as many observation examples.
    Combined space-time reduced-order model with 3D deep convolution for extrapolating fluid dynamics. (arXiv:2211.00307v1 [physics.flu-dyn])
    There is a critical need for efficient and reliable active flow control strategies to reduce drag and noise in aerospace and marine engineering applications. While traditional full-order models based on the Navier-Stokes equations are not feasible, advanced model reduction techniques can be inefficient for active control tasks, especially with strong non-linearity and convection-dominated phenomena. Using convolutional recurrent autoencoder network architectures, deep learning-based reduced-order models have been recently shown to be effective while performing several orders of magnitude faster than full-order simulations. However, these models encounter significant challenges outside the training data, limiting their effectiveness for active control and optimization tasks. In this study, we aim to improve the extrapolation capability by modifying network architecture and integrating coupled space-time physics as an implicit bias. Reduced-order models via deep learning generally employ decoupling in spatial and temporal dimensions, which can introduce modeling and approximation errors. To alleviate these errors, we propose a novel technique for learning coupled spatial-temporal correlation using a 3D convolution network. We assess the proposed technique against a standard encoder-propagator-decoder model and demonstrate a superior extrapolation performance. To demonstrate the effectiveness of 3D convolution network, we consider a benchmark problem of the flow past a circular cylinder at laminar flow conditions and use the spatio-temporal snapshots from the full-order simulations. Our proposed 3D convolution architecture accurately captures the velocity and pressure fields for varying Reynolds numbers. Compared to the standard encoder-propagator-decoder network, the spatio-temporal-based 3D convolution network improves the prediction range of Reynolds numbers outside of the training data.  ( 3 min )
    Learning to Navigate Wikipedia by Taking Random Walks. (arXiv:2211.00177v1 [cs.LG])
    A fundamental ability of an intelligent web-based agent is seeking out and acquiring new information. Internet search engines reliably find the correct vicinity but the top results may be a few links away from the desired target. A complementary approach is navigation via hyperlinks, employing a policy that comprehends local content and selects a link that moves it closer to the target. In this paper, we show that behavioral cloning of randomly sampled trajectories is sufficient to learn an effective link selection policy. We demonstrate the approach on a graph version of Wikipedia with 38M nodes and 387M edges. The model is able to efficiently navigate between nodes 5 and 20 steps apart 96% and 92% of the time, respectively. We then use the resulting embeddings and policy in downstream fact verification and question answering tasks where, in combination with basic TF-IDF search and ranking methods, they are competitive results to the state-of-the-art methods.  ( 2 min )
    Evaluation Metrics for Symbolic Knowledge Extracted from Machine Learning Black Boxes: A Discussion Paper. (arXiv:2211.00238v1 [cs.AI])
    As opaque decision systems are being increasingly adopted in almost any application field, issues about their lack of transparency and human readability are a concrete concern for end-users. Amongst existing proposals to associate human-interpretable knowledge with accurate predictions provided by opaque models, there are rule extraction techniques, capable of extracting symbolic knowledge out of an opaque model. However, how to assess the level of readability of the extracted knowledge quantitatively is still an open issue. Finding such a metric would be the key, for instance, to enable automatic comparison between a set of different knowledge representations, paving the way for the development of parameter autotuning algorithms for knowledge extractors. In this paper we discuss the need for such a metric as well as the criticalities of readability assessment and evaluation, taking into account the most common knowledge representations while highlighting the most puzzling issues.  ( 2 min )
    Xtreme Margin: A Tunable Loss Function for Binary Classification Problems. (arXiv:2211.00176v1 [cs.LG])
    Loss functions drive the optimization of machine learning algorithms. The choice of a loss function can have a significant impact on the training of a model, and how the model learns the data. Binary classification is one of the major pillars of machine learning problems, used in medical imaging to failure detection applications. The most commonly used surrogate loss functions for binary classification include the binary cross-entropy and the hinge loss functions, which form the focus of our study. In this paper, we provide an overview of a novel loss function, the Xtreme Margin loss function. Unlike the binary cross-entropy and the hinge loss functions, this loss function provides researchers and practitioners flexibility with their training process, from maximizing precision and AUC score to maximizing conditional accuracy for a particular class, through tunable hyperparameters $\lambda_1$ and $\lambda_2$, i.e., changing their values will alter the training of a model.  ( 2 min )
    Backtracking Counterfactuals. (arXiv:2211.00472v1 [cs.AI])
    Counterfactual reasoning -- envisioning hypothetical scenarios, or possible worlds, where some circumstances are different from what (f)actually occurred (counter-to-fact) -- is ubiquitous in human cognition. Conventionally, counterfactually-altered circumstances have been treated as "small miracles" that locally violate the laws of nature while sharing the same initial conditions. In Pearl's structural causal model (SCM) framework this is made mathematically rigorous via interventions that modify the causal laws while the values of exogenous variables are shared. In recent years, however, this purely interventionist account of counterfactuals has increasingly come under scrutiny from both philosophers and psychologists. Instead, they suggest a backtracking account of counterfactuals, according to which the causal laws remain unchanged in the counterfactual world; differences to the factual world are instead "backtracked" to altered initial conditions (exogenous variables). In the present work, we explore and formalise this alternative mode of counterfactual reasoning within the SCM framework. Despite ample evidence that humans backtrack, the present work constitutes, to the best of our knowledge, the first general account and algorithmisation of backtracking counterfactuals. We discuss our backtracking semantics in the context of related literature and draw connections to recent developments in explainable artificial intelligence (XAI).  ( 2 min )
    Consistent Training via Energy-Based GFlowNets for Modeling Discrete Joint Distributions. (arXiv:2211.00568v1 [cs.LG])
    Generative Flow Networks (GFlowNets) have demonstrated significant performance improvements for generating diverse discrete objects $x$ given a reward function $R(x)$, indicating the utility of the object and trained independently from the GFlowNet by supervised learning to predict a desirable property $y$ given $x$. We hypothesize that this can lead to \textit{incompatibility} between the inductive optimization biases in training $R$ and in training the GFlowNet, potentially leading to worse samples and slow adaptation to changes in the distribution. In this work, we build upon recent work on jointly learning energy-based models with GFlowNets and extend it to learn the joint over multiple variables, which we call Joint Energy-Based GFlowNets (JEBGFNs), such as peptide sequences and their antimicrobial activity. Joint learning of the energy-based model, used as a reward for the GFlowNet, can resolve the issues of incompatibility since both the reward function $R$ and the GFlowNet sampler are trained jointly. We find that this joint training or joint energy-based formulation leads to significant improvements in generating anti-microbial peptides. As the training sequences arose out of evolutionary or artificial selection for high antibiotic activity, there is presumably some structure in the distribution of sequences that reveals information about the antibiotic activity. This results in an advantage to modeling their joint generatively vs. pure discriminative modeling. We also evaluate JEBGFN in an active learning setting for discovering anti-microbial peptides.
    UNFIS: A Novel Neuro-Fuzzy Inference System with Unstructured Fuzzy Rules for Classification. (arXiv:2211.00599v1 [cs.AI])
    An important constraint of Fuzzy Inference Systems (FIS) is their structured rules defined based on evaluating all input variables. Indeed, the length of all fuzzy rules and the number of input variables are equal. However, in many decision-making problems evaluating some conditions on a limited set of input variables is sufficient to decide properly (unstructured rules). Therefore, this constraint limits the performance, generalization, and interpretability of the FIS. To address this issue, this paper presents a neuro-fuzzy inference system for classification applications that can select different sets of input variables for constructing each fuzzy rule. To realize this capability, a new fuzzy selector neuron with an adaptive parameter is proposed that can select input variables in the antecedent part of each fuzzy rule. Moreover, in this paper, the consequent part of the Takagi-Sugeno-Kang FIS is also changed properly to consider only the selected set of input variables. To learn the parameters of the proposed architecture, a trust-region-based learning method (General quasi-Levenberg-Marquardt (GqLM)) is proposed to minimize cross-entropy in multiclass problems. The performance of the proposed method is compared with some related previous approaches in some real-world classification problems. Based on these comparisons the proposed method has better or very close performance with a parsimonious structure consisting of unstructured fuzzy.
    Adapter-Based Extension of Multi-Speaker Text-to-Speech Model for New Speakers. (arXiv:2211.00585v1 [eess.AS])
    Fine-tuning is a popular method for adapting text-to-speech (TTS) models to new speakers. However this approach has some challenges. Usually fine-tuning requires several hours of high quality speech per speaker. There is also that fine-tuning will negatively affect the quality of speech synthesis for previously learnt speakers. In this paper we propose an alternative approach for TTS adaptation based on using parameter-efficient adapter modules. In the proposed approach, a few small adapter modules are added to the original network. The original weights are frozen, and only the adapters are fine-tuned on speech for new speaker. The parameter-efficient fine-tuning approach will produce a new model with high level of parameter sharing with original model. Our experiments on LibriTTS, HiFi-TTS and VCTK datasets validate the effectiveness of adapter-based method through objective and subjective metrics.
    Efficient Data Analytics on Augmented Similarity Triplets. (arXiv:1912.12064v2 [cs.LG] UPDATED)
    Data analysis require a pairwise proximity measure over objects. Recent work has extended this to situations where the distance information between objects is given as comparison results of distances between three objects (triplets). Humans find the comparison tasks much easier than the exact distance computation and such data can be easily obtained in big quantity via crowd-sourcing. In this work, we propose triplets augmentation, an efficient method to extend the triplets data by inferring the hidden implicit information form the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics. We also propose a novel set of algorithms for common data analysis tasks based on triplets. These methods work directly with triplets and avoid kernel evaluations, thus are scalable to big data. We demonstrate that our methods outperform the current best-known techniques and are robust to noisy data.
    CCS Explorer: Relevance Prediction, Extractive Summarization, and Named Entity Recognition from Clinical Cohort Studies. (arXiv:2211.00201v1 [cs.CL])
    Clinical Cohort Studies (CCS) are a great source of documented clinical research. Ideally, a clinical expert will interpret these articles for exploratory analysis ranging from drug discovery for evaluating the efficacy of existing drugs in tackling emerging diseases to the first test of newly developed drugs. However, more than 100 CCS articles are published on PubMed every day. As a result, it can take days for a doctor to find articles and extract relevant information. Can we find a way to quickly sift through the long list of these articles faster and document the crucial takeaways from each of these articles? In this work, we propose CCS Explorer, an end-to-end system for relevance prediction of sentences, extractive summarization, and patient, outcome, and intervention entity detection from CCS. CCS Explorer is packaged in a web-based graphical user interface where the user can provide any disease name. CCS Explorer then extracts and aggregates all relevant information from articles on PubMed based on the results of an automatically generated query produced on the back-end. CCS Explorer fine-tunes pre-trained language models based on transformers with additional layers for each of these tasks. We evaluate the models using two publicly available datasets. CCS Explorer obtains a recall of 80.2%, AUC-ROC of 0.843, and an accuracy of 88.3% on sentence relevance prediction using BioBERT and achieves an average Micro F1-Score of 77.8% on Patient, Intervention, Outcome detection (PIO) using PubMedBERT. Thus, CCS Explorer can reliably extract relevant information to summarize articles, saving time by ~ 660$\times$.  ( 3 min )
    Where to start? Analyzing the potential value of intermediate models. (arXiv:2211.00107v1 [cs.CL])
    Previous studies observed that finetuned models may be better base models than the vanilla pretrained model. Such a model, finetuned on some source dataset, may provide a better starting point for a new finetuning process on a desired target dataset. Here, we perform a systematic analysis of this \emph{intertraining} scheme, over a wide range of English classification tasks. Surprisingly, our analysis suggests that the potential intertraining gain can be analyzed \emph{independently} for the target dataset under consideration, and for a base model being considered as a starting point. This is in contrast to current perception that the alignment between the target dataset and the source dataset used to generate the base model is a major factor in determining intertraining success. We analyze different aspects that contribute to each. Furthermore, we leverage our analysis to propose a practical and efficient approach to determine if and how to select a base model in real-world settings. Last, we release an updating ranking of best models in the HuggingFace hub per architecture\anonm{remove this link: https://ibm.github.io/model-recycling/.  ( 2 min )
    A unified method of data assimilation and turbulence modeling for separated flows at high Reynolds numbers. (arXiv:2211.00601v1 [physics.flu-dyn])
    In recent years, machine learning methods represented by deep neural networks (DNN) have been a new paradigm of turbulence modeling. However, in the scenario of high Reynolds numbers, there are still some bottlenecks, including the lack of high-fidelity data and the convergence and stability problem in the coupling process of turbulence models and the RANS solvers. In this paper, we propose an improved ensemble kalman inversion method as a unified approach of data assimilation and turbulence modeling for separated flows at high Reynolds numbers. The trainable parameters of the DNN are optimized according to the given experimental surface pressure coefficients in the framework of mutual coupling between the RANS equations and DNN eddy-viscosity models. In this way, data assimilation and model training are combined into one step to get the high-fidelity turbulence models agree well with experiments efficiently. The effectiveness of the method is verified by cases of separated flows around airfoils(S809) at high Reynolds numbers. The results show that through joint assimilation of vary few experimental states, we can get turbulence models generalizing well to both attached and separated flows at different angles of attack. The errors of lift coefficients at high angles of attack are significantly reduced by more than three times compared with the traditional SA model. The models obtained also perform well in stability and robustness.
    Federated Averaging Langevin Dynamics: Toward a unified theory and new algorithms. (arXiv:2211.00100v1 [stat.ML])
    This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtain a novel tight non-asymptotic upper bound on the Wasserstein distance to the global posterior for FALD. This bound highlights the effects of statistical heterogeneity, which causes a drift in the local updates that negatively impacts convergence. We propose a new algorithm VR-FALD* that uses control variates to correct the client drift. We establish non-asymptotic bounds showing that VR-FALD* is not affected by statistical heterogeneity. Finally, we illustrate our results on several FL benchmarks for Bayesian inference.  ( 2 min )
    Do LSTMs See Gender? Probing the Ability of LSTMs to Learn Abstract Syntactic Rules. (arXiv:2211.00153v1 [cs.CL])
    LSTMs trained on next-word prediction can accurately perform linguistic tasks that require tracking long-distance syntactic dependencies. Notably, model accuracy approaches human performance on number agreement tasks (Gulordava et al., 2018). However, we do not have a mechanistic understanding of how LSTMs perform such linguistic tasks. Do LSTMs learn abstract grammatical rules, or do they rely on simple heuristics? Here, we test gender agreement in French which requires tracking both hierarchical syntactic structures and the inherent gender of lexical units. Our model is able to reliably predict long-distance gender agreement in two subject-predicate contexts: noun-adjective and noun-passive-verb agreement. The model showed more inaccuracies on plural noun phrases with gender attractors compared to singular cases, suggesting a reliance on clues from gendered articles for agreement. Overall, our study highlights key ways in which LSTMs deviate from human behaviour and questions whether LSTMs genuinely learn abstract syntactic rules and categories. We propose using gender agreement as a useful probe to investigate the underlying mechanisms, internal representations, and linguistic capabilities of LSTM language models.  ( 2 min )
    The Numerical Stability of Hyperbolic Representation Learning. (arXiv:2211.00181v1 [cs.LG])
    Given the exponential growth of the volume of the ball w.r.t. its radius, the hyperbolic space is capable of embedding trees with arbitrarily small distortion and hence has received wide attention for representing hierarchical datasets. However, this exponential growth property comes at a price of numerical instability such that training hyperbolic learning models will sometimes lead to catastrophic NaN problems, encountering unrepresentable values in floating point arithmetic. In this work, we carefully analyze the limitation of two popular models for the hyperbolic space, namely, the Poincar\'e ball and the Lorentz model. We first show that, under the 64 bit arithmetic system, the Poincar\'e ball has a relatively larger capacity than the Lorentz model for correctly representing points. Then, we theoretically validate the superiority of the Lorentz model over the Poincar\'e ball from the perspective of optimization. Given the numerical limitations of both models, we identify one Euclidean parametrization of the hyperbolic space which can alleviate these limitations. We further extend this Euclidean parametrization to hyperbolic hyperplanes and exhibits its ability in improving the performance of hyperbolic SVM.  ( 2 min )
    Adaptive Compression for Communication-Efficient Distributed Training. (arXiv:2211.00188v1 [cs.LG])
    We propose Adaptive Compressed Gradient Descent (AdaCGD) - a novel optimization algorithm for communication-efficient training of supervised machine learning models with adaptive compression level. Our approach is inspired by the recently proposed three point compressor (3PC) framework of Richtarik et al. (2022), which includes error feedback (EF21), lazily aggregated gradient (LAG), and their combination as special cases, and offers the current state-of-the-art rates for these methods under weak assumptions. While the above mechanisms offer a fixed compression level, or adapt between two extremes only, our proposal is to perform a much finer adaptation. In particular, we allow the user to choose any number of arbitrarily chosen contractive compression mechanisms, such as Top-K sparsification with a user-defined selection of sparsification levels K, or quantization with a user-defined selection of quantization levels, or their combination. AdaCGD chooses the appropriate compressor and compression level adaptively during the optimization process. Besides i) proposing a theoretically-grounded multi-adaptive communication compression mechanism, we further ii) extend the 3PC framework to bidirectional compression, i.e., we allow the server to compress as well, and iii) provide sharp convergence bounds in the strongly convex, convex and nonconvex settings. The convex regime results are new even for several key special cases of our general mechanism, including 3PC and EF21. In all regimes, our rates are superior compared to all existing adaptive compression methods.  ( 2 min )
    TaTa: A Multilingual Table-to-Text Dataset for African Languages. (arXiv:2211.00142v1 [cs.CL])
    Existing data-to-text generation datasets are mostly limited to English. To address this lack of data, we create Table-to-Text in African languages (TaTa), the first large multilingual table-to-text dataset with a focus on African languages. We created TaTa by transcribing figures and accompanying text in bilingual reports by the Demographic and Health Surveys Program, followed by professional translation to make the dataset fully parallel. TaTa includes 8,700 examples in nine languages including four African languages (Hausa, Igbo, Swahili, and Yor\`ub\'a) and a zero-shot test language (Russian). We additionally release screenshots of the original figures for future research on multilingual multi-modal approaches. Through an in-depth human evaluation, we show that TaTa is challenging for current models and that less than half the outputs from an mT5-XXL-based model are understandable and attributable to the source data. We further demonstrate that existing metrics perform poorly for TaTa and introduce learned metrics that achieve a high correlation with human judgments. We release all data and annotations at https://github.com/google-research/url-nlp.  ( 2 min )
    Anytime Generation of Counterfactual Explanations for Text Classification. (arXiv:2211.00369v1 [cs.LG])
    In many machine learning applications, it is important for the user to understand the reasoning behind the recommendation or prediction of the classifiers. The learned models, however, are often too complicated to be understood by a human. Research from the social sciences indicates that humans prefer counterfactual explanations over alternatives. In this paper, we present a general framework for generating counterfactual explanations in the textual domain. Our framework is model-agnostic, representation-agnostic, domain-agnostic, and anytime. We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in the complementary class. The operators transform a text by replacing parts of it. Our framework includes domain-independent operators, but can also exploit domain-specific knowledge through specialized operators. The search algorithm attempts to find a text from the complementary class with minimal word-level Levenshtein distance from the original classified object.
    Text-Only Training for Image Captioning using Noise-Injected CLIP. (arXiv:2211.00575v1 [cs.CV])
    We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available on GitHub.
    Edge Grasp Network: A Graph-Based SE(3)-invariant Approach to Grasp Detection. (arXiv:2211.00191v1 [cs.RO])
    Given point cloud input, the problem of 6-DoF grasp pose detection is to identify a set of hand poses in SE(3) from which an object can be successfully grasped. This important problem has many practical applications. Here we propose a novel method and neural network model that enables better grasp success rates relative to what is available in the literature. The method takes standard point cloud data as input and works well with single-view point clouds observed from arbitrary viewing directions.  ( 2 min )
    Dungeons and Data: A Large-Scale NetHack Dataset. (arXiv:2211.00539v1 [cs.LG])
    Recent breakthroughs in the development of agents to solve challenging sequential decision making problems such as Go, StarCraft, or DOTA, have relied on both simulated environments and large-scale datasets. However, progress on this research has been hindered by the scarcity of open-sourced datasets and the prohibitive computational cost to work with them. Here we present the NetHack Learning Dataset (NLD), a large and highly-scalable dataset of trajectories from the popular game of NetHack, which is both extremely challenging for current methods and very fast to run. NLD consists of three parts: 10 billion state transitions from 1.5 million human trajectories collected on the NAO public NetHack server from 2009 to 2020; 3 billion state-action-score transitions from 100,000 trajectories collected from the symbolic bot winner of the NetHack Challenge 2021; and, accompanying code for users to record, load and stream any collection of such trajectories in a highly compressed form. We evaluate a wide range of existing algorithms including online and offline RL, as well as learning from demonstrations, showing that significant research advances are needed to fully leverage large-scale datasets for challenging sequential decision making tasks.
    Multi-Resource Allocation for On-Device Distributed Federated Learning Systems. (arXiv:2211.00481v1 [eess.SY])
    This work poses a distributed multi-resource allocation scheme for minimizing the weighted sum of latency and energy consumption in the on-device distributed federated learning (FL) system. Each mobile device in the system engages the model training process within the specified area and allocates its computation and communication resources for deriving and uploading parameters, respectively, to minimize the objective of system subject to the computation/communication budget and a target latency requirement. In particular, mobile devices are connect via wireless TCP/IP architectures. Exploiting the optimization problem structure, the problem can be decomposed to two convex sub-problems. Drawing on the Lagrangian dual and harmony search techniques, we characterize the global optimal solution by the closed-form solutions to all sub-problems, which give qualitative insights to multi-resource tradeoff. Numerical simulations are used to validate the analysis and assess the performance of the proposed algorithm.
    Denoising neural networks for magnetic resonance spectroscopy. (arXiv:2211.00080v1 [cs.LG])
    In many scientific applications, measured time series are corrupted by noise or distortions. Traditional denoising techniques often fail to recover the signal of interest, particularly when the signal-to-noise ratio is low or when certain assumptions on the signal and noise are violated. In this work, we demonstrate that deep learning-based denoising methods can outperform traditional techniques while exhibiting greater robustness to variation in noise and signal characteristics. Our motivating example is magnetic resonance spectroscopy, in which a primary goal is to detect the presence of short-duration, low-amplitude radio frequency signals that are often obscured by strong interference that can be difficult to separate from the signal using traditional methods. We explore various deep learning architecture choices to capture the inherently complex-valued nature of magnetic resonance signals. On both synthetic and experimental data, we show that our deep learning-based approaches can exceed performance of traditional techniques, providing a powerful new class of methods for analysis of scientific time series data.  ( 2 min )
    Disentangled (Un)Controllable Features. (arXiv:2211.00086v1 [cs.LG])
    In the context of MDPs with high-dimensional states, reinforcement learning can achieve better results when using a compressed, low-dimensional representation of the original input space. A variety of learning objectives have therefore been used to learn useful representations. However, these representations usually lack interpretability of the different features. We propose a representation learning algorithm that is able to disentangle latent features into a controllable and an uncontrollable part. The resulting representations are easily interpretable and can be used for learning and planning efficiently by leveraging the specific properties of the two parts. To highlight the benefits of the approach, the disentangling properties of the algorithm are illustrated in three different environments.  ( 2 min )
    Unsafe's Betrayal: Abusing Unsafe Rust in Binary Reverse Engineering toward Finding Memory-safety Bugs via Machine Learning. (arXiv:2211.00111v1 [cs.CR])
    Memory-safety bugs introduce critical software-security issues. Rust provides memory-safe mechanisms to avoid memory-safety bugs in programming, while still allowing unsafe escape hatches via unsafe code. However, the unsafe code that enhances the usability of Rust provides clear spots for finding memory-safety bugs in Rust source code. In this paper, we claim that these unsafe spots can still be identifiable in Rust binary code via machine learning and be leveraged for finding memory-safety bugs. To support our claim, we propose the tool textttrustspot, that enables reverse engineering to learn an unsafe classifier that proposes a list of functions in Rust binaries for downstream analysis. We empirically show that the function proposals by textttrustspot can recall $92.92\%$ of memory-safety bugs, while it covers only $16.79\%$ of the entire binary code. As an application, we demonstrate that the function proposals are used in targeted fuzzing on Rust packages, which contribute to reducing the fuzzing time compared to non-targeted fuzzing.  ( 2 min )
    A new benchmark for group distribution shifts in hand grasp regression for object manipulation. Can meta-learning raise the bar?. (arXiv:2211.00110v1 [cs.CV])
    Understanding hand-object pose with computer vision opens the door to new applications in mixed reality, assisted living or human-robot interaction. Most methods are trained and evaluated on balanced datasets. This is of limited use in real-world applications; how do these methods perform in the wild on unknown objects? We propose a novel benchmark for object group distribution shifts in hand and object pose regression. We then test the hypothesis that meta-learning a baseline pose regression neural network can adapt to these shifts and generalize better to unknown objects. Our results show measurable improvements over the baseline, depending on the amount of prior knowledge. For the task of joint hand-object pose regression, we observe optimization interference for the meta-learner. To address this issue and improve the method further, we provide a comprehensive analysis which should serve as a basis for future work on this benchmark.  ( 2 min )
    Using Emotion Embeddings to Transfer Knowledge Between Emotions, Languages, and Annotation Formats. (arXiv:2211.00171v1 [cs.CL])
    The need for emotional inference from text continues to diversify as more and more disciplines integrate emotions into their theories and applications. These needs include inferring different emotion types, handling multiple languages, and different annotation formats. A shared model between different configurations would enable the sharing of knowledge and a decrease in training costs, and would simplify the process of deploying emotion recognition models in novel environments. In this work, we study how we can build a single model that can transition between these different configurations by leveraging multilingual models and Demux, a transformer-based model whose input includes the emotions of interest, enabling us to dynamically change the emotions predicted by the model. Demux also produces emotion embeddings, and performing operations on them allows us to transition to clusters of emotions by pooling the embeddings of each cluster. We show that Demux can simultaneously transfer knowledge in a zero-shot manner to a new language, to a novel annotation format and to unseen emotions. Code is available at https://github.com/gchochla/Demux-MEmo .  ( 2 min )
    A Machine Learning Tutorial for Operational Meteorology, Part II: Neural Networks and Deep Learning. (arXiv:2211.00147v1 [cs.LG])
    Over the past decade the use of machine learning in meteorology has grown rapidly. Specifically neural networks and deep learning have been being used at an unprecedented rate. In order to fill the dearth of resources covering neural networks with a meteorological lens, this paper discusses machine learning methods in a plain language format that is targeted for the operational meteorolgical community. This is the second paper in a pair that aim to serve as a machine learning resource for meteorologists. While the first paper focused on traditional machine learning methods (e.g., random forest), here a broad spectrum of neural networks and deep learning methods are discussed. Specifically this paper covers perceptrons, artificial neural networks, convolutional neural networks and U-networks. Like the part 1 paper, this manuscript discusses the terms associated with neural networks and their training. Then the manuscript provides some intuition behind every method and concludes by showing each method used in a meteorological example of diagnosing thunderstorms from satellite images (e.g., lightning flashes). This paper is accompanied by an open-source code repository to allow readers to explore neural networks using either the dataset provided (which is used in the paper) or as a template for alternate datasets.  ( 2 min )
    Agent-Controller Representations: Principled Offline RL with Rich Exogenous Information. (arXiv:2211.00164v1 [cs.LG])
    Learning to control an agent from data collected offline in a rich pixel-based visual observation space is vital for real-world applications of reinforcement learning (RL). A major challenge in this setting is the presence of input information that is hard to model and irrelevant to controlling the agent. This problem has been approached by the theoretical RL community through the lens of exogenous information, i.e, any control-irrelevant information contained in observations. For example, a robot navigating in busy streets needs to ignore irrelevant information, such as other people walking in the background, textures of objects, or birds in the sky. In this paper, we focus on the setting with visually detailed exogenous information, and introduce new offline RL benchmarks offering the ability to study this problem. We find that contemporary representation learning techniques can fail on datasets where the noise is a complex and time dependent process, which is prevalent in practical applications. To address these, we propose to use multi-step inverse models, which have seen a great deal of interest in the RL theory community, to learn Agent-Controller Representations for Offline-RL (ACRO). Despite being simple and requiring no reward, we show theoretically and empirically that the representation created by this objective greatly outperforms baselines.  ( 2 min )
    Optimizing Closed-Loop Performance with Data from Similar Systems: A Bayesian Meta-Learning Approach. (arXiv:2211.00077v1 [cs.LG])
    Bayesian optimization (BO) has demonstrated potential for optimizing control performance in data-limited settings, especially for systems with unknown dynamics or unmodeled performance objectives. The BO algorithm efficiently trades-off exploration and exploitation by leveraging uncertainty estimates using surrogate models. These surrogates are usually learned using data collected from the target dynamical system to be optimized. Intuitively, the convergence rate of BO is better for surrogate models that can accurately predict the target system performance. In classical BO, initial surrogate models are constructed using very limited data points, and therefore rarely yield accurate predictions of system performance. In this paper, we propose the use of meta-learning to generate an initial surrogate model based on data collected from performance optimization tasks performed on a variety of systems that are different to the target system. To this end, we employ deep kernel networks (DKNs) which are simple to train and which comprise encoded Gaussian process models that integrate seamlessly with classical BO. The effectiveness of our proposed DKN-BO approach for speeding up control system performance optimization is demonstrated using a well-studied nonlinear system with unknown dynamics and an unmodeled performance function.  ( 2 min )
    Large scale traffic forecasting with gradient boosting, Traffic4cast 2022 challenge. (arXiv:2211.00157v1 [cs.LG])
    Accurate traffic forecasting is of the utmost importance for optimal travel planning and for efficient city mobility. IARAI \footnote{The Institute of Advanced Research in Artificial Intelligence} organizes Traffic4cast, a yearly traffic prediction competition based on real-life data \footnote{\url{https://www.iarai.ac.at/traffic4cast/}}, aiming to leverage artificial intelligence advances for producing accurate traffic estimates. We present our solution to the IARAI Traffic4cast 2022 competition, in which the goal is to develop algorithms for predicting road graph edge congestion classes and supersegment-level travel times. In contrast to the previous years, this year's competition focuses on modelling graph edge level behaviour, rather than more coarse aggregated grid-based traffic movies. Due to this, we leverage a method familiar from tabular data modelling - gradient-boosted decision tree ensembles. We reduce the dimensionality of the input data representing traffic counters with the help of the classic PCA method and feed it as input to a LightGBM model. This simple, fast, and scalable technique allowed us to win second place in the core competition. The source code and references to trained model files and submissions are available at \url{https://github.com/skandium/t4c22}.  ( 2 min )
    WHEN FLUE MEETS FLANG: Benchmarks and Large Pre-trained Language Model for Financial Domain. (arXiv:2211.00083v1 [cs.CL])
    Pre-trained language models have shown impressive performance on a variety of tasks and domains. Previous research on financial language models usually employs a generic training scheme to train standard model architectures, without completely leveraging the richness of the financial data. We propose a novel domain specific Financial LANGuage model (FLANG) which uses financial keywords and phrases for better masking, together with span boundary objective and in-filing objective. Additionally, the evaluation benchmarks in the field have been limited. To this end, we contribute the Financial Language Understanding Evaluation (FLUE), an open-source comprehensive suite of benchmarks for the financial domain. These include new benchmarks across 5 NLP tasks in financial domain as well as common benchmarks used in the previous research. Experiments on these benchmarks suggest that our model outperforms those in prior literature on a variety of NLP tasks. Our models, code and benchmark data are publicly available on Github and Huggingface.  ( 2 min )
    What is my math transformer doing? -- Three results on interpretability and generalization. (arXiv:2211.00170v1 [cs.LG])
    This paper investigates the failure cases and out-of-distribution behavior of transformers trained on matrix inversion and eigenvalue decomposition. I show that incorrect model predictions still retain deep mathematical properties of the solution (e.g. correct eigenvalues, unit norm of eigenvectors), and that almost all model failures can be attributed to, and predicted from, properties of the problem or solution. This demonstrates that, when in doubt, math transformers do not hallucinate absurd solutions (as was sometimes proposed) but remain ``roughly right''. I also show that the careful choice of a training dataset can accelerate training, while allowing the model to generalize out of its training distribution, invalidating the idea that transformers ``merely interpolate'' from memorized examples.  ( 2 min )
    Is Facial Recognition Biased at Near-Infrared Spectrum As Well?. (arXiv:2211.00129v1 [cs.CV])
    Published academic research and media articles suggest face recognition is biased across demographics. Specifically, unequal performance is obtained for women, dark-skinned people, and older adults. However, these published studies have examined the bias of facial recognition in the visible spectrum (VIS). Factors such as facial makeup, facial hair, skin color, and illumination variation have been attributed to the bias of this technology at the VIS. The near-infrared (NIR) spectrum offers an advantage over the VIS in terms of robustness to factors such as illumination changes, facial makeup, and skin color. Therefore, it is worthwhile to investigate the bias of facial recognition at the near-infrared spectrum (NIR). This first study investigates the bias of the face recognition systems at the NIR spectrum. To this aim, two popular NIR facial image datasets namely, CASIA-Face-Africa and Notre-Dame-NIVL consisting of African and Caucasian subjects, respectively, are used to investigate the bias of facial recognition technology across gender and race. Interestingly, experimental results suggest equitable face recognition performance across gender and race at the NIR spectrum.  ( 2 min )
    Indexability is Not Enough for Whittle: Improved, Near-Optimal Algorithms for Restless Bandits. (arXiv:2211.00112v1 [cs.MA])
    We study the problem of planning restless multi-armed bandits (RMABs) with multiple actions. This is a popular model for multi-agent systems with applications like multi-channel communication, monitoring and machine maintenance tasks, and healthcare. Whittle index policies, which are based on Lagrangian relaxations, are widely used in these settings due to their simplicity and near-optimality under certain conditions. In this work, we first show that Whittle index policies can fail in simple and practically relevant RMAB settings, \textit{even when} the RMABs are indexable. We discuss why the optimality guarantees fail and why asymptotic optimality may not translate well to practically relevant planning horizons. We then propose an alternate planning algorithm based on the mean-field method, which can provably and efficiently obtain near-optimal policies with a large number of arms, without the stringent structural assumptions required by the Whittle index policies. This borrows ideas from existing research with some improvements: our approach is hyper-parameter free, and we provide an improved non-asymptotic analysis which has: (a) no requirement for exogenous hyper-parameters and tighter polynomial dependence on known problem parameters; (b) high probability bounds which show that the reward of the policy is reliable; and (c) matching sub-optimality lower bounds for this algorithm with respect to the number of arms, thus demonstrating the tightness of our bounds. Our extensive experimental analysis shows that the mean-field approach matches or outperforms other baselines.  ( 2 min )
    The interaction of transmission intensity, mortality, and the economy: a retrospective analysis of the COVID-19 pandemic. (arXiv:2211.00054v1 [stat.AP])
    The COVID-19 pandemic has caused over 6.4 million registered deaths to date, and has had a profound impact on economic activity. Here, we study the interaction of transmission, mortality, and the economy during the SARS-CoV-2 pandemic from January 2020 to December 2022 across 25 European countries. We adopt a Bayesian vector autoregressive model with both fixed and random effects. We find that increases in disease transmission intensity decreases Gross domestic product (GDP) and increases daily excess deaths, with a longer lasting impact on excess deaths in comparison to GDP, which recovers more rapidly. Broadly, our results reinforce the intuitive phenomenon that significant economic activity arises from diverse person-to-person interactions. We report on the effectiveness of non-pharmaceutical interventions (NPIs) on transmission intensity, excess deaths and changes in GDP, and resulting implications for policy makers. Our results highlight a complex cost-benefit trade off from individual NPIs. For example, banning international travel increases GDP however reduces excess deaths. We consider country random effects and their associations with excess changes in GDP and excess deaths. For example, more developed countries in Europe typically had more cautious approaches to the COVID-19 pandemic, prioritising healthcare and excess deaths over economic performance. Long term economic impairments are not fully captured by our model, as well as long term disease effects (Long Covid). Our results highlight that the impact of disease on a country is complex and multifaceted, and simple heuristic conclusions to extract the best outcome from the economy and disease burden are challenging.  ( 3 min )
    A robust estimator of mutual information for deep learning interpretability. (arXiv:2211.00024v1 [physics.data-an])
    We develop the use of mutual information (MI), a well-established metric in information theory, to interpret the inner workings of deep learning models. To accurately estimate MI from a finite number of samples, we present GMM-MI (pronounced $``$Jimmie$"$), an algorithm based on Gaussian mixture models that can be applied to both discrete and continuous settings. GMM-MI is computationally efficient, robust to the choice of hyperparameters and provides the uncertainty on the MI estimate due to the finite sample size. We extensively validate GMM-MI on toy data for which the ground truth MI is known, comparing its performance against established mutual information estimators. We then demonstrate the use of our MI estimator in the context of representation learning, working with synthetic data and physical datasets describing highly non-linear processes. We train deep learning models to encode high-dimensional data within a meaningful compressed (latent) representation, and use GMM-MI to quantify both the level of disentanglement between the latent variables, and their association with relevant physical quantities, thus unlocking the interpretability of the latent representation. We make GMM-MI publicly available.  ( 2 min )
    SAGE: Saliency-Guided Mixup with Optimal Rearrangements. (arXiv:2211.00113v1 [cs.LG])
    Data augmentation is a key element for training accurate models by reducing overfitting and improving generalization. For image classification, the most popular data augmentation techniques range from simple photometric and geometrical transformations, to more complex methods that use visual saliency to craft new training examples. As augmentation methods get more complex, their ability to increase the test accuracy improves, yet, such methods become cumbersome, inefficient and lead to poor out-of-domain generalization, as we show in this paper. This motivates a new augmentation technique that allows for high accuracy gains while being simple, efficient (i.e., minimal computation overhead) and generalizable. To this end, we introduce Saliency-Guided Mixup with Optimal Rearrangements (SAGE), which creates new training examples by rearranging and mixing image pairs using visual saliency as guidance. By explicitly leveraging saliency, SAGE promotes discriminative foreground objects and produces informative new images useful for training. We demonstrate on CIFAR-10 and CIFAR-100 that SAGE achieves better or comparable performance to the state of the art while being more efficient. Additionally, evaluations in the out-of-distribution setting, and few-shot learning on mini-ImageNet, show that SAGE achieves improved generalization performance without trading off robustness.  ( 2 min )
    Uncertainty-DTW for Time Series and Sequences. (arXiv:2211.00005v1 [cs.CV])
    Dynamic Time Warping (DTW) is used for matching pairs of sequences and celebrated in applications such as forecasting the evolution of time series, clustering time series or even matching sequence pairs in few-shot action recognition. The transportation plan of DTW contains a set of paths; each path matches frames between two sequences under a varying degree of time warping, to account for varying temporal intra-class dynamics of actions. However, as DTW is the smallest distance among all paths, it may be affected by the feature uncertainty which varies across time steps/frames. Thus, in this paper, we propose to model the so-called aleatoric uncertainty of a differentiable (soft) version of DTW. To this end, we model the heteroscedastic aleatoric uncertainty of each path by the product of likelihoods from Normal distributions, each capturing variance of pair of frames. (The path distance is the sum of base distances between features of pairs of frames of the path.) The Maximum Likelihood Estimation (MLE) applied to a path yields two terms: (i) a sum of Euclidean distances weighted by the variance inverse, and (ii) a sum of log-variance regularization terms. Thus, our uncertainty-DTW is the smallest weighted path distance among all paths, and the regularization term (penalty for the high uncertainty) is the aggregate of log-variances along the path. The distance and the regularization term can be used in various objectives. We showcase forecasting the evolution of time series, estimating the Fr\'echet mean of time series, and supervised/unsupervised few-shot action recognition of the articulated human 3D body joints.  ( 3 min )
    Spatial-Temporal Synchronous Graph Transformer network (STSGT) for COVID-19 forecasting. (arXiv:2211.00082v1 [cs.LG])
    COVID-19 has become a matter of serious concern over the last few years. It has adversely affected numerous people around the globe and has led to the loss of billions of dollars of business capital. In this paper, we propose a novel Spatial-Temporal Synchronous Graph Transformer network (STSGT) to capture the complex spatial and temporal dependency of the COVID-19 time series data and forecast the future status of an evolving pandemic. The layers of STSGT combine the graph convolution network (GCN) with the self-attention mechanism of transformers on a synchronous spatial-temporal graph to capture the dynamically changing pattern of the COVID time series. The spatial-temporal synchronous graph simultaneously captures the spatial and temporal dependencies between the vertices of the graph at a given and subsequent time-steps, which helps capture the heterogeneity in the time series and improve the forecasting accuracy. Our extensive experiments on two publicly available real-world COVID-19 time series datasets demonstrate that STSGT significantly outperforms state-of-the-art algorithms that were designed for spatial-temporal forecasting tasks. Specifically, on average over a 12-day horizon, we observe a potential improvement of 12.19% and 3.42% in Mean Absolute Error(MAE) over the next best algorithm while forecasting the daily infected and death cases respectively for the 50 states of US and Washington, D.C. Additionally, STSGT also outperformed others when forecasting the daily infected cases at the state level, e.g., for all the counties in the State of Michigan. The code and models are publicly available at https://github.com/soumbane/STSGT.  ( 3 min )
    Classical ensemble of Quantum-classical ML algorithms for Phishing detection in Ethereum transaction networks. (arXiv:2211.00004v1 [quant-ph])
    Ethereum is one of the most valuable blockchain networks in terms of the total monetary value locked in it, and arguably been the most active network where new blockchain innovations in research and applications are demonstrated. But, this also leads to Ethereum network being susceptible to a wide variety of threats and attacks in an attempt to gain unreasonable advantage or to undermine the value of the users. Even with the state-of-art classical ML algorithms, detecting such attacks is still hard. This motivated us to build a hybrid system of quantum-classical algorithms that improves phishing detection in financial transaction networks. This paper presents a classical ensemble pipeline of classical and quantum algorithms and a detailed study benchmarking existing Quantum Machine Learning algorithms such as Quantum Support Vector Machine and Variational Quantum Classifier. With the current generation of quantum hardware available, smaller datasets are more suited to the QML models and most research restricts to hundreds of samples. However, we experimented on different data sizes and report results with a test data of 12K transaction nodes, which is to the best of the authors knowledge the largest QML experiment run so far on any real quantum hardware. The classical ensembles of quantum-classical models improved the macro F-score and phishing F-score. One key observation is QSVM constantly gives lower false positives, thereby higher precision compared with any other classical or quantum network, which is always preferred for any anomaly detection problem. This is true for QSVMs when used individually or via bagging of same models or in combination with other classical/quantum models making it the most advantageous quantum algorithm so far. The proposed ensemble framework is generic and can be applied for any classification task  ( 3 min )
    CarbonTag: A browser-based method for approximating energy consumption of online ads. (arXiv:2211.00071v1 [cs.CY])
    Energy is today the most critical environmental challenge. The amount of carbon emissions contributing to climate change is significantly influenced by both the production and consumption of energy. Measuring and reducing the energy consumption of services is a crucial step toward reducing adverse environmental effects caused by carbon emissions. Millions of websites rely on online advertisements to generate revenue, with most websites earning most or all of their revenues from ads. As a result, hundreds of billions of online ads are delivered daily to internet users to be rendered in their browsers. Both the delivery and rendering of each ad consume energy. This study investigates how much energy online ads use and offers a way for predicting it as part of rendering the ad. To the best of the authors' knowledge, this is the first study to calculate the energy usage of single advertisements. Our research further introduces different levels of consumption by which online ads can be classified based on energy efficiency. This classification will allow advertisers to add energy efficiency metrics and optimize campaigns towards consuming less possible.  ( 2 min )
  • Open

    Amplifying Membership Exposure via Data Poisoning. (arXiv:2211.00463v1 [cs.CR])
    As in-the-wild data are increasingly involved in the training stage, machine learning applications become more susceptible to data poisoning attacks. Such attacks typically lead to test-time accuracy degradation or controlled misprediction. In this paper, we investigate the third type of exploitation of data poisoning - increasing the risks of privacy leakage of benign training samples. To this end, we demonstrate a set of data poisoning attacks to amplify the membership exposure of the targeted class. We first propose a generic dirty-label attack for supervised classification algorithms. We then propose an optimization-based clean-label attack in the transfer learning scenario, whereby the poisoning samples are correctly labeled and look "natural" to evade human moderation. We extensively evaluate our attacks on computer vision benchmarks. Our results show that the proposed attacks can substantially increase the membership inference precision with minimum overall test-time model performance degradation. To mitigate the potential negative impacts of our attacks, we also investigate feasible countermeasures.
    Conservative Likelihood Ratio Estimator for Infrequent Data Slightly above a Frequency Threshold. (arXiv:2211.00545v1 [stat.ML])
    A naive likelihood ratio (LR) estimation using the observed frequencies of events can overestimate LRs for infrequent data. One approach to avoid this problem is to use a frequency threshold and set the estimates to zero for frequencies below the threshold. This approach eliminates the computation of some estimates, thereby making practical tasks using LRs more efficient. However, it still overestimates LRs for low frequencies near the threshold. This study proposes a conservative estimator for low frequencies, slightly above the threshold. Our experiment used LRs to predict the occurrence contexts of named entities from a corpus. The experimental results demonstrate that our estimator improves the prediction accuracy while maintaining efficiency in the context prediction task.
    On Medians of (Randomized) Pairwise Means. (arXiv:2211.00603v1 [stat.ML])
    Tournament procedures, recently introduced in Lugosi & Mendelson (2016), offer an appealing alternative, from a theoretical perspective at least, to the principle of Empirical Risk Minimization in machine learning. Statistical learning by Median-of-Means (MoM) basically consists in segmenting the training data into blocks of equal size and comparing the statistical performance of every pair of candidate decision rules on each data block: that with highest performance on the majority of the blocks is declared as the winner. In the context of nonparametric regression, functions having won all their duels have been shown to outperform empirical risk minimizers w.r.t. the mean squared error under minimal assumptions, while exhibiting robustness properties. It is the purpose of this paper to extend this approach in order to address other learning problems, in particular for which the performance criterion takes the form of an expectation over pairs of observations rather than over one single observation, as may be the case in pairwise ranking, clustering or metric learning. Precisely, it is proved here that the bounds achieved by MoM are essentially conserved when the blocks are built by means of independent sampling without replacement schemes instead of a simple segmentation. These results are next extended to situations where the risk is related to a pairwise loss function and its empirical counterpart is of the form of a $U$-statistic. Beyond theoretical results guaranteeing the performance of the learning/estimation methods proposed, some numerical experiments provide empirical evidence of their relevance in practice.
    A Semismooth Newton Stochastic Proximal Point Algorithm with Variance Reduction. (arXiv:2204.00406v2 [math.OC] UPDATED)
    We develop an implementable stochastic proximal point (SPP) method for a class of weakly convex, composite optimization problems. The proposed stochastic proximal point algorithm incorporates a variance reduction mechanism and the resulting SPP updates are solved using an inexact semismooth Newton framework. We establish detailed convergence results that take the inexactness of the SPP steps into account and that are in accordance with existing convergence guarantees of (proximal) stochastic variance-reduced gradient methods. Numerical experiments show that the proposed algorithm competes favorably with other state-of-the-art methods and achieves higher robustness with respect to the step size selection.
    Backtracking Counterfactuals. (arXiv:2211.00472v1 [cs.AI])
    Counterfactual reasoning -- envisioning hypothetical scenarios, or possible worlds, where some circumstances are different from what (f)actually occurred (counter-to-fact) -- is ubiquitous in human cognition. Conventionally, counterfactually-altered circumstances have been treated as "small miracles" that locally violate the laws of nature while sharing the same initial conditions. In Pearl's structural causal model (SCM) framework this is made mathematically rigorous via interventions that modify the causal laws while the values of exogenous variables are shared. In recent years, however, this purely interventionist account of counterfactuals has increasingly come under scrutiny from both philosophers and psychologists. Instead, they suggest a backtracking account of counterfactuals, according to which the causal laws remain unchanged in the counterfactual world; differences to the factual world are instead "backtracked" to altered initial conditions (exogenous variables). In the present work, we explore and formalise this alternative mode of counterfactual reasoning within the SCM framework. Despite ample evidence that humans backtrack, the present work constitutes, to the best of our knowledge, the first general account and algorithmisation of backtracking counterfactuals. We discuss our backtracking semantics in the context of related literature and draw connections to recent developments in explainable artificial intelligence (XAI).
    An Adversarial Approach to Structural Estimation. (arXiv:2007.06169v3 [econ.EM] UPDATED)
    We propose a new simulation-based estimation method, adversarial estimation, for structural models. The estimator is formulated as the solution to a minimax problem between a generator (which generates simulated observations using the structural model) and a discriminator (which classifies whether an observation is simulated). The discriminator maximizes the accuracy of its classification while the generator minimizes it. We show that, with a sufficiently rich discriminator, the adversarial estimator attains parametric efficiency under correct specification and the parametric rate under misspecification. We advocate the use of a neural network as a discriminator that can exploit adaptivity properties and attain fast rates of convergence. We apply our method to the elderly's saving decision model and show that our estimator uncovers the bequest motive as an important source of saving across the wealth distribution, not only for the rich.
    Differentially private stochastic expectation propagation (DP-SEP). (arXiv:2111.13219v5 [cs.LG] UPDATED)
    We are interested in privatizing an approximate posterior inference algorithm called Expectation Propagation (EP). EP approximates the posterior by iteratively refining approximations to the local likelihoods, and is known to provide better posterior uncertainties than those by variational inference (VI). However, EP needs a large memory to maintain all local approximates associated with each datapoint in the training data. To overcome this challenge, stochastic expectation propagation (SEP) considers a single unique local factor that captures the average effect of each likelihood term to the posterior and refines it in a way analogous to EP. In terms of privacy, SEP is more tractable than EP because at each refining step of a factor, the remaining factors are fixed and do not depend on other datapoints as in EP, which makes the sensitivity analysis straightforward. We provide a theoretical analysis of the privacy-accuracy trade-off in the posterior estimates under our method, called differentially private stochastic expectation propagation (DP-SEP). Furthermore, we demonstrate the performance of our DP-SEP algorithm evaluated on both synthetic and real-world datasets in terms of the quality of posterior estimates at different levels of guaranteed privacy.
    Deciding What to Model: Value-Equivalent Sampling for Reinforcement Learning. (arXiv:2206.02072v2 [cs.LG] UPDATED)
    The quintessential model-based reinforcement-learning agent iteratively refines its estimates or prior beliefs about the true underlying model of the environment. Recent empirical successes in model-based reinforcement learning with function approximation, however, eschew the true model in favor of a surrogate that, while ignoring various facets of the environment, still facilitates effective planning over behaviors. Recently formalized as the value equivalence principle, this algorithmic technique is perhaps unavoidable as real-world reinforcement learning demands consideration of a simple, computationally-bounded agent interacting with an overwhelmingly complex environment, whose underlying dynamics likely exceed the agent's capacity for representation. In this work, we consider the scenario where agent limitations may entirely preclude identifying an exactly value-equivalent model, immediately giving rise to a trade-off between identifying a model that is simple enough to learn while only incurring bounded sub-optimality. To address this problem, we introduce an algorithm that, using rate-distortion theory, iteratively computes an approximately-value-equivalent, lossy compression of the environment which an agent may feasibly target in lieu of the true model. We prove an information-theoretic, Bayesian regret bound for our algorithm that holds for any finite-horizon, episodic sequential decision-making problem. Crucially, our regret bound can be expressed in one of two possible forms, providing a performance guarantee for finding either the simplest model that achieves a desired sub-optimality gap or, alternatively, the best model given a limit on agent capacity.
    A Confidence Machine for Sparse High-Order Interaction Model. (arXiv:2205.14317v2 [stat.ML] UPDATED)
    In predictive modeling for high-stake decision-making, predictors must be not only accurate but also reliable. Conformal prediction (CP) is a promising approach for obtaining the confidence of prediction results with fewer theoretical assumptions. To obtain the confidence set by so-called full-CP, we need to refit the predictor for all possible values of prediction results, which is only possible for simple predictors. For complex predictors such as random forests (RFs) or neural networks (NNs), split-CP is often employed where the data is split into two parts: one part for fitting and another to compute the confidence set. Unfortunately, because of the reduced sample size, split-CP is inferior to full-CP both in fitting as well as confidence set computation. In this paper, we develop a full-CP of sparse high-order interaction model (SHIM), which is sufficiently flexible as it can take into account high-order interactions among variables. We resolve the computational challenge for full-CP of SHIM by introducing a novel approach called homotopy mining. Through numerical experiments, we demonstrate that SHIM is as accurate as complex predictors such as RF and NN and enjoys the superior statistical power of full-CP.
    Fighting Selection Bias in Statistical Learning: Application to Visual Recognition from Biased Image Databases. (arXiv:2109.02357v2 [cs.CV] UPDATED)
    In practice, and especially when training deep neural networks, visual recognition rules are often learned based on various sources of information. On the other hand, the recent deployment of facial recognition systems with uneven performances on different population segments has highlighted the representativeness issues induced by a naive aggregation of the datasets. In this paper, we show how biasing models can remedy these problems. Based on the (approximate) knowledge of the biasing mechanisms at work, our approach consists in reweighting the observations, so as to form a nearly debiased estimator of the target distribution. One key condition is that the supports of the biased distributions must partly overlap, and cover the support of the target distribution. In order to meet this requirement in practice, we propose to use a low dimensional image representation, shared across the image databases. Finally, we provide numerical experiments highlighting the relevance of our approach.
    Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics. (arXiv:2211.00450v1 [math.AP])
    Motivated by the challenge of sampling Gibbs measures with nonconvex potentials, we study a continuum birth-death dynamics. We prove that the probability density of the birth-death governed by Kullback-Leibler divergence or by $\chi^2$ divergence converge exponentially fast to the Gibbs equilibrium measure with a universal rate that is independent of the potential barrier. To build a practical numerical sampler based on the pure birth-death dynamics, we consider an interacting particle system which relies on kernel-based approximations of the measure and retains the gradient-flow structure. We show on the torus that the kernelized dynamics $\Gamma$-converges, on finite time intervals, to the pure birth-death dynamics as the kernel bandwidth shrinks to zero. Moreover we provide quantitative estimates on the bias of minimizers of the energy corresponding to the kernalized dynamics. Finally we prove the long-time asymptotic results on the convergence of the asymptotic states of the kernalized dynamics towards the Gibbs measure.
    Pi theorem formulation of flood mapping. (arXiv:2211.00636v1 [physics.geo-ph])
    While physical phenomena are stated in terms of physical laws that are homogeneous in all dimensions, the mechanisms and patterns of the physical phenomena are independent of the form of the units describing the physical process. Accordingly, across different conditions, the similarity of a process may be captured through a dimensionless reformulation of the physical problem with Buckingham $\Pi$ theorem. Here, we apply Buckingham $\Pi$ theorem for creating dimensionless indices for capturing the similarity of the flood process, and in turn, these indices allow machine learning to map the likelihood of pluvial (flash) flooding over a landscape. In particular, we use these dimensionless predictors with a logistic regression machine learning (ML) model for a probabilistic determination of flood risk. The logistic regression derived flood maps compare well to 2D hydraulic model results that are the basis of the Federal Emergency Management Agency (FEMA) maps. As a result, the indices and logistic regression also provide the potential to expand existing FEMA maps to new (unmapped) areas and a wider spectrum of flood flows and precipitation events. Our results demonstrate that the new dimensionless indices capture the similarity of the flood process across different topographies and climate regions. Consequently, these dimensionless indices may expand observations of flooding (e.g., satellite) to the risk of flooding in new areas, as well as provide a basis for the rapid, real-time estimation of flood risk on a worldwide scale.
    Recurrent Neural Networks and Universal Approximation of Bayesian Filters. (arXiv:2211.00335v1 [stat.ML])
    We consider the Bayesian optimal filtering problem: i.e. estimating some conditional statistics of a latent time-series signal from an observation sequence. Classical approaches often rely on the use of assumed or estimated transition and observation models. Instead, we formulate a generic recurrent neural network framework and seek to learn directly a recursive mapping from observational inputs to the desired estimator statistics. The main focus of this article is the approximation capabilities of this framework. We provide approximation error bounds for filtering in general non-compact domains. We also consider strong time-uniform approximation error bounds that guarantee good long-time performance. We discuss and illustrate a number of practical concerns and implications of these results.
    Adversarial Policies Beat Professional-Level Go AIs. (arXiv:2211.00241v1 [cs.LG])
    We attack the state-of-the-art Go-playing AI system, KataGo, by training an adversarial policy that plays against a frozen KataGo victim. Our attack achieves a >99% win-rate against KataGo without search, and a >50% win-rate when KataGo uses enough search to be near-superhuman. To the best of our knowledge, this is the first successful end-to-end attack against a Go AI playing at the level of a top human professional. Notably, the adversary does not win by learning to play Go better than KataGo -- in fact, the adversary is easily beaten by human amateurs. Instead, the adversary wins by tricking KataGo into ending the game prematurely at a point that is favorable to the adversary. Our results demonstrate that even professional-level AI systems may harbor surprising failure modes. See https://goattack.alignmentfund.org/ for example games.
    Model-based segmentation for improved activation detection in single-subject functional Magnetic Resonance Imaging studies. (arXiv:2102.03639v2 [stat.AP] UPDATED)
    Functional Magnetic Resonance Imaging (fMRI) maps cerebral activation in response to stimuli but this activation is often difficult to detect, especially in low-signal contexts and single-subject studies. Accurate activation detection can be guided by the fact that very few voxels are, in reality, truly activated and that these voxels are spatially localized, but it is challenging to incorporate both these facts. We address these twin challenges to single-subject and low-signal fMRI by developing a computationally feasible and methodologically sound model-based approach, implemented in the R package MixfMRI, that bounds the a priori expected proportion of activated voxels while also incorporating spatial context. An added benefit of our methodology is the ability to distinguish voxels and regions having different intensities of activation. Our suggested approach is evaluated in realistic two- and three-dimensional simulation experiments as well as on multiple datasets. Finally, the value of our suggested approach in low-signal and single-subject fMRI studies is illustrated on a sports imagination experiment that is often used to detect awareness and improve treatment in patients in persistent vegetative state (PVS). Our ability to reliably distinguish activation in this experiment potentially opens the door to the adoption of fMRI as a clinical tool for the improved treatment and therapy of PVS survivors and other patients.
    Improved Rates for Differentially Private Stochastic Convex Optimization with Heavy-Tailed Data. (arXiv:2106.01336v6 [cs.LG] UPDATED)
    We study stochastic convex optimization with heavy-tailed data under the constraint of differential privacy (DP). Most prior work on this problem is restricted to the case where the loss function is Lipschitz. Instead, as introduced by Wang, Xiao, Devadas, and Xu \cite{WangXDX20}, we study general convex loss functions with the assumption that the distribution of gradients has bounded $k$-th moments. We provide improved upper bounds on the excess population risk under concentrated DP for convex and strongly convex loss functions. Along the way, we derive new algorithms for private mean estimation of heavy-tailed distributions, under both pure and concentrated DP. Finally, we prove nearly-matching lower bounds for private stochastic convex optimization with strongly convex losses and mean estimation, showing new separations between pure and concentrated DP.
    Federated Averaging Langevin Dynamics: Toward a unified theory and new algorithms. (arXiv:2211.00100v1 [stat.ML])
    This paper focuses on Bayesian inference in a federated learning context (FL). While several distributed MCMC algorithms have been proposed, few consider the specific limitations of FL such as communication bottlenecks and statistical heterogeneity. Recently, Federated Averaging Langevin Dynamics (FALD) was introduced, which extends the Federated Averaging algorithm to Bayesian inference. We obtain a novel tight non-asymptotic upper bound on the Wasserstein distance to the global posterior for FALD. This bound highlights the effects of statistical heterogeneity, which causes a drift in the local updates that negatively impacts convergence. We propose a new algorithm VR-FALD* that uses control variates to correct the client drift. We establish non-asymptotic bounds showing that VR-FALD* is not affected by statistical heterogeneity. Finally, we illustrate our results on several FL benchmarks for Bayesian inference.
    Augmentation Invariant Manifold Learning. (arXiv:2211.00460v1 [stat.ML])
    Data augmentation is a widely used technique and an essential ingredient in the recent advance in self-supervised representation learning. By preserving the similarity between augmented data, the resulting data representation can improve various downstream analyses and achieve state-of-art performance in many applications. To demystify the role of data augmentation, we develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations. Compared with existing methods, the new data representation simultaneously exploits the manifold's geometric structure and invariant property of augmented data. Our theoretical investigation precisely characterizes how the data representation learned from augmented data can improve the $k$-nearest neighbor classifier in the downstream analysis, showing that a more complex data augmentation leads to more improvement in downstream analysis. Finally, numerical experiments on simulated and real datasets are presented to support the theoretical results in this paper.  ( 2 min )
    On the Semi-supervised Expectation Maximization. (arXiv:2211.00537v1 [cs.LG])
    The Expectation Maximization (EM) algorithm is widely used as an iterative modification to maximum likelihood estimation when the data is incomplete. We focus on a semi-supervised case to learn the model from labeled and unlabeled samples. Existing work in the semi-supervised case has focused mainly on performance rather than convergence guarantee, however we focus on the contribution of the labeled samples to the convergence rate. The analysis clearly demonstrates how the labeled samples improve the convergence rate for the exponential family mixture model. In this case, we assume that the population EM (EM with unlimited data) is initialized within the neighborhood of global convergence for the population EM that consists solely of samples that have not been labeled. The analysis for the labeled samples provides a comprehensive description of the convergence rate for the Gaussian mixture model. In addition, we extend the findings for labeled samples and offer an alternative proof for the population EM's convergence rate with unlabeled samples for the symmetric mixture of two Gaussians.  ( 2 min )
    Time Series Alignment with Global Invariances. (arXiv:2002.03848v2 [stat.ML] UPDATED)
    Multivariate time series are ubiquitous objects in signal processing. Measuring a distance or similarity between two such objects is of prime interest in a variety of applications, including machine learning, but can be very difficult as soon as the temporal dynamics and the representation of the time series, {\em i.e.} the nature of the observed quantities, differ from one another. In this work, we propose a novel distance accounting both feature space and temporal variabilities by learning a latent global transformation of the feature space together with a temporal alignment, cast as a joint optimization problem. The versatility of our framework allows for several variants depending on the invariance class at stake. Among other contributions, we define a differentiable loss for time series and present two algorithms for the computation of time series barycenters under this new geometry. We illustrate the interest of our approach on both simulated and real world data and show the robustness of our approach compared to state-of-the-art methods.  ( 2 min )
    Robust Large-Margin Learning in Hyperbolic Space. (arXiv:2004.05465v3 [cs.LG] UPDATED)
    Recently, there has been a surge of interest in representation learning in hyperbolic spaces, driven by their ability to represent hierarchical data with significantly fewer dimensions than standard Euclidean spaces. However, the viability and benefits of hyperbolic spaces for downstream machine learning tasks have received less attention. In this paper, we present, to our knowledge, the first theoretical guarantees for learning a classifier in hyperbolic rather than Euclidean space. Specifically, we consider the problem of learning a large-margin classifier for data possessing a hierarchical structure. We provide an algorithm to efficiently learn a large-margin hyperplane, relying on the careful injection of adversarial examples. Finally, we prove that for hierarchical data that embeds well into hyperbolic space, the low embedding dimension ensures superior guarantees when learning the classifier directly in hyperbolic space.  ( 2 min )
    Efficient Data Analytics on Augmented Similarity Triplets. (arXiv:1912.12064v2 [cs.LG] UPDATED)
    Data analysis require a pairwise proximity measure over objects. Recent work has extended this to situations where the distance information between objects is given as comparison results of distances between three objects (triplets). Humans find the comparison tasks much easier than the exact distance computation and such data can be easily obtained in big quantity via crowd-sourcing. In this work, we propose triplets augmentation, an efficient method to extend the triplets data by inferring the hidden implicit information form the existing data. Triplets augmentation improves the quality of kernel-based and kernel-free data analytics. We also propose a novel set of algorithms for common data analysis tasks based on triplets. These methods work directly with triplets and avoid kernel evaluations, thus are scalable to big data. We demonstrate that our methods outperform the current best-known techniques and are robust to noisy data.  ( 2 min )
    Electroencephalography and mild cognitive impairment research: A scoping review and bibliometric analysis (ScoRBA). (arXiv:2211.00302v1 [q-bio.NC])
    Background: Mild cognitive impairment (MCI) is often considered a precursor to Alzheimer's disease (AD) due to the high rate of progression from MCI to AD. Sensitive neural biomarkers may provide a tool for an accurate MCI diagnosis, enabling earlier and perhaps more effective treatment. Despite the availability of numerous neuroscience techniques, electroencephalography (EEG) is the most popular and frequently used tool among researchers due to its low cost and superior temporal resolution. Objective: We conducted a scoping review of EEG and MCI between 2012 and 2022 to track the progression of research in this field. Methods: In contrast to previous scoping reviews, the data charting was aided by co-occurrence analysis using VOSviewer, while data reporting adopted a Patterns, Advances, Gaps, Evidence of Practice, and Research Recommendations (PAGER) framework to increase the quality of the results. Results: Event-related potentials (ERPs) and EEG, epilepsy, quantitative EEG (QEEG), and EEG-based machine learning were the research themes addressed by 2310 peer-reviewed articles on EEG and MCI. Conclusion: Our review identified the main research themes in EEG and MCI with high-accuracy detection of seizure and MCI performed using ERP/EEG, QEEG and EEG-based machine learning frameworks.  ( 2 min )
    Theoretical Foundations of t-SNE for Visualizing High-Dimensional Clustered Data. (arXiv:2105.07536v4 [stat.ML] UPDATED)
    This paper investigates the theoretical foundations of the t-distributed stochastic neighbor embedding (t-SNE) algorithm, a popular nonlinear dimension reduction and data visualization method. A novel theoretical framework for the analysis of t-SNE based on the gradient descent approach is presented. For the early exaggeration stage of t-SNE, we show its asymptotic equivalence to power iterations based on the underlying graph Laplacian, characterize its limiting behavior, and uncover its deep connection to Laplacian spectral clustering, and fundamental principles including early stopping as implicit regularization. The results explain the intrinsic mechanism and the empirical benefits of such a computational strategy. For the embedding stage of t-SNE, we characterize the kinematics of the low-dimensional map throughout the iterations, and identify an amplification phase, featuring the intercluster repulsion and the expansive behavior of the low-dimensional map, and a stabilization phase. The general theory explains the fast convergence rate and the exceptional empirical performance of t-SNE for visualizing clustered data, brings forth interpretations of the t-SNE visualizations, and provides theoretical guidance for applying t-SNE and selecting its tuning parameters in various applications.  ( 2 min )
    CLIP: Cheap Lipschitz Training of Neural Networks. (arXiv:2103.12531v2 [cs.LG] UPDATED)
    Despite the large success of deep neural networks (DNN) in recent years, most neural networks still lack mathematical guarantees in terms of stability. For instance, DNNs are vulnerable to small or even imperceptible input perturbations, so called adversarial examples, that can cause false predictions. This instability can have severe consequences in applications which influence the health and safety of humans, e.g., biomedical imaging or autonomous driving. While bounding the Lipschitz constant of a neural network improves stability, most methods rely on restricting the Lipschitz constants of each layer which gives a poor bound for the actual Lipschitz constant. In this paper we investigate a variational regularization method named CLIP for controlling the Lipschitz constant of a neural network, which can easily be integrated into the training procedure. We mathematically analyze the proposed model, in particular discussing the impact of the chosen regularization parameter on the output of the network. Finally, we numerically evaluate our method on both a nonlinear regression problem and the MNIST and Fashion-MNIST classification databases, and compare our results with a weight regularization approach.  ( 3 min )
    Stability Based Generalization Bounds for Exponential Family Langevin Dynamics. (arXiv:2201.03064v2 [cs.LG] UPDATED)
    Recent years have seen advances in generalization bounds for noisy stochastic algorithms, especially stochastic gradient Langevin dynamics (SGLD) based on stability (Mou et al., 2018; Li et al., 2020) and information theoretic approaches (Xu and Raginsky, 2017; Negrea et al., 2019; Steinke and Zakynthinou, 2020). In this paper, we unify and substantially generalize stability based generalization bounds and make three technical contributions. First, we bound the generalization error in terms of expected (not uniform) stability which arguably leads to quantitatively sharper bounds. Second, as our main contribution, we introduce Exponential Family Langevin Dynamics (EFLD), a substantial generalization of SGLD, which includes noisy versions of Sign-SGD and quantized SGD as special cases. We establish data-dependent expected stability based generalization bounds for any EFLD algorithm with a O(1/n) sample dependence and dependence on gradient discrepancy rather than the norm of gradients, yielding significantly sharper bounds. Third, we establish optimization guarantees for special cases of EFLD. Further, empirical results on benchmarks illustrate that our bounds are non-vacuous, quantitatively sharper than existing bounds, and behave correctly under noisy labels.  ( 2 min )
    Robust Direct Learning for Causal Data Fusion. (arXiv:2211.00249v1 [stat.ML])
    In the era of big data, the explosive growth of multi-source heterogeneous data offers many exciting challenges and opportunities for improving the inference of conditional average treatment effects. In this paper, we investigate homogeneous and heterogeneous causal data fusion problems under a general setting that allows for the presence of source-specific covariates. We provide a direct learning framework for integrating multi-source data that separates the treatment effect from other nuisance functions, and achieves double robustness against certain misspecification. To improve estimation precision and stability, we propose a causal information-aware weighting function motivated by theoretical insights from the semiparametric efficiency theory; it assigns larger weights to samples containing more causal information with high interpretability. We introduce a two-step algorithm, the weighted multi-source direct learner, based on constructing a pseudo-outcome and regressing it on covariates under a weighted least square criterion; it offers us a powerful tool for causal data fusion, enjoying the advantages of easy implementation, double robustness and model flexibility. In simulation studies, we demonstrate the effectiveness of our proposed methods in both homogeneous and heterogeneous causal data fusion scenarios.  ( 2 min )
    Dynamical Wasserstein Barycenters for Time-series Modeling. (arXiv:2110.06741v3 [cs.LG] UPDATED)
    Many time series can be modeled as a sequence of segments representing high-level discrete states, such as running and walking in a human activity application. Flexible models should describe the system state and observations in stationary "pure-state" periods as well as transition periods between adjacent segments, such as a gradual slowdown between running and walking. However, most prior work assumes instantaneous transitions between pure discrete states. We propose a dynamical Wasserstein barycentric (DWB) model that estimates the system state over time as well as the data-generating distributions of pure states in an unsupervised manner. Our model assumes each pure state generates data from a multivariate normal distribution, and characterizes transitions between states via displacement-interpolation specified by the Wasserstein barycenter. The system state is represented by a barycentric weight vector which evolves over time via a random walk on the simplex. Parameter learning leverages the natural Riemannian geometry of Gaussian distributions under the Wasserstein distance, which leads to improved convergence speeds. Experiments on several human activity datasets show that our proposed DWB model accurately learns the generating distribution of pure states while improving state estimation for transition periods compared to the commonly used linear interpolation mixture models.  ( 3 min )
    Batch Active Learning from the Perspective of Sparse Approximation. (arXiv:2211.00246v1 [cs.LG])
    Active learning enables efficient model training by leveraging interactions between machine learning agents and human annotators. We study and propose a novel framework that formulates batch active learning from the sparse approximation's perspective. Our active learning method aims to find an informative subset from the unlabeled data pool such that the corresponding training loss function approximates its full data pool counterpart. We realize the framework as sparsity-constrained discontinuous optimization problems, which explicitly balance uncertainty and representation for large-scale applications and could be solved by greedy or proximal iterative hard thresholding algorithms. The proposed method can adapt to various settings, including both Bayesian and non-Bayesian neural networks. Numerical experiments show that our work achieves competitive performance across different settings with lower computational complexity.  ( 2 min )
    Inductive Representation Learning in Temporal Networks via Causal Anonymous Walks. (arXiv:2101.05974v5 [cs.LG] UPDATED)
    Temporal networks serve as abstractions of many real-world dynamic systems. These networks typically evolve according to certain laws, such as the law of triadic closure, which is universal in social networks. Inductive representation learning of temporal networks should be able to capture such laws and further be applied to systems that follow the same laws but have not been unseen during the training stage. Previous works in this area depend on either network node identities or rich edge attributes and typically fail to extract these laws. Here, we propose Causal Anonymous Walks (CAWs) to inductively represent a temporal network. CAWs are extracted by temporal random walks and work as automatic retrieval of temporal network motifs to represent network dynamics while avoiding the time-consuming selection and counting of those motifs. CAWs adopt a novel anonymization strategy that replaces node identities with the hitting counts of the nodes based on a set of sampled walks to keep the method inductive, and simultaneously establish the correlation between motifs. We further propose a neural-network model CAW-N to encode CAWs, and pair it with a CAW sampling strategy with constant memory and time cost to support online training and inference. CAW-N is evaluated to predict links over 6 real temporal networks and uniformly outperforms previous SOTA methods by averaged 10% AUC gain in the inductive setting. CAW-N also outperforms previous methods in 4 out of the 6 networks in the transductive setting.  ( 3 min )
    Shedding a PAC-Bayesian Light on Adaptive Sliced-Wasserstein Distances. (arXiv:2206.03230v2 [stat.ML] UPDATED)
    The Sliced-Wasserstein distance (SW) is a computationally efficient and theoretically grounded alternative to the Wasserstein distance. Yet, the literature on its statistical properties -- or, more accurately, its generalization properties -- with respect to the distribution of slices, beyond the uniform measure, is scarce. To bring new contributions to this line of research, we leverage the PAC-Bayesian theory and a central observation that SW may be interpreted as an average risk, the quantity PAC-Bayesian bounds have been designed to characterize. We provide three types of results: i) PAC-Bayesian generalization bounds that hold on what we refer as adaptive Sliced-Wasserstein distances, i.e. SW defined with respect to arbitrary distributions of slices (among which data-dependent distributions), ii) a principled procedure to learn the distribution of slices that yields maximally discriminative SW, by optimizing our theoretical bounds, and iii) empirical illustrations of our theoretical findings.  ( 2 min )
    MAZE: Data-Free Model Stealing Attack Using Zeroth-Order Gradient Estimation. (arXiv:2005.03161v2 [stat.ML] UPDATED)
    Model Stealing (MS) attacks allow an adversary with black-box access to a Machine Learning model to replicate its functionality, compromising the confidentiality of the model. Such attacks train a clone model by using the predictions of the target model for different inputs. The effectiveness of such attacks relies heavily on the availability of data necessary to query the target model. Existing attacks either assume partial access to the dataset of the target model or availability of an alternate dataset with semantic similarities. This paper proposes MAZE -- a data-free model stealing attack using zeroth-order gradient estimation. In contrast to prior works, MAZE does not require any data and instead creates synthetic data using a generative model. Inspired by recent works in data-free Knowledge Distillation (KD), we train the generative model using a disagreement objective to produce inputs that maximize disagreement between the clone and the target model. However, unlike the white-box setting of KD, where the gradient information is available, training a generator for model stealing requires performing black-box optimization, as it involves accessing the target model under attack. MAZE relies on zeroth-order gradient estimation to perform this optimization and enables a highly accurate MS attack. Our evaluation with four datasets shows that MAZE provides a normalized clone accuracy in the range of 0.91x to 0.99x, and outperforms even the recent attacks that rely on partial data (JBDA, clone accuracy 0.13x to 0.69x) and surrogate data (KnockoffNets, clone accuracy 0.52x to 0.97x). We also study an extension of MAZE in the partial-data setting and develop MAZE-PD, which generates synthetic data closer to the target distribution. MAZE-PD further improves the clone accuracy (0.97x to 1.0x) and reduces the query required for the attack by 2x-24x.  ( 3 min )
    Statistical Learning from Biased Training Samples. (arXiv:1906.12304v4 [stat.ML] UPDATED)
    With the deluge of digitized information in the Big Data era, massive datasets are becoming increasingly available for learning predictive models. However, in many practical situations, the poor control of the data acquisition processes may naturally jeopardize the outputs of machine learning algorithms, and selection bias issues are now the subject of much attention in the literature. The present article investigates how to extend Empirical Risk Minimization, the principal paradigm in statistical learning, when training observations are generated from biased models, i.e., from distributions that are different from that in the test/prediction stage, and absolutely continuous with respect to the latter. Precisely, we show how to build a "nearly debiased" training statistical population from biased samples and the related biasing functions, following in the footsteps of the approach originally proposed in Vardi (1985). Furthermore, we study from a nonasymptotic perspective the performance of minimizers of an empirical version of the risk computed from the statistical population thus created. Remarkably, the learning rate achieved by this procedure is of the same order as that attained in absence of selection bias. Beyond the theoretical guarantees, we also present experimental results supporting the relevance of the algorithmic approach promoted in this paper.  ( 2 min )
    Training Matters: Unlocking Potentials of Deeper Graph Convolutional Neural Networks. (arXiv:2008.08838v2 [cs.LG] UPDATED)
    The performance limit of Graph Convolutional Networks (GCNs) and the fact that we cannot stack more of them to increase the performance, which we usually do for other deep learning paradigms, are pervasively thought to be caused by the limitations of the GCN layers, including insufficient expressive power, etc. However, if so, for a fixed architecture, it would be unlikely to lower the training difficulty and to improve performance by changing only the training procedure, which we show in this paper not only possible but possible in several ways. This paper first identify the training difficulty of GCNs from the perspective of graph signal energy loss. More specifically, we find that the loss of energy in the backward pass during training nullifies the learning of the layers closer to the input. Then, we propose several methodologies to mitigate the training problem by slightly modifying the GCN operator, from the energy perspective. After empirical validation, we confirm that these changes of operator lead to significant decrease in the training difficulties and notable performance boost, without changing the composition of parameters. With these, we conclude that the root cause of the problem is more likely the training difficulty than the others.  ( 2 min )
    SIMPLE-RC: Group Network Inference with Non-Sharp Nulls and Weak Signals. (arXiv:2211.00128v1 [stat.ML])
    Large-scale network inference with uncertainty quantification has important applications in natural, social, and medical sciences. The recent work of Fan, Fan, Han and Lv (2022) introduced a general framework of statistical inference on membership profiles in large networks (SIMPLE) for testing the sharp null hypothesis that a pair of given nodes share the same membership profiles. In real applications, there are often groups of nodes under investigation that may share similar membership profiles at the presence of relatively weaker signals than the setting considered in SIMPLE. To address these practical challenges, in this paper we propose a SIMPLE method with random coupling (SIMPLE-RC) for testing the non-sharp null hypothesis that a group of given nodes share similar (not necessarily identical) membership profiles under weaker signals. Utilizing the idea of random coupling, we construct our test as the maximum of the SIMPLE tests for subsampled node pairs from the group. Such technique reduces significantly the correlation among individual SIMPLE tests while largely maintaining the power, enabling delicate analysis on the asymptotic distributions of the SIMPLE-RC test. Our method and theory cover both the cases with and without node degree heterogeneity. These new theoretical developments are empowered by a second-order expansion of spiked eigenvectors under the $\ell_\infty$-norm, built upon our work for random matrices with weak spikes. Our theoretical results and the practical advantages of the newly suggested method are demonstrated through several simulation and real data examples.  ( 3 min )
    Error Controlled Feature Selection for Ultrahigh Dimensional and Highly Correlated Feature Space Using Deep Learning. (arXiv:2209.07011v3 [stat.ML] UPDATED)
    In recent years, deep learning has been at the center of analytics due to its impressive empirical success in analyzing complex data objects. Despite this success, most of the existing tools behave like black-box machines, thus the increasing interest in interpretable, reliable, and robust deep learning models applicable to a broad class of applications. Feature-selected deep learning has emerged as a promising tool in this realm. However, the recent developments do not accommodate ultra-high dimensional and highly correlated features, in addition to the high noise level. In this article, we propose a novel screening and cleaning method with the aid of deep learning for a data-adaptive multi-resolutional discovery of highly correlated predictors with a controlled error rate. Extensive empirical evaluations over a wide range of simulated scenarios and several real datasets demonstrate the effectiveness of the proposed method in achieving high power while keeping the false discovery rate at a minimum.  ( 2 min )
    Informed Priors for Knowledge Integration in Trajectory Prediction. (arXiv:2211.00348v1 [cs.LG])
    Informed machine learning methods allow the integration of prior knowledge into learning systems. This can increase accuracy and robustness or reduce data needs. However, existing methods often assume hard constraining knowledge, that does not require to trade-off prior knowledge with observations, but can be used to directly reduce the problem space. Other approaches use specific, architectural changes as representation of prior knowledge, limiting applicability. We propose an informed machine learning method, based on continual learning. This allows the integration of arbitrary, prior knowledge, potentially from multiple sources, and does not require specific architectures. Furthermore, our approach enables probabilistic and multi-modal predictions, that can improve predictive accuracy and robustness. We exemplify our approach by applying it to a state-of-the-art trajectory predictor for autonomous driving. This domain is especially dependent on informed learning approaches, as it is subject to an overwhelming large variety of possible environments and very rare events, while requiring robust and accurate predictions. We evaluate our model on a commonly used benchmark dataset, only using data already available in a conventional setup. We show that our method outperforms both non-informed and informed learning methods, that are often used in the literature. Furthermore, we are able to compete with a conventional baseline, even using half as many observation examples.  ( 2 min )

  • Open

    [D] Any scripts to sanitize videos of text automatically?
    I want to remove my personal information / sensitive information (like street signs, addresses, license plates, etc) from my videos. There's no possible way for me to do this through after effects on over 90+ videos. There's gotta be some kind of script that removes text from videos? submitted by /u/Valiantay [link] [comments]  ( 63 min )
    [R] "Broken Neural Scaling Laws" paper; Presents new Functional Form that yields SotA Extrapolation of Scaling behavior for each task within large, diverse set of downstream tasks, including large-scale Vision, NLP, Diffusion Models, "Emergent" "Unpredictable" Math, Double Descent, & RL.
    paper: https://arxiv.org/abs/2210.14891 tweet thread: https://twitter.com/ethanCaballero/status/1587502829580820481 submitted by /u/evc123 [link] [comments]  ( 63 min )
    [R] Extended submission deadline — EvoMUSART 2023 conference
    Good news: The submission deadline of EvoMUSART 2023 has been extended to November 16th! 🙌 You still have time to submit your work to the 12th International Conference on Artificial Intelligence in Music, Sound, Art and Design (EvoMUSART). If you work with Artificial Intelligence techniques applied to visual art, music, sound synthesis, architecture, video, poetry, design, or other creative tasks, don't miss the opportunity to submit your work to EvoMUSART. EvoMUSART 2022 will be held in Brno, Czech Republic, between 12 and 14 April 2023. 🇨🇿 For more information, visit the conference webpage: https://www.evostar.org/2023/evomusart/ https://preview.redd.it/4t05oxawzdx91.png?width=2083&format=png&auto=webp&s=25bc62844e658d60b3df49975d2afd79711142ee submitted by /u/evomusart_conference [link] [comments]  ( 63 min )
    [D] What is the widely accepted way of testing on ImageNet?
    Torchvision's builtin dataset classes seems to only support train and validation sets. In the drive of the lab I work for, there is also only the train and validation set available. But for ImageNet1k dataset card from huggingface, there is a test set available. Tensorflow Dataset also contains a test split. I'm a bit confused - how exactly are the results obtained for the benchmark? Are people just testing on the validation set or is there a secret test set that we don't get access to? More interestingly, where does this difference come from? Why does this one ultra-well known dataset come to show up with different splits when presented by the major platforms? submitted by /u/XtremePocket [link] [comments]  ( 67 min )
    [N] Meta AI | Evolutionary-scale prediction of atomic level protein structure with a language model
    Paper: https://www.biorxiv.org/content/10.1101/2022.07.20.500902v2 Meta's Tweet: https://twitter.com/MetaAI/status/1587467591068459008 Abstract Artificial intelligence has the potential to open insight into the structure of proteins at the scale of evolution. It has only recently been possible to extend protein structure prediction to two hundred million cataloged proteins. Characterizing the structures of the exponentially growing billions of protein sequences revealed by large scale gene sequencing experiments would necessitate a breakthrough in the speed of folding. Here we show that direct inference of structure from primary sequence using a large language model enables an order of magnitude speed-up in high resolution structure prediction. Leveraging the insight that language models learn evolutionary patterns across millions of sequences, we train models up to 15B parameters, the largest language model of proteins to date. As the language models are scaled they learn information that enables prediction of the three-dimensional structure of a protein at the resolution of individual atoms. This results in prediction that is up to 60x faster than state-of-the-art while maintaining resolution and accuracy. Building on this, we present the ESM Metagenomic Atlas. This is the first large-scale structural characterization of metagenomic proteins, with more than 617 million structures. The atlas reveals more than 225 million high confidence predictions, including millions whose structures are novel in comparison with experimentally determined structures, giving an unprecedented view into the vast breadth and diversity of the structures of some of the least understood proteins on earth. submitted by /u/xutw21 [link] [comments]  ( 62 min )
    [P] Need pretrained EBMs for benchmarks
    Hi guys. My question is simple. Is there any good place to find an energy based model that has been trained on celebA dataset? The training algorithm of choice can be anything, for instance denoising score matching, contrastive divergence etc. I am just benchmarking my own model and wanted some simple baselines as I don't want to train from scratch. My project is still in early stages so the pretrained EBM doesn't have to be particularly SOTA. So any simple model would suffice for now but with decent generative performance. Hence the need for EBM weights. Any help is appreciated. Thanks. submitted by /u/anomaly_in_testset [link] [comments]  ( 63 min )
    [D] Machine learning prototyping on Apple silicon?
    Hi, what's the latest state of affairs with prototyping ML on Apple silicon, especially M1 Macbook Pro (or M2 if you can see the future)? I need to be able to run ML code incl training and inference, but it doesn't need to be efficient as it's just for local validation/prototyping (esp. unit testing) before I do the significant training elsewhere. I want to know what it's like doing ML dev day to day. If it's still finnicky then it's probably not worth it. Unfortunately I still don't know if I'm going to be using TensorFlow, PyTorch or JAX (EDIT but I'll be using one of them). This question has been asked a few times, but not recently and things can change fast. EDIT: I'll be working in industry as an ML engineer. submitted by /u/laprika0 [link] [comments]  ( 75 min )
    [D] Is there a way we can score "popularity" on social media posts?
    I'm currently in charge of a project that deals with social media texts in the e-commerce domain. My objective is that I want to create "popularity scores" for particular items per each week. I'm currently using a very crude metric of using a weighted mean between the number of views and comments per post. However, I'd like to know if there's another way that I can score popularity. I've currently found papers like this: https://dl.acm.org/doi/abs/10.1145/2661714.2661722 but they're rather old. Wondering if there's a better way to go about this. Thank you! submitted by /u/Seankala [link] [comments]  ( 63 min )
    [D] Is there a way to decode clip features to a sentence?
    I found a few papers that use CLIP for text generation, but they do it via other models such as GPT-2. Is there a simpler decoder model that receives CLIP features and outputs text? submitted by /u/HackZisBotez [link] [comments]  ( 61 min )
    [R] What method can I use to sample nodes from a set with the least connectivities?
    I have a set of nodes N, and I have an NxN connectivity matrix with each element representing the connectivity strength between 2 nodes. Currently, I hope to sample like half of the nodes out of the N nodes, but with the least connectivities among these sampled nodes. For example, it would be best if there doesn't exist any connections among these N/2 nodes. Are there any commonly used ML methods (or greedy tranditional method) for such an objective? submitted by /u/AaronSpalding [link] [comments]  ( 63 min )
    [R] Do you have experience posting tasks on Mechanical Turk?
    We are trying to understand how requestors (task posters) on digital pieceworking platforms (e.g., Amazon Mechanical Turk, Microworker, Clickworker, etc.) approach and design tasks they post on the platform. If you are an English speaking adult, located in the United States, we would like to talk with you about your background, approach, and methodology as such a requester. The session will last 1 hour and will be held on a Georgia Tech-sponsored Zoom call, with a short demographic survey distributed via Qualtrics. Please comment or message me if you have experience and are interested in participating! Thank you for your time. submitted by /u/22FlyingTurtles [link] [comments]  ( 64 min )
  • Open

    "IMAGINARIUM" A music video made using Stable diffusion and disco diffusion video input
    submitted by /u/lonelygalemusic [link] [comments]  ( 41 min )
    AlphaFold’s new rival? Meta AI predicts shape of 600 million proteins
    submitted by /u/codingai [link] [comments]  ( 40 min )
    Weekly China AI News: Baidu Proposes Largest Text-to-Image Model; XPeng Doubles Down on Self-Driving Tech; AI Firms Reportedly Laying Off Employees
    submitted by /u/trcytony [link] [comments]  ( 41 min )
    Song Analysis to get track data (Energy, Tempo, Happy/Sad, etc)
    I am looking for something similar to the Spotify Track Audio Analysis, in which live sound can be analyzed and rated on categories. For example, you play live audio and get shown how Happy/Sad, Tempo, Warmth, etc. Is there any known models for something like this? submitted by /u/jude_mcjude [link] [comments]  ( 41 min )
    I made 34 Burning Man photos that don't exist... also a bit about why artists should embrace AI as a "caddy"
    Enjoy my reddit frens - I made a long blog post about it https://stuckincustoms.com/2022/11/01/34-crazy-photos-from-burning-man-that-dont-exist/?fbclid=IwAR24QGZngGqjhuhDNLaeY4WsxJchO48Qsqov31Y1y4mac3RjJ2Rdp8af3GQ ​ ​ https://preview.redd.it/qvhx2blqidx91.png?width=784&format=png&auto=webp&s=98f7e1017591394d866ef34773d717d7aefac107 submitted by /u/treyratcliff [link] [comments]  ( 41 min )
    How AI Sees The Victims' Rooms From Famous Horror Movies
    Neural Network generated victims' rooms from famous horror movies, check the floor plans (but furnished in modern style). What do you think? ​ https://preview.redd.it/cj757lhmddx91.png?width=1080&format=png&auto=webp&s=b9d4c08615c1dd2102d2ba22e5ff95c67625cc35 https://preview.redd.it/czpzhshmddx91.png?width=1080&format=png&auto=webp&s=1927e1c81f7322ee38afc5c058cdbb6983b2f7ef https://preview.redd.it/lzfz1thmddx91.png?width=1080&format=png&auto=webp&s=a19854164ed359d2ec8cf086b097ae255b36d0b9 https://preview.redd.it/fu0qpnhmddx91.png?width=1080&format=png&auto=webp&s=3471210d2f6fcb738232cd37ca0cf09dba5ab0d1 submitted by /u/MystrangeLove [link] [comments]  ( 41 min )
    online waiver system?
    Hi guys, so my business requires my customers to sign a waiver, but im unable to find a software that meets the requirements. im looking for a software that allows e-signature, can take a selfie, allows the customer to fill out the waiver, and has multi-factor authentication. as I was researching, I found a few apps that meet all the requirements except the multi-factor authentication part. submitted by /u/bigmoneyshnz [link] [comments]  ( 41 min )
    AMAZING Image 2 Image Option: Unmasked Conditioning Tutorial
    submitted by /u/PuppetHere [link] [comments]  ( 40 min )
    NN-SVG is a tool for creating Neural Network architecture drawings parametrically rather than manually! It also provides the ability to export those drawings to Scalable Vector Graphics (SVG) files, suitable for inclusion in academic papers or web pages
    submitted by /u/ai-lover [link] [comments]  ( 41 min )
    Breakthrough Computer Vision AI Device | New Neuromorphic Tech Runs Deep Learning AI 100X Faster Using Photons
    submitted by /u/kenickh [link] [comments]  ( 41 min )
    best reference or scientific for face detection & emotion detection
    so, I have this college project which is related to Image processing and I was required to implement 2 programs using python, one is for a face detection algorithm and the other is for an emotion detection algorithm, so I was wondering if anyone could guide me to the best reference or scientific paper for implementing my programs. any help would be appreciated, thanks in advance. submitted by /u/abdosalm [link] [comments]  ( 41 min )
    I made a comparison on free image colorizer websites (Ft. Marilyn Monroe) (What are your thoughts)
    submitted by /u/Rhurartehgreat [link] [comments]  ( 45 min )
    🚨Issue 2 of my comic miniseries ANIMOIA is OUT NOW!🚨 ➡️Amazon links⬅️ 🇺🇸US link - https://www.amazon.com/dp/B0BL1XC4KX 🇬🇧UK link - https://www.amazon.co.uk/dp/B0BL1XC4KX
    submitted by /u/Ideal-Typical [link] [comments]  ( 41 min )
    Chatbots in marketing: 6 real examples
    Hi! There're plenty of articles about chatbot benefits for marketing, but not so many about real examples of how companies use chatbots for marketing. 🙃So I gathered 6 chatbot examples from companies like Virgin Holidays, Coca-Cola, Choose Chicago, Honda, etc., with campaign descriptions, results, and stories behind the development of these chatbots. Info includes anything I could get from media like Adweek, Drift, Campaign Bried to Forbes, Google Business, and my company's experience. Read it here Let me know if it is helpful! submitted by /u/Avandegraund [link] [comments]  ( 41 min )
    Replit’s Ghostwriter AI can explain programs to you—or help write them
    submitted by /u/NISMO1968 [link] [comments]  ( 40 min )
    When demon follows you
    submitted by /u/nalr00n [link] [comments]  ( 40 min )
    Found this while I was looking for AI generated videos, kinda weird
    submitted by /u/CooleBanane420 [link] [comments]  ( 41 min )
    Users question AI's ability to moderate online harassment
    New Cornell University research finds that both the type of moderator—human or AI—and the "temperature" of harassing content online influence people's perception of the moderation decision and the moderation system. Now published in Big Data & Society, the study used a custom social media site, on which people can post pictures of food and comment on other posts. The site contains a simulation engine, Truman, an open-source platform that mimics other users' behaviors (likes, comments, posts) through preprogrammed bots created and curated by researchers. "The Truman platform allows researchers to create a controlled yet realistic social media experience for participants, with social and design versatility to examine a variety of research questions about human behaviors in social media," B…  ( 43 min )
    I want to have my own business.
    Hello and thanks to readers. I live in South Korea and am aged 26 and soon 27. I majored in accounting and finance at the university and intended to start my career at the accounting firms. But now I hope to have my own business in AI field. It is because I do have enough knowledge in AI so I cannot build the structure of the business. So I came here to ask to experts(you) about the ideas for a business. It's not like I am gonna do that now but I want to listen to you and learn from you. What would be upcoming trends in AI business and what are your recommendations if you start a business? ​ Thanks a lot and I appreciate all your opinions :) >< submitted by /u/cryptolover0 [link] [comments]  ( 43 min )
    The "Pandora's Box" of Artificial Intelligence has been opened: How it will change your life?
    submitted by /u/_-_agenda_-_ [link] [comments]  ( 42 min )
  • Open

    DSC Weekly 1 Nov 2022 – Why the Future Seldom Matches Expectations
    This week saw the news that two major automotive companies, Ford and VW, were walking away from a multi-billion dollar investment into Argo AI, a venture intended to build self-driving vehicles. Instead, the companies hope to roll at least some of that effort back into augmenting drivers' abilities to drive safely and efficiently. The post DSC Weekly 1 Nov 2022 – Why the Future Seldom Matches Expectations appeared first on Data Science Central.  ( 25 min )
    AI Enable Wearable Devices: The Next Layer of IoT and Machine Learning
    Wearable devices are gadgets that can comfortably be worn on a body. These devices have intelligent sensors connected to the internet for collecting data. The growing partnership and collaboration for bringing AI functionalities to wearable devices provide new developments in AI wearable devices in the coming years.  In today’s busy schedule, people want to improve… Read More »AI Enable Wearable Devices: The Next Layer of IoT and Machine Learning The post AI Enable Wearable Devices: The Next Layer of IoT and Machine Learning appeared first on Data Science Central.  ( 19 min )
    Strategies and Examples Of Effective Team Dynamics In A Digital Workspace
    Every organization needs teamwork, and for a hybrid or digital workforce, it’s even more crucial. Your company may experience low chances of expansion and success if the team dynamics are unhealthy. Interacting with coworkers in a physical office setting offers numerous opportunities to learn about one another’s lives, discuss any problems, and plan initiatives. You… Read More »Strategies and Examples Of Effective Team Dynamics In A Digital Workspace The post Strategies and Examples Of Effective Team Dynamics In A Digital Workspace appeared first on Data Science Central.  ( 21 min )
  • Open

    Combination of DQL and DDPG
    Did anyone try to combine the use of these two algorithms? I have two vectors which I want to optimize, one is discrete and one is continuous. I am thinking of making the main file as DDPG optimizing the continuous action, and inside it I call the DQL algorithm to optimize the discrete action, until both are optimized. Does it make sense? Any suggestions? Thanks submitted by /u/alicefaisal [link] [comments]  ( 51 min )
    Using RL's exploitation to debug
    submitted by /u/robotphilanthropist [link] [comments]  ( 53 min )
    New student in RL want to improve myself and learn more about algorithm
    I'm a student in RL (Captain obvious) and m'y teacher told me about some algorithms like DQN and RND. I think get the theorical aspect of the DQN algorithm but when it comes to understand the algorithm written with PyTorch for example I feel completely Lost. Myabe I dont understand well the theorical aspect of the algorithm. Anyways I need help to understand this algorithm. The coding aspect ans maybe the theorical aspect. Thank you in advance for any help you provide 🙏 submitted by /u/Satan_Sulfurox_666 [link] [comments]  ( 53 min )
    My last post about "training booster" got rating of 44%. Is it because of religious beliefs or technically it is not sound...?
    No comments were added. I know that there are better algorithms than DDPG right now, especially model based, entropy based, etc. If somebody know, it is possible to add noise regulation (not based on epsilon), e.g. regulate noise level if Q is increasing or not. Sometimes adding log probability can take a lot of time, is it true (it starts to explore a lot of policies)? submitted by /u/timurgepard [link] [comments]  ( 51 min )
    Getting started with Imitation Learning & Combinational Optimization
    Hi guys, my research direction is shop scheduling. And recently I have been thinking about a question: RL can help make decisions in the process of combinational optimization (CO), but usually the result is far worse than the best solution got by metaheuristic algorithms. Can I do something more with COs? Actually, in scheduling area, in some situation we can get the best solution easily by applying metaheuristic. Can I use imitation learning to learn how the best solutions act in the whole process? My idea is very preliminary, so I hope you can give me some suggestions about this.And is there some existing researches about this? Thank you very much for helping me! submitted by /u/JoPrimer [link] [comments]  ( 51 min )
  • Open

    Automated exploratory data analysis and model operationalization framework with a human in the loop
    Identifying, collecting, and transforming data is the foundation for machine learning (ML). According to a Forbes survey, there is widespread consensus among ML practitioners that data preparation accounts for approximately 80% of the time spent in developing a viable ML model. In addition, many of our customers face several challenges during the model operationalization phase […]  ( 15 min )
    Move Amazon SageMaker Autopilot ML models from experimentation to production using Amazon SageMaker Pipelines
    Amazon SageMaker Autopilot automatically builds, trains, and tunes the best custom machine learning (ML) models based on your data. It’s an automated machine learning (AutoML) solution that eliminates the heavy lifting of handwritten ML models that requires ML expertise. Data scientists need to only provide a tabular dataset and select the target column to predict, […]  ( 9 min )
    Startups across AWS Accelerators use AI and ML to solve mission-critical customer challenges
    Relentless advancement in technology is improving the decision-making capacity of humans and enterprises alike. Digitization of the physical world has accelerated the three dimensions of data: velocity, variety, and volume. This has made information more widely available than before, allowing for advancements in problem-solving. Now, with cloud-enabled democratized availability, technologies like artificial intelligence (AI) and […]  ( 7 min )
  • Open

    Solving Kepler’s equation with Newton’s method
    In the introduction to his book Solving Kepler’s Equation Over Three Centuries, Peter Colwell says In virtually every decade from 1650 to the present there have appeared papers devoted to the Kepler problem and its solution. This is remarkable because Kepler’s equation isn’t that hard to solve. It cannot be solved in closed form using […] Solving Kepler’s equation with Newton’s method first appeared on John D. Cook.  ( 7 min )
    The world is lumpy
    The Pareto principle, or the 80-20 rule, says that 80% of output comes from 20% of inputs. For example, maybe the top 20% of salesmen generate 80% of a company’s revenue. For some reason, the Pareto principle angers some people. Mention the Pareto principle and someone will explain why it can’t be true, based on […] The world is lumpy first appeared on John D. Cook.  ( 6 min )
  • Open

    Attend the Data Science Symposium 2022, November 8 in Cincinnati
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022, November 8 in Cincinnati appeared first on Machine Learning Mastery.  ( 10 min )
  • Open

    Breakthrough Computer Vision AI Device | New Neuromorphic Tech Runs Deep Learning AI 100X Faster Using Photons
    submitted by /u/kenickh [link] [comments]  ( 42 min )
  • Open

    Stormy Weather? Scientist Sharpens Forecasts With AI
    A perpetual shower of random raindrops falls inside a three-foot metal ring Dale Durran erected outside his front door (shown above). It’s a symbol of his passion for finding order in the seeming chaos of the planet’s weather. The post Stormy Weather? Scientist Sharpens Forecasts With AI appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Machine learning facilitates “turbulence tracking” in fusion reactors
    A new approach sheds light on the behavior of turbulent structures that can affect the energy generated during fusion reactions, with implications for reactor design.  ( 8 min )
    Using sound to model the world
    This machine-learning system can simulate how a listener would hear a sound from any point in a room.  ( 8 min )
  • Open

    Efficient Federated Learning on Knowledge Graphs via Privacy-preserving Relation Embedding Aggregation. (arXiv:2203.09553v3 [cs.AI] UPDATED)
    Federated learning (FL) can be essential in knowledge representation, reasoning, and data mining applications over multi-source knowledge graphs (KGs). A recent study FedE first proposes an FL framework that shares entity embeddings of KGs across all clients. However, entity embedding sharing from FedE would incur a severe privacy leakage. Specifically, the known entity embedding can be used to infer whether a specific relation between two entities exists in a private client. In this paper, we introduce a novel attack method that aims to recover the original data based on the embedding information, which is further used to evaluate the vulnerabilities of FedE. Furthermore, we propose a Federated learning paradigm with privacy-preserving Relation embedding aggregation (FedR) to tackle the privacy issue in FedE. Besides, relation embedding sharing can significantly reduce the communication cost due to its smaller size of queries. We conduct extensive experiments to evaluate FedR with five different KG embedding models and three datasets. Compared to FedE, FedR achieves similar utility and significant improvements regarding privacy-preserving effect and communication efficiency on the link prediction task.  ( 2 min )
    Mitigating Unfairness via Evolutionary Multi-objective Ensemble Learning. (arXiv:2210.16754v1 [cs.LG])
    In the literature of mitigating unfairness in machine learning, many fairness measures are designed to evaluate predictions of learning models and also utilised to guide the training of fair models. It has been theoretically and empirically shown that there exist conflicts and inconsistencies among accuracy and multiple fairness measures. Optimising one or several fairness measures may sacrifice or deteriorate other measures. Two key questions should be considered, how to simultaneously optimise accuracy and multiple fairness measures, and how to optimise all the considered fairness measures more effectively. In this paper, we view the mitigating unfairness problem as a multi-objective learning problem considering the conflicts among fairness measures. A multi-objective evolutionary learning framework is used to simultaneously optimise several metrics (including accuracy and multiple fairness measures) of machine learning models. Then, ensembles are constructed based on the learning models in order to automatically balance different metrics. Empirical results on eight well-known datasets demonstrate that compared with the state-of-the-art approaches for mitigating unfairness, our proposed algorithm can provide decision-makers with better tradeoffs among accuracy and multiple fairness metrics. Furthermore, the high-quality models generated by the framework can be used to construct an ensemble to automatically achieve a better tradeoff among all the considered fairness metrics than other ensemble methods. Our code is publicly available at https://github.com/qingquan63/FairEMOL  ( 2 min )
    How "troll" are you? Measuring and detecting troll behavior in online social networks. (arXiv:2210.08786v2 [cs.SI] UPDATED)
    The detection of state-sponsored trolls acting in misinformation operations is an unsolved and critical challenge for the research community, with repercussions that go beyond the online realm. In this paper, we propose a novel approach for the detection of troll accounts, which consists of two steps. The first step aims at classifying trajectories of accounts' online activities as belonging to either a troll account or to an organic user account. In the second step, we exploit the classified trajectories to compute a metric, namely "troll score", which allows us to quantify the extent to which an account behaves like a troll. Experimental results show that our approach identifies accounts' trajectories with an AUC close to 99% and, accordingly, classify trolls and organic users with an AUC of 97%. Finally, we evaluate whether the proposed solution can be generalized to different contexts (e.g., discussions about Covid-19) and generic misbehaving users, showing promising results that will be further expanded in our future endeavors.  ( 3 min )
    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v2 [cs.LG] UPDATED)
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, named neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments and case studies using synthetic and real-world datasets.  ( 2 min )
    The least-control principle for local learning at equilibrium. (arXiv:2207.01332v2 [cs.LG] UPDATED)
    Equilibrium systems are a powerful way to express neural computations. As special cases, they include models of great current interest in both neuroscience and machine learning, such as deep neural networks, equilibrium recurrent neural networks, deep equilibrium models, or meta-learning. Here, we present a new principle for learning such systems with a temporally- and spatially-local rule. Our principle casts learning as a least-control problem, where we first introduce an optimal controller to lead the system towards a solution state, and then define learning as reducing the amount of control needed to reach such a state. We show that incorporating learning signals within a dynamics as an optimal control enables transmitting activity-dependent credit assignment information, avoids storing intermediate states in memory, and does not rely on infinitesimal learning signals. In practice, our principle leads to strong performance matching that of leading gradient-based learning methods when applied to an array of problems involving recurrent neural networks and meta-learning. Our results shed light on how the brain might learn and offer new ways of approaching a broad class of machine learning problems.  ( 2 min )
    Filter and evolve: progressive pseudo label refining for semi-supervised automatic speech recognition. (arXiv:2210.16318v1 [cs.SD])
    Fine tuning self supervised pretrained models using pseudo labels can effectively improve speech recognition performance. But, low quality pseudo labels can misguide decision boundaries and degrade performance. We propose a simple yet effective strategy to filter low quality pseudo labels to alleviate this problem. Specifically, pseudo-labels are produced over the entire training set and filtered via average probability scores calculated from the model output. Subsequently, an optimal percentage of utterances with high probability scores are considered reliable training data with trustworthy labels. The model is iteratively updated to correct the unreliable pseudo labels to minimize the effect of noisy labels. The process above is repeated until unreliable pseudo abels have been adequately corrected. Extensive experiments on LibriSpeech show that these filtered samples enable the refined model to yield more correct predictions, leading to better ASR performances under various experimental settings.  ( 2 min )
    Beyond prompting: Making Pre-trained Language Models Better Zero-shot Learners by Clustering Representations. (arXiv:2210.16637v1 [cs.CL])
    Recent work has demonstrated that pre-trained language models (PLMs) are zero-shot learners. However, most existing zero-shot methods involve heavy human engineering or complicated self-training pipelines, hindering their application to new situations. In this work, we show that zero-shot text classification can be improved simply by clustering texts in the embedding spaces of PLMs. Specifically, we fit the unlabeled texts with a Bayesian Gaussian Mixture Model after initializing cluster positions and shapes using class names. Despite its simplicity, this approach achieves superior or comparable performance on both topic and sentiment classification datasets and outperforms prior works significantly on unbalanced datasets. We further explore the applicability of our clustering approach by evaluating it on 14 datasets with more diverse topics, text lengths, and numbers of classes. Our approach achieves an average of 20% absolute improvement over prompt-based zero-shot learning. Finally, we compare different PLM embedding spaces and find that texts are well-clustered by topics even if the PLM is not explicitly pre-trained to generate meaningful sentence embeddings. This work indicates that PLM embeddings can categorize texts without task-specific fine-tuning, thus providing a new way to analyze and utilize their knowledge and zero-shot learning ability.  ( 2 min )
    Learning Audio-Visual Dynamics Using Scene Graphs for Audio Source Separation. (arXiv:2210.16472v1 [cs.SD])
    There exists an unequivocal distinction between the sound produced by a static source and that produced by a moving one, especially when the source moves towards or away from the microphone. In this paper, we propose to use this connection between audio and visual dynamics for solving two challenging tasks simultaneously, namely: (i) separating audio sources from a mixture using visual cues, and (ii) predicting the 3D visual motion of a sounding source using its separated audio. Towards this end, we present Audio Separator and Motion Predictor (ASMP) -- a deep learning framework that leverages the 3D structure of the scene and the motion of sound sources for better audio source separation. At the heart of ASMP is a 2.5D scene graph capturing various objects in the video and their pseudo-3D spatial proximities. This graph is constructed by registering together 2.5D monocular depth predictions from the 2D video frames and associating the 2.5D scene regions with the outputs of an object detector applied on those frames. The ASMP task is then mathematically modeled as the joint problem of: (i) recursively segmenting the 2.5D scene graph into several sub-graphs, each associated with a constituent sound in the input audio mixture (which is then separated) and (ii) predicting the 3D motions of the corresponding sound sources from the separated audio. To empirically evaluate ASMP, we present experiments on two challenging audio-visual datasets, viz. Audio Separation in the Wild (ASIW) and Audio Visual Event (AVE). Our results demonstrate that ASMP achieves a clear improvement in source separation quality, outperforming prior works on both datasets, while also estimating the direction of motion of the sound sources better than other methods.  ( 3 min )
    Neural Combinatorial Logic Circuit Synthesis from Input-Output Examples. (arXiv:2210.16606v1 [cs.LG])
    We propose a novel, fully explainable neural approach to synthesis of combinatorial logic circuits from input-output examples. The carrying advantage of our method is that it readily extends to inductive scenarios, where the set of examples is incomplete but still indicative of the desired behaviour. Our method can be employed for a virtually arbitrary choice of atoms - from logic gates to FPGA blocks - as long as they can be formulated in a differentiable fashion, and consistently yields good results for synthesis of practical circuits of increasing size. In particular, we succeed in learning a number of arithmetic, bitwise, and signal-routing operations, and even generalise towards the correct behaviour in inductive scenarios. Our method, attacking a discrete logical synthesis problem with an explainable neural approach, hints at a wider promise for synthesis and reasoning-related tasks.  ( 2 min )
    Asymptotic-Preserving Neural Networks for hyperbolic systems with diffusive scaling. (arXiv:2210.09081v2 [math.NA] UPDATED)
    With the rapid advance of Machine Learning techniques and the deep increase of availability of scientific data, data-driven approaches have started to become progressively popular across science, causing a fundamental shift in the scientific method after proving to be powerful tools with a direct impact in many areas of society. Nevertheless, when attempting to analyze dynamics of complex multiscale systems, the usage of standard Deep Neural Networks (DNNs) and even standard Physics-Informed Neural Networks (PINNs) may lead to incorrect inferences and predictions, due to the presence of small scales leading to reduced or simplified models in the system that have to be applied consistently during the learning process. In this Chapter, we will address these issues in light of recent results obtained in the development of Asymptotic-Preserving Neural Networks (APNNs) for hyperbolic models with diffusive scaling. Several numerical tests show how APNNs provide considerably better results with respect to the different scales of the problem when compared with standard DNNs and PINNs, especially when analyzing scenarios in which only little and scattered information is available.  ( 2 min )
    Predicting Drug-Drug Interactions using Deep Generative Models on Graphs. (arXiv:2209.09941v3 [q-bio.BM] UPDATED)
    Latent representations of drugs and their targets produced by contemporary graph autoencoder-based models have proved useful in predicting many types of node-pair interactions on large networks, including drug-drug, drug-target, and target-target interactions. However, most existing approaches model the node's latent spaces in which node distributions are rigid and disjoint; these limitations hinder the methods from generating new links among pairs of nodes. In this paper, we present the effectiveness of variational graph autoencoders (VGAE) in modeling latent node representations on multimodal networks. Our approach can produce flexible latent spaces for each node type of the multimodal graph; the embeddings are used later for predicting links among node pairs under different edge types. To further enhance the models' performance, we suggest a new method that concatenates Morgan fingerprints, which capture the molecular structures of each drug, with their latent embeddings before preceding them to the decoding stage for link prediction. Our proposed model shows competitive results on two multimodal networks: (1) a multi-graph consisting of drug and protein nodes, and (2) a multi-graph consisting of drug and cell line nodes. Our source code is publicly available at https://github.com/HySonLab/drug-interactions.  ( 2 min )
    Spectral Representation Learning for Conditional Moment Models. (arXiv:2210.16525v1 [stat.ML])
    Many problems in causal inference and economics can be formulated in the framework of conditional moment models, which characterize the target function through a collection of conditional moment restrictions. For nonparametric conditional moment models, efficient estimation has always relied on preimposed conditions on various measures of ill-posedness of the hypothesis space, which are hard to validate when flexible models are used. In this work, we address this issue by proposing a procedure that automatically learns representations with controlled measures of ill-posedness. Our method approximates a linear representation defined by the spectral decomposition of a conditional expectation operator, which can be used for kernelized estimators and is known to facilitate minimax optimal estimation in certain settings. We show this representation can be efficiently estimated from data, and establish L2 consistency for the resulting estimator. We evaluate the proposed method on proximal causal inference tasks, exhibiting promising performance on high-dimensional, semi-synthetic data.
    Learning Individual Interactions from Population Dynamics with Discrete-Event Simulation Model. (arXiv:2205.02332v2 [cs.LG] UPDATED)
    The abundance of data affords researchers to pursue more powerful computational tools to learn the dynamics of complex system, such as neural networks, engineered systems and social networks. Traditional machine learning approaches capture complex system dynamics either with dynamic Bayesian networks and state space models, which is hard to scale because it is non-trivial to prescribe the dynamics with a sparse graph or a system of differential equations; or a deep neural networks, where the distributed representation of the learned dynamics is hard to interpret. In this paper, we will explore the possibility of learning a discrete-event simulation representation of complex system dynamics assuming multivariate normal distribution of the state variables, based on the observation that many complex system dynamics can be decomposed into a sequence of local interactions, which individually change the system state only minimally but in sequence generate complex and diverse dynamics. Our results show that the algorithm can data-efficiently capture complex network dynamics in several fields with meaningful events.
    Federated X-Armed Bandit. (arXiv:2205.15268v2 [stat.ML] UPDATED)
    This work establishes the first framework of federated $\mathcal{X}$-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named \texttt{Fed-PNE}. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of \texttt{Fed-PNE} over single-client algorithms and federated multi-armed bandit algorithms.
    LearningGroup: A Real-Time Sparse Training on FPGA via Learnable Weight Grouping for Multi-Agent Reinforcement Learning. (arXiv:2210.16624v1 [cs.AR])
    Multi-agent reinforcement learning (MARL) is a powerful technology to construct interactive artificial intelligent systems in various applications such as multi-robot control and self-driving cars. Unlike supervised model or single-agent reinforcement learning, which actively exploits network pruning, it is obscure that how pruning will work in multi-agent reinforcement learning with its cooperative and interactive characteristics. \par In this paper, we present a real-time sparse training acceleration system named LearningGroup, which adopts network pruning on the training of MARL for the first time with an algorithm/architecture co-design approach. We create sparsity using a weight grouping algorithm and propose on-chip sparse data encoding loop (OSEL) that enables fast encoding with efficient implementation. Based on the OSEL's encoding format, LearningGroup performs efficient weight compression and computation workload allocation to multiple cores, where each core handles multiple sparse rows of the weight matrix simultaneously with vector processing units. As a result, LearningGroup system minimizes the cycle time and memory footprint for sparse data generation up to 5.72x and 6.81x. Its FPGA accelerator shows 257.40-3629.48 GFLOPS throughput and 7.10-100.12 GFLOPS/W energy efficiency for various conditions in MARL, which are 7.13x higher and 12.43x more energy efficient than Nvidia Titan RTX GPU, thanks to the fully on-chip training and highly optimized dataflow/data format provided by FPGA. Most importantly, the accelerator shows speedup up to 12.52x for processing sparse data over the dense case, which is the highest among state-of-the-art sparse training accelerators.
    Pre-Trained Language Models for Interactive Decision-Making. (arXiv:2202.01771v4 [cs.LG] UPDATED)
    Language model (LM) pre-training is useful in many language processing tasks. But can pre-trained LMs be further leveraged for more general machine learning problems? We propose an approach for using LMs to scaffold learning and generalization in general sequential decision-making problems. In this approach, goals and observations are represented as a sequence of embeddings, and a policy network initialized with a pre-trained LM predicts the next action. We demonstrate that this framework enables effective combinatorial generalization across different environments and supervisory modalities. We begin by assuming access to a set of expert demonstrations, and show that initializing policies with LMs and fine-tuning them via behavior cloning improves task completion rates by 43.6% in the VirtualHome environment. Next, we integrate an active data gathering procedure in which agents iteratively interact with the environment, relabel past "failed" experiences with new goals, and update their policies in a self-supervised loop. Active data gathering further improves combinatorial generalization, outperforming the best baseline by 25.1%. Finally, we explain these results by investigating three possible factors underlying the effectiveness of the LM-based policy. We find that sequential input representations (vs. fixed-dimensional feature vectors) and LM-based weight initialization are both important for generalization. Surprisingly, however, the format of the policy inputs encoding (e.g. as a natural language string vs. an arbitrary sequential encoding) has little influence. Together, these results suggest that language modeling induces representations that are useful for modeling not just language, but also goals and plans; these representations can aid learning and generalization even outside of language processing.
    Universality of empirical risk minimization. (arXiv:2202.08832v2 [math.ST] UPDATED)
    Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = \Theta(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).
    Out-of-distribution generalization for learning quantum dynamics. (arXiv:2204.10268v2 [quant-ph] UPDATED)
    Generalization bounds are a critical tool to assess the training data requirements of Quantum Machine Learning (QML). Recent work has established guarantees for in-distribution generalization of quantum neural networks (QNNs), where training and testing data are assumed to be drawn from the same data distribution. However, there are currently no results on out-of-distribution generalization in QML, where we require a trained model to perform well even on data drawn from a distribution different from the training distribution. In this work, we prove out-of-distribution generalization for the task of learning an unknown unitary using a QNN and for a broad class of training and testing distributions. In particular, we show that one can learn the action of a unitary on entangled states using only product state training data. We numerically illustrate this by showing that the evolution of a Heisenberg spin chain can be learned using only product training states. Since product states can be prepared using only single-qubit gates, this advances the prospects of learning quantum dynamics using near term quantum computers and quantum experiments, and further opens up new methods for both the classical and quantum compilation of quantum circuits.
    MSGNN: A Spectral Graph Neural Network Based on a Novel Magnetic Signed Laplacian. (arXiv:2209.00546v3 [stat.ML] UPDATED)
    Signed and directed networks are ubiquitous in real-world applications. However, there has been relatively little work proposing spectral graph neural networks (GNNs) for such networks. Here we introduce a signed directed Laplacian matrix, which we call the magnetic signed Laplacian, as a natural generalization of both the signed Laplacian on signed graphs and the magnetic Laplacian on directed graphs. We then use this matrix to construct a novel efficient spectral GNN architecture and conduct extensive experiments on both node clustering and link prediction tasks. In these experiments, we consider tasks related to signed information, tasks related to directional information, and tasks related to both signed and directional information. We demonstrate that our proposed spectral GNN is effective for incorporating both signed and directional information, and attains leading performance on a wide range of data sets. Additionally, we provide a novel synthetic network model, which we refer to as the signed directed stochastic block model, and a number of novel real-world data sets based on lead-lag relationships in financial time series.
    GowFed -- A novel Federated Network Intrusion Detection System. (arXiv:2210.16441v1 [cs.LG])
    Network intrusion detection systems are evolving into intelligent systems that perform data analysis while searching for anomalies in their environment. Indeed, the development of deep learning techniques paved the way to build more complex and effective threat detection models. However, training those models may be computationally infeasible in most Edge or IoT devices. Current approaches rely on powerful centralized servers that receive data from all their parties -- violating basic privacy constraints and substantially affecting response times and operational costs due to the huge communication overheads. To mitigate these issues, Federated Learning emerged as a promising approach, where different agents collaboratively train a shared model, without exposing training data to others or requiring a compute-intensive centralized infrastructure. This work presents GowFed, a novel network threat detection system that combines the usage of Gower Dissimilarity matrices and Federated averaging. Different approaches of GowFed have been developed based on state-of the-art knowledge: (1) a vanilla version; and (2) a version instrumented with an attention mechanism. Furthermore, each variant has been tested using simulation oriented tools provided by TensorFlow Federated framework. In the same way, a centralized analogous development of the Federated systems is carried out to explore their differences in terms of scalability and performance -- across a set of designed experiments/scenarios. Overall, GowFed intends to be the first stepping stone towards the combined usage of Federated Learning and Gower Dissimilarity matrices to detect network threats in industrial-level networks.
    Neural network quantum state with proximal optimization: a ground-state searching scheme based on variational Monte Carlo. (arXiv:2210.16493v1 [cond-mat.dis-nn])
    Neural network quantum states (NQS), incorporating with variational Monte Carlo (VMC) method, are shown to be a promising way to investigate quantum many-body physics. Whereas vanilla VMC methods perform one gradient update per sample, we introduce a novel objective function with proximal optimization (PO) that enables multiple updates via reusing the mismatched samples. Our VMC-PO method keeps the advantage of the previous importance sampling gradient optimization algorithm [L. Yang, {\it et al}, Phys. Rev. Research {\bf 2}, 012039(R)(2020)] that efficiently uses sampled states. PO mitigates the numerical instabilities during network updates, which is similar to stochastic reconfiguration (SR) methods, but achieves an alternative and simpler implement with lower computational complexity. We investigate the performance of our VMC-PO algorithm for ground-state searching with a 1-dimensional transverse-field Ising model and 2-dimensional Heisenberg antiferromagnet on a square lattice, and demonstrate that the reached ground-state energies are comparable to state-of-the-art results.
    DuDe: Dual-Decoder Multilingual ASR for Indian Languages using Common Label Set. (arXiv:2210.16739v1 [eess.AS])
    In a multilingual country like India, multilingual Automatic Speech Recognition (ASR) systems have much scope. Multilingual ASR systems exhibit many advantages like scalability, maintainability, and improved performance over the monolingual ASR systems. However, building multilingual systems for Indian languages is challenging since different languages use different scripts for writing. On the other hand, Indian languages share a lot of common sounds. Common Label Set (CLS) exploits this idea and maps graphemes of various languages with similar sounds to common labels. Since Indian languages are mostly phonetic, building a parser to convert from native script to CLS is easy. In this paper, we explore various approaches to build multilingual ASR models. We also propose a novel architecture called Encoder-Decoder-Decoder for building multilingual systems that use both CLS and native script labels. We also analyzed the effectiveness of CLS-based multilingual systems combined with machine transliteration.
    Facetron: A Multi-speaker Face-to-Speech Model based on Cross-modal Latent Representations. (arXiv:2107.12003v2 [cs.CV] UPDATED)
    In this paper, we propose a multi-speaker face-to-speech waveform generation model that also works for unseen speaker conditions. Using a generative adversarial network (GAN) with linguistic and speaker characteristic features as auxiliary conditions, our method directly converts face images into speech waveforms under an end-to-end training framework. The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images using cross-modal learning with a pre-trained acoustic model. Since these two features are uncorrelated and controlled independently, we can flexibly synthesize speech waveforms whose speaker characteristics vary depending on the input face images. We show the superiority of our proposed model over conventional methods in terms of objective and subjective evaluation results. Specifically, we evaluate the performances of linguistic features by measuring their accuracy on an automatic speech recognition task. In addition, we estimate speaker and gender similarity for multi-speaker and unseen conditions, respectively. We also evaluate the aturalness of the synthesized speech waveforms using a mean opinion score (MOS) test and non-intrusive objective speech quality assessment (NISQA).The demo samples of the proposed and other models are available at https://sam-0927.github.io/
    Label Efficient Regularization and Propagation for Graph Node Classification. (arXiv:2204.08646v2 [cs.LG] UPDATED)
    An enhanced label propagation (LP) method called GraphHop was proposed recently. It outperforms graph convolutional networks (GCNs) in the semi-supervised node classification task on various networks. Although the performance of GraphHop was explained intuitively with joint node attribute and label signal smoothening, its rigorous mathematical treatment is lacking. In this paper, we propose a label efficient regularization and propagation (LERP) framework for graph node classification, and present an alternate optimization procedure for its solution. Furthermore, we show that GraphHop only offers an approximate solution to this framework and has two drawbacks. First, it includes all nodes in the classifier training without taking the reliability of pseudo-labeled nodes into account in the label update step. Second, it provides a rough approximation to the optimum of a subproblem in the label aggregation step. Based on the LERP framework, we propose a new method, named the LERP method, to solve these two shortcomings. LERP determines reliable pseudo-labels adaptively during the alternate optimization and provides a better approximation to the optimum with computational efficiency. Theoretical convergence of LERP is guaranteed. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of LERP. That is, LERP outperforms all benchmarking methods, including GraphHop, consistently on five test datasets and an object recognition task at extremely low label rates (i.e., 1, 2, 4, 8, 16, and 20 labeled samples per class).
    Atlas: Automate Online Service Configuration in Network Slicing. (arXiv:2210.16902v1 [cs.LG])
    Network slicing achieves cost-efficient slice customization to support heterogeneous applications and services. Configuring cross-domain resources to end-to-end slices based on service-level agreements, however, is challenging, due to the complicated underlying correlations and the simulation-to-reality discrepancy between simulators and real networks. In this paper, we propose Atlas, an online network slicing system, which automates the service configuration of slices via safe and sample-efficient learn-to-configure approaches in three interrelated stages. First, we design a learning-based simulator to reduce the sim-to-real discrepancy, which is accomplished by a new parameter searching method based on Bayesian optimization. Second, we offline train the policy in the augmented simulator via a novel offline algorithm with a Bayesian neural network and parallel Thompson sampling. Third, we online learn the policy in real networks with a novel online algorithm with safe exploration and Gaussian process regression. We implement Atlas on an end-to-end network prototype based on OpenAirInterface RAN, OpenDayLight SDN transport, OpenAir-CN core network, and Docker-based edge server. Experimental results show that, compared to state-of-the-art solutions, Atlas achieves 63.9% and 85.7% regret reduction on resource usage and slice quality of experience during the online learning stage, respectively.
    Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction. (arXiv:2210.16872v1 [cs.LG])
    The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.  ( 2 min )
    Imbalanced Classification via Explicit Gradient Learning From Augmented Data. (arXiv:2202.10550v2 [cs.LG] UPDATED)
    Learning from imbalanced data is one of the most significant challenges in real-world classification tasks. In such cases, neural networks performance is substantially impaired due to preference towards the majority class. Existing approaches attempt to eliminate the bias through data re-sampling or re-weighting the loss in the learning process. Still, these methods tend to overfit the minority samples and perform poorly when the structure of the minority class is highly irregular. Here, we propose a novel deep meta-learning technique to augment a given imbalanced dataset with new minority instances. These additional data are incorporated in the classifier's deep-learning process, and their contributions are learned explicitly. The advantage of the proposed method is demonstrated on synthetic and real-world datasets with various imbalance ratios.
    Self-Improving Safety Performance of Reinforcement Learning Based Driving with Black-Box Verification Algorithms. (arXiv:2210.16575v1 [cs.AI])
    In this work, we propose a self-improving artificial intelligence system for enhancing the safety performance of reinforcement learning (RL) based autonomous driving (AD) agents based on black-box verification methods. RL methods have enjoyed popularity among AD applications in recent years. That being said, existing RL algorithms' performance strongly depends on the diversity of training scenarios. Lack of safety-critical scenarios in the training phase might lead to poor generalization performance in real-world driving applications. We propose a novel framework, where the weaknesses of the training set are explored via black-box verification methods. After the discovery of AD failure scenarios, the training of the RL agent is re-initiated to improve the performance of the previously unsafe scenarios. Simulation results show that the proposed approach efficiently discovers such safety failures in RL-based adaptive cruise control (ACC) applications and significantly reduces the number of vehicle collisions through iterative applications of our method.
    Robust Distributed Learning Against Both Distributional Shifts and Byzantine Attacks. (arXiv:2210.16682v1 [cs.LG])
    In distributed learning systems, robustness issues may arise from two sources. On one hand, due to distributional shifts between training data and test data, the trained model could exhibit poor out-of-sample performance. On the other hand, a portion of working nodes might be subject to byzantine attacks which could invalidate the learning result. Existing works mostly deal with these two issues separately. In this paper, we propose a new algorithm that equips distributed learning with robustness measures against both distributional shifts and byzantine attacks. Our algorithm is built on recent advances in distributionally robust optimization as well as norm-based screening (NBS), a robust aggregation scheme against byzantine attacks. We provide convergence proofs in three cases of the learning model being nonconvex, convex, and strongly convex for the proposed algorithm, shedding light on its convergence behaviors and endurability against byzantine attacks. In particular, we deduce that any algorithm employing NBS (including ours) cannot converge when the percentage of byzantine nodes is 1/3 or higher, instead of 1/2, which is the common belief in current literature. The experimental results demonstrate the effectiveness of our algorithm against both robustness issues. To the best of our knowledge, this is the first work to address distributional shifts and byzantine attacks simultaneously.
    SBI: A Simulation-Based Test of Identifiability for Bayesian Causal Inference. (arXiv:2102.11761v2 [cs.LG] UPDATED)
    A growing family of approaches to causal inference rely on Bayesian formulations of assumptions that go beyond causal graph structure. For example, Bayesian approaches have been developed for analyzing instrumental variable designs, regression discontinuity designs, and within-subjects designs. This paper introduces simulation-based identifiability (SBI), a procedure for testing the identifiability of queries in Bayesian causal inference approaches that are implemented as probabilistic programs. SBI complements analytical approaches to identifiability, leveraging a particle-based optimization scheme on simulated data to determine identifiability for analytically intractable models. We analyze SBI's soundness for a broad class of differentiable, finite-dimensional probabilistic programs with bounded effects. Finally, we provide an implementation of SBI using stochastic gradient descent, and show empirically that it agrees with known identification results on a suite of graph-based and quasi-experimental design benchmarks, including those using Gaussian processes.
    Deep Policies for Online Bipartite Matching: A Reinforcement Learning Approach. (arXiv:2109.10380v3 [cs.LG] UPDATED)
    The challenge in the widely applicable online matching problem lies in making irrevocable assignments while there is uncertainty about future inputs. Most theoretically-grounded policies are myopic or greedy in nature. In real-world applications where the matching process is repeated on a regular basis, the underlying data distribution can be leveraged for better decision-making. We present an end-to-end Reinforcement Learning framework for deriving better matching policies based on trial-and-error on historical data. We devise a set of neural network architectures, design feature representations, and empirically evaluate them across two online matching problems: Edge-Weighted Online Bipartite Matching and Online Submodular Bipartite Matching. We show that most of the learning approaches perform consistently better than classical baseline algorithms on four synthetic and real-world datasets. On average, our proposed models improve the matching quality by 3--10\% on a variety of synthetic and real-world datasets. Our code is publicly available at https://github.com/lyeskhalil/CORL.
    Group Equality in Adaptive Submodular Maximization. (arXiv:2207.03364v2 [cs.LG] UPDATED)
    In this paper, we study the classic submodular maximization problem subject to a group equality constraint under both non-adaptive and adaptive settings. It has been shown that the utility function of many machine learning applications, including data summarization, influence maximization in social networks, and personalized recommendation, satisfies the property of submodularity. Hence, maximizing a submodular function subject to various constraints can be found at the heart of many of those applications. On a high level, submodular maximization aims to select a group of most representative items (e.g., data points). However, the design of most existing algorithms does not incorporate the fairness constraint, leading to under- or over-representation of some particular groups. This motivates us to study the submodular maximization problem with group equality, where we aim to select a group of items to maximize a (possibly non-monotone) submodular utility function subject to a group equality constraint. To this end, we develop the first constant-factor approximation algorithm for this problem. The design of our algorithm is robust enough to be extended to solving the submodular maximization problem under a more complicated adaptive setting. Moreover, we further extend our study to incorporating a global cardinality constraint.
    BIMRL: Brain Inspired Meta Reinforcement Learning. (arXiv:2210.16530v1 [cs.LG])
    Sample efficiency has been a key issue in reinforcement learning (RL). An efficient agent must be able to leverage its prior experiences to quickly adapt to similar, but new tasks and situations. Meta-RL is one attempt at formalizing and addressing this issue. Inspired by recent progress in meta-RL, we introduce BIMRL, a novel multi-layer architecture along with a novel brain-inspired memory module that will help agents quickly adapt to new tasks within a few episodes. We also utilize this memory module to design a novel intrinsic reward that will guide the agent's exploration. Our architecture is inspired by findings in cognitive neuroscience and is compatible with the knowledge on connectivity and functionality of different regions in the brain. We empirically validate the effectiveness of our proposed method by competing with or surpassing the performance of some strong baselines on multiple MiniGrid environments.  ( 2 min )
    ACMP: Allen-Cahn Message Passing for Graph Neural Networks with Particle Phase Transition. (arXiv:2206.05437v2 [cs.LG] UPDATED)
    Neural message passing is a basic feature extraction unit for graph-structured data that takes account of the impact of neighboring node features in network propagation from one layer to the next. We model such process by an interacting particle system with attractive and repulsive forces and the Allen-Cahn force arising in the modeling of phase transition. The system is a reaction-diffusion process which can separate particles to different clusters. This induces an Allen-Cahn message passing (ACMP) for graph neural networks where the numerical iteration for the solution constitutes the message passing propagation. The mechanism behind ACMP is phase transition of particles which enables the formation of multi-clusters and thus GNNs prediction for node classification. ACMP can propel the network depth to hundreds of layers with theoretically proven strictly positive lower bound of the Dirichlet energy. It thus provides a deep model of GNNs which circumvents the common GNN problem of oversmoothing. Experiments for various real node classification datasets, with possible high homophily difficulty, show the GNNs with ACMP can achieve state of the art performance with no decay of Dirichlet energy.
    Traffic-Twitter Transformer: A Nature Language Processing-joined Framework For Network-wide Traffic Forecasting. (arXiv:2206.11078v3 [cs.LG] UPDATED)
    With accurate and timely traffic forecasting, the impacted traffic conditions can be predicted in advance to guide agencies and residents to respond to changes in traffic patterns appropriately. However, existing works on traffic forecasting mainly relied on historical traffic patterns confining to short-term prediction, under 1 hour, for instance. To better manage future roadway capacity and accommodate social and human impacts, it is crucial to propose a flexible and comprehensive framework to predict physical-aware long-term traffic conditions for public users and transportation agencies. In this paper, the gap of robust long-term traffic forecasting was bridged by taking social media features into consideration. A correlation study and a linear regression model were first implemented to evaluate the significance of the correlation between two time-series data, traffic intensity and Twitter data intensity. Two time-series data were then fed into our proposed social-aware framework, Traffic-Twitter Transformer, which integrated Nature Language representations into time-series records for long-term traffic prediction. Experimental results in the Great Seattle Area showed that our proposed model outperformed baseline models in all evaluation matrices. This NLP-joined social-aware framework can become a valuable implement of network-wide traffic prediction and management for traffic agencies.
    SupervisorBot: NLP-Annotated Real-Time Recommendations of Psychotherapy Treatment Strategies with Deep Reinforcement Learning. (arXiv:2208.13077v2 [cs.CL] UPDATED)
    We propose a recommendation system that suggests treatment strategies to a therapist during the psychotherapy session in real-time. Our system uses a turn-level rating mechanism that predicts the therapeutic outcome by computing a similarity score between the deep embedding of a scoring inventory, and the current sentence that the patient is speaking. The system automatically transcribes a continuous audio stream and separates it into turns of the patient and of the therapist and perform real-time inference of their therapeutic working alliance. The dialogue pairs along with their computed working alliance as ratings are then fed into a deep reinforcement learning recommendation system where the sessions are treated as users and the topics are treated as items. Other than evaluating the empirical advantages of the core components on an existing dataset of psychotherapy sessions, we demonstrate the effectiveness of this system in a web app.
    Fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v3 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, and sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models. In this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.  ( 2 min )
    Exploiting prompt learning with pre-trained language models for Alzheimer's Disease detection. (arXiv:2210.16539v1 [cs.CL])
    Early diagnosis of Alzheimer's disease (AD) is crucial in facilitating preventive care and to delay further progression. Speech based automatic AD screening systems provide a non-intrusive and more scalable alternative to other clinical screening techniques. Textual embedding features produced by pre-trained language models (PLMs) such as BERT are widely used in such systems. However, PLM domain fine-tuning is commonly based on the masked word or sentence prediction costs that are inconsistent with the back-end AD detection task. To this end, this paper investigates the use of prompt-based fine-tuning of PLMs that consistently uses AD classification errors as the training objective function. Disfluency features based on hesitation or pause filler token frequencies are further incorporated into prompt phrases during PLM fine-tuning. The exploit of the complementarity between BERT or RoBERTa based PLMs that are either prompt learning fine-tuned, or optimized using conventional masked word or sentence prediction costs, decision voting based system combination between them is further applied. Mean, standard deviation and the maximum among accuracy scores over 15 experiment runs are adopted as performance measurements for the AD detection system. Mean detection accuracy of 84.20% (with std 2.09%, best 87.5%) and 82.64% (with std 4.0%, best 89.58%) were obtained using manual and ASR speech transcripts respectively on the ADReSS20 test set consisting of 48 elderly speakers.  ( 2 min )
    Stability Analysis and Generalization Bounds of Adversarial Training. (arXiv:2210.00960v2 [cs.LG] UPDATED)
    In adversarial machine learning, deep neural networks can fit the adversarial examples on the training dataset but have poor generalization ability on the test set. This phenomenon is called robust overfitting, and it can be observed when adversarially training neural nets on common datasets, including SVHN, CIFAR-10, CIFAR-100, and ImageNet. In this paper, we study the robust overfitting issue of adversarial training by using tools from uniform stability. One major challenge is that the outer function (as a maximization of the inner function) is nonsmooth, so the standard technique (e.g., hardt et al., 2016) cannot be applied. Our approach is to consider $\eta$-approximate smoothness: we show that the outer function satisfies this modified smoothness assumption with $\eta$ being a constant related to the adversarial perturbation $\epsilon$. Based on this, we derive stability-based generalization bounds for stochastic gradient descent (SGD) on the general class of $\eta$-approximate smooth functions, which covers the adversarial loss. Our results suggest that robust test accuracy decreases in $\epsilon$ when $T$ is large, with a speed between $\Omega(\epsilon\sqrt{T})$ and $\mathcal{O}(\epsilon T)$. This phenomenon is also observed in practice. Additionally, we show that a few popular techniques for adversarial training (e.g., early stopping, cyclic learning rate, and stochastic weight averaging) are stability-promoting in theory.
    Robust Boosting Forests with Richer Deep Feature Hierarchy. (arXiv:2210.16451v1 [cs.CV])
    We propose a robust variant of boosting forest to the various adversarial defense methods, and apply it to enhance the robustness of the deep neural network. We retain the deep network architecture, weights, and middle layer features, then install gradient boosting forest to select the features from each layer of the deep network, and predict the target. For training each decision tree, we propose a novel conservative and greedy trade-off, with consideration for less misprediction instead of pure gain functions, therefore being suboptimal and conservative. We actively increase tree depth to remedy the accuracy with splits in more features, being more greedy in growing tree depth. We propose a new task on 3D face model, whose robustness has not been carefully studied, despite the great security and privacy concerns related to face analytics. We tried a simple attack method on a pure convolutional neural network (CNN) face shape estimator, making it degenerate to only output average face shape with invisible perturbation. Our conservative-greedy boosting forest (CGBF) on face landmark datasets showed a great improvement over original pure deep learning methods under the adversarial attacks.
    Edgeless-GNN: Unsupervised Representation Learning for Edgeless Nodes. (arXiv:2104.05225v3 [cs.SI] UPDATED)
    We study the problem of embedding edgeless nodes such as users who newly enter the underlying network, while using graph neural networks (GNNs) widely studied for effective representation learning of graphs. Our study is motivated by the fact that GNNs cannot be straightforwardly adopted for our problem since message passing to such edgeless nodes having no connections is impossible. To tackle this challenge, we propose Edgeless-GNN, a novel inductive framework that enables GNNs to generate node embeddings even for edgeless nodes through unsupervised learning. Specifically, we start by constructing a proxy graph based on the similarity of node attributes as the GNN's computation graph defined by the underlying network. The known network structure is used to train model parameters, whereas a topology-aware loss function is established in such a way that our model judiciously learns the network structure by encoding positive, negative, and second-order relations between nodes. For the edgeless nodes, we inductively infer embeddings by expanding the computation graph. By evaluating the performance of various downstream machine learning tasks, we empirically demonstrate that Edgeless-GNN exhibits (a) superiority over state-of-the-art inductive network embedding methods for edgeless nodes, (b) effectiveness of our topology-aware loss function, (c) robustness to incomplete node attributes, and (d) a linear scaling with the graph size.
    Novel Policy Seeking with Constrained Optimization. (arXiv:2005.10696v3 [cs.LG] UPDATED)
    In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.
    Modeling Perceptual Loudness of Piano Tone: Theory and Applications. (arXiv:2209.10674v2 [cs.SD] UPDATED)
    The relationship between perceptual loudness and physical attributes of sound is an important subject in both computer music and psychoacoustics. Early studies of "equal-loudness contour" can trace back to the 1920s and the measured loudness with respect to intensity and frequency has been revised many times since then. However, most studies merely focus on synthesized sound, and the induced theories on natural tones with complex timbre have rarely been justified. To this end, we investigate both theory and applications of natural-tone loudness perception in this paper via modeling piano tone. The theory part contains: 1) an accurate measurement of piano-tone equal-loudness contour of pitches, and 2) a machine-learning model capable of inferring loudness purely based on spectral features trained on human subject measurements. As for the application, we apply our theory to piano control transfer, in which we adjust the MIDI velocities on two different player pianos (in different acoustic environments) to achieve the same perceptual effect. Experiments show that both our theoretical loudness modeling and the corresponding performance control transfer algorithm significantly outperform their baselines.
    ODNet: A Convolutional Neural Network for Asteroid Occultation Detection. (arXiv:2210.16440v1 [astro-ph.EP])
    We propose to design and build an algorithm that will use a Convolutional Neural Network (CNN) and observations from the Unistellar network to reliably detect asteroid occultations. The Unistellar Network, made of more than 10,000 digital telescopes owned by citizen scientists, and is regularly used to record asteroid occultations. In order to process the increasing amount of observational produced by this network, we need a quick and reliable way to analyze occultations. In an effort to solve this problem, we trained a CNN with artificial images of stars with twenty different types of photometric signals. Inputs to the network consists of two stacks of snippet images of stars, one around the star that is supposed to be occulted and a reference star used for comparison. We need the reference star to distinguish between a true occultation and artefacts introduced by poor atmospheric condition. Our Occultation Detection Neural Network (ODNet), can analyze three sequence of stars per second with 91\% of precision and 87\% of recall. The algorithm is sufficiently fast and robust so we can envision incorporating onboard the eVscopes to deliver real-time results. We conclude that citizen science represents an important opportunity for the future studies and discoveries in the occultations, and that application of artificial intelligence will permit us to to take better advantage of the ever-growing quantity of data to categorize asteroids.
    Two is Better than Many? Binary Classification as an Effective Approach to Multi-Choice Question Answering. (arXiv:2210.16495v1 [cs.CL])
    We propose a simple refactoring of multi-choice question answering (MCQA) tasks as a series of binary classifications. The MCQA task is generally performed by scoring each (question, answer) pair normalized over all the pairs, and then selecting the answer from the pair that yield the highest score. For n answer choices, this is equivalent to an n-class classification setup where only one class (true answer) is correct. We instead show that classifying (question, true answer) as positive instances and (question, false answer) as negative instances is significantly more effective across various models and datasets. We show the efficacy of our proposed approach in different tasks -- abductive reasoning, commonsense question answering, science question answering, and sentence completion. Our DeBERTa binary classification model reaches the top or close to the top performance on public leaderboards for these tasks. The source code of the proposed approach is available at https://github.com/declare-lab/TEAM.
    Auxo: Heterogeneity-Mitigating Federated Learning via Scalable Client Clustering. (arXiv:2210.16656v1 [cs.LG])
    Federated learning (FL) is an emerging machine learning (ML) paradigm that enables heterogeneous edge devices to collaboratively train ML models without revealing their raw data to a logically centralized server. Heterogeneity across participants is a fundamental challenge in FL, both in terms of non-independent and identically distributed (Non-IID) data distributions and variations in device capabilities. Many existing works present point solutions to address issues like slow convergence, low final accuracy, and bias in FL, all stemming from the client heterogeneity. We observe that, in a large population, there exist groups of clients with statistically similar data distributions (cohorts). In this paper, we propose Auxo to gradually identify cohorts among large-scale, low-participation, and resource-constrained FL populations. Auxo then adaptively determines how to train cohort-specific models in order to achieve better model performance and ensure resource efficiency. By identifying cohorts with smaller heterogeneity and performing efficient cohort-based training, our extensive evaluations show that Auxo substantially boosts the state-of-the-art solutions in terms of final accuracy, convergence time, and model bias.
    Online Convex Optimization with Long Term Constraints for Predictable Sequences. (arXiv:2210.16735v1 [cs.LG])
    In this paper, we investigate the framework of Online Convex Optimization (OCO) for online learning. OCO offers a very powerful online learning framework for many applications. In this context, we study a specific framework of OCO called {\it OCO with long term constraints}. Long term constraints are introduced typically as an alternative to reduce the complexity of the projection at every update step in online optimization. While many algorithmic advances have been made towards online optimization with long term constraints, these algorithms typically assume that the sequence of cost functions over a certain $T$ finite steps that determine the cost to the online learner are adversarially generated. In many circumstances, the sequence of cost functions may not be unrelated, and thus predictable from those observed till a point of time. In this paper, we study the setting where the sequences are predictable. We present a novel online optimization algorithm for online optimization with long term constraints that can leverage such predictability. We show that, with a predictor that can supply the gradient information of the next function in the sequence, our algorithm can achieve an overall regret and constraint violation rate that is strictly less than the rate that is achievable without prediction.
    Decentralized adaptive clustering of deep nets is beneficial for client collaboration. (arXiv:2206.08839v2 [cs.LG] UPDATED)
    We study the problem of training personalized deep learning models in a decentralized peer-to-peer setting, focusing on the setting where data distributions differ between the clients and where different clients have different local learning tasks. We study both covariate and label shift, and our contribution is an algorithm which for each client finds beneficial collaborations based on a similarity estimate for the local task. Our method does not rely on hyperparameters which are hard to estimate, such as the number of client clusters, but rather continuously adapts to the network topology using soft cluster assignment based on a novel adaptive gossip algorithm. We test the proposed method in various settings where data is not independent and identically distributed among the clients. The experimental evaluation shows that the proposed method performs better than previous state-of-the-art algorithms for this problem setting, and handles situations well where previous methods fail.
    Towards Intercultural Affect Recognition: Audio-Visual Affect Recognition in the Wild Across Six Cultures. (arXiv:2208.00344v3 [cs.CV] UPDATED)
    In our multicultural world, affect-aware AI systems that support humans need the ability to perceive affect across variations in emotion expression patterns across cultures. These systems must perform well in cultural contexts without annotated affect datasets available for training models. A standard assumption in affective computing is that affect recognition models trained and used within the same culture (intracultural) will perform better than models trained on one culture and used on different cultures (intercultural). We test this assumption and present the first systematic study of intercultural affect recognition models using videos of real-world dyadic interactions from six cultures. We develop an attention-based feature selection approach under temporal causal discovery to identify behavioral cues that can be leveraged in intercultural affect recognition models. Across all six cultures, our findings demonstrate that intercultural affect recognition models were as effective or more effective than intracultural models. We identify and contribute useful behavioral features for intercultural affect recognition; facial features from the visual modality were more useful than the audio modality in this study's context. Our paper presents a proof-of-concept and motivation for the future development of intercultural affect recognition systems, especially those deployed in low-resource situations without annotated data.
    STGC-GNNs: A GNN-based traffic prediction framework with a spatial-temporal Granger causality graph. (arXiv:2210.16789v1 [cs.LG])
    The key to traffic prediction is to accurately depict the temporal dynamics of traffic flow traveling in a road network, so it is important to model the spatial dependence of the road network. The essence of spatial dependence is to accurately describe how traffic information transmission is affected by other nodes in the road network, and the GNN-based traffic prediction model, as a benchmark for traffic prediction, has become the most common method for the ability to model spatial dependence by transmitting traffic information with the message passing mechanism. However, existing methods model a local and static spatial dependence, which cannot transmit the global-dynamic traffic information (GDTi) required for long-term prediction. The challenge is the difficulty of detecting the precise transmission of GDTi due to the uncertainty of individual transport, especially for long-term transmission. In this paper, we propose a new hypothesis\: GDTi behaves macroscopically as a transmitting causal relationship (TCR) underlying traffic flow, which remains stable under dynamic changing traffic flow. We further propose spatial-temporal Granger causality (STGC) to express TCR, which models global and dynamic spatial dependence. To model global transmission, we model the causal order and causal lag of TCRs global transmission by a spatial-temporal alignment algorithm. To capture dynamic spatial dependence, we approximate the stable TCR underlying dynamic traffic flow by a Granger causality test. The experimental results on three backbone models show that using STGC to model the spatial dependence has better results than the original model for 45 min and 1 h long-term prediction.
    NIDA-CLIFGAN: Natural Infrastructure Damage Assessment through Efficient Classification Combining Contrastive Learning, Information Fusion and Generative Adversarial Networks. (arXiv:2110.14518v2 [cs.LG] UPDATED)
    During natural disasters, aircraft and satellites are used to survey the impacted regions. Usually human experts are needed to manually label the degrees of the building damage so that proper humanitarian assistance and disaster response (HADR) can be achieved, which is labor-intensive and time-consuming. Expecting human labeling of major disasters over a wide area gravely slows down the HADR efforts. It is thus of crucial interest to take advantage of the cutting-edge Artificial Intelligence and Machine Learning techniques to speed up the natural infrastructure damage assessment process to achieve effective HADR. Accordingly, the paper demonstrates a systematic effort to achieve efficient building damage classification. First, two novel generative adversarial nets (GANs) are designed to augment data used to train the deep-learning-based classifier. Second, a contrastive learning based method using novel data structures is developed to achieve great performance. Third, by using information fusion, the classifier is effectively trained with very few training data samples for transfer learning. All the classifiers are small enough to be loaded in a smart phone or simple laptop for first responders. Based on the available overhead imagery dataset, results demonstrate data and computational efficiency with 10% of the collected data combined with a GAN reducing the time of computation from roughly half a day to about 1 hour with roughly similar classification performances.
    FRAMED: An AutoML Approach for Structural Performance Prediction of Bicycle Frames. (arXiv:2201.10459v2 [cs.LG] UPDATED)
    This paper presents a data-driven analysis of the structural performance of 4500 community-designed bicycle frames. We introduce FRAMED -- a parametric dataset of bicycle frames based on bicycles designed by bicycle practitioners from across the world. To support our data-driven approach, we also provide a dataset of structural performance values such as weight, displacements under load, and safety factors for all the bicycle frame designs. Our structural simulations are validated against results from physical experiments on real bicycle frames. By exploring a diverse design space of frame design parameters and a set of ten competing design objectives, we present a data-driven approach to analyze the structural performance of bicycle frames. Through our analysis, we highlight overall trends in bicycle frame designs created by community members and study several bicycle frames under different loading conditions. We then undertake a systematic search for optimal performance and feasibility-predictive Machine Learning models, applying a state-of-the-art Automated Machine Learning framework. We demonstrate that the proposed AutoML models outperform commonly used models such as Neural Networks and XGBoost, which we tune using Bayesian hyperparameter optimization. This work aims to simultaneously serve researchers focusing on bicycle design as well as researchers focusing on the development of data-driven design algorithms, such as surrogate models and Deep Generative Models. The dataset and code are provided at this http URL .
    FairVFL: A Fair Vertical Federated Learning Framework with Contrastive Adversarial Learning. (arXiv:2206.03200v2 [cs.LG] UPDATED)
    Vertical federated learning (VFL) is a privacy-preserving machine learning paradigm that can learn models from features distributed on different platforms in a privacy-preserving way. Since in real-world applications the data may contain bias on fairness-sensitive features (e.g., gender), VFL models may inherit bias from training data and become unfair for some user groups. However, existing fair machine learning methods usually rely on the centralized storage of fairness-sensitive features to achieve model fairness, which are usually inapplicable in federated scenarios. In this paper, we propose a fair vertical federated learning framework (FairVFL), which can improve the fairness of VFL models. The core idea of FairVFL is to learn unified and fair representations of samples based on the decentralized feature fields in a privacy-preserving way. Specifically, each platform with fairness-insensitive features first learns local data representations from local features. Then, these local representations are uploaded to a server and aggregated into a unified representation for the target task. In order to learn a fair unified representation, we send it to each platform storing fairness-sensitive features and apply adversarial learning to remove bias from the unified representation inherited from the biased data. Moreover, for protecting user privacy, we further propose a contrastive adversarial learning method to remove private information from the unified representation in server before sending it to the platforms keeping fairness-sensitive features. Experiments on three real-world datasets validate that our method can effectively improve model fairness with user privacy well-protected.
    Symmetric Saliency-based Adversarial Attack To Speaker Identification. (arXiv:2210.16777v1 [cs.SD])
    Adversarial attack approaches to speaker identification either need high computational cost or are not very effective, to our knowledge. To address this issue, in this paper, we propose a novel generation-network-based approach, called symmetric saliency-based encoder-decoder (SSED), to generate adversarial voice examples to speaker identification. It contains two novel components. First, it uses a novel saliency map decoder to learn the importance of speech samples to the decision of a targeted speaker identification system, so as to make the attacker focus on generating artificial noise to the important samples. It also proposes an angular loss function to push the speaker embedding far away from the source speaker. Our experimental results demonstrate that the proposed SSED yields the state-of-the-art performance, i.e. over 97% targeted attack success rate and a signal-to-noise level of over 39 dB on both the open-set and close-set speaker identification tasks, with a low computational cost.
    A simple, efficient and scalable contrastive masked autoencoder for learning visual representations. (arXiv:2210.16870v1 [cs.CV])
    We introduce CAN, a simple, efficient and scalable method for self-supervised learning of visual representations. Our framework is a minimal and conceptually clean synthesis of (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach used in diffusion models. The learning mechanisms are complementary to one another: contrastive learning shapes the embedding space across a batch of image samples; masked autoencoders focus on reconstruction of the low-frequency spatial correlations in a single image sample; and noise prediction encourages the reconstruction of the high-frequency components of an image. The combined approach results in a robust, scalable and simple-to-implement algorithm. The training process is symmetric, with 50% of patches in both views being masked at random, yielding a considerable efficiency improvement over prior contrastive learning methods. Extensive empirical studies demonstrate that CAN achieves strong downstream performance under both linear and finetuning evaluations on transfer learning and robustness tasks. CAN outperforms MAE and SimCLR when pre-training on ImageNet, but is especially useful for pre-training on larger uncurated datasets such as JFT-300M: for linear probe on ImageNet, CAN achieves 75.4% compared to 73.4% for SimCLR and 64.1% for MAE. The finetuned performance on ImageNet of our ViT-L model is 86.1%, compared to 85.5% for SimCLR, and 85.4% for MAE. The overall FLOPs load of SimCLR is 70% higher than CAN for ViT-L models.
    Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses. (arXiv:2106.09779v6 [cs.LG] UPDATED)
    This paper studies federated learning (FL) -- especially cross-silo FL -- with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo $i$'s communications to satisfy record-level differential privacy (DP). ISRL-DP ensures that the data of each person in silo~$i$ cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks.
    sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. (arXiv:2108.10566v3 [cs.LG] UPDATED)
    Multiclass multilabel classification is the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of the multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). Multilabel classification reductions do not accommodate for the prediction of varying numbers of labels per example and the underlying losses are distant estimates of the performance metrics. We propose a loss function, sigmoidF1, which is an approximation of the F1 score that (1) is smooth and tractable for stochastic gradient descent, (2) naturally approximates a multilabel metric, and (3) estimates label propensities and label counts. We show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. sigmoidF1 outperforms other loss functions on one text and two image datasets and several metrics. These results show the effectiveness of using inference-time metrics as loss functions for non-trivial classification problems like multilabel classification.
    Meta-Learning Biologically Plausible Plasticity Rules with Random Feedback Pathways. (arXiv:2210.16414v1 [q-bio.NC])
    Backpropagation is widely used to train artificial neural networks, but its relationship to synaptic plasticity in the brain is unknown. Some biological models of backpropagation rely on feedback projections that are symmetric with feedforward connections, but experiments do not corroborate the existence of such symmetric backward connectivity. Random feedback alignment offers an alternative model in which errors are propagated backward through fixed, random backward connections. This approach successfully trains shallow models, but learns slowly and does not perform well with deeper models or online learning. In this study, we develop a novel meta-plasticity approach to discover interpretable, biologically plausible plasticity rules that improve online learning performance with fixed random feedback connections. The resulting plasticity rules show improved online training of deep models in the low data regime. Our results highlight the potential of meta-plasticity to discover effective, interpretable learning rules satisfying biological constraints.
    Parameter-Efficient Tuning Makes a Good Classification Head. (arXiv:2210.16771v1 [cs.CL])
    In recent years, pretrained models revolutionized the paradigm of natural language understanding (NLU), where we append a randomly initialized classification head after the pretrained backbone, e.g. BERT, and finetune the whole model. As the pretrained backbone makes a major contribution to the improvement, we naturally expect a good pretrained classification head can also benefit the training. However, the final-layer output of the backbone, i.e. the input of the classification head, will change greatly during finetuning, making the usual head-only pretraining (LP-FT) ineffective. In this paper, we find that parameter-efficient tuning makes a good classification head, with which we can simply replace the randomly initialized heads for a stable performance gain. Our experiments demonstrate that the classification head jointly pretrained with parameter-efficient tuning consistently improves the performance on 9 tasks in GLUE and SuperGLUE.
    Online Learning for Predictive Control with Provable Regret Guarantees. (arXiv:2111.15041v3 [cs.LG] UPDATED)
    We study the problem of online learning in predictive control of an unknown linear dynamical system with time varying cost functions which are unknown apriori. Specifically, we study the online learning problem where the control algorithm does not know the true system model and has only access to a fixed-length (that does not grow with the control horizon) preview of the future cost functions. The goal of the online algorithm is to minimize the dynamic regret, defined as the difference between the cumulative cost incurred by the algorithm and that of the best sequence of actions in hindsight. We propose two different online Model Predictive Control (MPC) algorithms to address this problem, namely Certainty Equivalence MPC (CE-MPC) algorithm and Optimistic MPC (O-MPC) algorithm. We show that under the standard stability assumption for the model estimate, the CE-MPC algorithm achieves $\mathcal{O}(T^{2/3})$ dynamic regret. We then extend this result to the setting where the stability assumption holds only for the true system model by proposing the O-MPC algorithm. We show that the O-MPC algorithm also achieves $\mathcal{O}(T^{2/3})$ dynamic regret, at the cost of some additional computation. We also present numerical studies to demonstrate the performance of our algorithm.
    A Compressed Sensing Based Least Squares Approach to Semi-supervised Local Cluster Extraction. (arXiv:2202.02904v2 [cs.LG] UPDATED)
    A least squares semi-supervised local clustering algorithm based on the idea of compressed sensing is proposed to extract clusters from a graph with known adjacency matrix. The algorithm is based on a two-stage approach similar to the one in \cite{LaiMckenzie2020}. However, under a weaker assumption and with less computational complexity than the one in \cite{LaiMckenzie2020}, the algorithm is shown to be able to find a desired cluster with high probability. The ``one cluster at a time" feature of our method distinguishes it from other global clustering methods. Several numerical experiments are conducted on the synthetic data such as stochastic block model and real data such as MNIST, political blogs network, AT\&T and YaleB human faces data sets to demonstrate the effectiveness and efficiency of our algorithm.  ( 2 min )
    Attention Swin U-Net: Cross-Contextual Attention Mechanism for Skin Lesion Segmentation. (arXiv:2210.16898v1 [eess.IV])
    Melanoma is caused by the abnormal growth of melanocytes in human skin. Like other cancers, this life-threatening skin cancer can be treated with early diagnosis. To support a diagnosis by automatic skin lesion segmentation, several Fully Convolutional Network (FCN) approaches, specifically the U-Net architecture, have been proposed. The U-Net model with a symmetrical architecture has exhibited superior performance in the segmentation task. However, the locality restriction of the convolutional operation incorporated in the U-Net architecture limits its performance in capturing long-range dependency, which is crucial for the segmentation task in medical images. To address this limitation, recently a Transformer based U-Net architecture that replaces the CNN blocks with the Swin Transformer module has been proposed to capture both local and global representation. In this paper, we propose Att-SwinU-Net, an attention-based Swin U-Net extension, for medical image segmentation. In our design, we seek to enhance the feature re-usability of the network by carefully designing the skip connection path. We argue that the classical concatenation operation utilized in the skip connection path can be further improved by incorporating an attention mechanism. By performing a comprehensive ablation study on several skin lesion segmentation datasets, we demonstrate the effectiveness of our proposed attention mechanism.  ( 2 min )
    BERTops: Studying BERT Representations under a Topological Lens. (arXiv:2205.00953v2 [cs.LG] UPDATED)
    Proposing scoring functions to effectively understand, analyze and learn various properties of high dimensional hidden representations of large-scale transformer models like BERT can be a challenging task. In this work, we explore a new direction by studying the topological features of BERT hidden representations using persistent homology (PH). We propose a novel scoring function named "persistence scoring function (PSF)" which: (i) accurately captures the homology of the high-dimensional hidden representations and correlates well with the test set accuracy of a wide range of datasets and outperforms existing scoring metrics, (ii) captures interesting post fine-tuning "per-class" level properties from both qualitative and quantitative viewpoints, (iii) is more stable to perturbations as compared to the baseline functions, which makes it a very robust proxy, and (iv) finally, also serves as a predictor of the attack success rates for a wide category of black-box and white-box adversarial attack methods. Our extensive correlation experiments demonstrate the practical utility of PSF on various NLP tasks relevant to BERT.
    An Approach for Noisy, Crowdsourced Datasets Utilizing Ensemble Modeling, 'Human Softmax' Distributions, and Entropic Measures of Uncertainty. (arXiv:2210.16380v1 [cs.CV])
    Noisy, crowdsourced image datasets prove challenging, even for the best neural networks. Two issues which complicate classification on such datasets are class imbalance and ground-truth uncertainty in labeling. The AL-ALL and AL-PUB datasets-consisting of tightly cropped, individual characters from images of ancient Greek papyri are strongly affected by both issues. The application of ensemble modeling to such a dataset can help identify images where the ground-truth is questionable and quantify the trustworthiness of those samples. We apply stacked generalization consisting of nearly identical ResNets: one utilizing cross-entropy (CXE) and the other Kullback-Liebler Divergence (KLD). The CXE network uses standard labeling drawn from the crowdsourced consensus. In contrast, the KLD network uses probabilistic labeling for each image derived from the distribution of crowdsourced annotations. We refer to this labeling as the Human Softmax (HSM) distribution. For our ensemble model, we apply a k-nearest neighbors model to the outputs of the CXE and KLD networks. Individually, the ResNet models have approximately 93% accuracy, while the ensemble model achieves an accuracy of >95%. We also perform an analysis of the Shannon entropy of the various models' output distributions to measure classification uncertainty.
    MinUn: Accurate ML Inference on Microcontrollers. (arXiv:2210.16556v1 [cs.LG])
    Running machine learning inference on tiny devices, known as TinyML, is an emerging research area. This task requires generating inference code that uses memory frugally, a task that standard ML frameworks are ill-suited for. A deployment framework for TinyML must be a) parametric in the number representation to take advantage of the emerging representations like posits, b) carefully assign high-precision to a few tensors so that most tensors can be kept in low-precision while still maintaining model accuracy, and c) avoid memory fragmentation. We describe MinUn, the first TinyML framework that holistically addresses these issues to generate efficient code for ARM microcontrollers (e.g., Arduino Uno, Due and STM32H747) that outperforms the prior TinyML frameworks.
    Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction. (arXiv:2110.01661v5 [cs.CL] UPDATED)
    Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.
    Thinking Outside the Ball: Optimal Learning with Gradient Descent for Generalized Linear Stochastic Convex Optimization. (arXiv:2202.13328v2 [cs.LG] UPDATED)
    We consider linear prediction with a convex Lipschitz loss, or more generally, stochastic convex optimization problems of generalized linear form, i.e.~where each instantaneous loss is a scalar convex function of a linear function. We show that in this setting, early stopped Gradient Descent (GD), without any explicit regularization or projection, ensures excess error at most $\epsilon$ (compared to the best possible with unit Euclidean norm) with an optimal, up to logarithmic factors, sample complexity of $\tilde{O}(1/\epsilon^2)$ and only $\tilde{O}(1/\epsilon^2)$ iterations. This contrasts with general stochastic convex optimization, where $\Omega(1/\epsilon^4)$ iterations are needed Amir et al. [2021b]. The lower iteration complexity is ensured by leveraging uniform convergence rather than stability. But instead of uniform convergence in a norm ball, which we show can guarantee suboptimal learning using $\Theta(1/\epsilon^4)$ samples, we rely on uniform convergence in a distribution-dependent ball.
    Neural Copula: A unified framework for estimating generic high-dimensional Copula functions. (arXiv:2205.15031v2 [cs.LG] UPDATED)
    The Copula is widely used to describe the relationship between the marginal distribution and joint distribution of random variables. The estimation of high-dimensional Copula is difficult, and most existing solutions rely either on simplified assumptions or on complicating recursive decompositions. Therefore, people still hope to obtain a generic Copula estimation method with both universality and simplicity. To reach this goal, a novel neural network-based method (named Neural Copula) is proposed in this paper. In this method, a hierarchical unsupervised neural network is constructed to estimate the marginal distribution function and the Copula function by solving differential equations. In the training program, various constraints are imposed on both the neural network and its derivatives. The Copula estimated by the proposed method is smooth and has an analytic expression. The effectiveness of the proposed method is evaluated on both real-world datasets and complex numerical simulations. Experimental results show that Neural Copula's fitting quality for complex distributions is much better than classical methods. The relevant code for the experiments is available on GitHub. (We encourage the reader to run the program for a better understanding of the proposed method).
    A pruning method based on the dissimilarity of angle among channels and filters. (arXiv:2210.16504v1 [cs.CV])
    Convolutional Neural Network (CNN) is more and more widely used in various fileds, and its computation and memory-demand are also increasing significantly. In order to make it applicable to limited conditions such as embedded application, network compression comes out. Among them, researchers pay more attention to network pruning. In this paper, we encode the convolution network to obtain the similarity of different encoding nodes, and evaluate the connectivity-power among convolutional kernels on the basis of the similarity. Then impose different level of penalty according to different connectivity-power. Meanwhile, we propose Channel Pruning base on the Dissimilarity of Angle (DACP). Firstly, we train a sparse model by GL penalty, and impose an angle dissimilarity constraint on the channels and filters of convolutional network to obtain a more sparse structure. Eventually, the effectiveness of our method is demonstrated in the section of experiment. On CIFAR-10, we reduce 66.86% FLOPs on VGG-16 with 93.31% accuracy after pruning, where FLOPs represents the number of floating-point operations per second of the model. Moreover, on ResNet-32, we reduce FLOPs by 58.46%, which makes the accuracy after pruning reach 91.76%.
    Self-supervised predictive coding and multimodal fusion advance patient deterioration prediction in fine-grained time resolution. (arXiv:2210.16598v1 [cs.LG])
    In the Emergency Department (ED), accurate prediction of critical events using Electronic Health Records (EHR) allows timely intervention and effective resource allocation. Though many studies have suggested automatic prediction methods, their coarse-grained time resolutions limit their practical usage. Therefore, in this study, we propose an hourly prediction method of critical events in ED, i.e., mortality and vasopressor need. Through extensive experiments, we show that both 1) bi-modal fusion between EHR text and time-series data and 2) self-supervised predictive regularization using L2 loss between normalized context vector and EHR future time-series data improve predictive performance, especially the far-future prediction. Our uni-modal/bi-modal/bi-modal self-supervision scored 0.846/0.877/0.897 (0.824/0.855/0.886) and 0.817/0.820/0.858 (0.807/0.81/0.855) with mortality (far-future mortality) and with vasopressor need (far-future vasopressor need) prediction data in AUROC, respectively.
    Time-rEversed diffusioN tEnsor Transformer: A new TENET of Few-Shot Object Detection. (arXiv:2210.16897v1 [cs.CV])
    In this paper, we tackle the challenging problem of Few-shot Object Detection. Existing FSOD pipelines (i) use average-pooled representations that result in information loss; and/or (ii) discard position information that can help detect object instances. Consequently, such pipelines are sensitive to large intra-class appearance and geometric variations between support and query images. To address these drawbacks, we propose a Time-rEversed diffusioN tEnsor Transformer (TENET), which i) forms high-order tensor representations that capture multi-way feature occurrences that are highly discriminative, and ii) uses a transformer that dynamically extracts correlations between the query image and the entire support set, instead of a single average-pooled support embedding. We also propose a Transformer Relation Head (TRH), equipped with higher-order representations, which encodes correlations between query regions and the entire support set, while being sensitive to the positional variability of object instances. Our model achieves state-of-the-art results on PASCAL VOC, FSOD, and COCO.
    A Comparative Study of Graph Neural Networks for Shape Classification in Neuroimaging. (arXiv:2210.16670v1 [cs.CV])
    Graph neural networks have emerged as a promising approach for the analysis of non-Euclidean data such as meshes. In medical imaging, mesh-like data plays an important role for modelling anatomical structures, and shape classification can be used in computer aided diagnosis and disease detection. However, with a plethora of options, the best architectural choices for medical shape analysis using GNNs remain unclear. We conduct a comparative analysis to provide practitioners with an overview of the current state-of-the-art in geometric deep learning for shape classification in neuroimaging. Using biological sex classification as a proof-of-concept task, we find that using FPFH as node features substantially improves GNN performance and generalisation to out-of-distribution data; we compare the performance of three alternative convolutional layers; and we reinforce the importance of data augmentation for graph based learning. We then confirm these results hold for a clinically relevant task, using the classification of Alzheimer's disease.
    Manifold Alignment with Label Information. (arXiv:2210.12774v2 [stat.ML] UPDATED)
    Multi-domain data is becoming increasingly common and presents both challenges and opportunities in the data science community. The integration of distinct data-views can be used for exploratory data analysis, and benefit downstream analysis including machine learning related tasks. With this in mind, we present a novel manifold alignment method called MALI (Manifold alignment with label information) that learns a correspondence between two distinct domains. MALI can be considered as belonging to a middle ground between the more commonly addressed semi-supervised manifold alignment problem with some known correspondences between the two domains, and the purely unsupervised case, where no known correspondences are provided. To do this, MALI learns the manifold structure in both domains via a diffusion process and then leverages discrete class labels to guide the alignment. By aligning two distinct domains, MALI recovers a pairing and a common representation that reveals related samples in both domains. Additionally, MALI can be used for the transfer learning problem known as domain adaptation. We show that MALI outperforms the current state-of-the-art manifold alignment methods across multiple datasets.
    On the Efficient Implementation of the Matrix Exponentiated Gradient Algorithm for Low-Rank Matrix Optimization. (arXiv:2012.10469v2 [math.OC] UPDATED)
    Convex optimization over the spectrahedron, i.e., the set of all real $n\times n$ positive semidefinite matrices with unit trace, has important applications in machine learning, signal processing and statistics, mainly as a convex relaxation for optimization problems with low-rank matrices. It is also one of the most prominent examples in the theory of first-order methods for convex optimization in which non-Euclidean methods can be significantly preferable to their Euclidean counterparts. In particular, the desirable choice is the Matrix Exponentiated Gradient (MEG) method which is based on the Bregman distance induced by the (negative) von Neumann entropy. Unfortunately, implementing MEG requires a full SVD computation on each iteration, which is not scalable to high-dimensional problems. In this work we propose an efficient implementations of MEG, both with deterministic and stochastic gradients, which are tailored for optimization with low-rank matrices, and only use a single low-rank SVD computation on each iteration. We also provide efficiently-computable certificates for the correct convergence of our methods. Mainly, we prove that under a strict complementarity condition, the suggested methods converge from a ``warm-start" initialization with similar rates to their full-SVD-based counterparts. Finally, we bring empirical experiments which both support our theoretical findings and demonstrate the practical appeal of our methods.
    Dynamics-Regulated Kinematic Policy for Egocentric Pose Estimation. (arXiv:2106.05969v3 [cs.CV] UPDATED)
    We propose a method for object-aware 3D egocentric pose estimation that tightly integrates kinematics modeling, dynamics modeling, and scene object information. Unlike prior kinematics or dynamics-based approaches where the two components are used disjointly, we synergize the two approaches via dynamics-regulated training. At each timestep, a kinematic model is used to provide a target pose using video evidence and simulation state. Then, a prelearned dynamics model attempts to mimic the kinematic pose in a physics simulator. By comparing the pose instructed by the kinematic model against the pose generated by the dynamics model, we can use their misalignment to further improve the kinematic model. By factoring in the 6DoF pose of objects (e.g., chairs, boxes) in the scene, we demonstrate for the first time, the ability to estimate physically-plausible 3D human-object interactions using a single wearable camera. We evaluate our egocentric pose estimation method in both controlled laboratory settings and real-world scenarios.
    The eyes and hearts of UAV pilots: observations of physiological responses in real-life scenarios. (arXiv:2210.14910v1 [cs.HC] CROSS LISTED)
    The drone industry is diversifying and the number of pilots increases rapidly. In this context, flight schools need adapted tools to train pilots, most importantly with regard to their own awareness of their physiological and cognitive limits. In civil and military aviation, pilots can train themselves on realistic simulators to tune their reaction and reflexes, but also to gather data on their piloting behavior and physiological states. It helps them to improve their performances. Opposed to cockpit scenarios, drone teleoperation is conducted outdoor in the field, thus with only limited potential from desktop simulation training. This work aims to provide a solution to gather pilots behavior out in the field and help them increase their performance. We combined advance object detection from a frontal camera to gaze and heart-rate variability measurements. We observed pilots and analyze their behavior over three flight challenges. We believe this tool can support pilots both in their training and in their regular flight tasks. A demonstration video is available on https://www.youtube.com/watch?v=eePhjd2qNiI
    Multi-view Multi-label Anomaly Network Traffic Classification based on MLP-Mixer Neural Network. (arXiv:2210.16719v1 [cs.LG])
    Network traffic classification is the basis of many network security applications and has attracted enough attention in the field of cyberspace security. Existing network traffic classification based on convolutional neural networks (CNNs) often emphasizes local patterns of traffic data while ignoring global information associations. In this paper, we propose a MLP-Mixer based multi-view multi-label neural network for network traffic classification. Compared with the existing CNN-based methods, our method adopts the MLP-Mixer structure, which is more in line with the structure of the packet than the conventional convolution operation. In our method, the packet is divided into the packet header and the packet body, together with the flow features of the packet as input from different views. We utilize a multi-label setting to learn different scenarios simultaneously to improve the classification performance by exploiting the correlations between different scenarios. Taking advantage of the above characteristics, we propose an end-to-end network traffic classification method. We conduct experiments on three public datasets, and the experimental results show that our method can achieve superior performance.
    Search to Pass Messages for Temporal Knowledge Graph Completion. (arXiv:2210.16740v1 [cs.AI])
    Completing missing facts is a fundamental task for temporal knowledge graphs (TKGs). Recently, graph neural network (GNN) based methods, which can simultaneously explore topological and temporal information, have become the state-of-the-art (SOTA) to complete TKGs. However, these studies are based on hand-designed architectures and fail to explore the diverse topological and temporal properties of TKG. To address this issue, we propose to use neural architecture search (NAS) to design data-specific message passing architecture for TKG completion. In particular, we develop a generalized framework to explore topological and temporal information in TKGs. Based on this framework, we design an expressive search space to fully capture various properties of different TKGs. Meanwhile, we adopt a search algorithm, which trains a supernet structure by sampling single path for efficient search with less cost. We further conduct extensive experiments on three benchmark datasets. The results show that the searched architectures by our method achieve the SOTA performances. Besides, the searched models can also implicitly reveal diverse properties in different TKGs. Our code is released in https://github.com/striderdu/SPA.
    Improved Support Recovery in Universal One-bit Compressed Sensing. (arXiv:2210.16657v1 [cs.IT])
    One-bit compressed sensing (1bCS) is an extremely quantized signal acquisition method that has been proposed and studied rigorously in the past decade. In 1bCS, linear samples of a high dimensional signal are quantized to only one bit per sample (sign of the measurement). Assuming the original signal vector to be sparse, existing results in 1bCS either aim to find the support of the vector, or approximate the signal allowing a small error. The focus of this paper is support recovery, which often also computationally facilitate approximate signal recovery. A {\em universal} measurement matrix for 1bCS refers to one set of measurements that work for all sparse signals. With universality, it is known that $\tilde{\Theta}(k^2)$ 1bCS measurements are necessary and sufficient for support recovery (where $k$ denotes the sparsity). To improve the dependence on sparsity from quadratic to linear, in this work we propose approximate support recovery (allowing $\epsilon>0$ proportion of errors), and superset recovery (allowing $\epsilon$ proportion of false positives). We show that the first type of recovery is possible with $\tilde{O}(k/\epsilon)$ measurements, while the later type of recovery, more challenging, is possible with $\tilde{O}(\max\{k/\epsilon,k^{3/2}\})$ measurements. We also show that in both cases $\Omega(k/\epsilon)$ measurements would be necessary for universal recovery. Improved results are possible if we consider universal recovery within a restricted class of signals, such as rational signals, or signals with bounded dynamic range. In both cases superset recovery is possible with only $\tilde{O}(k/\epsilon)$ measurements. Other results on universal but approximate support recovery are also provided in this paper. All of our main recovery algorithms are simple and polynomial-time.
    Computer-aided diagnosis and prediction in brain disorders. (arXiv:2206.14683v2 [cs.LG] UPDATED)
    Computer-aided methods have shown added value for diagnosing and predicting brain disorders and can thus support decision making in clinical care and treatment planning. This chapter will provide insight into the type of methods, their working, their input data - such as cognitive tests, imaging and genetic data - and the types of output they provide. We will focus on specific use cases for diagnosis, i.e. estimating the current 'condition' of the patient, such as early detection and diagnosis of dementia, differential diagnosis of brain tumours, and decision making in stroke. Regarding prediction, i.e. estimation of the future 'condition' of the patient, we will zoom in on use cases such as predicting the disease course in multiple sclerosis and predicting patient outcomes after treatment in brain cancer. Furthermore, based on these use cases, we will assess the current state-of-the-art methodology and highlight current efforts on benchmarking of these methods and the importance of open science therein. Finally, we assess the current clinical impact of computer-aided methods and discuss the required next steps to increase clinical impact.
    Improving Multilayer-Perceptron(MLP)-based Network Anomaly Detection with Birch Clustering on CICIDS-2017 Dataset. (arXiv:2208.09711v2 [cs.CR] UPDATED)
    Machine learning algorithms have been widely used in intrusion detection systems, including Multi-layer Perceptron (MLP). In this study, we proposed a two-stage model that combines the Birch clustering algorithm and MLP classifier to improve the performance of network anomaly multi-classification. In our proposed method, we first apply Birch or Kmeans as an unsupervised clustering algorithm to the CICIDS-2017 dataset to pre-group the data. The generated pseudo-label is then added as an additional feature to the training of the MLP-based classifier. The experimental results show that using Birch and K-Means clustering for data pre-grouping can improve intrusion detection system performance. Our method can achieve 99.73% accuracy in multi-classification using Birch clustering, which is better than similar researches using a stand-alone MLP model.
    On Cross-Domain Pre-Trained Language Models for Clinical Text Mining: How Do They Perform on Data-Constrained Fine-Tuning?. (arXiv:2210.12770v2 [cs.CL] UPDATED)
    Pre-trained language models (PLMs) have been deployed in many natural language processing (NLP) tasks and in various domains. Language model pre-training from general or mixed domain rich data plus fine-tuning using small amounts of available data in a low resource domain demonstrated beneficial results by researchers. In this work, we question this statement and verify if BERT-based PLMs from the biomedical domain can perform well in clinical text mining tasks via fine-tuning. We test the state-of-the-art models, i.e. Bioformer which is pre-trained on a large amount of biomedical data from PubMed corpus. We use a historical n2c2 clinical NLP challenge dataset for fine-tuning its task-adapted version (BioformerApt), and show that their performances are actually very low. We also present our own end-to-end model, TransformerCRF, which is developed using Transformer and conditional random fields (CRFs) as encoder and decoder. We further create a new variation model by adding a CRF layer on top of PLM Bioformer (BioformerCRF). We investigate the performances of TransformerCRF on clinical text mining tasks by training from scratch using a limited amount of data, as well as the model BioformerCRF. Experimental evaluation shows that, in a \textit{constrained setting}, all tested models are \textit{far from ideal} regarding extreme low-frequency special token recognition, even though they can achieve relatively higher accuracy on overall text tagging. Our models including source codes will be hosted at \url{https://github.com/poethan/TransformerCRF}.
    DeFIX: Detecting and Fixing Failure Scenarios with Reinforcement Learning in Imitation Learning Based Autonomous Driving. (arXiv:2210.16567v1 [cs.RO])
    Safely navigating through an urban environment without violating any traffic rules is a crucial performance target for reliable autonomous driving. In this paper, we present a Reinforcement Learning (RL) based methodology to DEtect and FIX (DeFIX) failures of an Imitation Learning (IL) agent by extracting infraction spots and re-constructing mini-scenarios on these infraction areas to train an RL agent for fixing the shortcomings of the IL approach. DeFIX is a continuous learning framework, where extraction of failure scenarios and training of RL agents are executed in an infinite loop. After each new policy is trained and added to the library of policies, a policy classifier method effectively decides on which policy to activate at each step during the evaluation. It is demonstrated that even with only one RL agent trained on failure scenario of an IL agent, DeFIX method is either competitive or does outperform state-of-the-art IL and RL based autonomous urban driving benchmarks. We trained and validated our approach on the most challenging map (Town05) of CARLA simulator which involves complex, realistic, and adversarial driving scenarios. The source code is publicly available at https://github.com/data-and-decision-lab/DeFIX
    Track2Vec: fairness music recommendation with a GPU-free customizable-driven framework. (arXiv:2210.16590v1 [cs.IR])
    Recommendation systems have illustrated the significant progress made in characterizing users' preferences based on their past behaviors. Despite the effectiveness of recommending accurately, there exist several factors that are essential but unexplored for evaluating various facets of recommendation systems, e.g., fairness, diversity, and limited resources. To address these issues, we propose Track2Vec, a GPU-free customizable-driven framework for fairness music recommendation. In order to take both accuracy and fairness into account, our solution consists of three modules, a customized fairness-aware groups for modeling different features based on configurable settings, a track representation learning module for learning better user embedding, and an ensemble module for ranking the recommendation results from different track representation learning modules. Moreover, inspired by TF-IDF which has been widely used in natural language processing, we introduce a metric called Miss Rate - Inverse Ground Truth Frequency (MR-ITF) to measure the fairness. Extensive experiments demonstrate that our model achieves a 4th price ranking in a GPU-free environment on the leaderboard in the EvalRS @ CIKM 2022 challenge, which is superior to the official baseline by about 200% in terms of the official scores. In addition, the ablation study illustrates the necessity of ensembling each group to acquire both accurate and fair recommendations.
    Medical Codes Prediction from Clinical Notes: From Human Coders to Machines. (arXiv:2210.16850v1 [cs.LG])
    Prediction of medical codes from clinical notes is a practical and essential need for every healthcare delivery organization within current medical systems. Automating annotation will save significant time and excessive effort that human coders spend today. However, the biggest challenge is directly identifying appropriate medical codes from several thousands of high-dimensional codes from unstructured free-text clinical notes. This complex medical codes prediction problem from clinical notes has received substantial interest in the NLP community, and several recent studies have shown the state-of-the-art code prediction results of full-fledged deep learning-based methods. This progress raises the fundamental question of how far automated machine learning systems are from human coders' working performance, as well as the important question of how well current explainability methods apply to advanced neural network models such as transformers. This is to predict correct codes and present references in clinical notes that support code prediction, as this level of explainability and accuracy of the prediction outcomes is critical to gaining trust from professional medical coders.
    Perspective Transformation Layer. (arXiv:2201.05706v2 [cs.CV] UPDATED)
    Incorporating geometric transformations that reflect the relative position changes between an observer and an object into computer vision and deep learning models has attracted much attention in recent years. However, the existing proposals mainly focus on the affine transformation that is insufficient to reflect such geometric position changes. Furthermore, current solutions often apply a neural network module to learn a single transformation matrix, which not only ignores the importance of multi-view analysis but also includes extra training parameters from the module apart from the transformation matrix parameters that increase the model complexity. In this paper, a perspective transformation layer is proposed in the context of deep learning. The proposed layer can learn homography, therefore reflecting the geometric positions between observers and objects. In addition, by directly training its transformation matrices, a single proposed layer can learn an adjustable number of multiple viewpoints without considering module parameters. The experiments and evaluations confirm the superiority of the proposed layer.
    Joint Sub-component Level Segmentation and Classification for Anomaly Detection within Dual-Energy X-Ray Security Imagery. (arXiv:2210.16453v1 [cs.CV])
    X-ray baggage security screening is in widespread use and crucial to maintaining transport security for threat/anomaly detection tasks. The automatic detection of anomaly, which is concealed within cluttered and complex electronics/electrical items, using 2D X-ray imagery is of primary interest in recent years. We address this task by introducing joint object sub-component level segmentation and classification strategy using deep Convolution Neural Network architecture. The performance is evaluated over a dataset of cluttered X-ray baggage security imagery, consisting of consumer electrical and electronics items using variants of dual-energy X-ray imagery (pseudo-colour, high, low, and effective-Z). The proposed joint sub-component level segmentation and classification approach achieve ~99% true positive and ~5% false positive for anomaly detection task.
    Micro and Macro Level Graph Modeling for Graph Variational Auto-Encoders. (arXiv:2210.16844v1 [cs.LG])
    Generative models for graph data are an important research topic in machine learning. Graph data comprise two levels that are typically analyzed separately: node-level properties such as the existence of a link between a pair of nodes, and global aggregate graph-level statistics, such as motif counts. This paper proposes a new multi-level framework that jointly models node-level properties and graph-level statistics, as mutually reinforcing sources of information. We introduce a new micro-macro training objective for graph generation that combines node-level and graph-level losses. We utilize the micro-macro objective to improve graph generation with a GraphVAE, a well-established model based on graph-level latent variables, that provides fast training and generation time for medium-sized graphs. Our experiments show that adding micro-macro modeling to the GraphVAE model improves graph quality scores up to 2 orders of magnitude on five benchmark datasets, while maintaining the GraphVAE generation speed advantage.
    Imitating Opponent to Win: Adversarial Policy Imitation Learning in Two-player Competitive Games. (arXiv:2210.16915v1 [cs.LG])
    Recent research on vulnerabilities of deep reinforcement learning (RL) has shown that adversarial policies adopted by an adversary agent can influence a target RL agent (victim agent) to perform poorly in a multi-agent environment. In existing studies, adversarial policies are directly trained based on experiences of interacting with the victim agent. There is a key shortcoming of this approach; knowledge derived from historical interactions may not be properly generalized to unexplored policy regions of the victim agent, making the trained adversarial policy significantly less effective. In this work, we design a new effective adversarial policy learning algorithm that overcomes this shortcoming. The core idea of our new algorithm is to create a new imitator to imitate the victim agent's policy while the adversarial policy will be trained not only based on interactions with the victim agent but also based on feedback from the imitator to forecast victim's intention. By doing so, we can leverage the capability of imitation learning in well capturing underlying characteristics of the victim policy only based on sample trajectories of the victim. Our victim imitation learning model differs from prior models as the environment's dynamics are driven by adversary's policy and will keep changing during the adversarial policy training. We provide a provable bound to guarantee a desired imitating policy when the adversary's policy becomes stable. We further strengthen our adversarial policy learning by making our imitator a stronger version of the victim. Finally, our extensive experiments using four competitive MuJoCo game environments show that our proposed adversarial policy learning algorithm outperforms state-of-the-art algorithms.
    Deep Learning on Edge TPUs. (arXiv:2108.13732v2 [cs.CV] UPDATED)
    Computing at the edge is important in remote settings, however, conventional hardware is not optimized for utilizing deep neural networks. The Google Edge TPU is an emerging hardware accelerator that is cost, power and speed efficient, and is available for prototyping and production purposes. Here, I review the Edge TPU platform, the tasks that have been accomplished using the Edge TPU, and which steps are necessary to deploy a model to the Edge TPU hardware. The Edge TPU is not only capable of tackling common computer vision tasks, but also surpasses other hardware accelerators, especially when the entire model can be deployed to the Edge TPU. Co-embedding the Edge TPU in cameras allows a seamless analysis of primary data. In summary, the Edge TPU is a maturing system that has proven its usability across multiple tasks.
    A Systematic Survey of Molecular Pre-trained Models. (arXiv:2210.16484v1 [cs.LG])
    Obtaining effective molecular representations is at the core of a series of important chemical tasks ranging from property prediction to drug design. So far, deep learning has achieved remarkable success in learning representations for molecules through automated feature learning in a data-driven fashion. However, training deep neural networks from scratch often requires sufficient labeled molecules which are expensive to acquire in real-world scenarios. To alleviate this issue, inspired by the success of the pretrain-then-finetune paradigm in natural language processing, tremendous efforts have been devoted to Molecular Pre-trained Models (MPMs), where neural networks are pre-trained using large-scale unlabeled molecular databases and then fine-tuned for diverse downstream tasks. Despite the prosperity, this field is fast-growing and a systematic roadmap is urgently needed for both methodology advancements and practical applications in both machine learning and scientific communities. To this end, this paper provides a systematic survey of pre-trained models for molecular representations. Firstly, to motivate MPMs studies, we highlight the limitations of training deep neural networks for molecular representations. Next, we systematically review recent advances on this topic from several key perspectives including molecular descriptors, encoder architectures, pre-training strategies, and applications. Finally, we identify several challenges and discuss promising future research directions.
    QuEst: Graph Transformer for Quantum Circuit Reliability Estimation. (arXiv:2210.16724v1 [quant-ph])
    Among different quantum algorithms, PQC for QML show promises on near-term devices. To facilitate the QML and PQC research, a recent python library called TorchQuantum has been released. It can construct, simulate, and train PQC for machine learning tasks with high speed and convenient debugging supports. Besides quantum for ML, we want to raise the community's attention on the reversed direction: ML for quantum. Specifically, the TorchQuantum library also supports using data-driven ML models to solve problems in quantum system research, such as predicting the impact of quantum noise on circuit fidelity and improving the quantum circuit compilation efficiency. This paper presents a case study of the ML for quantum part. Since estimating the noise impact on circuit reliability is an essential step toward understanding and mitigating noise, we propose to leverage classical ML to predict noise impact on circuit fidelity. Inspired by the natural graph representation of quantum circuits, we propose to leverage a graph transformer model to predict the noisy circuit fidelity. We firstly collect a large dataset with a variety of quantum circuits and obtain their fidelity on noisy simulators and real machines. Then we embed each circuit into a graph with gate and noise properties as node features, and adopt a graph transformer to predict the fidelity. Evaluated on 5 thousand random and algorithm circuits, the graph transformer predictor can provide accurate fidelity estimation with RMSE error 0.04 and outperform a simple neural network-based model by 0.02 on average. It can achieve 0.99 and 0.95 R$^2$ scores for random and algorithm circuits, respectively. Compared with circuit simulators, the predictor has over 200X speedup for estimating the fidelity.
    A Pipeline for Analysing Grant Applications. (arXiv:2210.16843v1 [cs.LG])
    Data mining techniques can transform massive amounts of unstructured data into quantitative data that quickly reveal insights, trends, and patterns behind the original data. In this paper, a data mining model is applied to analyse the 2019 grant applications submitted to an Australian Government research funding agency to investigate whether grant schemes successfully identifies innovative project proposals, as intended. The grant applications are peer-reviewed research proposals that include specific ``innovation and creativity'' (IC) scores assigned by reviewers. In addition to predicting the IC score for each research proposal, we are particularly interested in understanding the vocabulary of innovative proposals. In order to solve this problem, various data mining models and feature encoding algorithms are studied and explored. As a result, we propose a model with the best performance, a Random Forest (RF) classifier over documents encoded with features denoting the presence or absence of unigrams. In specific, the unigram terms are encoded by a modified Term Frequency - Inverse Document Frequency (TF-IDF) algorithm, which only implements the IDF part of TF-IDF. Besides the proposed model, this paper also presents a rigorous experimental pipeline for analysing grant applications, and the experimental results prove its feasibility.
    PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v4 [cs.LG] UPDATED)
    Networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many networks are signed or directed, or both, there is a lack of unified software packages on graph neural networks (GNNs) specially designed for signed and directed networks. In this paper, we present PyTorch Geometric Signed Directed (PyGSD), a software package which fills this gap. Along the way, we also provide a brief review surveying typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, provide an overview of methods proposed, and evaluate the implemented methods with experiments. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.
    UniCLIP: Unified Framework for Contrastive Language-Image Pre-training. (arXiv:2209.13430v2 [cs.CV] UPDATED)
    Pre-training vision-language models with contrastive objectives has shown promising results that are both scalable to large uncurated datasets and transferable to many downstream applications. Some following works have targeted to improve data efficiency by adding self-supervision terms, but inter-domain (image-text) contrastive loss and intra-domain (image-image) contrastive loss are defined on individual spaces in those works, so many feasible combinations of supervision are overlooked. To overcome this issue, we propose UniCLIP, a Unified framework for Contrastive Language-Image Pre-training. UniCLIP integrates the contrastive loss of both inter-domain pairs and intra-domain pairs into a single universal space. The discrepancies that occur when integrating contrastive loss between different domains are resolved by the three key components of UniCLIP: (1) augmentation-aware feature embedding, (2) MP-NCE loss, and (3) domain dependent similarity measure. UniCLIP outperforms previous vision-language pre-training methods on various single- and multi-modality downstream tasks. In our experiments, we show that each component that comprises UniCLIP contributes well to the final performance.
    Understanding Performance Problems in Deep Learning Systems. (arXiv:2112.01771v2 [cs.SE] UPDATED)
    Deep learning (DL) has been widely applied to many domains. Unique challenges in engineering DL systems are posed by the programming paradigm shift from traditional systems to DL systems, and performance is one of the challenges. Performance problems (PPs) in DL systems can cause severe consequences such as excessive resource consumption and financial loss. While bugs in DL systems have been extensively investigated, PPs in DL systems have hardly been explored. To bridge this gap, we present the first comprehensive study to i) characterize symptoms, root causes, and introducing and exposing stages of PPs in DL systems developed in TensorFLow and Keras, with 224 PPs collected from 210 StackOverflow posts, and to ii) assess the capability of existing performance analysis approaches in tackling PPs, with a constructed benchmark of 58 PPs in DL systems. Our findings shed light on the implications on developing high-performance DL systems, and detecting and localizing PPs in DL systems. To demonstrate the usefulness of our findings, we develop a static checker Deep-Perf to detect three types of PPs. It has detected 488 new PPs in 130 GitHub projects. 105 and 27 PPs have been confirmed and fixed.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v2 [stat.ML] UPDATED)
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    Curiosity-Driven Multi-Agent Exploration with Mixed Objectives. (arXiv:2210.16468v1 [cs.AI])
    Intrinsic rewards have been increasingly used to mitigate the sparse reward problem in single-agent reinforcement learning. These intrinsic rewards encourage the agent to look for novel experiences, guiding the agent to explore the environment sufficiently despite the lack of extrinsic rewards. Curiosity-driven exploration is a simple yet efficient approach that quantifies this novelty as the prediction error of the agent's curiosity module, an internal neural network that is trained to predict the agent's next state given its current state and action. We show here, however, that naively using this curiosity-driven approach to guide exploration in sparse reward cooperative multi-agent environments does not consistently lead to improved results. Straightforward multi-agent extensions of curiosity-driven exploration take into consideration either individual or collective novelty only and thus, they do not provide a distinct but collaborative intrinsic reward signal that is essential for learning in cooperative multi-agent tasks. In this work, we propose a curiosity-driven multi-agent exploration method that has the mixed objective of motivating the agents to explore the environment in ways that are individually and collectively novel. First, we develop a two-headed curiosity module that is trained to predict the corresponding agent's next observation in the first head and the next joint observation in the second head. Second, we design the intrinsic reward formula to be the sum of the individual and joint prediction errors of this curiosity module. We empirically show that the combination of our curiosity module architecture and intrinsic reward formulation guides multi-agent exploration more efficiently than baseline approaches, thereby providing the best performance boost to MARL algorithms in cooperative navigation environments with sparse rewards.
    A Broad Dataset is All You Need for One-Shot Object Detection. (arXiv:2011.04267v2 [cs.CV] UPDATED)
    Is it possible to detect arbitrary objects from a single example? A central problem of all existing attempts at one-shot object detection is the generalization gap: Object categories used during training are detected much more reliably than novel ones. We here show that this generalization gap can be nearly closed by increasing the number of object categories used during training. Doing so allows us to improve generalization from seen to unseen classes from 45% to 89% and improve the state-of-the-art on COCO by 5.4 %AP50 (from 22.0 to 27.5). We verify that the effect is caused by the number of categories and not the number of training samples, and that it holds for different models, backbones and datasets. This result suggests that the key to strong few-shot detection models may not lie in sophisticated metric learning approaches, but instead simply in scaling the number of categories. We hope that our findings will help to better understand the challenges of few-shot learning and encourage future data annotation efforts to focus on wider datasets with a broader set of categories rather than gathering more samples per category.
    Multi-View Attention Transfer for Efficient Speech Enhancement. (arXiv:2208.10367v2 [cs.SD] UPDATED)
    Recent deep learning models have achieved high performance in speech enhancement; however, it is still challenging to obtain a fast and low-complexity model without significant performance degradation. Previous knowledge distillation studies on speech enhancement could not solve this problem because their output distillation methods do not fit the speech enhancement task in some aspects. In this study, we propose multi-view attention transfer (MV-AT), a feature-based distillation, to obtain efficient speech enhancement models in the time domain. Based on the multi-view features extraction model, MV-AT transfers multi-view knowledge of the teacher network to the student network without additional parameters. The experimental results show that the proposed method consistently improved the performance of student models of various sizes on the Valentini and deep noise suppression (DNS) datasets. MANNER-S-8.1GF with our proposed method, a lightweight model for efficient deployment, achieved 15.4x and 4.71x fewer parameters and floating-point operations (FLOPs), respectively, compared to the baseline model with similar performance.  ( 2 min )
    Flatter, faster: scaling momentum for optimal speedup of SGD. (arXiv:2210.16400v1 [cs.LG])
    Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study implicit bias arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter $1-\beta$ with the learning rate to the power of $2/3$ maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We perform experiments, including matrix sensing and ResNet on CIFAR10, which provide evidence for the robustness of these results.
    GradSkip: Communication-Accelerated Local Gradient Methods with Better Computational Complexity. (arXiv:2210.16402v1 [cs.LG])
    In this work, we study distributed optimization algorithms that reduce the high communication costs of synchronization by allowing clients to perform multiple local gradient steps in each communication round. Recently, Mishchenko et al. (2022) proposed a new type of local method, called ProxSkip, that enjoys an accelerated communication complexity without any data similarity condition. However, their method requires all clients to call local gradient oracles with the same frequency. Because of statistical heterogeneity, we argue that clients with well-conditioned local problems should compute their local gradients less frequently than clients with ill-conditioned local problems. Our first contribution is the extension of the original ProxSkip method to the setup where clients are allowed to perform a different number of local gradient steps in each communication round. We prove that our modified method, GradSkip, still converges linearly, has the same accelerated communication complexity, and the required frequency for local gradient computations is proportional to the local condition number. Next, we generalize our method by extending the randomness of probabilistic alternations to arbitrary unbiased compression operators and considering a generic proximable regularizer. This generalization, GradSkip+, recovers several related methods in the literature. Finally, we present an empirical study to confirm our theoretical claims.
    Memory Capacity of Recurrent Neural Networks with Matrix Representation. (arXiv:2104.07454v2 [cs.LG] UPDATED)
    It is well known that canonical recurrent neural networks (RNNs) faced limitations in learning long-term dependencies which has been addressed by memory structures in long short-term memory (LSTM) networks. Neural Turing machines (NTMs) are novel RNNs that implement the notion of programmable computers with neural network controllers which can learn simple algorithmic tasks. Matrix neural networks feature matrix representation which inherently preserves the spatial structure of data when compared to canonical neural networks that use vector-based representation. The matrix-representation of neural networks also have the potential to provide better memory capacity. \textcolor{black}{In this paper, we define and study a probabilistic notion of memory capacity based on Fisher information for matrix-based RNNs. We find bounds on memory capacity for such networks under various hypotheses and compare them with their vector counterparts. In particular, we show that the memory capacity of such networks is bounded by $N^2$ for $N\times N$ state matrix which generalizes the one known for vector networks. We also show and analyze the increase in memory capacity for such networks which is introduced when one exhibits an external state memory, such as Neural Turing Machines (NTMs). This motivates us to construct NTMs with RNN controllers with matrix-based representation of external memory, leading us to introduce Matrix NTMs. We demonstrate the performance of this class of memory networks under certain algorithmic learning tasks such as copying and recall and compare it with Matrix RNNs. We find an improvement in the performance of Matrix NTMs by the addition of external memory.
    Recurrent Convolutional Deep Neural Networks for Modeling Time-Resolved Wildfire Spread Behavior. (arXiv:2210.16411v1 [cs.LG])
    The increasing incidence and severity of wildfires underscores the necessity of accurately predicting their behavior. While high-fidelity models derived from first principles offer physical accuracy, they are too computationally expensive for use in real-time fire response. Low-fidelity models sacrifice some physical accuracy and generalizability via the integration of empirical measurements, but enable real-time simulations for operational use in fire response. Machine learning techniques offer the ability to bridge these objectives by learning first-principles physics while achieving computational speedup. While deep learning approaches have demonstrated the ability to predict wildfire propagation over large time periods, time-resolved fire-spread predictions are needed for active fire management. In this work, we evaluate the ability of deep learning approaches in accurately modeling the time-resolved dynamics of wildfires. We use an autoregressive process in which a convolutional recurrent deep learning model makes predictions that propagate a wildfire over 15 minute increments. We demonstrate the model in application to three simulated datasets of increasing complexity, containing both field fires with homogeneous fuel distribution as well as real-world topologies sampled from the California region of the United States. We show that even after 100 autoregressive predictions representing more than 24 hours of simulated fire spread, the resulting models generate stable and realistic propagation dynamics, achieving a Jaccard score between 0.89 and 0.94 when predicting the resulting fire scar.
    Scalable Spectral Clustering with Group Fairness Constraints. (arXiv:2210.16435v1 [cs.LG])
    There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.
    A Theoretical Understanding of Gradient Bias in Meta-Reinforcement Learning. (arXiv:2112.15400v3 [cs.LG] UPDATED)
    Gradient-based Meta-RL (GMRL) refers to methods that maintain two-level optimisation procedures wherein the outer-loop meta-learner guides the inner-loop gradient-based reinforcement learner to achieve fast adaptations. In this paper, we develop a unified framework that describes variations of GMRL algorithms and points out that existing stochastic meta-gradient estimators adopted by GMRL are actually \textbf{biased}. Such meta-gradient bias comes from two sources: 1) the compositional bias incurred by the two-level problem structure, which has an upper bound of $\mathcal{O}\big(K\alpha^{K}\hat{\sigma}_{\text{In}}|\tau|^{-0.5}\big)$ \emph{w.r.t.} inner-loop update step $K$, learning rate $\alpha$, estimate variance $\hat{\sigma}^{2}_{\text{In}}$ and sample size $|\tau|$, and 2) the multi-step Hessian estimation bias $\hat{\Delta}_{H}$ due to the use of autodiff, which has a polynomial impact $\mathcal{O}\big((K-1)(\hat{\Delta}_{H})^{K-1}\big)$ on the meta-gradient bias. We study tabular MDPs empirically and offer quantitative evidence that testifies our theoretical findings on existing stochastic meta-gradient estimators. Furthermore, we conduct experiments on Iterated Prisoner's Dilemma and Atari games to show how other methods such as off-policy learning and low-bias estimator can help fix the gradient bias for GMRL algorithms in general.
    Continuous Attribution of Episodical Outcomes for More Efficient and Targeted Online Measurement. (arXiv:2210.16373v1 [stat.AP])
    Online experimentation platforms collect user feedback at low cost and large scale. Some systems even support real-time or near real-time data processing, and can update metrics and statistics continuously. Many commonly used metrics, such as clicks and page views, can be observed without much delay. However, many important signals can only be observed after several hours or days, with noise adding up over the duration of the episode. When episodical outcomes follow a complex sequence of user-product interactions, it is difficult to understand which interactions lead to the final outcome. There is no obvious attribution logic for us to associate a positive or negative outcome back to the actions and choices we made at different times. This attribution logic is critical to unlocking more targeted and efficient measurement at a finer granularity that could eventually lead to the full capability of reinforcement learning. In this paper, we borrow the idea of Causal Surrogacy to model a long-term outcome using leading indicators that are incrementally observed and apply it as the value function to track the progress towards the final outcome and attribute incrementally to various user-product interaction steps. Applying this approach to the guest booking metric at Airbnb resulted in significant variance reductions of 50% to 85%, while aligning well with the booking metric itself. Continuous attribution allows us to assign a utility score to each product page-view, and this score can be flexibly further aggregated to a variety of units of interest, such as searches and listings. We provide multiple real-world applications of attribution to illustrate its versatility.
    Dynamic Bandits with an Auto-Regressive Temporal Structure. (arXiv:2210.16386v1 [cs.LG])
    Multi-armed bandit (MAB) problems are mainly studied under two extreme settings known as stochastic and adversarial. These two settings, however, do not capture realistic environments such as search engines and marketing and advertising, in which rewards stochastically change in time. Motivated by that, we introduce and study a dynamic MAB problem with stochastic temporal structure, where the expected reward of each arm is governed by an auto-regressive (AR) model. Due to the dynamic nature of the rewards, simple "explore and commit" policies fail, as all arms have to be explored continuously over time. We formalize this by characterizing a per-round regret lower bound, where the regret is measured against a strong (dynamic) benchmark. We then present an algorithm whose per-round regret almost matches our regret lower bound. Our algorithm relies on two mechanisms: (i) alternating between recently pulled arms and unpulled arms with potential, and (ii) restarting. These mechanisms enable the algorithm to dynamically adapt to changes and discard irrelevant past information at a suitable rate. In numerical studies, we further demonstrate the strength of our algorithm under different types of non-stationary settings.
    When does mixup promote local linearity in learned representations?. (arXiv:2210.16413v1 [cs.LG])
    Mixup is a regularization technique that artificially produces new samples using convex combinations of original training points. This simple technique has shown strong empirical performance, and has been heavily used as part of semi-supervised learning techniques such as mixmatch~\citep{berthelot2019mixmatch} and interpolation consistent training (ICT)~\citep{verma2019interpolation}. In this paper, we look at Mixup through a \emph{representation learning} lens in a semi-supervised learning setup. In particular, we study the role of Mixup in promoting linearity in the learned network representations. Towards this, we study two questions: (1) how does the Mixup loss that enforces linearity in the \emph{last} network layer propagate the linearity to the \emph{earlier} layers?; and (2) how does the enforcement of stronger Mixup loss on more than two data points affect the convergence of training? We empirically investigate these properties of Mixup on vision datasets such as CIFAR-10, CIFAR-100 and SVHN. Our results show that supervised Mixup training does not make \emph{all} the network layers linear; in fact the \emph{intermediate layers} become more non-linear during Mixup training compared to a network that is trained \emph{without} Mixup. However, when Mixup is used as an unsupervised loss, we observe that all the network layers become more linear resulting in faster training convergence.
    Distributed Black-box Attack against Image Classification Cloud Services. (arXiv:2210.16371v1 [cs.LG])
    Black-box adversarial attacks can fool image classifiers into misclassifying images without requiring access to model structure and weights. Recently proposed black-box attacks can achieve a success rate of more than 95\% after less than 1,000 queries. The question then arises of whether black-box attacks have become a real threat against IoT devices that rely on cloud APIs to achieve image classification. To shed some light on this, note that prior research has primarily focused on increasing the success rate and reducing the number of required queries. However, another crucial factor for black-box attacks against cloud APIs is the time required to perform the attack. This paper applies black-box attacks directly to cloud APIs rather than to local models, thereby avoiding multiple mistakes made in prior research. Further, we exploit load balancing to enable distributed black-box attacks that can reduce the attack time by a factor of about five for both local search and gradient estimation methods.
    On Rate-Distortion Theory in Capacity-Limited Cognition & Reinforcement Learning. (arXiv:2210.16877v1 [cs.LG])
    Throughout the cognitive-science literature, there is widespread agreement that decision-making agents operating in the real world do so under limited information-processing capabilities and without access to unbounded cognitive or computational resources. Prior work has drawn inspiration from this fact and leveraged an information-theoretic model of such behaviors or policies as communication channels operating under a bounded rate constraint. Meanwhile, a parallel line of work also capitalizes on the same principles from rate-distortion theory to formalize capacity-limited decision making through the notion of a learning target, which facilitates Bayesian regret bounds for provably-efficient learning algorithms. In this paper, we aim to elucidate this latter perspective by presenting a brief survey of these information-theoretic models of capacity-limited decision making in biological and artificial agents.
    The secret role of undesired physical effects in accurate shape sensing with eccentric FBGs. (arXiv:2210.16316v1 [cs.LG])
    Fiber optic shape sensors have enabled unique advances in various navigation tasks, from medical tool tracking to industrial applications. Eccentric fiber Bragg gratings (FBG) are cheap and easy-to-fabricate shape sensors that are often interrogated with simple setups. However, using low-cost interrogation systems for such intensity-based quasi-distributed sensors introduces further complications to the sensor's signal. Therefore, eccentric FBGs have not been able to accurately estimate complex multi-bend shapes. Here, we present a novel technique to overcome these limitations and provide accurate and precise shape estimation in eccentric FBG sensors. We investigate the most important bending-induced effects in curved optical fibers that are usually eliminated in intensity-based fiber sensors. These effects contain shape deformation information with a higher spatial resolution that we are now able to extract using deep learning techniques. We design a deep learning model based on a convolutional neural network that is trained to predict shapes given the sensor's spectra. We also provide a visual explanation, highlighting wavelength elements whose intensities are more relevant in making shape predictions. These findings imply that deep learning techniques benefit from the bending-induced effects that impact the desired signal in a complex manner. This is the first step toward cheap yet accurate fiber shape sensing solutions.
    Revisiting consistency for semi-supervised semantic segmentation. (arXiv:2106.07075v4 [cs.CV] UPDATED)
    Semi-supervised learning an attractive technique in practical deployments of deep models since it relaxes the dependence on labeled data. It is especially important in the scope of dense prediction because pixel-level annotation requires significant effort. This paper considers semi-supervised algorithms that enforce consistent predictions over perturbed unlabeled inputs. We study the advantages of perturbing only one of the two model instances and preventing the backward pass through the unperturbed instance. We also propose a competitive perturbation model as a composition of geometric warp and photometric jittering. We experiment with efficient models due to their importance for real-time and low-power applications. Our experiments show clear advantages of (1) one-way consistency, (2) perturbing only the student branch, and (3) strong photometric and geometric perturbations. Our perturbation model outperforms recent work and most of the contribution comes from photometric component. Experiments with additional data from the large coarsely annotated subset of Cityscapes suggest that semi-supervised training can outperform supervised training with the coarse labels.
    Evaluation of Categorical Generative Models -- Bridging the Gap Between Real and Synthetic Data. (arXiv:2210.16405v1 [cs.LG])
    The machine learning community has mainly relied on real data to benchmark algorithms as it provides compelling evidence of model applicability. Evaluation on synthetic datasets can be a powerful tool to provide a better understanding of a model's strengths, weaknesses, and overall capabilities. Gaining these insights can be particularly important for generative modeling as the target quantity is completely unknown. Multiple issues related to the evaluation of generative models have been reported in the literature. We argue those problems can be avoided by an evaluation based on ground truth. General criticisms of synthetic experiments are that they are too simplified and not representative of practical scenarios. As such, our experimental setting is tailored to a realistic generative task. We focus on categorical data and introduce an appropriately scalable evaluation method. Our method involves tasking a generative model to learn a distribution in a high-dimensional setting. We then successively bin the large space to obtain smaller probability spaces where meaningful statistical tests can be applied. We consider increasingly large probability spaces, which correspond to increasingly difficult modeling tasks and compare the generative models based on the highest task difficulty they can reach before being detected as being too far from the ground truth. We validate our evaluation procedure with synthetic experiments on both synthetic generative models and current state-of-the-art categorical generative models.
    Reformulating van Rijsbergen's $F_{\beta}$ metric for weighted binary cross-entropy. (arXiv:2210.16458v1 [stat.ML])
    The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergens $F_{\beta}$ metric -- a popular choice for gauging classification performance. Through distributional assumptions on the $F_{\beta}$, an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the $F_{\beta}$ metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal $\beta$ or $\beta_{opt}$. This $\beta_{opt}$ is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data with imbalanced classes mostly yields better and interpretable results as compared to the baseline. For example, for the IMDB text data with known labeling errors, a 14% boost is shown. This methodology can accelerate training and provide better interpretation.
    Estimating oil recovery factor using machine learning: Applications of XGBoost classification. (arXiv:2210.16345v1 [cs.LG])
    In petroleum engineering, it is essential to determine the ultimate recovery factor, RF, particularly before exploitation and exploration. However, accurately estimating requires data that is not necessarily available or measured at early stages of reservoir development. We, therefore, applied machine learning (ML), using readily available features, to estimate oil RF for ten classes defined in this study. To construct the ML models, we applied the XGBoost classification algorithm. Classification was chosen because recovery factor is bounded from 0 to 1, much like probability. Three databases were merged, leaving us with four different combinations to first train and test the ML models and then further evaluate them using an independent database including unseen data. The cross-validation method with ten folds was applied on the training datasets to assess the effectiveness of the models. To evaluate the accuracy and reliability of the models, the accuracy, neighborhood accuracy, and macro averaged f1 score were determined. Overall, results showed that the XGBoost classification algorithm could estimate the RF class with reasonable accuracies as high as 0.49 in the training datasets, 0.34 in the testing datasets and 0.2 in the independent databases used. We found that the reliability of the XGBoost model depended on the data in the training dataset meaning that the ML models were database dependent. The feature importance analysis and the SHAP approach showed that the most important features were reserves and reservoir area and thickness.
    Data-driven low-dimensional dynamic model of Kolmogorov flow. (arXiv:2210.16708v1 [cs.LG])
    Reduced order models (ROMs) that capture flow dynamics are of interest for decreasing computational costs for simulation as well as for model-based control approaches. This work presents a data-driven framework for minimal-dimensional models that effectively capture the dynamics and properties of the flow. We apply this to Kolmogorov flow in a regime consisting of chaotic and intermittent behavior, which is common in many flows processes and is challenging to model. The trajectory of the flow travels near relative periodic orbits (RPOs), interspersed with sporadic bursting events corresponding to excursions between the regions containing the RPOs. The first step in development of the models is use of an undercomplete autoencoder to map from the full state data down to a latent space of dramatically lower dimension. Then models of the discrete-time evolution of the dynamics in the latent space are developed. By analyzing the model performance as a function of latent space dimension we can estimate the minimum number of dimensions required to capture the system dynamics. To further reduce the dimension of the dynamical model, we factor out a phase variable in the direction of translational invariance for the flow, leading to separate evolution equations for the pattern and phase dynamics. At a model dimension of five for the pattern dynamics, as opposed to the full state dimension of 1024 (i.e. a 32x32 grid), accurate predictions are found for individual trajectories out to about two Lyapunov times, as well as for long-time statistics. The nearly heteroclinic connections between the different RPOs, including the quiescent and bursting time scales, are well captured. We also capture key features of the phase dynamics. Finally, we use the low-dimensional representation to predict future bursting events, finding good success.
    PARC: Physics-Aware Recurrent Convolutional Neural Networks to Assimilate Meso-scale Reactive Mechanics of Energetic Materials. (arXiv:2204.07234v2 [cond-mat.mtrl-sci] UPDATED)
    The thermomechanical properties of energetic materials (EM) are known to be a function of their microscopic structures, i.e., morphological configurations of crystals and pores. This microstructural dependency has motivated vigorous research in the EM community, seeking to engineer material microstructures with targeted properties and performance under the materials-by-design paradigm. However, establishing the complex structure-property-performance (SPP) relationships of EMs demands extensive experimental and simulation efforts, and assimilating and encapsulating these relationships in usable models is a challenge. Here, we present a novel deep learning method, Physics-Aware Recurrent Convolutional (PARC) Neural Network, that can "learn" the mesoscale thermo-mechanics of EM microstructures during the shock-to-detonation transition (SDT). We show that this new approach can produce accurate high-fidelity predictions of time-evolving temperature and pressure fields of the same quality as the state-of-the-art direct numerical simulations (DNS), despite the dramatic reduction of computing time, from hours and days on a high-performance computing cluster (HPC) to a little more than a second on a commodity laptop. We also demonstrate that PARC can provide physical insights, i.e., the artificial neurons can illuminate the underlying physics by identifying which microstructural features led to critical hotspots and what are the characteristics of "critical" versus "non-critical" microstructures. This new knowledge generated alongside the capacity to conduct high-throughput experiments will broaden our theoretical understanding of the initiation mechanisms of EM detonation, as a step towards engineering EMs with specific properties.
    EFFGAN: Ensembles of fine-tuned federated GANs. (arXiv:2206.11682v2 [cs.LG] UPDATED)
    Generative adversarial networks have proven to be a powerful tool for learning complex and high-dimensional data distributions, but issues such as mode collapse have been shown to make it difficult to train them. This is an even harder problem when the data is decentralized over several clients in a federated learning setup, as problems such as client drift and non-iid data make it hard for federated averaging to converge. In this work, we study the task of how to learn a data distribution when training data is heterogeneously decentralized over clients and cannot be shared. Our goal is to sample from this distribution centrally, while the data never leaves the clients. We show using standard benchmark image datasets that existing approaches fail in this setting, experiencing so-called client drift when the local number of epochs becomes to large. We thus propose a novel approach we call EFFGAN: Ensembles of fine-tuned federated GANs. Being an ensemble of local expert generators, EFFGAN is able to learn the data distribution over all clients and mitigate client drift. It is able to train with a large number of local epochs, making it more communication efficient than previous works.
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v1 [cs.LG])
    Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key model-comparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.
    On-the-fly Object Detection using StyleGAN with CLIP Guidance. (arXiv:2210.16742v1 [cs.CV])
    We present a fully automated framework for building object detectors on satellite imagery without requiring any human annotation or intervention. We achieve this by leveraging the combined power of modern generative models (e.g., StyleGAN) and recent advances in multi-modal learning (e.g., CLIP). While deep generative models effectively encode the key semantics pertinent to a data distribution, this information is not immediately accessible for downstream tasks, such as object detection. In this work, we exploit CLIP's ability to associate image features with text descriptions to identify neurons in the generator network, which are subsequently used to build detectors on-the-fly.
    Modeling Volatility and Dependence of European Carbon and Energy Prices. (arXiv:2208.14311v2 [q-fin.ST] UPDATED)
    We study the prices of European Emission Allowances (EUA), whereby we analyze their uncertainty and dependencies on related energy prices (natural gas, coal, and oil). We propose a probabilistic multivariate conditional time series model with a VECM-Copula-GARCH structure which exploits key characteristics of the data. We normalize the EUA and fuel price data with respect to inflation and carbon emissions to allow for proper cross-series evaluation. The forecasting performance is evaluated in an extensive rolling-window forecasting study, covering eight years out-of-sample. The accuracy of the multivariate probabilistic forecasts is assessed using the energy score. We discuss our findings for both levels- and log-transformed data, focusing on time-varying correlations, and in view of the Russian invasion of Ukraine.
    Improving Hyperspectral Adversarial Robustness using Ensemble Networks in the Presences of Multiple Attacks. (arXiv:2210.16346v1 [cs.LG])
    Semantic segmentation of hyperspectral images (HSI) has seen great strides in recent years by incorporating knowledge from deep learning RGB classification models. Similar to their classification counterparts, semantic segmentation models are vulnerable to adversarial examples and need adversarial training to counteract them. Traditional approaches to adversarial robustness focus on training or retraining a single network on attacked data, however, in the presence of multiple attacks these approaches decrease the performance compared to networks trained individually on each attack. To combat this issue we propose an Adversarial Discriminator Ensemble Network (ADE-Net) which focuses on attack type detection and adversarial robustness under a unified model to preserve per data-type weight optimally while robustifiying the overall network. In the proposed method, a discriminator network is used to separate data by attack type into their specific attack-expert ensemble network. Our approach allows for the presence of multiple attacks mixed together while also labeling attack types during testing. We experimentally show that ADE-Net outperforms the baseline, which is a single network adversarially trained under a mix of multiple attacks, for HSI Indian Pines, Kennedy Space, and Houston datasets.
    AI Poincar\'{e} 2.0: Machine Learning Conservation Laws from Differential Equations. (arXiv:2203.12610v2 [cs.LG] UPDATED)
    We present a machine learning algorithm that discovers conservation laws from differential equations, both numerically (parametrized as neural networks) and symbolically, ensuring their functional independence (a non-linear generalization of linear independence). Our independence module can be viewed as a nonlinear generalization of singular value decomposition. Our method can readily handle inductive biases for conservation laws. We validate it with examples including the 3-body problem, the KdV equation and nonlinear Schr\"odinger equation.
    Lessons Learned: How (Not) to Defend Against Property Inference Attacks. (arXiv:2205.08821v3 [cs.CR] UPDATED)
    This work investigates and evaluates multiple defense strategies against property inference attacks (PIAs), a privacy attack against machine learning models. Given a trained machine learning model, PIAs aim to extract statistical properties of its underlying training data, e.g., reveal the ratio of men and women in a medical training data set. While for other privacy attacks like membership inference, a lot of research on defense mechanisms has been published, this is the first work focusing on defending against PIAs. With the primary goal of developing a generic mitigation strategy against white-box PIAs, we propose the novel approach property unlearning. Extensive experiments with property unlearning show that while it is very effective when defending target models against specific adversaries, property unlearning is not able to generalize, i.e., protect against a whole class of PIAs. To investigate the reasons behind this limitation, we present the results of experiments with the explainable AI tool LIME. They show how state-of-the-art property inference adversaries with the same objective focus on different parts of the target model. We further elaborate on this with a follow-up experiment, in which we use the visualization technique t-SNE to exhibit how severely statistical training data properties are manifested in machine learning models. Based on this, we develop the conjecture that post-training techniques like property unlearning might not suffice to provide the desirable generic protection against PIAs. As an alternative, we investigate the effects of simpler training data preprocessing methods like adding Gaussian noise to images of a training data set on the success rate of PIAs. We conclude with a discussion of the different defense approaches, summarize the lessons learned and provide directions for future work.
    Using Interpretable Machine Learning to Massively Increase the Number of Antibody-Virus Interactions Across Studies. (arXiv:2206.14566v2 [q-bio.QM] UPDATED)
    A central challenge in every field of biology is to use existing measurements to predict the outcomes of future experiments. In this work, we consider the wealth of antibody inhibition data against variants of the influenza virus. Due to this viru's genetic diversity and evolvability, the variants examined in one study will often have little-to-no overlap with other studies, making it difficult to discern common patterns or unify datasets for further analysis. To that end, we develop a computational framework that predicts how an antibody or serum would inhibit any variant from any other study. We use this framework to greatly expand seven influenza datasets utilizing hemagglutination inhibition, validating our method upon 200,000 existing measurements and predicting 2,000,000 new values along with their uncertainties. With these new values, we quantify the transferability between seven vaccination and infection studies in humans and ferrets, show that the serum potency is negatively correlated with breadth, and present a tool for pandemic preparedness. This data-driven approach does not require any information beyond each virus's name and measurements, and even datasets with as few as 5 viruses can be expanded, making this approach widely applicable. Future influenza studies using hemagglutination inhibition can directly utilize our curated datasets to predict newly measured antibody responses against ~80 H3N2 influenza viruses from 1968-2011, whereas immunological studies utilizing other viruses or a different assay only need a single partially-overlapping dataset to extend their work. In essence, this approach enables a shift in perspective when analyzing data from "what you see is what you get" into "what anyone sees is what everyone gets."  ( 3 min )
    Simultaneous off-the-grid learning of mixtures issued from a continuous dictionary. (arXiv:2210.16311v1 [stat.ML])
    In this paper we observe a set, possibly a continuum, of signals corrupted by noise. Each signal is a finite mixture of an unknown number of features belonging to a continuous dictionary. The continuous dictionary is parametrized by a real non-linear parameter. We shall assume that the signals share an underlying structure by saying that the union of active features in the whole dataset is finite. We formulate regularized optimization problems to estimate simultaneously the linear coefficients in the mixtures and the non-linear parameters of the features. The optimization problems are composed of a data fidelity term and a (l1 , Lp)-penalty. We prove high probability bounds on the prediction errors associated to our estimators. The proof is based on the existence of certificate functions. Following recent works on the geometry of off-the-grid methods, we show that such functions can be constructed provided the parameters of the active features are pairwise separated by a constant with respect to a Riemannian metric. When the number of signals is finite and the noise is assumed Gaussian, we give refinements of our results for p = 1 and p = 2 using tail bounds on suprema of Gaussian and $\chi$2 random processes. When p = 2, our prediction error reaches the rates obtained by the Group-Lasso estimator in the multi-task linear regression model.
    Investigation of chemical structure recognition by encoder-decoder models in learning progress. (arXiv:2210.16307v1 [physics.chem-ph])
    Descriptor generation methods using latent representations of encoder$-$decoder (ED) models with SMILES as input are useful because of the continuity of descriptor and restorability to the structure. However, it is not clear how the structure is recognized in the learning progress of ED models. In this work, we created ED models of various learning progress and investigated the relationship between structural information and learning progress. We showed that compound substructures were learned early in ED models by monitoring the accuracy of downstream tasks and input$-$output substructure similarity using substructure$-$based descriptors, which suggests that existing evaluation methods based on the accuracy of downstream tasks may not be sensitive enough to evaluate the performance of ED models with SMILES as descriptor generation methods. On the other hand, we showed that structure restoration was time$-$consuming, and in particular, insufficient learning led to the estimation of a larger structure than the actual one. It can be inferred that determining the endpoint of the structure is a difficult task for the model. To our knowledge, this is the first study to link the learning progress of SMILES by ED model to chemical structures for a wide range of chemicals.
    Predicting Brain Age using Transferable coVariance Neural Networks. (arXiv:2210.16363v1 [cs.LG])
    The deviation between chronological age and biological age is a well-recognized biomarker associated with cognitive decline and neurodegeneration. Age-related and pathology-driven changes to brain structure are captured by various neuroimaging modalities. These datasets are characterized by high dimensionality as well as collinearity, hence applications of graph neural networks in neuroimaging research routinely use sample covariance matrices as graphs. We have recently studied covariance neural networks (VNNs) that operate on sample covariance matrices using the architecture derived from graph convolutional networks, and we showed VNNs enjoy significant advantages over traditional data analysis approaches. In this paper, we demonstrate the utility of VNNs in inferring brain age using cortical thickness data. Furthermore, our results show that VNNs exhibit multi-scale and multi-site transferability for inferring {brain age}. In the context of brain age in Alzheimer's disease (AD), our experiments show that i) VNN outputs are interpretable as brain age predicted using VNNs is significantly elevated for AD with respect to healthy subjects for different datasets; and ii) VNNs can be transferable, i.e., VNNs trained on one dataset can be transferred to another dataset with different dimensions without retraining for brain age prediction.
    A State-Augmented Approach for Learning Optimal Resource Management Decisions in Wireless Networks. (arXiv:2210.16412v1 [cs.LG])
    We consider a radio resource management (RRM) problem in a multi-user wireless network, where the goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users. We propose a state-augmented parameterization for the RRM policy, where alongside the instantaneous network states, the RRM policy takes as input the set of dual variables corresponding to the constraints. We provide theoretical justification for the feasibility and near-optimality of the RRM decisions generated by the proposed state-augmented algorithm. Focusing on the power allocation problem with RRM policies parameterized by a graph neural network (GNN) and dual variables sampled from the dual descent dynamics, we numerically demonstrate that the proposed approach achieves a superior trade-off between mean, minimum, and 5th percentile rates than baseline methods.
    ZeroPrompt: Scaling Prompt-Based Pretraining to 1,000 Tasks Improves Zero-Shot Generalization. (arXiv:2201.06910v2 [cs.LG] UPDATED)
    We propose a multitask pretraining approach ZeroPrompt for zero-shot generalization, focusing on task scaling and zero-shot prompting. While previous models are trained on only a few dozen tasks, we scale to 1,000 tasks for the first time using real-world data. This leads to a crucial discovery that task scaling can be an efficient alternative to model scaling; i.e., the model size has little impact on performance with an extremely large number of tasks. Our results show that task scaling can substantially improve training efficiency by 30 times in FLOPs. Moreover, we present a prompting method that incorporates a genetic algorithm to automatically search for the best prompt for unseen tasks, along with a few other improvements. Empirically, ZeroPrompt substantially improves both the efficiency and the performance of zero-shot learning across a variety of academic and production datasets.
    Perturbation Analysis of Neural Collapse. (arXiv:2210.16658v1 [cs.LG])
    Training deep neural networks for classification often includes minimizing the training loss beyond the zero training error point. In this phase of training, a "neural collapse" behavior has been observed: the variability of features (outputs of the penultimate layer) of within-class samples decreases and the mean features of different classes approach a certain tight frame structure. Recent works analyze this behavior via idealized unconstrained features models where all the minimizers exhibit exact collapse. However, with practical networks and datasets, the features typically do not reach exact collapse, e.g., because deep layers cannot arbitrarily modify intermediate features that are far from being collapsed. In this paper, we propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models. For example, we prove reduction in the within-class variability of the optimized features compared to the predefined input features (via analyzing gradient flow on the "central-path" with minimal assumptions), analyze the minimizers in the near-collapse regime, and provide insights on the effect of regularization hyperparameters on the closeness to collapse. We support our theory with experiments in practical deep learning settings.
    Few-shot Image Generation via Adaptation-Aware Kernel Modulation. (arXiv:2210.16559v1 [cs.CV])
    Few-shot image generation (FSIG) aims to learn to generate new and diverse samples given an extremely limited number of samples from a domain, e.g., 10 training samples. Recent work has addressed the problem using transfer learning approach, leveraging a GAN pretrained on a large-scale source domain dataset and adapting that model to the target domain based on very limited target domain samples. Central to recent FSIG methods are knowledge preserving criteria, which aim to select a subset of source model's knowledge to be preserved into the adapted model. However, a major limitation of existing methods is that their knowledge preserving criteria consider only source domain/source task, and they fail to consider target domain/adaptation task in selecting source model's knowledge, casting doubt on their suitability for setups of different proximity between source and target domain. Our work makes two contributions. As our first contribution, we re-visit recent FSIG works and their experiments. Our important finding is that, under setups which assumption of close proximity between source and target domains is relaxed, existing state-of-the-art (SOTA) methods which consider only source domain/source task in knowledge preserving perform no better than a baseline fine-tuning method. To address the limitation of existing methods, as our second contribution, we propose Adaptation-Aware kernel Modulation (AdAM) to address general FSIG of different source-target domain proximity. Extensive experimental results show that the proposed method consistently achieves SOTA performance across source/target domains of different proximity, including challenging setups when source and target domains are more apart. Project Page: https://yunqing-me.github.io/AdAM/
    Optimal-er Auctions through Attention. (arXiv:2202.13110v4 [cs.LG] UPDATED)
    RegretNet is a recent breakthrough in the automated design of revenue-maximizing auctions. It combines the flexibility of deep learning with the regret-based approach to relax the Incentive Compatibility (IC) constraint (that participants prefer to bid truthfully) in order to approximate optimal auctions. We propose two independent improvements of RegretNet. The first is a neural architecture denoted as RegretFormer that is based on attention layers. The second is a loss function that requires explicit specification of an acceptable IC violation denoted as regret budget. We investigate both modifications in an extensive experimental study that includes settings with constant and inconstant number of items and participants, as well as novel validation procedures tailored to regret-based approaches. We find that RegretFormer consistently outperforms RegretNet in revenue (i.e. is optimal-er) and that our loss function both simplifies hyperparameter tuning and allows to unambiguously control the revenue-regret trade-off by selecting the regret budget.
    Diverse Parallel Data Synthesis for Cross-Database Adaptation of Text-to-SQL Parsers. (arXiv:2210.16613v1 [cs.CL])
    Text-to-SQL parsers typically struggle with databases unseen during the train time. Adapting parsers to new databases is a challenging problem due to the lack of natural language queries in the new schemas. We present ReFill, a framework for synthesizing high-quality and textually diverse parallel datasets for adapting a Text-to-SQL parser to a target schema. ReFill learns to retrieve-and-edit text queries from the existing schemas and transfers them to the target schema. We show that retrieving diverse existing text, masking their schema-specific tokens, and refilling with tokens relevant to the target schema, leads to significantly more diverse text queries than achievable by standard SQL-to-Text generation methods. Through experiments spanning multiple databases, we demonstrate that fine-tuning parsers on datasets synthesized using ReFill consistently outperforms the prior data-augmentation methods.
    Revisiting Simple Regret Minimization in Multi-Armed Bandits. (arXiv:2210.16913v1 [cs.LG])
    Simple regret is a natural and parameter-free performance criterion for identifying a good arm in multi-armed bandits yet is less popular than the probability of missing the best arm or an $\epsilon$-good arm, perhaps due to lack of easy ways to characterize it. In this paper, we achieve improved simple regret upper bounds for both data-rich ($T\ge n$) and data-poor regime ($T \le n$) where $n$ is the number of arms and $T$ is the number of samples. At its heart is an improved analysis of the well-known Sequential Halving (SH) algorithm that bounds the probability of returning an arm whose mean reward is not within $\epsilon$ from the best (i.e., not $\epsilon$-good) for any choice of $\epsilon>0$, although $\epsilon$ is not an input to SH. We show that this directly implies an optimal simple regret bound of $\mathcal{O}(\sqrt{n/T})$. Furthermore, our upper bound gets smaller as a function of the number of $\epsilon$-good arms. This results in an accelerated rate for the $(\epsilon,\delta)$-PAC criterion, which closes the gap between the upper and lower bounds in prior art. For the more challenging data-poor regime, we propose Bracketing SH (BSH) that enjoys the same improvement even without sampling each arm at least once. Our empirical study shows that BSH outperforms existing methods on real-world tasks.
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v2 [cs.CV] UPDATED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    Nish: A Novel Negative Stimulated Hybrid Activation Function. (arXiv:2210.09083v2 [cs.LG] UPDATED)
    Activation functions play a crucial role in the performance and stability of neural networks. In this study, we propose a novel non-monotonic activation function is called Negative Stimulated Hybrid Activation Function (Nish). It behaves like a Rectified Linear Unit (ReLU) function for values greater than zero, and a sinus-sigmoidal function for values less than zero. The proposed function incorporates the sigmoid and sine wave, allowing new dynamics over traditional ReLU activations. We evaluate robustness of the Nish for different combinations of well-established architectures as well as recently proposed activation functions using on various well-known benchmarks. The results indicate that the accuracy rates obtained by the proposed activation function are slightly higher than those obtained using the set of weights calculated by Mish activation.
    MisMatch: Calibrated Segmentation via Consistency on Differential Morphological Feature Perturbations with Limited Labels. (arXiv:2110.12179v2 [cs.CV] UPDATED)
    Semi-supervised learning (SSL) is a promising machine learning paradigm to address the issue of label scarcity in medical imaging. SSL methods were originally developed in image classification. The state-of-the-art SSL methods in image classification utilise consistency regularisation to learn unlabelled predictions which are invariant to input level perturbations. However, image level perturbations violate the cluster assumption in the setting of segmentation. Moreover, existing image level perturbations are hand-crafted which could be sub-optimal. Therefore, it is a not trivial to straightforwardly adapt existing SSL image classification methods in segmentation. In this paper, we propose MisMatch, a semi-supervised segmentation framework based on the consistency between paired predictions which are derived from two differently learnt morphological feature perturbations. MisMatch consists of an encoder and two decoders. One decoder learns positive attention for foreground on unlabelled data thereby generating dilated features of foreground. The other decoder learns negative attention for foreground on the same unlabelled data thereby generating eroded features of foreground. We first develop a 2D U-net based MisMatch framework and perform extensive cross-validation on a CT-based pulmonary vessel segmentation task and show that MisMatch statistically outperforms state-of-the-art semi-supervised methods when only 6.25\% of the total labels are used. In a second experiment, we show that U-net based MisMatch outperforms state-of-the-art methods on an MRI-based brain tumour segmentation task. In a third experiment, we show that a 3D MisMatch outperforms a previous method using input level augmentations, on a left atrium segmentation task. Lastly, we find that the performance improvement of MisMatch over the baseline might originate from its better calibration.
    Region of Interest Detection in Melanocytic Skin Tumor Whole Slide Images. (arXiv:2210.16457v1 [cs.CV])
    Automated region of interest detection in histopathological image analysis is a challenging and important topic with tremendous potential impact on clinical practice. The deep-learning methods used in computational pathology help us to reduce costs and increase the speed and accuracy of regions of interest detection and cancer diagnosis. In this work, we propose a patch-based region of interest detection method for melanocytic skin tumor whole-slide images. We work with a dataset that contains 165 primary melanomas and nevi Hematoxylin and Eosin whole-slide images and build a deep-learning method. The proposed method performs well on a hold-out test data set including five TCGA-SKCM slides (accuracy of 93.94\% in slide classification task and intersection over union rate of 41.27\% in the region of interest detection task), showing the outstanding performance of our model on melanocytic skin tumor. Even though we test the experiments on the skin tumor dataset, our work could also be extended to other medical image detection problems, such as various tumors' classification and prediction, to help and benefit the clinical evaluation and diagnosis of different tumors.
    NU-net: An Unpretentious Nested U-net for Breast Tumor Segmentation. (arXiv:2209.07193v2 [eess.IV] UPDATED)
    Breast tumor segmentation is one of the key steps that helps us characterize and localize tumor regions. However, variable tumor morphology, blurred boundary, and similar intensity distributions bring challenges for accurate segmentation of breast tumors. Recently, many U-net variants have been proposed and widely used for breast tumors segmentation. However, these architectures suffer from two limitations: (1) Ignoring the characterize ability of the benchmark networks, and (2) Introducing extra complex operations increases the difficulty of understanding and reproducing the network. To alleviate these challenges, this paper proposes a simple yet powerful nested U-net (NU-net) for accurate segmentation of breast tumors. The key idea is to utilize U-Nets with different depths and shared weights to achieve robust characterization of breast tumors. NU-net mainly has the following advantages: (1) Improving network adaptability and robustness to breast tumors with different scales, (2) This method is easy to reproduce and execute, and (3) The extra operations increase network parameters without significantly increasing computational cost. Extensive experimental results with twelve state-of-the-art segmentation methods on three public breast ultrasound datasets demonstrate that NU-net has more competitive segmentation performance on breast tumors. Furthermore, the robustness of NU-net is further illustrated on the segmentation of renal ultrasound images. The source code is publicly available on https://github.com/CGPzy/NU-net.
    Monitoring the Dynamic Networks of Stock Returns. (arXiv:2210.16679v1 [q-fin.ST])
    In this paper, we study the connection between the companies in the Swedish capital market. We consider 28 companies included in the determination of the market index OMX30. The network structure of the market is constructed using different methods to determine the distance between the companies. We use hierarchical clustering methods to find the relation among the companies in each window. Next, we obtain one-dimensional time series of the distances between the clustering trees that reflect the changes in the relationship between the companies in the market over time. The method of statistical process control, namely the Shewhart control chart, is applied to those time series to detect abnormal changes in the financial market.
    Spatiotemporal Deformable Scene Graphs for Complex Activity Detection. (arXiv:2104.08194v2 [cs.CV] UPDATED)
    Long-term complex activity recognition and localisation can be crucial for decision making in autonomous systems such as smart cars and surgical robots. Here we address the problem via a novel deformable, spatiotemporal scene graph approach, consisting of three main building blocks: (i) action tube detection, (ii) the modelling of the deformable geometry of parts, and (iii) a graph convolutional network. Firstly, action tubes are detected in a series of snippets. Next, a new 3D deformable RoI pooling layer is designed for learning the flexible, deformable geometry of the constituent action tubes. Finally, a scene graph is constructed by considering all parts as nodes and connecting them based on different semantics such as order of appearance, sharing the same action label and feature similarity. We also contribute fresh temporal complex activity annotation for the recently released ROAD autonomous driving and SARAS-ESAD surgical action datasets and show the adaptability of our framework to different domains. Our method is shown to significantly outperform graph-based competitors on both augmented datasets.
    A Variational Edge Partition Model for Supervised Graph Representation Learning. (arXiv:2202.03233v3 [stat.ML] UPDATED)
    Graph neural networks (GNNs), which propagate the node features through the edges and learn how to transform the aggregated features under label supervision, have achieved great success in supervised feature extraction for both node-level and graph-level classification tasks. However, GNNs typically treat the graph structure as given and ignore how the edges are formed. This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities, each of which contributes to the edges via a logical OR mechanism. Based on this generative model, we partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN-based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN-based predictor that combines community-specific GNNs for the end classification task. Extensive evaluations on real-world graph datasets have verified the effectiveness of the proposed method in learning discriminative representations for both node-level and graph-level classification tasks.
    A Deep Reinforcement Learning Framework For Column Generation. (arXiv:2206.02568v2 [math.OC] UPDATED)
    Column Generation (CG) is an iterative algorithm for solving linear programs (LPs) with an extremely large number of variables (columns). CG is the workhorse for tackling large-scale \textit{integer} linear programs, which rely on CG to solve LP relaxations within a branch and price algorithm. Two canonical applications are the Cutting Stock Problem (CSP) and Vehicle Routing Problem with Time Windows (VRPTW). In VRPTW, for example, each binary variable represents the decision to include or exclude a \textit{route}, of which there are exponentially many; CG incrementally grows the subset of columns being used, ultimately converging to an optimal solution. We propose RLCG, the first Reinforcement Learning (RL) approach for CG. Unlike typical column selection rules which myopically select a column based on local information at each iteration, we treat CG as a sequential decision-making problem: the column selected in a given iteration affects subsequent column selections. This perspective lends itself to a Deep Reinforcement Learning approach that uses Graph Neural Networks (GNNs) to represent the variable-constraint structure in the LP of interest. We perform an extensive set of experiments using the publicly available BPPLIB benchmark for CSP and Solomon benchmark for VRPTW. RLCG converges faster and reduces the number of CG iterations by 22.4\% for CSP and 40.9\% for VRPTW on average compared to a commonly used greedy policy. Our code is available at https://github.com/chichengmessi/reinforcement-learning-for-column-generation.git.
    A Solvable Model of Neural Scaling Laws. (arXiv:2210.16859v1 [cs.LG])
    Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
    Synthesizing Speech from Intracranial Depth Electrodes using an Encoder-Decoder Framework. (arXiv:2111.01457v4 [cs.SD] UPDATED)
    Speech Neuroprostheses have the potential to enable communication for people with dysarthria or anarthria. Recent advances have demonstrated high-quality text decoding and speech synthesis from electrocorticographic grids placed on the cortical surface. Here, we investigate a less invasive measurement modality in three participants, namely stereotactic EEG (sEEG) that provides sparse sampling from multiple brain regions, including subcortical regions. To evaluate whether sEEG can also be used to synthesize audio from neural recordings, we employ a recurrent encoder-decoder model based on modern deep learning methods. We find that speech can indeed be reconstructed with correlations up to 0.8 from these minimally invasive recordings, despite limited amounts of training data. In particular, the architecture we employ naturally picks up on the temporal nature of the data and thereby outperforms an existing benchmark based on non-regressive convolutional neural networks.
    Large Language Models and the Reverse Turing Test. (arXiv:2207.14382v8 [cs.CL] UPDATED)
    Large Language Models (LLMs) have been transformative. They are pre-trained foundational models that are self-supervised and can be adapted with fine tuning to a wide range of natural language tasks, each of which previously would have required a separate network model. This is one step closer to the extraordinary versatility of human language. GPT-3 and more recently LaMDA can carry on dialogs with humans on many topics after minimal priming with a few examples. However, there has been a wide range of reactions and debate on whether these LLMs understand what they are saying or exhibit signs of intelligence. This high variance is exhibited in three interviews with LLMs reaching wildly different conclusions. A new possibility was uncovered that could explain this divergence. What appears to be intelligence in LLMs may in fact be a mirror that reflects the intelligence of the interviewer, a remarkable twist that could be considered a Reverse Turing Test. If so, then by studying interviews we may be learning more about the intelligence and beliefs of the interviewer than the intelligence of the LLMs. As LLMs become more capable they may transform the way we interact with machines and how they interact with each other. Increasingly, LLMs are being coupled with sensorimotor devices. LLMs can talk the talk, but can they walk the walk? A road map for achieving artificial general autonomy is outlined with seven major improvements inspired by brain systems. LLMs could be used to uncover new insights into brain function by downloading brain data during natural behaviors.
    Hierarchical Automatic Power Plane Generation with Genetic Optimization and Multilayer Perceptron. (arXiv:2210.16314v1 [cs.NE])
    We present an automatic multilayer power plane generation method to accelerate the design of printed circuit boards (PCB). In PCB design, while automatic solvers have been developed to predict important indicators such as the IR-drop, power integrity, and signal integrity, the generation of the power plane itself still largely relies on laborious manual methods. Our automatic power plane generation approach is based on genetic optimization combined with a multilayer perceptron and is able to automatically generate power planes across a diverse set of problems with varying levels of difficulty. Our method GOMLP consists of an outer loop genetic optimizer (GO) and an inner loop multi-layer perceptron (MLP) that generate power planes automatically. The critical elements of our approach include contour detection, feature expansion, and a distance measure to enable island-minimizing complex power plane generation. We compare our approach to a baseline solution based on A*. The A* method consisting of a sequential island generation and merging process which can produce less than ideal solutions. Our experimental results show that on single layer power plane problems, our method outperforms A* in 71% of the problems with varying levels of board layout difficulty. We further describe H-GOMLP, which extends GOMLP to multilayer power plane problems using hierarchical clustering and net similarities based on the Hausdorff distance.
    Clenshaw Graph Neural Networks. (arXiv:2210.16508v1 [cs.LG])
    Graph Convolutional Networks (GCNs), which use a message-passing paradigm with stacked convolution layers, are foundational methods for learning graph representations. Recent GCN models use various residual connection techniques to alleviate the model degradation problem such as over-smoothing and gradient vanishing. Existing residual connection techniques, however, fail to make extensive use of underlying graph structure as in the graph spectral domain, which is critical for obtaining satisfactory results on heterophilic graphs. In this paper, we introduce ClenshawGCN, a GNN model that employs the Clenshaw Summation Algorithm to enhance the expressiveness of the GCN model. ClenshawGCN equips the standard GCN model with two straightforward residual modules: the adaptive initial residual connection and the negative second-order residual connection. We show that by adding these two residual modules, ClenshawGCN implicitly simulates a polynomial filter under the Chebyshev basis, giving it at least as much expressive power as polynomial spectral GNNs. In addition, we conduct comprehensive experiments to demonstrate the superiority of our model over spatial and spectral GNN models.
    Learning to Detect Interesting Anomalies. (arXiv:2210.16334v1 [cs.LG])
    Anomaly detection algorithms are typically applied to static, unchanging, data features hand-crafted by the user. But how does a user systematically craft good features for anomalies that have never been seen? Here we couple deep learning with active learning -- in which an Oracle iteratively labels small amounts of data selected algorithmically over a series of rounds -- to automatically and dynamically improve the data features for efficient outlier detection. This approach, AHUNT, shows excellent performance on MNIST, CIFAR10, and Galaxy-DESI data, significantly outperforming both standard anomaly detection and active learning algorithms with static feature spaces. Beyond improved performance, AHUNT also allows the number of anomaly classes to grow organically in response to Oracle's evaluations. Extensive ablation studies explore the impact of Oracle question selection strategy and loss function on performance. We illustrate how the dynamic anomaly class taxonomy represents another step towards fully personalized rankings of different anomaly classes that reflect a user's interests, allowing the algorithm to learn to ignore statistically significant but uninteresting outliers (e.g., noise). This should prove useful in the era of massive astronomical datasets serving diverse sets of users who can only review a tiny subset of the incoming data.
    A Simple Hypergraph Kernel Convolution based on Discounted Markov Diffusion Process. (arXiv:2210.16884v1 [cs.LG])
    Kernels on discrete structures evaluate pairwise similarities between objects which capture semantics and inherent topology information. Existing kernels on discrete structures are only developed by topology information(such as adjacency matrix of graphs), without considering original attributes of objects. This paper proposes a two-phase paradigm to aggregate comprehensive information on discrete structures leading to a Discount Markov Diffusion Learnable Kernel (DMDLK). Specifically, based on the underlying projection of DMDLK, we design a Simple Hypergraph Kernel Convolution (SHKC) for hidden representation of vertices. SHKC can adjust diffusion steps rather than stacking convolution layers to aggregate information from long-range neighborhoods which prevents over-smoothing issues of existing hypergraph convolutions. Moreover, we utilize the uniform stability bound theorem in transductive learning to analyze critical factors for the effectiveness and generalization ability of SHKC from a theoretical perspective. The experimental results on several benchmark datasets for node classification tasks verified the superior performance of SHKC over state-of-the-art methods.
    Distributionally Robust Domain Adaptation. (arXiv:2210.16894v1 [stat.ML])
    Domain Adaptation (DA) has recently received significant attention due to its potential to adapt a learning model across source and target domains with mismatched distributions. Since DA methods rely exclusively on the given source and target domain samples, they generally yield models that are vulnerable to noise and unable to adapt to unseen samples from the target domain, which calls for DA methods that guarantee the robustness and generalization of the learned models. In this paper, we propose DRDA, a distributionally robust domain adaptation method. DRDA leverages a distributionally robust optimization (DRO) framework to learn a robust decision function that minimizes the worst-case target domain risk and generalizes to any sample from the target domain by transferring knowledge from a given labeled source domain sample. We utilize the Maximum Mean Discrepancy (MMD) metric to construct an ambiguity set of distributions that provably contains the source and target domain distributions with high probability. Hence, the risk is shown to upper bound the out-of-sample target domain loss. Our experimental results demonstrate that our formulation outperforms existing robust learning approaches.
    CMT: Interpretable Model for Rapid Recognition Pneumonia from Chest X-Ray Images by Fusing Low Complexity Multilevel Attention Mechanism. (arXiv:2210.16584v1 [eess.IV])
    Chest imaging plays an essential role in diagnosing and predicting patients with COVID-19 with evidence of worsening respiratory status. Many deep learning-based diagnostic models for pneumonia have been developed to enable computer-aided diagnosis. However, the long training and inference time make them inflexible. In addition, the lack of interpretability reduces their credibility in clinical medical practice. This paper presents CMT, a model with interpretability and rapid recognition of pneumonia, especially COVID-19 positive. Multiple convolutional layers in CMT are first used to extract features in CXR images, and then Transformer is applied to calculate the possibility of each symptom. To improve the model's generalization performance and to address the problem of sparse medical image data, we propose Feature Fusion Augmentation (FFA), a plug-and-play method for image augmentation. It fuses the features of the two images to varying degrees to produce a new image that does not deviate from the original distribution. Furthermore, to reduce the computational complexity and accelerate the convergence, we propose Multilevel Multi-Head Self-Attention (MMSA), which computes attention on different levels to establish the relationship between global and local features. It significantly improves the model performance while substantially reducing its training and inference time. Experimental results on the largest COVID-19 dataset show the proposed CMT has state-of-the-art performance. The effectiveness of FFA and MMSA is demonstrated in the ablation experiments. In addition, the weights and feature activation maps of the model inference process are visualized to show the CMT's interpretability.
    Image-based Treatment Effect Heterogeneity. (arXiv:2206.06417v3 [cs.LG] UPDATED)
    Randomized controlled trials (RCTs) are considered the gold standard for estimating the average treatment effect (ATE) of interventions. One use of RCTs is to study the causes of global poverty -- a subject explicitly cited in the 2019 Nobel Memorial Prize awarded to Duflo, Banerjee, and Kremer "for their experimental approach to alleviating global poverty." Because the ATE is a population summary, anti-poverty experiments often seek to unpack the effect variation around the ATE by conditioning (CATE) on tabular variables such as age and ethnicity that were measured during the RCT data collection. Although such variables are key to unpacking CATE, using only such variables may fail to capture historical, geographical, or neighborhood-specific contributors to effect variation, as tabular RCT data are often only observed near the time of the experiment. In global poverty research, when the location of the experiment units is approximately known, satellite imagery can provide a window into such factors important for understanding heterogeneity. However, there is no method that specifically enables applied researchers to analyze CATE from images. In this paper, using a deep probabilistic modeling framework, we develop such a method that estimates latent clusters of images by identifying images with similar treatment effects distributions. We also emphasize a sensitivity factor that quantifies the importance of image segments contributing to the mean effect cluster probabilities. We compare the proposed methods against alternatives in simulation; also, we show how the model works in an actual RCT, estimating the effects of an anti-poverty intervention in northern Uganda and obtaining a posterior predictive distribution over effects for the rest of the country where no experimental data was collected. We make the models available in an open-source package and discuss other applications.
    Nonlinear Causal Discovery via Kernel Anchor Regression. (arXiv:2210.16775v1 [stat.ML])
    Learning causal relationships is a fundamental problem in science. Anchor regression has been developed to address this problem for a large class of causal graphical models, though the relationships between the variables are assumed to be linear. In this work, we tackle the nonlinear setting by proposing kernel anchor regression (KAR). Beyond the natural formulation using a classic two-stage least square estimator, we also study an improved variant that involves nonparametric regression in three separate stages. We provide convergence results for the proposed KAR estimators and the identifiability conditions for KAR to learn the nonlinear structural equation models (SEM). Experimental results demonstrate the superior performances of the proposed KAR estimators over existing baselines.
    Temporal-Viewpoint Transportation Plan for Skeletal Few-shot Action Recognition. (arXiv:2210.16820v1 [cs.CV])
    We propose a Few-shot Learning pipeline for 3D skeleton-based action recognition by Joint tEmporal and cAmera viewpoiNt alIgnmEnt (JEANIE). To factor out misalignment between query and support sequences of 3D body joints, we propose an advanced variant of Dynamic Time Warping which jointly models each smooth path between the query and support frames to achieve simultaneously the best alignment in the temporal and simulated camera viewpoint spaces for end-to-end learning under the limited few-shot training data. Sequences are encoded with a temporal block encoder based on Simple Spectral Graph Convolution, a lightweight linear Graph Neural Network backbone. We also include a setting with a transformer. Finally, we propose a similarity-based loss which encourages the alignment of sequences of the same class while preventing the alignment of unrelated sequences. We show state-of-the-art results on NTU-60, NTU-120, Kinetics-skeleton and UWA3D Multiview Activity II.
    Single-Shot Domain Adaptation via Target-Aware Generative Augmentation. (arXiv:2210.16692v1 [cs.CV])
    The problem of adapting models from a source domain using data from any target domain of interest has gained prominence, thanks to the brittle generalization in deep neural networks. While several test-time adaptation techniques have emerged, they typically rely on synthetic data augmentations in cases of limited target data availability. In this paper, we consider the challenging setting of single-shot adaptation and explore the design of augmentation strategies. We argue that augmentations utilized by existing methods are insufficient to handle large distribution shifts, and hence propose a new approach SiSTA (Single-Shot Target Augmentations), which first fine-tunes a generative model from the source domain using a single-shot target, and then employs novel sampling strategies for curating synthetic target data. Using experiments with a state-of-the-art domain adaptation method, we find that SiSTA produces improvements as high as 20\% over existing baselines under challenging shifts in face attribute detection, and that it performs competitively to oracle models obtained by training on a larger target dataset.
    Strong Lottery Ticket Hypothesis with $\varepsilon$--perturbation. (arXiv:2210.16589v1 [cs.LG])
    The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization. In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? We answer the first question by first extending the theoretical result on subset sum to allow perturbation on the candidates. Applying this result to the neural network setting, we show that such $\varepsilon$-perturbation reduces the over-parameterization requirement of the strong LTH. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.
    DiffusER: Discrete Diffusion via Edit-based Reconstruction. (arXiv:2210.16886v1 [cs.CL])
    In text generation, models that generate text from scratch one token at a time are currently the dominant paradigm. Despite being performant, these models lack the ability to revise existing text, which limits their usability in many practical scenarios. We look to address this, with DiffusER (Diffusion via Edit-based Reconstruction), a new edit-based generative model for text based on denoising diffusion models -- a class of models that use a Markov chain of denoising steps to incrementally generate data. DiffusER is not only a strong generative model in general, rivalling autoregressive models on several tasks spanning machine translation, summarization, and style transfer; it can also perform other varieties of generation that standard autoregressive models are not well-suited for. For instance, we demonstrate that DiffusER makes it possible for a user to condition generation on a prototype, or an incomplete sequence, and continue revising based on previous edit steps.
    Fast-Convergent Federated Learning via Cyclic Aggregation. (arXiv:2210.16520v1 [cs.LG])
    Federated learning (FL) aims at optimizing a shared global model over multiple edge devices without transmitting (private) data to the central server. While it is theoretically well-known that FL yields an optimal model -- centrally trained model assuming availability of all the edge device data at the central server -- under mild condition, in practice, it often requires massive amount of iterations until convergence, especially under presence of statistical/computational heterogeneity. This paper utilizes cyclic learning rate at the server side to reduce the number of training iterations with increased performance without any additional computational costs for both the server and the edge devices. Numerical results validate that, simply plugging-in the proposed cyclic aggregation to the existing FL algorithms effectively reduces the number of training iterations with improved performance.
    DyG2Vec: Representation Learning for Dynamic Graphs with Self-Supervision. (arXiv:2210.16906v1 [cs.LG])
    The challenge in learning from dynamic graphs for predictive tasks lies in extracting fine-grained temporal motifs from an ever-evolving graph. Moreover, task labels are often scarce, costly to obtain, and highly imbalanced for large dynamic graphs. Recent advances in self-supervised learning on graphs demonstrate great potential, but focus on static graphs. State-of-the-art (SoTA) models for dynamic graphs are not only incompatible with the self-supervised learning (SSL) paradigm but also fail to forecast interactions beyond the very near future. To address these limitations, we present DyG2Vec, an SSL-compatible, efficient model for representation learning on dynamic graphs. DyG2Vec uses a window-based mechanism to generate task-agnostic node embeddings that can be used to forecast future interactions. DyG2Vec significantly outperforms SoTA baselines on benchmark datasets for downstream tasks while only requiring a fraction of the training/inference time. We adapt two SSL evaluation mechanisms to make them applicable to dynamic graphs and thus show that SSL pre-training helps learn more robust temporal node representations, especially for scenarios with few labels.
    Projection Valued Measure-based Quantum Machine Learning for Multi-Class Classification. (arXiv:2210.16731v1 [quant-ph])
    In recent years, quantum machine learning (QML) has been actively used for various tasks, e.g., classification, reinforcement learning, and adversarial learning. However, these QML studies do not achieve complex tasks due to scalability issues on input and output are the biggest hurdle in QML. To cope with this problem, we aim to solve the output scalability issue. Motivated by this challenge, we focus on projection-valued measure (PVM) which utilizes the nature of probability amplitude in quantum statistical mechanics. By leveraging PVM, the output dimension is expanded from the number of qubits $q$ to $\mathcal{O}(2^q)$. We propose a novel QML framework for multi-class classification. We corroborate that our framework outperforms the state-of-theart (SOTA) with various datasets using no more than 6 qubits. Furthermore, our PVM-based QML outperforms 42.2% SOTA.
    ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network. (arXiv:2207.09927v2 [cs.CV] UPDATED)
    In this paper a pure-attention bottom-up approach, called ViGAT, that utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video, is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, Mini-Kinetics, ActivityNet).
    Recursive Reasoning in Minimax Games: A Level $k$ Gradient Play Method. (arXiv:2210.16482v1 [cs.LG])
    Despite the success of generative adversarial networks (GANs) in generating visually appealing images, they are notoriously challenging to train. In order to stabilize the learning dynamics in minimax games, we propose a novel recursive reasoning algorithm: Level $k$ Gradient Play (Lv.$k$ GP) algorithm. In contrast to many existing algorithms, our algorithm does not require sophisticated heuristics or curvature information. We show that as $k$ increases, Lv.$k$ GP converges asymptotically towards an accurate estimation of players' future strategy. Moreover, we justify that Lv.$\infty$ GP naturally generalizes a line of provably convergent game dynamics which rely on predictive updates. Furthermore, we provide its local convergence property in nonconvex-nonconcave zero-sum games and global convergence in bilinear and quadratic games. By combining Lv.$k$ GP with Adam optimizer, our algorithm shows a clear advantage in terms of performance and computational overhead compared to other methods. Using a single Nvidia RTX3090 GPU and 30 times fewer parameters than BigGAN on CIFAR-10, we achieve an FID of 10.17 for unconditional image generation within 30 hours, allowing GAN training on common computational resources to reach state-of-the-art performance.
    Simulating Diffusion Bridges with Score Matching. (arXiv:2111.07243v2 [stat.CO] UPDATED)
    We consider the problem of simulating diffusion bridges, which are diffusion processes that are conditioned to initialize and terminate at two given states. The simulation of diffusion bridges has applications in diverse scientific fields and plays a crucial role in the statistical inference of discretely-observed diffusions. This is known to be a challenging problem that has received much attention in the last two decades. This article contributes to this rich body of literature by presenting a new avenue to obtain diffusion bridge approximations. Our approach is based on a backward time representation of a diffusion bridge, which may be simulated if one can time-reverse the unconditioned diffusion. We introduce a variational formulation to learn this time-reversal with function approximation and rely on a score matching method to circumvent intractability. Another iteration of our proposed methodology approximates the Doob's $h$-transform defining the forward time representation of a diffusion bridge. We discuss algorithmic considerations and extensions, and present numerical results on an Ornstein--Uhlenbeck process, a model from financial econometrics for interest rates, and a model from genetics for cell differentiation and development to illustrate the effectiveness of our approach.
    Robust Data Valuation via Variance Reduced Data Shapley. (arXiv:2210.16835v1 [stat.ML])
    Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.
    CFU Playground: Full-Stack Open-Source Framework for Tiny Machine Learning (tinyML) Acceleration on FPGAs. (arXiv:2201.01863v2 [cs.LG] UPDATED)
    Need for the efficient processing of neural networks has given rise to the development of hardware accelerators. The increased adoption of specialized hardware has highlighted the need for more agile design flows for hardware-software co-design and domain-specific optimizations. We present CFU Playground, a full-stack open-source framework that enables rapid and iterative design of machine learning (ML) accelerators for embedded ML systems. Our toolchain integrates open-source software, open-source RTL generators, and open-source FPGA tools for synthesis, place, and route. This full-stack framework gives the users access to explore bespoke architectures that are customized and co-optimized for embedded ML. The rapid, deploy-profile-optimization feedback loop lets ML hardware and software developers achieve significant returns out of a relatively small investment in customization. Using CFU Playground's design loop, we show substantial speedups between 55$\times$ and 75$\times$. The soft CPU coupled with the accelerator opens up a new, rich design space between the two components that we explore in an automated fashion using Vizier, a black-box optimization service.
    Automatic Detection of Interplanetary Coronal Mass Ejections in Solar Wind In Situ Data. (arXiv:2205.03578v2 [astro-ph.SR] UPDATED)
    Interplanetary coronal mass ejections (ICMEs) are one of the main drivers for space weather disturbances. In the past, different approaches have been used to automatically detect events in existing time series resulting from solar wind in situ observations. However, accurate and fast detection still remains a challenge when facing the large amount of data from different instruments. For the automatic detection of ICMEs we propose a pipeline using a method that has recently proven successful in medical image segmentation. Comparing it to an existing method, we find that while achieving similar results, our model outperforms the baseline regarding training time by a factor of approximately 20, thus making it more applicable for other datasets. The method has been tested on in situ data from the Wind spacecraft between 1997 and 2015 with a True Skill Statistic (TSS) of 0.64. Out of the 640 ICMEs, 466 were detected correctly by our algorithm, producing a total of 254 False Positives. Additionally, it produced reasonable results on datasets with fewer features and smaller training sets from Wind, STEREO-A and STEREO-B with True Skill Statistics of 0.56, 0.57 and 0.53, respectively. Our pipeline manages to find the start of an ICME with a mean absolute error (MAE) of around 2 hours and 56 minutes, and the end time with a MAE of 3 hours and 20 minutes. The relatively fast training allows straightforward tuning of hyperparameters and could therefore easily be used to detect other structures and phenomena in solar wind data, such as corotating interaction regions.
    LEADER: Learning Attention over Driving Behaviors for Planning under Uncertainty. (arXiv:2209.11422v3 [cs.LG] UPDATED)
    Uncertainty on human behaviors poses a significant challenge to autonomous driving in crowded urban environments. The partially observable Markov decision processes (POMDPs) offer a principled framework for planning under uncertainty, often leveraging Monte Carlo sampling to achieve online performance for complex tasks. However, sampling also raises safety concerns by potentially missing critical events. To address this, we propose a new algorithm, LEarning Attention over Driving bEhavioRs (LEADER), that learns to attend to critical human behaviors during planning. LEADER learns a neural network generator to provide attention over human behaviors in real-time situations. It integrates the attention into a belief-space planner, using importance sampling to bias reasoning towards critical events. To train the algorithm, we let the attention generator and the planner form a min-max game. By solving the min-max game, LEADER learns to perform risk-aware planning without human labeling.
    RUSH: Robust Contrastive Learning via Randomized Smoothing. (arXiv:2207.05127v2 [cs.LG] UPDATED)
    Recently, adversarial training has been incorporated in self-supervised contrastive pre-training to augment label efficiency with exciting adversarial robustness. However, the robustness came at a cost of expensive adversarial training. In this paper, we show a surprising fact that contrastive pre-training has an interesting yet implicit connection with robustness, and such natural robustness in the pre trained representation enables us to design a powerful robust algorithm against adversarial attacks, RUSH, that combines the standard contrastive pre-training and randomized smoothing. It boosts both standard accuracy and robust accuracy, and significantly reduces training costs as compared with adversarial training. We use extensive empirical studies to show that the proposed RUSH outperforms robust classifiers from adversarial training, by a significant margin on common benchmarks (CIFAR-10, CIFAR-100, and STL-10) under first-order attacks. In particular, under $\ell_{\infty}$-norm perturbations of size 8/255 PGD attack on CIFAR-10, our model using ResNet-18 as backbone reached 77.8% robust accuracy and 87.9% standard accuracy. Our work has an improvement of over 15% in robust accuracy and a slight improvement in standard accuracy, compared to the state-of-the-arts.
    Graph Fuzzy System: Concepts, Models and Algorithms. (arXiv:2210.16730v1 [cs.AI])
    Fuzzy systems (FSs) have enjoyed wide applications in various fields, including pattern recognition, intelligent control, data mining and bioinformatics, which is attributed to the strong interpretation and learning ability. In traditional application scenarios, FSs are mainly applied to model Euclidean space data and cannot be used to handle graph data of non-Euclidean structure in nature, such as social networks and traffic route maps. Therefore, development of FS modeling method that is suitable for graph data and can retain the advantages of traditional FSs is an important research. To meet this challenge, a new type of FS for graph data modeling called Graph Fuzzy System (GFS) is proposed in this paper, where the concepts, modeling framework and construction algorithms are systematically developed. First, GFS related concepts, including graph fuzzy rule base, graph fuzzy sets and graph consequent processing unit (GCPU), are defined. A GFS modeling framework is then constructed and the antecedents and consequents of the GFS are presented and analyzed. Finally, a learning framework of GFS is proposed, in which a kernel K-prototype graph clustering (K2PGC) is proposed to develop the construction algorithm for the GFS antecedent generation, and then based on graph neural network (GNNs), consequent parameters learning algorithm is proposed for GFS. Specifically, three different versions of the GFS implementation algorithm are developed for comprehensive evaluations with experiments on various benchmark graph classification datasets. The results demonstrate that the proposed GFS inherits the advantages of both existing mainstream GNNs methods and conventional FSs methods while achieving better performance than the counterparts.
    Ice Core Dating using Probabilistic Programming. (arXiv:2210.16568v1 [stat.ML])
    Ice cores record crucial information about past climate. However, before ice core data can have scientific value, the chronology must be inferred by estimating the age as a function of depth. Under certain conditions, chemicals locked in the ice display quasi-periodic cycles that delineate annual layers. Manually counting these noisy seasonal patterns to infer the chronology can be an imperfect and time-consuming process, and does not capture uncertainty in a principled fashion. In addition, several ice cores may be collected from a region, introducing an aspect of spatial correlation between them. We present an exploration of the use of probabilistic models for automatic dating of ice cores, using probabilistic programming to showcase its use for prototyping, automatic inference and maintainability, and demonstrate common failure modes of these tools.
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v2 [stat.ML] UPDATED)
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    Time-aware Metapath Feature Augmentation for Ponzi Detection in Ethereum. (arXiv:2210.16863v1 [cs.LG])
    With the development of Web 3.0 which emphasizes decentralization, blockchain technology ushers in its revolution and also brings numerous challenges, particularly in the field of cryptocurrency. Recently, a large number of criminal behaviors continuously emerge on blockchain, such as Ponzi schemes and phishing scams, which severely endanger decentralized finance. Existing graph-based abnormal behavior detection methods on blockchain usually focus on constructing homogeneous transaction graphs without distinguishing the heterogeneity of nodes and edges, resulting in partial loss of transaction pattern information. Although existing heterogeneous modeling methods can depict richer information through metapaths, the extracted metapaths generally neglect temporal dependencies between entities and do not reflect real behavior. In this paper, we introduce Time-aware Metapath Feature Augmentation (TMFAug) as a plug-and-play module to capture the real metapath-based transaction patterns during Ponzi scheme detection on Ethereum. The proposed module can be adaptively combined with existing graph-based Ponzi detection methods. Extensive experimental results show that our TMFAug can help existing Ponzi detection methods achieve significant performance improvements on the Ethereum dataset, indicating the effectiveness of heterogeneous temporal information for Ponzi scheme detection.
    Explainable Predictive Decision Mining for Operational Support. (arXiv:2210.16786v1 [cs.AI])
    Several decision points exist in business processes (e.g., whether a purchase order needs a manager's approval or not), and different decisions are made for different process instances based on their characteristics (e.g., a purchase order higher than $500 needs a manager approval). Decision mining in process mining aims to describe/predict the routing of a process instance at a decision point of the process. By predicting the decision, one can take proactive actions to improve the process. For instance, when a bottleneck is developing in one of the possible decisions, one can predict the decision and bypass the bottleneck. However, despite its huge potential for such operational support, existing techniques for decision mining have focused largely on describing decisions but not on predicting them, deploying decision trees to produce logical expressions to explain the decision. In this work, we aim to enhance the predictive capability of decision mining to enable proactive operational support by deploying more advanced machine learning algorithms. Our proposed approach provides explanations of the predicted decisions using SHAP values to support the elicitation of proactive actions. We have implemented a Web application to support the proposed approach and evaluated the approach using the implementation.
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v1 [cs.CV])
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.
    NTULM: Enriching Social Media Text Representations with Non-Textual Units. (arXiv:2210.16586v1 [cs.CL])
    On social media, additional context is often present in the form of annotations and meta-data such as the post's author, mentions, Hashtags, and hyperlinks. We refer to these annotations as Non-Textual Units (NTUs). We posit that NTUs provide social context beyond their textual semantics and leveraging these units can enrich social media text representations. In this work we construct an NTU-centric social heterogeneous network to co-embed NTUs. We then principally integrate these NTU embeddings into a large pretrained language model by fine-tuning with these additional units. This adds context to noisy short-text social media. Experiments show that utilizing NTU-augmented text representations significantly outperforms existing text-only baselines by 2-5\% relative points on many downstream tasks highlighting the importance of context to social media NLP. We also highlight that including NTU context into the initial layers of language model alongside text is better than using it after the text embedding is generated. Our work leads to the generation of holistic general purpose social media content embedding.
    Partitioned Gradient Matching-based Data Subset Selection for Compute-Efficient Robust ASR Training. (arXiv:2210.16892v1 [cs.LG])
    Training state-of-the-art ASR systems such as RNN-T often has a high associated financial and environmental cost. Training with a subset of training data could mitigate this problem if the subset selected could achieve on-par performance with training with the entire dataset. Although there are many data subset selection(DSS) algorithms, direct application to the RNN-T is difficult, especially the DSS algorithms that are adaptive and use learning dynamics such as gradients, as RNN-T tend to have gradients with a significantly larger memory footprint. In this paper, we propose Partitioned Gradient Matching (PGM) a novel distributable DSS algorithm, suitable for massive datasets like those used to train RNN-T. Through extensive experiments on Librispeech 100H and Librispeech 960H, we show that PGM achieves between 3x to 6x speedup with only a very small accuracy degradation (under 1% absolute WER difference). In addition, we demonstrate similar results for PGM even in settings where the training data is corrupted with noise.
    Dataset Distillation via Factorization. (arXiv:2210.16774v1 [cs.CV])
    In this paper, we study \xw{dataset distillation (DD)}, from a novel perspective and introduce a \emph{dataset factorization} approach, termed \emph{HaBa}, which is a plug-and-play strategy portable to any existing DD baseline. Unlike conventional DD approaches that aim to produce distilled and representative samples, \emph{HaBa} explores decomposing a dataset into two components: data \emph{Ha}llucination networks and \emph{Ba}ses, where the latter is fed into the former to reconstruct image samples. The flexible combinations between bases and hallucination networks, therefore, equip the distilled data with exponential informativeness gain, which largely increase the representation capability of distilled datasets. To furthermore increase the data efficiency of compression results, we further introduce a pair of adversarial contrastive constraints on the resultant hallucination networks and bases, which increase the diversity of generated images and inject more discriminant information into the factorization. Extensive comparisons and experiments demonstrate that our method can yield significant improvement on downstream classification tasks compared with previous state of the arts, while reducing the total number of compressed parameters by up to 65\%. Moreover, distilled datasets by our approach also achieve \textasciitilde10\% higher accuracy than baseline methods in cross-architecture generalization. Our code is available \href{https://github.com/Huage001/DatasetFactorization}{here}.
    Conditional Supervised Contrastive Learning for Fair Text Classification. (arXiv:2205.11485v2 [cs.CL] UPDATED)
    Contrastive representation learning has gained much attention due to its superior performance in learning representations from both image and sequential data. However, the learned representations could potentially lead to performance disparities in downstream tasks, such as increased silencing of underrepresented groups in toxicity comment classification. In light of this challenge, in this work, we study learning fair representations that satisfy a notion of fairness known as equalized odds for text classification via contrastive learning. Specifically, we first theoretically analyze the connections between learning representations with a fairness constraint and conditional supervised contrastive objectives, and then propose to use conditional supervised contrastive objectives to learn fair representations for text classification. We conduct experiments on two text datasets to demonstrate the effectiveness of our approaches in balancing the trade-offs between task performance and bias mitigation among existing baselines for text classification. Furthermore, we also show that the proposed methods are stable in different hyperparameter settings.
    A review of machine learning concepts and methods for addressing challenges in probabilistic hydrological post-processing and forecasting. (arXiv:2206.08998v2 [cs.LG] UPDATED)
    Probabilistic forecasting is receiving growing attention nowadays in a variety of applied fields, including hydrology. Several machine learning concepts and methods are notably relevant towards addressing the major challenges of formalizing and optimizing probabilistic forecasting implementations, as well as the equally important challenge of identifying the most useful ones among these implementations. Nonetheless, practically-oriented reviews focusing on such concepts and methods, and on how these can be effectively exploited in the above-outlined essential endeavour, are currently missing from the probabilistic hydrological forecasting literature. This absence holds despite the pronounced intensification in the research efforts for benefitting from machine learning in this same literature. It also holds despite the substantial relevant progress that has recently emerged, especially in the field of probabilistic hydrological post-processing, which traditionally provides the hydrologists with probabilistic hydrological forecasting implementations. Herein, we aim to fill this specific gap. In our review, we emphasize key ideas and information that can lead to effective popularizations, as such an emphasis can support successful future implementations and further scientific developments. In the same forward-looking direction, we identify open research questions and propose ideas to be explored in the future.
    Security-Preserving Federated Learning via Byzantine-Sensitive Triplet Distance. (arXiv:2210.16519v1 [cs.LG])
    While being an effective framework of learning a shared model across multiple edge devices, federated learning (FL) is generally vulnerable to Byzantine attacks from adversarial edge devices. While existing works on FL mitigate such compromised devices by only aggregating a subset of the local models at the server side, they still cannot successfully ignore the outliers due to imprecise scoring rule. In this paper, we propose an effective Byzantine-robust FL framework, namely dummy contrastive aggregation, by defining a novel scoring function that sensitively discriminates whether the model has been poisoned or not. Key idea is to extract essential information from every local models along with the previous global model to define a distance measure in a manner similar to triplet loss. Numerical results validate the advantage of the proposed approach by showing improved performance as compared to the state-of-the-art Byzantine-resilient aggregation methods, e.g., Krum, Trimmed-mean, and Fang.
    Granular Generalized Variable Precision Rough Sets and Rational Approximations. (arXiv:2205.14365v3 [cs.AI] UPDATED)
    Rational approximations are introduced and studied in granular graded rough sets and generalizations thereof by the first author in recent research papers. The concept of rationality is determined by related ontologies and coherence between granularity, mereology and approximations in the context. In addition, a framework for rational approximations is introduced by her in the mentioned paper(s). Granular approximations constructed as per the procedures of variable precision rough sets (VPRS) are likely to be more rational than those constructed from a classical perspective under certain conditions. This may continue to hold for some generalizations of the former. However, a formal characterization of such conditions is not available in the previously published literature. In this research, theoretical aspects of the problem are critically examined, uniform generalizations of granular VPRS are introduced, new connections with granular graded rough sets are proved, appropriate concepts of substantial parthood are introduced, their extent of compatibility with the framework is accessed, and the framework is extended. Basic assumptions are explained in detail, and additional examples are constructed for readability. Furthermore, meta applications to cluster validation, image segmentation and dynamic sorting are invented. Extensions to direct generalizations of VPRS such as probabilistic rough sets are a natural consequence of the work.
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v3 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.
    Membership Inference Attacks and Generalization: A Causal Perspective. (arXiv:2209.08615v2 [cs.LG] UPDATED)
    Membership inference (MI) attacks highlight a privacy weakness in present stochastic training methods for neural networks. It is not well understood, however, why they arise. Are they a natural consequence of imperfect generalization only? Which underlying causes should we address during training to mitigate these attacks? Towards answering such questions, we propose the first approach to explain MI attacks and their connection to generalization based on principled causal reasoning. We offer causal graphs that quantitatively explain the observed MI attack performance achieved for $6$ attack variants. We refute several prior non-quantitative hypotheses that over-simplify or over-estimate the influence of underlying causes, thereby failing to capture the complex interplay between several factors. Our causal models also show a new connection between generalization and MI attacks via their shared causal factors. Our causal models have high predictive power ($0.90$), i.e., their analytical predictions match with observations in unseen experiments often, which makes analysis via them a pragmatic alternative.
    Elastic Weight Consolidation Improves the Robustness of Self-Supervised Learning Methods under Transfer. (arXiv:2210.16365v1 [cs.LG])
    Self-supervised representation learning (SSL) methods provide an effective label-free initial condition for fine-tuning downstream tasks. However, in numerous realistic scenarios, the downstream task might be biased with respect to the target label distribution. This in turn moves the learned fine-tuned model posterior away from the initial (label) bias-free self-supervised model posterior. In this work, we re-interpret SSL fine-tuning under the lens of Bayesian continual learning and consider regularization through the Elastic Weight Consolidation (EWC) framework. We demonstrate that self-regularization against an initial SSL backbone improves worst sub-group performance in Waterbirds by 5% and Celeb-A by 2% when using the ViT-B/16 architecture. Furthermore, to help simplify the use of EWC with SSL, we pre-compute and publicly release the Fisher Information Matrix (FIM), evaluated with 10,000 ImageNet-1K variates evaluated on large modern SSL architectures including ViT-B/16 and ResNet50 trained with DINO.
    The Fisher-Rao Loss for Learning under Label Noise. (arXiv:2210.16401v1 [cs.LG])
    Choosing a suitable loss function is essential when learning by empirical risk minimisation. In many practical cases, the datasets used for training a classifier may contain incorrect labels, which prompts the interest for using loss functions that are inherently robust to label noise. In this paper, we study the Fisher-Rao loss function, which emerges from the Fisher-Rao distance in the statistical manifold of discrete distributions. We derive an upper bound for the performance degradation in the presence of label noise, and analyse the learning speed of this loss. Comparing with other commonly used losses, we argue that the Fisher-Rao loss provides a natural trade-off between robustness and training dynamics. Numerical experiments with synthetic and MNIST datasets illustrate this performance.
    U-Net-based Models for Skin Lesion Segmentation: More Attention and Augmentation. (arXiv:2210.16399v1 [eess.IV])
    According to WHO[1], since the 1970s, diagnosis of melanoma skin cancer has been more frequent. However, if detected early, the 5-year survival rate for melanoma can increase to 99 percent. In this regard, skin lesion segmentation can be pivotal in monitoring and treatment planning. In this work, ten models and four augmentation configurations are trained on the ISIC 2016 dataset. The performance and overfitting are compared utilizing five metrics. Our results show that the U-Net-Resnet50 and the R2U-Net have the highest metrics value, along with two data augmentation scenarios. We also investigate CBAM and AG blocks in the U-Net architecture, which enhances segmentation performance at a meager computational cost. In addition, we propose using pyramid, AG, and CBAM blocks in a sequence, which significantly surpasses the results of using the two individually. Finally, our experiments show that models that have exploited attention modules successfully overcome common skin lesion segmentation problems. Lastly, in the spirit of reproducible research, we implement models and codes publicly available.
    Certified Graph Unlearning. (arXiv:2206.09140v2 [cs.LG] UPDATED)
    Graph-structured data is ubiquitous in practice and often processed using graph neural networks (GNNs). With the adoption of recent laws ensuring the ``right to be forgotten'', the problem of graph data removal has become of significant importance. To address the problem, we introduce the first known framework for \emph{certified graph unlearning} of GNNs. In contrast to standard machine unlearning, new analytical and heuristic unlearning challenges arise when dealing with complex graph data. First, three different types of unlearning requests need to be considered, including node feature, edge and node unlearning. Second, to establish provable performance guarantees, one needs to address challenges associated with feature mixing during propagation. The underlying analysis is illustrated on the example of simple graph convolutions (SGC) and their generalized PageRank (GPR) extensions, thereby laying the theoretical foundation for certified unlearning of GNNs. Our empirical studies on six benchmark datasets demonstrate excellent performance-complexity trade-offs when compared to complete retraining methods and approaches that do not leverage graph information. For example, when unlearning $20\%$ of the nodes on the Cora dataset, our approach suffers only a $0.1\%$ loss in test accuracy while offering a $4$-fold speed-up compared to complete retraining. Our scheme also outperforms unlearning methods that do not leverage graph information with a $12\%$ increase in test accuracy for a comparable time complexity.
    Arithmetic Circuits, Structured Matrices and (not so) Deep Learning. (arXiv:2206.12490v2 [cs.CC] UPDATED)
    This survey presents a necessarily incomplete (and biased) overview of results at the intersection of arithmetic circuit complexity, structured matrices and deep learning. Recently there has been some research activity in replacing unstructured weight matrices in neural networks by structured ones (with the aim of reducing the size of the corresponding deep learning models). Most of this work has been experimental and in this survey, we formalize the research question and show how a recent work that combines arithmetic circuit complexity, structured matrices and deep learning essentially answers this question. This survey is targeted at complexity theorists who might enjoy reading about how tools developed in arithmetic circuit complexity helped design (to the best of our knowledge) a new family of structured matrices, which in turn seem well-suited for applications in deep learning. However, we hope that folks primarily interested in deep learning would also appreciate the connections to complexity theory.
    Machine Unlearning of Federated Clusters. (arXiv:2210.16424v1 [cs.LG])
    Federated clustering is an unsupervised learning problem that arises in a number of practical applications, including personalized recommender and healthcare systems. With the adoption of recent laws ensuring the "right to be forgotten", the problem of machine unlearning for federated clustering methods has become of significant importance. This work proposes the first known unlearning mechanism for federated clustering with privacy criteria that support simple, provable, and efficient data removal at the client and server level. The gist of our approach is to combine special initialization procedures with quantization methods that allow for secure aggregation of estimated local cluster counts at the server unit. As part of our platform, we introduce secure compressed multiset aggregation (SCMA), which is of independent interest for secure sparse model aggregation. In order to simultaneously facilitate low communication complexity and secret sharing protocols, we integrate Reed-Solomon encoding with special evaluation points into the new SCMA pipeline and derive bounds on the time and communication complexity of different components of the scheme. Compared to completely retraining K-means++ locally and globally for each removal request, we obtain an average speed-up of roughly 84x across seven datasets, two of which contain biological and medical information that is subject to frequent unlearning requests.
    An Efficient Memory-Augmented Transformer for Knowledge-Intensive NLP Tasks. (arXiv:2210.16773v1 [cs.CL])
    Access to external knowledge is essential for many natural language processing tasks, such as question answering and dialogue. Existing methods often rely on a parametric model that stores knowledge in its parameters, or use a retrieval-augmented model that has access to an external knowledge source. Parametric and retrieval-augmented models have complementary strengths in terms of computational efficiency and predictive accuracy. To combine the strength of both approaches, we propose the Efficient Memory-Augmented Transformer (EMAT) -- it encodes external knowledge into a key-value memory and exploits the fast maximum inner product search for memory querying. We also introduce pre-training tasks that allow EMAT to encode informative key-value representations, and to learn an implicit strategy to integrate multiple memory slots into the transformer. Experiments on various knowledge-intensive tasks such as question answering and dialogue datasets show that, simply augmenting parametric models (T5-base) using our method produces more accurate results (e.g., 25.8 -> 44.3 EM on NQ) while retaining a high throughput (e.g., 1000 queries/s on NQ). Compared to retrieval-augmented models, EMAT runs substantially faster across the board and produces more accurate results on WoW and ELI5. Our code and datasets are available at https://github. com/uclnlp/EMAT.
    Zydeco-Style Spike Sorting Low Power VLSI Architecture for IoT BCI Implants. (arXiv:2209.04427v3 [cs.AR] UPDATED)
    Brain Computer Interface (BCI) has great potential for solving many brain signal analysis limitations, mental disorder resolutions, and restoring missing limb functionality via neural-controlled implants. However, there is no single available, and safe implant for daily life usage exists yet. Most of the proposed implants have several implementation issues, such as infection hazards and heat dissipation, which limits their usability and makes it more challenging to pass regulations and quality control production. The wireless implant does not require a chronic wound in the skull. However, the current complex clustering neuron identification algorithms inside the implant chip consume a lot of power and bandwidth, causing higher heat dissipation issues and draining the implant's battery. The spike sorting is the core unit of an invasive BCI chip, which plays a significant role in power consumption, accuracy, and area. Therefore, in this study, we propose a low-power adaptive simplified VLSI architecture, "Zydeco-Style," for BCI spike sorting that is computationally less complex with higher accuracy that performs up to 93.5% in the worst-case scenario. The architecture uses a low-power Bluetooth Wireless communication module with external IoT medical ICU devices. The proposed architecture was implemented and simulated in Verilog. In addition, we are proposing an implant conceptual design.
    Federated clustering with GAN-based data synthesis. (arXiv:2210.16524v1 [cs.LG])
    Federated clustering is an adaptation of centralized clustering in the federated settings, which aims to cluster data based on a global similarity measure while keeping all data local. The key here is how to construct a global similarity measure without sharing private data. To handle this, k-FED and federated fuzzy c-means (FFCM) respectively adapted K-means and fuzzy c-means to the federated learning settings, which aim to construct $K$ global cluster centroids by running K-means on a set of all local cluster centroids. However, the constructed global cluster centroids may be fragile and be sensitive to different non-independent and identically distributed (Non-IID) levels among clients. To handle this, we propose a simple but effective federated clustering framework with GAN-based data synthesis, which is called synthetic data aided federated clustering (SDA-FC). It outperforms k-FED and FFCM in terms of effectiveness and robustness, requires only one communication round, can run asynchronously, and can handle device failures. Moreover, although NMI is a far more commonly used metric than Kappa, empirical results indicate that Kappa is a more reliable one.
    One Gradient Frank-Wolfe for Decentralized Online Convex and Submodular Optimization. (arXiv:2210.16790v1 [cs.LG])
    Decentralized learning has been studied intensively in recent years motivated by its wide applications in the context of federated learning. The majority of previous research focuses on the offline setting in which the objective function is static. However, the offline setting becomes unrealistic in numerous machine learning applications that witness the change of massive data. In this paper, we propose \emph{decentralized online} algorithm for convex and continuous DR-submodular optimization, two classes of functions that are present in a variety of machine learning problems. Our algorithms achieve performance guarantees comparable to those in the centralized offline setting. Moreover, on average, each participant performs only a \emph{single} gradient computation per time step. Subsequently, we extend our algorithms to the bandit setting. Finally, we illustrate the competitive performance of our algorithms in real-world experiments.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v2 [cs.LG] UPDATED)
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    Learning Bipedal Walking On Planned Footsteps For Humanoid Robots. (arXiv:2207.12644v2 [cs.RO] UPDATED)
    Deep reinforcement learning (RL) based controllers for legged robots have demonstrated impressive robustness for walking in different environments for several robot platforms. To enable the application of RL policies for humanoid robots in real-world settings, it is crucial to build a system that can achieve robust walking in any direction, on 2D and 3D terrains, and be controllable by a user-command. In this paper, we tackle this problem by learning a policy to follow a given step sequence. The policy is trained with the help of a set of procedurally generated step sequences (also called footstep plans). We show that simply feeding the upcoming 2 steps to the policy is sufficient to achieve omnidirectional walking, turning in place, standing, and climbing stairs. Our method employs curriculum learning on the complexity of terrains, and circumvents the need for reference motions or pre-trained weights. We demonstrate the application of our proposed method to learn RL policies for 2 new robot platforms - HRP5P and JVRC-1 - in the MuJoCo simulation environment. The code for training and evaluation is available online.
    Toward Causal-Aware RL: State-Wise Action-Refined Temporal Difference. (arXiv:2201.00354v2 [cs.LG] UPDATED)
    Although it is well known that exploration plays a key role in Reinforcement Learning (RL), prevailing exploration strategies for continuous control tasks in RL are mainly based on naive isotropic Gaussian noise regardless of the causality relationship between action space and the task and consider all dimensions of actions equally important. In this work, we propose to conduct interventions on the primal action space to discover the causal relationship between the action space and the task reward. We propose the method of State-Wise Action Refined (SWAR), which addresses the issue of action space redundancy and promote causality discovery in RL. We formulate causality discovery in RL tasks as a state-dependent action space selection problem and propose two practical algorithms as solutions. The first approach, TD-SWAR, detects task-related actions during temporal difference learning, while the second approach, Dyn-SWAR, reveals important actions through dynamic model prediction. Empirically, both methods provide approaches to understand the decisions made by RL agents and improve learning efficiency in action-redundant tasks.
    KnAC: an approach for enhancing cluster analysis with background knowledge and explanations. (arXiv:2112.08759v2 [cs.AI] UPDATED)
    Pattern discovery in multidimensional data sets has been the subject of research for decades. There exists a wide spectrum of clustering algorithms that can be used for this purpose. However, their practical applications share a common post-clustering phase, which concerns expert-based interpretation and analysis of the obtained results. We argue that this can be the bottleneck in the process, especially in cases where domain knowledge exists prior to clustering. Such a situation requires not only a proper analysis of automatically discovered clusters but also conformance checking with existing knowledge. In this work, we present Knowledge Augmented Clustering (KnAC). Its main goal is to confront expert-based labelling with automated clustering for the sake of updating and refining the former. Our solution is not restricted to any existing clustering algorithm. Instead, KnAC can serve as an augmentation of an arbitrary clustering algorithm, making the approach robust and a model-agnostic improvement of any state-of-the-art clustering method. We demonstrate the feasibility of our method on artificially, reproducible examples and in a real life use case scenario. In both cases, we achieved better results than classic clustering algorithms without augmentation.
    Physics-aware Graph Neural Network for Accurate RNA 3D Structure Prediction. (arXiv:2210.16392v1 [cs.LG])
    Biological functions of RNAs are determined by their three-dimensional (3D) structures. Thus, given the limited number of experimentally determined RNA structures, the prediction of RNA structures will facilitate elucidating RNA functions and RNA-targeted drug discovery, but remains a challenging task. In this work, we propose a Graph Neural Network (GNN)-based scoring function trained only with the atomic types and coordinates on limited solved RNA 3D structures for distinguishing accurate structural models. The proposed Physics-aware Multiplex Graph Neural Network (PaxNet) separately models the local and non-local interactions inspired by molecular mechanics. Furthermore, PaxNet contains an attention-based fusion module that learns the individual contribution of each interaction type for the final prediction. We rigorously evaluate the performance of PaxNet on two benchmarks and compare it with several state-of-the-art baselines. The results show that PaxNet significantly outperforms all the baselines overall, and demonstrate the potential of PaxNet for improving the 3D structure modeling of RNA and other macromolecules.
    Foreign Object Debris Detection for Airport Pavement Images based on Self-supervised Localization and Vision Transformer. (arXiv:2210.16901v1 [cs.CV])
    Supervised object detection methods provide subpar performance when applied to Foreign Object Debris (FOD) detection because FOD could be arbitrary objects according to the Federal Aviation Administration (FAA) specification. Current supervised object detection algorithms require datasets that contain annotated examples of every to-be-detected object. While a large and expensive dataset could be developed to include common FOD examples, it is infeasible to collect all possible FOD examples in the dataset representation because of the open-ended nature of FOD. Limitations of the dataset could cause FOD detection systems driven by those supervised algorithms to miss certain FOD, which can become dangerous to airport operations. To this end, this paper presents a self-supervised FOD localization by learning to predict the runway images, which avoids the enumeration of FOD annotation examples. The localization method utilizes the Vision Transformer (ViT) to improve localization performance. The experiments show that the method successfully detects arbitrary FOD in real-world runway situations. The paper also provides an extension to the localization result to perform classification; a feature that can be useful to downstream tasks. To train the localization, this paper also presents a simple and realistic dataset creation framework that only collects clean runway images. The training and testing data for this method are collected at a local airport using unmanned aircraft systems (UAS). Additionally, the developed dataset is provided for public use and further studies.
  • Open

    Feature Learning for Dimensionality Reduction toward Maximal Extraction of Hidden Patterns. (arXiv:2206.13891v2 [cs.LG] UPDATED)
    Dimensionality reduction (DR) plays a vital role in the visual analysis of high-dimensional data. One main aim of DR is to reveal hidden patterns that lie on intrinsic low-dimensional manifolds. However, DR often overlooks important patterns when the manifolds are distorted or hidden by certain influential data attributes. This paper presents a feature learning framework, FEALM, designed to generate an optimized set of data projections for nonlinear DR in order to capture important patterns in the hidden manifolds. These projections produce maximally different nearest-neighbor graphs so that resultant DR outcomes are significantly different. To achieve such a capability, we design an optimization algorithm as well as introduce a new graph dissimilarity measure, named neighbor-shape dissimilarity. Additionally, we develop interactive visualizations to assist comparison of obtained DR results and interpretation of each DR result. We demonstrate FEALM's effectiveness through experiments and case studies using synthetic and real-world datasets.  ( 2 min )
    Representation Learning for General-sum Low-rank Markov Games. (arXiv:2210.16976v1 [cs.LG])
    We study multi-agent general-sum Markov games with nonlinear function approximation. We focus on low-rank Markov games whose transition matrix admits a hidden low-rank structure on top of an unknown non-linear representation. The goal is to design an algorithm that (1) finds an $\varepsilon$-equilibrium policy sample efficiently without prior knowledge of the environment or the representation, and (2) permits a deep-learning friendly implementation. We leverage representation learning and present a model-based and a model-free approach to construct an effective representation from the collected data. For both approaches, the algorithm achieves a sample complexity of poly$(H,d,A,1/\varepsilon)$, where $H$ is the game horizon, $d$ is the dimension of the feature vector, $A$ is the size of the joint action space and $\varepsilon$ is the optimality gap. When the number of players is large, the above sample complexity can scale exponentially with the number of players in the worst case. To address this challenge, we consider Markov games with a factorized transition structure and present an algorithm that escapes such exponential scaling. To our best knowledge, this is the first sample-efficient algorithm for multi-agent general-sum Markov games that incorporates (non-linear) function approximation. We accompany our theoretical result with a neural network-based implementation of our algorithm and evaluate it against the widely used deep RL baseline, DQN with fictitious play.  ( 2 min )
    A view on model misspecification in uncertainty quantification. (arXiv:2210.16938v1 [cs.LG])
    Estimating uncertainty of machine learning models is essential to assess the quality of the predictions that these models provide. However, there are several factors that influence the quality of uncertainty estimates, one of which is the amount of model misspecification. Model misspecification always exists as models are mere simplifications or approximations to reality. The question arises whether the estimated uncertainty under model misspecification is reliable or not. In this paper, we argue that model misspecification should receive more attention, by providing thought experiments and contextualizing these with relevant literature.  ( 2 min )
    Fast Deep Mixtures of Gaussian Process Experts. (arXiv:2006.13309v3 [cs.LG] UPDATED)
    Mixtures of experts have become an indispensable tool for flexible modelling in a supervised learning context, and sparse Gaussian processes (GP) have shown promise as a leading candidate for the experts in such models. In this article, we propose to design the gating network for selecting the experts from such mixtures of sparse GPs using a deep neural network (DNN). Furthermore, a fast one pass algorithm called Cluster-Classify-Regress (CCR) is leveraged to approximate the maximum a posteriori (MAP) estimator extremely quickly. This powerful combination of model and algorithm together delivers a novel method which is flexible, robust, and extremely efficient. In particular, the method is able to outperform competing methods in terms of accuracy and uncertainty quantification. The cost is competitive on low-dimensional and small data sets, but is significantly lower for higher-dimensional and big data sets. Iteratively maximizing the distribution of experts given allocations and allocations given experts does not provide significant improvement, which indicates that the algorithm achieves a good approximation to the local MAP estimator very fast. This insight can be useful also in the context of other mixture of experts models.  ( 2 min )
    ForestPrune: Compact Depth-Controlled Tree Ensembles. (arXiv:2206.00128v2 [stat.ML] UPDATED)
    Tree ensembles are versatile supervised learning algorithms that achieve state-of-the-art performance. These models are extremely powerful but can grow to enormous sizes. As a result, tree ensembles are often post-processed to reduce memory footprint and improve interpretability. In this paper, we present ForestPrune, a novel optimization framework that can post-process tree ensembles by pruning depth layers from individual trees. We also develop a new block coordinate descent method to efficiently obtain high-quality solutions to optimization problems under this framework. The number of nodes in a decision tree increases exponentially with tree depth, so pruning deep trees can drastically improve model parsimony. ForestPrune can substantially reduce the space complexity of an ensemble for a minimal cost to performance. The framework supports various weighting schemes and contains just a single hyperparameter to tune. In our experiments, we observe that ForestPrune can reduce model size 20-fold with negligible performance loss.
    On the Efficient Implementation of the Matrix Exponentiated Gradient Algorithm for Low-Rank Matrix Optimization. (arXiv:2012.10469v2 [math.OC] UPDATED)
    Convex optimization over the spectrahedron, i.e., the set of all real $n\times n$ positive semidefinite matrices with unit trace, has important applications in machine learning, signal processing and statistics, mainly as a convex relaxation for optimization problems with low-rank matrices. It is also one of the most prominent examples in the theory of first-order methods for convex optimization in which non-Euclidean methods can be significantly preferable to their Euclidean counterparts. In particular, the desirable choice is the Matrix Exponentiated Gradient (MEG) method which is based on the Bregman distance induced by the (negative) von Neumann entropy. Unfortunately, implementing MEG requires a full SVD computation on each iteration, which is not scalable to high-dimensional problems. In this work we propose an efficient implementations of MEG, both with deterministic and stochastic gradients, which are tailored for optimization with low-rank matrices, and only use a single low-rank SVD computation on each iteration. We also provide efficiently-computable certificates for the correct convergence of our methods. Mainly, we prove that under a strict complementarity condition, the suggested methods converge from a ``warm-start" initialization with similar rates to their full-SVD-based counterparts. Finally, we bring empirical experiments which both support our theoretical findings and demonstrate the practical appeal of our methods.
    Improved Support Recovery in Universal One-bit Compressed Sensing. (arXiv:2210.16657v1 [cs.IT])
    One-bit compressed sensing (1bCS) is an extremely quantized signal acquisition method that has been proposed and studied rigorously in the past decade. In 1bCS, linear samples of a high dimensional signal are quantized to only one bit per sample (sign of the measurement). Assuming the original signal vector to be sparse, existing results in 1bCS either aim to find the support of the vector, or approximate the signal allowing a small error. The focus of this paper is support recovery, which often also computationally facilitate approximate signal recovery. A {\em universal} measurement matrix for 1bCS refers to one set of measurements that work for all sparse signals. With universality, it is known that $\tilde{\Theta}(k^2)$ 1bCS measurements are necessary and sufficient for support recovery (where $k$ denotes the sparsity). To improve the dependence on sparsity from quadratic to linear, in this work we propose approximate support recovery (allowing $\epsilon>0$ proportion of errors), and superset recovery (allowing $\epsilon$ proportion of false positives). We show that the first type of recovery is possible with $\tilde{O}(k/\epsilon)$ measurements, while the later type of recovery, more challenging, is possible with $\tilde{O}(\max\{k/\epsilon,k^{3/2}\})$ measurements. We also show that in both cases $\Omega(k/\epsilon)$ measurements would be necessary for universal recovery. Improved results are possible if we consider universal recovery within a restricted class of signals, such as rational signals, or signals with bounded dynamic range. In both cases superset recovery is possible with only $\tilde{O}(k/\epsilon)$ measurements. Other results on universal but approximate support recovery are also provided in this paper. All of our main recovery algorithms are simple and polynomial-time.
    Solving a Special Type of Optimal Transport Problem by a Modified Hungarian Algorithm. (arXiv:2210.16645v1 [math.OC])
    We observe that computing empirical Wasserstein distance in the independence test is an optimal transport (OT) problem with a special structure. This observation inspires us to study a special type of OT problem and propose a modified Hungarian algorithm to solve it exactly. For an OT problem between marginals with $m$ and $n$ atoms ($m\geq n$), the computational complexity of the proposed algorithm is $O(m^2n)$. Computing the empirical Wasserstein distance in the independence test requires solving this special type of OT problem, where we have $m=n^2$. The associate computational complexity of our algorithm is $O(n^5)$, while the order of applying the classic Hungarian algorithm is $O(n^6)$. Numerical experiments validate our theoretical analysis. Broader applications of the proposed algorithm are discussed at the end.
    Spectral Representation Learning for Conditional Moment Models. (arXiv:2210.16525v1 [stat.ML])
    Many problems in causal inference and economics can be formulated in the framework of conditional moment models, which characterize the target function through a collection of conditional moment restrictions. For nonparametric conditional moment models, efficient estimation has always relied on preimposed conditions on various measures of ill-posedness of the hypothesis space, which are hard to validate when flexible models are used. In this work, we address this issue by proposing a procedure that automatically learns representations with controlled measures of ill-posedness. Our method approximates a linear representation defined by the spectral decomposition of a conditional expectation operator, which can be used for kernelized estimators and is known to facilitate minimax optimal estimation in certain settings. We show this representation can be efficiently estimated from data, and establish L2 consistency for the resulting estimator. We evaluate the proposed method on proximal causal inference tasks, exhibiting promising performance on high-dimensional, semi-synthetic data.
    Private optimization in the interpolation regime: faster rates and hardness results. (arXiv:2210.17070v1 [cs.LG])
    In non-private stochastic convex optimization, stochastic gradient methods converge much faster on interpolation problems -- problems where there exists a solution that simultaneously minimizes all of the sample losses -- than on non-interpolating ones; we show that generally similar improvements are impossible in the private setting. However, when the functions exhibit quadratic growth around the optimum, we show (near) exponential improvements in the private sample complexity. In particular, we propose an adaptive algorithm that improves the sample complexity to achieve expected error $\alpha$ from $\frac{d}{\varepsilon \sqrt{\alpha}}$ to $\frac{1}{\alpha^\rho} + \frac{d}{\varepsilon} \log\left(\frac{1}{\alpha}\right)$ for any fixed $\rho >0$, while retaining the standard minimax-optimal sample complexity for non-interpolation problems. We prove a lower bound that shows the dimension-dependent term is tight. Furthermore, we provide a superefficiency result which demonstrates the necessity of the polynomial term for adaptive algorithms: any algorithm that has a polylogarithmic sample complexity for interpolation problems cannot achieve the minimax-optimal rates for the family of non-interpolation problems.
    Learning Probabilistic Models from Generator Latent Spaces with Hat EBM. (arXiv:2210.16486v1 [cs.CV])
    This work proposes a method for using any generator network as the foundation of an Energy-Based Model (EBM). Our formulation posits that observed images are the sum of unobserved latent variables passed through the generator network and a residual random variable that spans the gap between the generator output and the image manifold. One can then define an EBM that includes the generator as part of its forward pass, which we call the Hat EBM. The model can be trained without inferring the latent variables of the observed data or calculating the generator Jacobian determinant. This enables explicit probabilistic modeling of the output distribution of any type of generator network. Experiments show strong performance of the proposed method on (1) unconditional ImageNet synthesis at 128x128 resolution, (2) refining the output of existing generators, and (3) learning EBMs that incorporate non-probabilistic generators. Code and pretrained models to reproduce our results are available at https://github.com/point0bar1/hat-ebm.
    Ensemble transport smoothing -- Part 1: unified framework. (arXiv:2210.17000v1 [stat.ME])
    Smoothers are algorithms for Bayesian time series re-analysis. Most operational smoothers rely either on affine Kalman-type transformations or on sequential importance sampling. These strategies occupy opposite ends of a spectrum that trades computational efficiency and scalability for statistical generality and consistency: non-Gaussianity renders affine Kalman updates inconsistent with the true Bayesian solution, while the ensemble size required for successful importance sampling can be prohibitive. This paper revisits the smoothing problem from the perspective of measure transport, which offers the prospect of consistent prior-to-posterior transformations for Bayesian inference. We leverage this capacity by proposing a general ensemble framework for transport-based smoothing. Within this framework, we derive a comprehensive set of smoothing recursions based on nonlinear transport maps and detail how they exploit the structure of state-space models in fully non-Gaussian settings. We also describe how many standard Kalman-type smoothing algorithms emerge as special cases of our framework. A companion paper explores the implementation of nonlinear ensemble transport smoothers in greater depth.
    SoftBart: Soft Bayesian Additive Regression Trees. (arXiv:2210.16375v1 [stat.ME])
    Bayesian additive regression tree (BART) models have seen increased attention in recent years as a general-purpose nonparametric modeling technique. BART combines the flexibility of modern machine learning techniques with the principled uncertainty quantification of Bayesian inference, and it has been shown to be uniquely appropriate for addressing the high-noise problems that occur commonly in many areas of science, including medicine and the social sciences. This paper introduces the SoftBart package for fitting the Soft BART algorithm of Linero and Yang (2018). In addition to improving upon the predictive performance of other BART packages, a major goal of this package has been to facilitate the inclusion of BART in larger models, making it ideal for researchers in Bayesian statistics. I show both how to use this package for standard prediction tasks and how to embed BART models in larger models; I illustrate by using SoftBart to implement a nonparametric probit regression model, a semiparametric varying coefficient model, and a partial linear model.
    Learning to Defer to Multiple Experts: Consistent Surrogate Losses, Confidence Calibration, and Conformal Ensembles. (arXiv:2210.16955v1 [stat.ML])
    We study the statistical properties of learning to defer (L2D) to multiple experts. In particular, we address the open problems of deriving a consistent surrogate loss, confidence calibration, and principled ensembling of experts. Firstly, we derive two consistent surrogates -- one based on a softmax parameterization, the other on a one-vs-all (OvA) parameterization -- that are analogous to the single expert losses proposed by Mozannar and Sontag (2020) and Verma and Nalisnick (2022), respectively. We then study the frameworks' ability to estimate P( m_j = y | x ), the probability that the jth expert will correctly predict the label for x. Theory shows the softmax-based loss causes mis-calibration to propagate between the estimates while the OvA-based loss does not (though in practice, we find there are trade offs). Lastly, we propose a conformal inference technique that chooses a subset of experts to query when the system defers. We perform empirical validation on tasks for galaxy, skin lesion, and hate speech classification.
    A Faster Sampler for Discrete Determinantal Point Processes. (arXiv:2210.17358v1 [cs.LG])
    Discrete Determinantal Point Processes (DPPs) have a wide array of potential applications for subsampling datasets. They are however held back in some cases by the high cost of sampling. In the worst-case scenario, the sampling cost scales as $O(n^3)$ where n is the number of elements of the ground set. A popular workaround to this prohibitive cost is to sample DPPs defined by low-rank kernels. In such cases, the cost of standard sampling algorithms scales as $O(np^2 + nm^2)$ where m is the (average) number of samples of the DPP (usually $m \ll n$) and p ($m \leq p \leq n$) the rank of the kernel used to define the DPP. The first term, $O(np^2)$, comes from a SVD-like step. We focus here on the second term of this cost, $O(nm^2)$, and show that it can be brought down to $O(nm + m^3 log m)$ without loss on the sampling's exactness. In practice, we observe extremely substantial speedups compared to the classical algorithm as soon as $n > 1, 000$. The algorithm described here is a close variant of the standard algorithm for sampling continuous DPPs, and uses rejection sampling. In the specific case of projection DPPs, we also show that any additional sample can be drawn in time $O(m^3 log m)$. Finally, an interesting by-product of the analysis is that a realisation from a DPP is typically contained in a subset of size $O(m log m)$ formed using leverage score i.i.d. sampling.
    Scalable Spectral Clustering with Group Fairness Constraints. (arXiv:2210.16435v1 [cs.LG])
    There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.
    Novel Policy Seeking with Constrained Optimization. (arXiv:2005.10696v3 [cs.LG] UPDATED)
    In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.
    Selective Inference for Hierarchical Clustering. (arXiv:2012.02936v3 [stat.ME] UPDATED)
    Classical tests for a difference in means control the type I error rate when the groups are defined a priori. However, when the groups are instead defined via clustering, then applying a classical test yields an extremely inflated type I error rate. Notably, this problem persists even if two separate and independent data sets are used to define the groups and to test for a difference in their means. To address this problem, in this paper, we propose a selective inference approach to test for a difference in means between two clusters. Our procedure controls the selective type I error rate by accounting for the fact that the choice of null hypothesis was made based on the data. We describe how to efficiently compute exact p-values for clusters obtained using agglomerative hierarchical clustering with many commonly-used linkages. We apply our method to simulated data and to single-cell RNA-sequencing data.
    TiAda: A Time-scale Adaptive Algorithm for Nonconvex Minimax Optimization. (arXiv:2210.17478v1 [math.OC])
    Adaptive gradient methods have shown their ability to adjust the stepsizes on the fly in a parameter-agnostic manner, and empirically achieve faster convergence for solving minimization problems. When it comes to nonconvex minimax optimization, however, current convergence analyses of gradient descent ascent (GDA) combined with adaptive stepsizes require careful tuning of hyper-parameters and the knowledge of problem-dependent parameters. Such a discrepancy arises from the primal-dual nature of minimax problems and the necessity of delicate time-scale separation between the primal and dual updates in attaining convergence. In this work, we propose a single-loop adaptive GDA algorithm called TiAda for nonconvex minimax optimization that automatically adapts to the time-scale separation. Our algorithm is fully parameter-agnostic and can achieve near-optimal complexities simultaneously in deterministic and stochastic settings of nonconvex-strongly-concave minimax problems. The effectiveness of the proposed method is further justified numerically for a number of machine learning applications.
    A Solvable Model of Neural Scaling Laws. (arXiv:2210.16859v1 [cs.LG])
    Large language models with a huge number of parameters, when trained on near internet-sized number of tokens, have been empirically shown to obey neural scaling laws: specifically, their performance behaves predictably as a power law in either parameters or dataset size until bottlenecked by the other resource. To understand this better, we first identify the necessary properties allowing such scaling laws to arise and then propose a statistical model -- a joint generative data model and random feature model -- that captures this neural scaling phenomenology. By solving this model in the dual limit of large training set size and large number of parameters, we gain insight into (i) the statistical structure of datasets and tasks that lead to scaling laws, (ii) the way nonlinear feature maps, such as those provided by neural networks, enable scaling laws when trained on these datasets, (iii) the optimality of the equiparameterization scaling of training sets and parameters, and (iv) whether such scaling laws can break down and how they behave when they do. Key findings are the manner in which the power laws that occur in the statistics of natural datasets are extended by nonlinear random feature maps and then translated into power-law scalings of the test loss and how the finite extent of the data's spectral power law causes the model's performance to plateau.
    Convergence of Dirichlet Forms for MCMC Optimal Scaling with General Target Distributions on Large Graphs. (arXiv:2210.17042v1 [math.ST])
    Markov chain Monte Carlo (MCMC) algorithms have played a significant role in statistics, physics, machine learning and others, and they are the only known general and efficient approach for some high-dimensional problems. The Metropolis-Hastings (MH) algorithm as the most classical MCMC algorithm, has had a great influence on the development and practice of science and engineering. The behavior of the MH algorithm in high-dimensional problems is typically investigated through a weak convergence result of diffusion processes. In this paper, we introduce Mosco convergence of Dirichlet forms in analyzing the MH algorithm on large graphs, whose target distribution is the Gibbs measure that includes any probability measure satisfying a Markov property. The abstract and powerful theory of Dirichlet forms allows us to work directly and naturally on the infinite-dimensional space, and our notion of Mosco convergence allows Dirichlet forms associated with the MH Markov chains to lie on changing Hilbert spaces. Through the optimal scaling problem, we demonstrate the impressive strengths of the Dirichlet form approach over the standard diffusion approach.
    Private Federated Learning Without a Trusted Server: Optimal Algorithms for Convex Losses. (arXiv:2106.09779v6 [cs.LG] UPDATED)
    This paper studies federated learning (FL) -- especially cross-silo FL -- with data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) has data from different people (e.g. patients) and must maintain the privacy of each person's data (e.g. medical record), even if the server or other silos act as adversarial eavesdroppers. This requirement motivates the study of Inter-Silo Record-Level Differential Privacy (ISRL-DP), which requires silo $i$'s communications to satisfy record-level differential privacy (DP). ISRL-DP ensures that the data of each person in silo~$i$ cannot be leaked. ISRL-DP is different from well-studied privacy notions. Central and user-level DP assume that people trust the server/other silos. On the other end of the spectrum, local DP assumes that people do not trust anyone at all (even their own silo). Sitting between central and local DP, ISRL-DP makes the realistic assumption (in cross-silo FL) that people trust their own silo, but not the server or other silos. In this work, we provide tight (up to logarithms) upper and lower bounds for ISRL-DP FL with convex/strongly convex loss functions and homogeneous (i.i.d.) silo data. Remarkably, we show that similar bounds are attainable for smooth losses with arbitrary heterogeneous silo data distributions, via an accelerated ISRL-DP algorithm. We also provide tight upper and lower bounds for ISRL-DP federated empirical risk minimization, and use acceleration to attain the optimal bounds in fewer rounds of communication than the state-of-the-art. Finally, with a secure "shuffler" to anonymize silo messages (but without a trusted server), our algorithm attains the optimal central DP rates under more practical trust assumptions. Numerical experiments show favorable privacy-accuracy tradeoffs for our algorithm in classification and regression tasks.
    Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs. (arXiv:2206.00939v2 [stat.ML] UPDATED)
    The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
    Consistent and Truthful Interpretation with Fourier Analysis. (arXiv:2210.17426v1 [cs.LG])
    For many interdisciplinary fields, ML interpretations need to be consistent with what-if scenarios related to the current case, i.e., if one factor changes, how does the model react? Although the attribution methods are supported by the elegant axiomatic systems, they mainly focus on individual inputs, and are generally inconsistent. To support what-if scenarios, we introduce a new notion called truthful interpretation, and apply Fourier analysis of Boolean functions to get rigorous guarantees. Experimental results show that for neighborhoods with various radii, our method achieves 2x - 50x lower interpretation error compared with the other methods.
    Nesterov Meets Optimism: Rate-Optimal Optimistic-Gradient-Based Method for Stochastic Bilinearly-Coupled Minimax Optimization. (arXiv:2210.17550v1 [math.OC])
    We provide a novel first-order optimization algorithm for bilinearly-coupled strongly-convex-concave minimax optimization called the AcceleratedGradient OptimisticGradient (AG-OG). The main idea of our algorithm is to leverage the structure of the considered minimax problem and operates Nesterov's acceleration on the individual part and optimistic gradient on the coupling part of the objective. We motivate our method by showing that its continuous-time dynamics corresponds to an organic combination of the dynamics of optimistic gradient and of Nesterov's acceleration. By discretizing the dynamics we conclude polynomial convergence behavior in discrete time. Further enhancement of AG-OG with proper restarting allows us to achieve rate-optimal (up to a constant) convergence rates with respect to the conditioning of the coupling and individual parts, which results in the first single-call algorithm achieving improved convergence in the deterministic setting and rate-optimality in the stochastic setting under bilinearly coupled minimax problem sets.
    Gaussian Process Hydrodynamics. (arXiv:2209.10707v2 [physics.flu-dyn] UPDATED)
    We present a Gaussian Process (GP) approach (Gaussian Process Hydrodynamics, GPH) for approximating the solution of the Euler and Navier-Stokes equations. As in Smoothed Particle Hydrodynamics (SPH), GPH is a Lagrangian particle-based approach involving the tracking of a finite number of particles transported by the flow. However, these particles do not represent mollified particles of matter but carry discrete about the continuous flow. Closure is achieved by placing a divergence-free GP prior $\xi$ on the velocity field and conditioning on vorticity at particle locations. Known physics (e.g., the Richardson cascade and velocity-increments power laws) is incorporated into the GP prior through physics-informed additive kernels. This is equivalent to expressing $\xi$ as a sum of independent GPs $\xi^l$, which we call modes, acting at different scales. This approach leads to a quantitative analysis of the Richardson cascade through the analysis of the activation of these modes and allows us to coarse-grain turbulence in a statistical manner rather than a deterministic one. Since GPH is formulated on the vorticity equations, it does not require solving a pressure equation. By enforcing incompressibility and fluid/structure boundary conditions through the selection of the kernel, GPH requires much fewer particles than SPH. Since GPH has a natural probabilistic interpretation, numerical results come with uncertainty estimates enabling their incorporation into a UQ pipeline and the adding/removing of particles in an adapted manner. The proposed approach is amenable to analysis, it inherits the complexity of state-of-the-art solvers for dense kernel matrices, and it leads to a natural definition of turbulence as information loss. Numerical experiments support the importance of selecting physics-informed kernels and illustrate the major impact of such kernels on accuracy and stability.
    A Law of Data Separation in Deep Learning. (arXiv:2210.17020v1 [cs.LG])
    Multilayer neural networks have achieved superhuman performance in many artificial intelligence applications. However, their black-box nature obscures the underlying mechanism for transforming input data into labels throughout all layers, thus hindering architecture design for new tasks and interpretation for high-stakes decision makings. We addressed this problem by introducing a precise law that governs how real-world deep neural networks separate data according to their class membership from the bottom layers to the top layers in classification problems. This law shows that each layer roughly improves a certain measure of data separation by an \textit{equal} multiplicative factor. This law manifests in modern architectures such as AlexNet, VGGNet, and ResNet in the late phase of training. This law together with the perspective of data separation offers practical guidelines for designing network architectures, improving model robustness and out-of-sample performance during training, as well as interpreting deep learning predictions.
    Pseudo Numerical Methods for Diffusion Models on Manifolds. (arXiv:2202.09778v2 [cs.CV] UPDATED)
    Denoising Diffusion Probabilistic Models (DDPMs) can generate high-quality samples such as image and audio samples. However, DDPMs require hundreds to thousands of iterations to produce final samples. Several prior works have successfully accelerated DDPMs through adjusting the variance schedule (e.g., Improved Denoising Diffusion Probabilistic Models) or the denoising equation (e.g., Denoising Diffusion Implicit Models (DDIMs)). However, these acceleration methods cannot maintain the quality of samples and even introduce new noise at a high speedup rate, which limit their practicability. To accelerate the inference process while keeping the sample quality, we provide a fresh perspective that DDPMs should be treated as solving differential equations on manifolds. Under such a perspective, we propose pseudo numerical methods for diffusion models (PNDMs). Specifically, we figure out how to solve differential equations on manifolds and show that DDIMs are simple cases of pseudo numerical methods. We change several classical numerical methods to corresponding pseudo numerical methods and find that the pseudo linear multi-step method is the best in most situations. According to our experiments, by directly using pre-trained models on Cifar10, CelebA and LSUN, PNDMs can generate higher quality synthetic images with only 50 steps compared with 1000-step DDIMs (20x speedup), significantly outperform DDIMs with 250 steps (by around 0.4 in FID) and have good generalization on different variance schedules. Our implementation is available at https://github.com/luping-liu/PNDM.
    Out-of-distribution generalization for learning quantum dynamics. (arXiv:2204.10268v2 [quant-ph] UPDATED)
    Generalization bounds are a critical tool to assess the training data requirements of Quantum Machine Learning (QML). Recent work has established guarantees for in-distribution generalization of quantum neural networks (QNNs), where training and testing data are assumed to be drawn from the same data distribution. However, there are currently no results on out-of-distribution generalization in QML, where we require a trained model to perform well even on data drawn from a distribution different from the training distribution. In this work, we prove out-of-distribution generalization for the task of learning an unknown unitary using a QNN and for a broad class of training and testing distributions. In particular, we show that one can learn the action of a unitary on entangled states using only product state training data. We numerically illustrate this by showing that the evolution of a Heisenberg spin chain can be learned using only product training states. Since product states can be prepared using only single-qubit gates, this advances the prospects of learning quantum dynamics using near term quantum computers and quantum experiments, and further opens up new methods for both the classical and quantum compilation of quantum circuits.
    Ensemble transport smoothing -- Part 2: nonlinear updates. (arXiv:2210.17435v1 [stat.ME])
    Smoothing is a specialized form of Bayesian inference for state-space models that characterizes the posterior distribution of a collection of states given an associated sequence of observations. Our companion manuscript proposes a general framework for transport-based ensemble smoothing, which includes linear Kalman-type smoothers as special cases. Here, we build on this foundation to realize and demonstrate nonlinear backward ensemble transport smoothers. We discuss parameterization and regularization of the associated transport maps, and then examine the performance of these smoothers for nonlinear and chaotic dynamical systems that exhibit non-Gaussian behavior. In these settings, our nonlinear transport smoothers yield lower estimation error than conventional linear smoothers and state-of-the-art iterative ensemble Kalman smoothers, for comparable numbers of model evaluations.
    Fast Sparse Classification for Generalized Linear and Additive Models. (arXiv:2202.11389v2 [cs.LG] UPDATED)
    We present fast classification techniques for sparse generalized linear and additive models. These techniques can handle thousands of features and thousands of observations in minutes, even in the presence of many highly correlated features. For fast sparse logistic regression, our computational speed-up over other best-subset search techniques owes to linear and quadratic surrogate cuts for the logistic loss that allow us to efficiently screen features for elimination, as well as use of a priority queue that favors a more uniform exploration of features. As an alternative to the logistic loss, we propose the exponential loss, which permits an analytical solution to the line search at each iteration. Our algorithms are generally 2 to 5 times faster than previous approaches. They produce interpretable models that have accuracy comparable to black box models on challenging datasets.
    Federated X-Armed Bandit. (arXiv:2205.15268v2 [stat.ML] UPDATED)
    This work establishes the first framework of federated $\mathcal{X}$-armed bandit, where different clients face heterogeneous local objective functions defined on the same domain and are required to collaboratively figure out the global optimum. We propose the first federated algorithm for such problems, named \texttt{Fed-PNE}. By utilizing the topological structure of the global objective inside the hierarchical partitioning and the weak smoothness property, our algorithm achieves sublinear cumulative regret with respect to both the number of clients and the evaluation budget. Meanwhile, it only requires logarithmic communications between the central server and clients, protecting the client privacy. Experimental results on synthetic functions and real datasets validate the advantages of \texttt{Fed-PNE} over single-client algorithms and federated multi-armed bandit algorithms.
    Adaptive Selection of the Optimal Strategy to Improve Precision and Power in Randomized Trials. (arXiv:2210.17453v1 [stat.ME])
    Benkeser et al. demonstrate how adjustment for baseline covariates in randomized trials can meaningfully improve precision for a variety of outcome types, including binary, ordinal, and time-to-event. Their findings build on a long history, starting in 1932 with R.A. Fisher and including the more recent endorsements by the U.S. Food and Drug Administration and the European Medicines Agency. Here, we address an important practical consideration: how to select the adjustment approach -- which variables and in which form -- to maximize precision, while maintaining nominal confidence interval coverage. Balzer et al. previously proposed, evaluated, and applied Adaptive Prespecification to flexibly select, from a prespecified set, the variables that maximize empirical efficiency in small randomized trials (N<40). To avoid overfitting with few randomized units, adjustment was previously limited to a single covariate in a working generalized linear model (GLM) for the expected outcome and a single covariate in a working GLM for the propensity score. Here, we tailor Adaptive Prespecification to trials with many randomized units. Specifically, using V-fold cross-validation and the squared influence curve as the loss function, we select from an expanded set of candidate algorithms, including both parametric and semi-parametric methods, the optimal combination of estimators of the expected outcome and known propensity score. Using simulations, under a variety of data generating processes, we demonstrate the dramatic gains in precision offered by our novel approach.
    On the power of conditional independence testing under model-X. (arXiv:2005.05506v5 [math.ST] UPDATED)
    For testing conditional independence (CI) of a response Y and a predictor X given covariates Z, the recently introduced model-X (MX) framework has been the subject of active methodological research, especially in the context of MX knockoffs and their successful application to genome-wide association studies. In this paper, we study the power of MX CI tests, yielding quantitative insights into the role of machine learning and providing evidence in favor of using likelihood-based statistics in practice. Focusing on the conditional randomization test (CRT), we find that its conditional mode of inference allows us to reformulate it as testing a point null hypothesis involving the conditional distribution of X. The Neyman-Pearson lemma then implies that a likelihood-based statistic yields the most powerful CRT against a point alternative. We also obtain a related optimality result for MX knockoffs. Switching to an asymptotic framework with arbitrarily growing covariate dimension, we derive an expression for the limiting power of the CRT against local semiparametric alternatives in terms of the prediction error of the machine learning algorithm on which its test statistic is based. Finally, we exhibit a resampling-free test with uniform asymptotic Type-I error control under the assumption that only the first two moments of X given Z are known, a significant relaxation of the MX assumption.
    Latent Multimodal Functional Graphical Model Estimation. (arXiv:2210.17237v1 [stat.ME])
    Joint multimodal functional data acquisition, where functional data from multiple modes are measured simultaneously from the same subject, has emerged as an exciting modern approach enabled by recent engineering breakthroughs in the neurological and biological sciences. One prominent motivation to acquire such data is to enable new discoveries of the underlying connectivity by combining multimodal signals. Despite the scientific interest, there remains a gap in principled statistical methods for estimating the graph underlying multimodal functional data. To this end, we propose a new integrative framework that models the data generation process and identifies operators mapping from the observation space to the latent space. We then develop an estimator that simultaneously estimates the transformation operators and the latent graph. This estimator is based on the partial correlation operator, which we rigorously extend from the multivariate to the functional setting. Our procedure is provably efficient, with the estimator converging to a stationary point with quantifiable statistical error. Furthermore, we show recovery of the latent graph under mild conditions. Our work is applied to analyze simultaneously acquired multimodal brain imaging data where the graph indicates functional connectivity of the brain. We present simulation and empirical results that support the benefits of joint estimation.
    sigmoidF1: A Smooth F1 Score Surrogate Loss for Multilabel Classification. (arXiv:2108.10566v3 [cs.LG] UPDATED)
    Multiclass multilabel classification is the task of attributing multiple labels to examples via predictions. Current models formulate a reduction of the multilabel setting into either multiple binary classifications or multiclass classification, allowing for the use of existing loss functions (sigmoid, cross-entropy, logistic, etc.). Multilabel classification reductions do not accommodate for the prediction of varying numbers of labels per example and the underlying losses are distant estimates of the performance metrics. We propose a loss function, sigmoidF1, which is an approximation of the F1 score that (1) is smooth and tractable for stochastic gradient descent, (2) naturally approximates a multilabel metric, and (3) estimates label propensities and label counts. We show that any confusion matrix metric can be formulated with a smooth surrogate. We evaluate the proposed loss function on text and image datasets, and with a variety of metrics, to account for the complexity of multilabel classification evaluation. sigmoidF1 outperforms other loss functions on one text and two image datasets and several metrics. These results show the effectiveness of using inference-time metrics as loss functions for non-trivial classification problems like multilabel classification.
    Simulating Diffusion Bridges with Score Matching. (arXiv:2111.07243v2 [stat.CO] UPDATED)
    We consider the problem of simulating diffusion bridges, which are diffusion processes that are conditioned to initialize and terminate at two given states. The simulation of diffusion bridges has applications in diverse scientific fields and plays a crucial role in the statistical inference of discretely-observed diffusions. This is known to be a challenging problem that has received much attention in the last two decades. This article contributes to this rich body of literature by presenting a new avenue to obtain diffusion bridge approximations. Our approach is based on a backward time representation of a diffusion bridge, which may be simulated if one can time-reverse the unconditioned diffusion. We introduce a variational formulation to learn this time-reversal with function approximation and rely on a score matching method to circumvent intractability. Another iteration of our proposed methodology approximates the Doob's $h$-transform defining the forward time representation of a diffusion bridge. We discuss algorithmic considerations and extensions, and present numerical results on an Ornstein--Uhlenbeck process, a model from financial econometrics for interest rates, and a model from genetics for cell differentiation and development to illustrate the effectiveness of our approach.
    Variational Inference Aided Estimation of Time Varying Channels. (arXiv:2210.17177v1 [eess.SP])
    One way to improve the estimation of time varying channels is to incorporate knowledge of previous observations. In this context, Dynamical VAEs (DVAEs) build a promising deep learning (DL) framework which is well suited to learn the distribution of time series data. We introduce a new DVAE architecture, called k-MemoryMarkovVAE (k-MMVAE), whose sparsity can be controlled by an additional memory parameter. Following the approach in [1] we derive a k-MMVAE aided channel estimator which takes temporal correlations of successive observations into account. The results are evaluated on simulated channels by QuaDRiGa and show that the k-MMVAE aided channel estimator clearly outperforms other machine learning (ML) aided estimators which are either memoryless or naively extended to time varying channels without major adaptions.
    MSGNN: A Spectral Graph Neural Network Based on a Novel Magnetic Signed Laplacian. (arXiv:2209.00546v3 [stat.ML] UPDATED)
    Signed and directed networks are ubiquitous in real-world applications. However, there has been relatively little work proposing spectral graph neural networks (GNNs) for such networks. Here we introduce a signed directed Laplacian matrix, which we call the magnetic signed Laplacian, as a natural generalization of both the signed Laplacian on signed graphs and the magnetic Laplacian on directed graphs. We then use this matrix to construct a novel efficient spectral GNN architecture and conduct extensive experiments on both node clustering and link prediction tasks. In these experiments, we consider tasks related to signed information, tasks related to directional information, and tasks related to both signed and directional information. We demonstrate that our proposed spectral GNN is effective for incorporating both signed and directional information, and attains leading performance on a wide range of data sets. Additionally, we provide a novel synthetic network model, which we refer to as the signed directed stochastic block model, and a number of novel real-world data sets based on lead-lag relationships in financial time series.
    Lipschitz regularized gradient flows and latent generative particles. (arXiv:2210.17230v1 [stat.ML])
    Lipschitz regularized f-divergences are constructed by imposing a bound on the Lipschitz constant of the discriminator in the variational representation. They interpolate between the Wasserstein metric and f-divergences and provide a flexible family of loss functions for non-absolutely continuous (e.g. empirical) distributions, possibly with heavy tails. We construct Lipschitz regularized gradient flows on the space of probability measures based on these divergences. Examples of such gradient flows are Lipschitz regularized Fokker-Planck and porous medium partial differential equations (PDEs) for the Kullback-Leibler and alpha-divergences, respectively. The regularization corresponds to imposing a Courant-Friedrichs-Lewy numerical stability condition on the PDEs. For empirical measures, the Lipschitz regularization on gradient flows induces a numerically stable transporter/discriminator particle algorithm, where the generative particles are transported along the gradient of the discriminator. The gradient structure leads to a regularized Fisher information (particle kinetic energy) used to track the convergence of the algorithm. The Lipschitz regularized discriminator can be implemented via neural network spectral normalization and the particle algorithm generates approximate samples from possibly high-dimensional distributions known only from data. Notably, our particle algorithm can generate synthetic data even in small sample size regimes. A new data processing inequality for the regularized divergence allows us to combine our particle algorithm with representation learning, e.g. autoencoder architectures. The resulting algorithm yields markedly improved generative properties in terms of efficiency and quality of the synthetic samples. From a statistical mechanics perspective the encoding can be interpreted dynamically as learning a better mobility for the generative particles.
    Training Neural Networks for Sequential Change-point Detection. (arXiv:2210.17312v1 [cs.LG])
    Detecting an abrupt distributional shift of the data stream, known as change-point detection, is a fundamental problem in statistics and signal processing. We present a new approach for online change-point detection by training neural networks (NN), and sequentially cumulating the detection statistics by evaluating the trained discriminating function on test samples by a CUSUM recursion. The idea is based on the observation that training neural networks through logistic loss may lead to the log-likelihood function. We demonstrated the good performance of NN-CUSUM on detecting change-point in high-dimensional data using both synthetic and real-world data.
    Universality of empirical risk minimization. (arXiv:2202.08832v2 [math.ST] UPDATED)
    Consider supervised learning from i.i.d. samples $\{{\boldsymbol x}_i,y_i\}_{i\le n}$ where ${\boldsymbol x}_i \in\mathbb{R}^p$ are feature vectors and ${y} \in \mathbb{R}$ are labels. We study empirical risk minimization over a class of functions that are parameterized by $\mathsf{k} = O(1)$ vectors ${\boldsymbol \theta}_1, . . . , {\boldsymbol \theta}_{\mathsf k} \in \mathbb{R}^p$ , and prove universality results both for the training and test error. Namely, under the proportional asymptotics $n,p\to\infty$, with $n/p = \Theta(1)$, we prove that the training error depends on the random features distribution only through its covariance structure. Further, we prove that the minimum test error over near-empirical risk minimizers enjoys similar universality properties. In particular, the asymptotics of these quantities can be computed $-$to leading order$-$ under a simpler model in which the feature vectors ${\boldsymbol x}_i$ are replaced by Gaussian vectors ${\boldsymbol g}_i$ with the same covariance. Earlier universality results were limited to strongly convex learning procedures, or to feature vectors ${\boldsymbol x}_i$ with independent entries. Our results do not make any of these assumptions. Our assumptions are general enough to include feature vectors ${\boldsymbol x}_i$ that are produced by randomized featurization maps. In particular we explicitly check the assumptions for certain random features models (computing the output of a one-layer neural network with random weights) and neural tangent models (first-order Taylor approximation of two-layer networks).
    A Variational Edge Partition Model for Supervised Graph Representation Learning. (arXiv:2202.03233v3 [stat.ML] UPDATED)
    Graph neural networks (GNNs), which propagate the node features through the edges and learn how to transform the aggregated features under label supervision, have achieved great success in supervised feature extraction for both node-level and graph-level classification tasks. However, GNNs typically treat the graph structure as given and ignore how the edges are formed. This paper introduces a graph generative process to model how the observed edges are generated by aggregating the node interactions over a set of overlapping node communities, each of which contributes to the edges via a logical OR mechanism. Based on this generative model, we partition each edge into the summation of multiple community-specific weighted edges and use them to define community-specific GNNs. A variational inference framework is proposed to jointly learn a GNN-based inference network that partitions the edges into different communities, these community-specific GNNs, and a GNN-based predictor that combines community-specific GNNs for the end classification task. Extensive evaluations on real-world graph datasets have verified the effectiveness of the proposed method in learning discriminative representations for both node-level and graph-level classification tasks.
    Distributionally Robust Domain Adaptation. (arXiv:2210.16894v1 [stat.ML])
    Domain Adaptation (DA) has recently received significant attention due to its potential to adapt a learning model across source and target domains with mismatched distributions. Since DA methods rely exclusively on the given source and target domain samples, they generally yield models that are vulnerable to noise and unable to adapt to unseen samples from the target domain, which calls for DA methods that guarantee the robustness and generalization of the learned models. In this paper, we propose DRDA, a distributionally robust domain adaptation method. DRDA leverages a distributionally robust optimization (DRO) framework to learn a robust decision function that minimizes the worst-case target domain risk and generalizes to any sample from the target domain by transferring knowledge from a given labeled source domain sample. We utilize the Maximum Mean Discrepancy (MMD) metric to construct an ambiguity set of distributions that provably contains the source and target domain distributions with high probability. Hence, the risk is shown to upper bound the out-of-sample target domain loss. Our experimental results demonstrate that our formulation outperforms existing robust learning approaches.
    Single-Shot Domain Adaptation via Target-Aware Generative Augmentation. (arXiv:2210.16692v1 [cs.CV])
    The problem of adapting models from a source domain using data from any target domain of interest has gained prominence, thanks to the brittle generalization in deep neural networks. While several test-time adaptation techniques have emerged, they typically rely on synthetic data augmentations in cases of limited target data availability. In this paper, we consider the challenging setting of single-shot adaptation and explore the design of augmentation strategies. We argue that augmentations utilized by existing methods are insufficient to handle large distribution shifts, and hence propose a new approach SiSTA (Single-Shot Target Augmentations), which first fine-tunes a generative model from the source domain using a single-shot target, and then employs novel sampling strategies for curating synthetic target data. Using experiments with a state-of-the-art domain adaptation method, we find that SiSTA produces improvements as high as 20\% over existing baselines under challenging shifts in face attribute detection, and that it performs competitively to oracle models obtained by training on a larger target dataset.
    A State-Augmented Approach for Learning Optimal Resource Management Decisions in Wireless Networks. (arXiv:2210.16412v1 [cs.LG])
    We consider a radio resource management (RRM) problem in a multi-user wireless network, where the goal is to optimize a network-wide utility function subject to constraints on the ergodic average performance of users. We propose a state-augmented parameterization for the RRM policy, where alongside the instantaneous network states, the RRM policy takes as input the set of dual variables corresponding to the constraints. We provide theoretical justification for the feasibility and near-optimality of the RRM decisions generated by the proposed state-augmented algorithm. Focusing on the power allocation problem with RRM policies parameterized by a graph neural network (GNN) and dual variables sampled from the dual descent dynamics, we numerically demonstrate that the proposed approach achieves a superior trade-off between mean, minimum, and 5th percentile rates than baseline methods.
    Ice Core Dating using Probabilistic Programming. (arXiv:2210.16568v1 [stat.ML])
    Ice cores record crucial information about past climate. However, before ice core data can have scientific value, the chronology must be inferred by estimating the age as a function of depth. Under certain conditions, chemicals locked in the ice display quasi-periodic cycles that delineate annual layers. Manually counting these noisy seasonal patterns to infer the chronology can be an imperfect and time-consuming process, and does not capture uncertainty in a principled fashion. In addition, several ice cores may be collected from a region, introducing an aspect of spatial correlation between them. We present an exploration of the use of probabilistic models for automatic dating of ice cores, using probabilistic programming to showcase its use for prototyping, automatic inference and maintainability, and demonstrate common failure modes of these tools.
    Probability-Dependent Gradient Decay in Large Margin Softmax. (arXiv:2210.17145v1 [stat.ML])
    In the past few years, Softmax has become a common component in neural network frameworks. In this paper, a gradient decay hyperparameter is introduced in Softmax to control the probability-dependent gradient decay rate during training. By following the theoretical analysis and empirical results of a variety of model architectures trained on MNIST, CIFAR-10/100 and SVHN, we find that the generalization performance depends significantly on the gradient decay rate as the confidence probability rises, i.e., the gradient decreases convexly or concavely as the sample probability increases. Moreover, optimization with the small gradient decay shows a similar curriculum learning sequence where hard samples are in the spotlight only after easy samples are convinced sufficiently, and well-separated samples gain a higher gradient to reduce intra-class distance. Based on the analysis results, we can provide evidence that the large margin Softmax will affect the local Lipschitz constraint of the loss function by regulating the probability-dependent gradient decay rate. This paper provides a new perspective and understanding of the relationship among concepts of large margin Softmax, local Lipschitz constraint and curriculum learning by analyzing the gradient decay rate. Besides, we propose a warm-up strategy to dynamically adjust Softmax loss in training, where the gradient decay rate increases from over-small to speed up the convergence rate.
    Guided Conditional Diffusion for Controllable Traffic Simulation. (arXiv:2210.17366v1 [cs.RO])
    Controllable and realistic traffic simulation is critical for developing and verifying autonomous vehicles. Typical heuristic-based traffic models offer flexible control to make vehicles follow specific trajectories and traffic rules. On the other hand, data-driven approaches generate realistic and human-like behaviors, improving transfer from simulated to real-world traffic. However, to the best of our knowledge, no traffic model offers both controllability and realism. In this work, we develop a conditional diffusion model for controllable traffic generation (CTG) that allows users to control desired properties of trajectories at test time (e.g., reach a goal or follow a speed limit) while maintaining realism and physical feasibility through enforced dynamics. The key technical idea is to leverage recent advances from diffusion modeling and differentiable logic to guide generated trajectories to meet rules defined using signal temporal logic (STL). We further extend guidance to multi-agent settings and enable interaction-based rules like collision avoidance. CTG is extensively evaluated on the nuScenes dataset for diverse and composite rules, demonstrating improvement over strong baselines in terms of the controllability-realism tradeoff.
    Exact and Approximate Conformal Inference in Multiple Dimensions. (arXiv:2210.17405v1 [stat.ML])
    It is common in machine learning to estimate a response y given covariate information x. However, these predictions alone do not quantify any uncertainty associated with said predictions. One way to overcome this deficiency is with conformal inference methods, which construct a set containing the unobserved response y with a prescribed probability. Unfortunately, even with one-dimensional responses, conformal inference is computationally expensive despite recent encouraging advances. In this paper, we explore the multidimensional response case within a regression setting, delivering exact derivations of conformal inference p-values when the predictive model can be described as a linear function of y. Additionally, we propose different efficient ways of approximating the conformal prediction region for non-linear predictors while preserving computational advantages. We also provide empirical justification for these approaches using a real-world data example.
    Robust Data Valuation via Variance Reduced Data Shapley. (arXiv:2210.16835v1 [stat.ML])
    Data valuation, especially quantifying data value in algorithmic prediction and decision-making, is a fundamental problem in data trading scenarios. The most widely used method is to define the data Shapley and approximate it by means of the permutation sampling algorithm. To make up for the large estimation variance of the permutation sampling that hinders the development of the data marketplace, we propose a more robust data valuation method using stratified sampling, named variance reduced data Shapley (VRDS for short). We theoretically show how to stratify, how many samples are taken at each stratum, and the sample complexity analysis of VRDS. Finally, the effectiveness of VRDS is illustrated in different types of datasets and data removal applications.
    PyTorch Geometric Signed Directed: A Software Package on Graph Neural Networks for Signed and Directed Graphs. (arXiv:2202.10793v4 [cs.LG] UPDATED)
    Networks are ubiquitous in many real-world applications (e.g., social networks encoding trust/distrust relationships, correlation networks arising from time series data). While many networks are signed or directed, or both, there is a lack of unified software packages on graph neural networks (GNNs) specially designed for signed and directed networks. In this paper, we present PyTorch Geometric Signed Directed (PyGSD), a software package which fills this gap. Along the way, we also provide a brief review surveying typical tasks, loss functions and evaluation metrics in the analysis of signed and directed networks, discuss data used in related experiments, provide an overview of methods proposed, and evaluate the implemented methods with experiments. The deep learning framework consists of easy-to-use GNN models, synthetic and real-world data, as well as task-specific evaluation metrics and loss functions for signed and directed networks. As an extension library for PyG, our proposed software is maintained with open-source releases, detailed documentation, continuous integration, unit tests and code coverage checks. Our code is publicly available at \url{https://github.com/SherylHYX/pytorch_geometric_signed_directed}.
    Planning to the Information Horizon of BAMDPs via Epistemic State Abstraction. (arXiv:2210.16872v1 [cs.LG])
    The Bayes-Adaptive Markov Decision Process (BAMDP) formalism pursues the Bayes-optimal solution to the exploration-exploitation trade-off in reinforcement learning. As the computation of exact solutions to Bayesian reinforcement-learning problems is intractable, much of the literature has focused on developing suitable approximation algorithms. In this work, before diving into algorithm design, we first define, under mild structural assumptions, a complexity measure for BAMDP planning. As efficient exploration in BAMDPs hinges upon the judicious acquisition of information, our complexity measure highlights the worst-case difficulty of gathering information and exhausting epistemic uncertainty. To illustrate its significance, we establish a computationally-intractable, exact planning algorithm that takes advantage of this measure to show more efficient planning. We then conclude by introducing a specific form of state abstraction with the potential to reduce BAMDP complexity and gives rise to a computationally-tractable, approximate planning algorithm.
    Linear regression with partially mismatched data: local search with theoretical guarantees. (arXiv:2106.02175v3 [math.OC] UPDATED)
    Linear regression is a fundamental modeling tool in statistics and related fields. In this paper, we study an important variant of linear regression in which the predictor-response pairs are partially mismatched. We use an optimization formulation to simultaneously learn the underlying regression coefficients and the permutation corresponding to the mismatches. The combinatorial structure of the problem leads to computational challenges. We propose and study a simple greedy local search algorithm for this optimization problem that enjoys strong theoretical guarantees and appealing computational performance. We prove that under a suitable scaling of the number of mismatched pairs compared to the number of samples and features, and certain assumptions on problem data; our local search algorithm converges to a nearly-optimal solution at a linear rate. In particular, in the noiseless case, our algorithm converges to the global optimal solution with a linear convergence rate. Based on this result, we prove an upper bound for the estimation error of the parameter. We also propose an approximate local search step that allows us to scale our approach to much larger instances. We conduct numerical experiments to gather further insights into our theoretical results, and show promising performance gains compared to existing approaches.
    Multiple Testing Framework for Out-of-Distribution Detection. (arXiv:2206.09522v3 [stat.ML] UPDATED)
    We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a learning algorithm's output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the learning algorithm, which provides insights for the construction of powerful tests for OOD detection. We propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the learning algorithm using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different types of OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks.
    Minimax Optimal Fair Regression under Linear Model. (arXiv:2206.11546v2 [math.ST] UPDATED)
    We investigate the minimax optimal error of a fair regression problem under a linear model employing demographic parity as a fairness constraint. As a tractable demographic parity constraint, we introduce $(\alpha,\delta)$-fairness consistency, meaning that the quantified unfairness is decreased at most $n^{-\alpha}$ rate with at least probability $1-\delta$, where $n$ is the sample size. In other words, the consistently fair algorithm eventually outputs a regressor satisfying the demographic parity constraint with high probability as $n$ tends to infinity. As a result of our analyses, we found that the minimax optimal error under the $(\alpha,\delta)$-fairness consistency constraint is $\Theta(\frac{dM}{n})$ provided that $\alpha \le \frac{1}{2}$, where $d$ is the dimensionality, and $M$ is the number of groups induced from the sensitive attributes.
    Simultaneous off-the-grid learning of mixtures issued from a continuous dictionary. (arXiv:2210.16311v1 [stat.ML])
    In this paper we observe a set, possibly a continuum, of signals corrupted by noise. Each signal is a finite mixture of an unknown number of features belonging to a continuous dictionary. The continuous dictionary is parametrized by a real non-linear parameter. We shall assume that the signals share an underlying structure by saying that the union of active features in the whole dataset is finite. We formulate regularized optimization problems to estimate simultaneously the linear coefficients in the mixtures and the non-linear parameters of the features. The optimization problems are composed of a data fidelity term and a (l1 , Lp)-penalty. We prove high probability bounds on the prediction errors associated to our estimators. The proof is based on the existence of certificate functions. Following recent works on the geometry of off-the-grid methods, we show that such functions can be constructed provided the parameters of the active features are pairwise separated by a constant with respect to a Riemannian metric. When the number of signals is finite and the noise is assumed Gaussian, we give refinements of our results for p = 1 and p = 2 using tail bounds on suprema of Gaussian and $\chi$2 random processes. When p = 2, our prediction error reaches the rates obtained by the Group-Lasso estimator in the multi-task linear regression model.  ( 3 min )
    Reformulating van Rijsbergen's $F_{\beta}$ metric for weighted binary cross-entropy. (arXiv:2210.16458v1 [stat.ML])
    The separation of performance metrics from gradient based loss functions may not always give optimal results and may miss vital aggregate information. This paper investigates incorporating a performance metric alongside differentiable loss functions to inform training outcomes. The goal is to guide model performance and interpretation by assuming statistical distributions on this performance metric for dynamic weighting. The focus is on van Rijsbergens $F_{\beta}$ metric -- a popular choice for gauging classification performance. Through distributional assumptions on the $F_{\beta}$, an intermediary link can be established to the standard binary cross-entropy via dynamic penalty weights. First, the $F_{\beta}$ metric is reformulated to facilitate assuming statistical distributions with accompanying proofs for the cumulative density function. These probabilities are used within a knee curve algorithm to find an optimal $\beta$ or $\beta_{opt}$. This $\beta_{opt}$ is used as a weight or penalty in the proposed weighted binary cross-entropy. Experimentation on publicly available data with imbalanced classes mostly yields better and interpretable results as compared to the baseline. For example, for the IMDB text data with known labeling errors, a 14% boost is shown. This methodology can accelerate training and provide better interpretation.  ( 2 min )
    Beyond calibration: estimating the grouping loss of modern neural networks. (arXiv:2210.16315v1 [cs.LG])
    Good decision making requires machine-learning models to provide trustworthy confidence scores. To this end, recent work has focused on miscalibration, i.e, the over or under confidence of model scores. Yet, contrary to widespread belief, calibration is not enough: even a classifier with the best possible accuracy and perfect calibration can have confidence scores far from the true posterior probabilities. This is due to the grouping loss, created by samples with the same confidence scores but different true posterior probabilities. Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. While there are many estimators of the calibration loss, none exists for the grouping loss in standard settings. Here, we propose an estimator to approximate the grouping loss. We use it to study modern neural network architectures in vision and NLP. We find that the grouping loss varies markedly across architectures, and that it is a key model-comparison factor across the most accurate, calibrated, models. We also show that distribution shifts lead to high grouping loss.  ( 2 min )
    Prediction Sets for High-Dimensional Mixture of Experts Models. (arXiv:2210.16710v1 [math.ST])
    Large datasets make it possible to build predictive models that can capture heterogenous relationships between the response variable and features. The mixture of high-dimensional linear experts model posits that observations come from a mixture of high-dimensional linear regression models, where the mixture weights are themselves feature-dependent. In this paper, we show how to construct valid prediction sets for an $\ell_1$-penalized mixture of experts model in the high-dimensional setting. We make use of a debiasing procedure to account for the bias induced by the penalization and propose a novel strategy for combining intervals to form a prediction set with coverage guarantees in the mixture setting. Synthetic examples and an application to the prediction of critical temperatures of superconducting materials show our method to have reliable practical performance.  ( 2 min )
    Nonlinear Causal Discovery via Kernel Anchor Regression. (arXiv:2210.16775v1 [stat.ML])
    Learning causal relationships is a fundamental problem in science. Anchor regression has been developed to address this problem for a large class of causal graphical models, though the relationships between the variables are assumed to be linear. In this work, we tackle the nonlinear setting by proposing kernel anchor regression (KAR). Beyond the natural formulation using a classic two-stage least square estimator, we also study an improved variant that involves nonparametric regression in three separate stages. We provide convergence results for the proposed KAR estimators and the identifiability conditions for KAR to learn the nonlinear structural equation models (SEM). Experimental results demonstrate the superior performances of the proposed KAR estimators over existing baselines.  ( 2 min )

  • Open

    How Dalle 2 AI inspired logo
    submitted by /u/tvojamatka [link] [comments]  ( 41 min )
    AI Dream 100 - Out of Body Experience Smooth Dream
    submitted by /u/LordPewPew777 [link] [comments]  ( 41 min )
    New AI shows taxi drivers which routes are predicted to have highest demand; this improves productivity by reducing cruising time, and narrows the productivity gap between high- and low-skilled drivers by 14%.
    submitted by /u/fotogneric [link] [comments]  ( 42 min )
    [Survey] Artificial Intelligence: Finance, insurance, ethics and bias
    Hi! We are a group of students from Politecnico di Milano. We are conducting a research about the applications of Artificial Intelligence in the fields of insurance and finance, with a focus on cognitive biases and ethical issues. Don't worry, the survey is for everyone (even if you are not an expert). All the data will be collected in anonymous form and for academic purposes only. ENGLISH https://forms.gle/YkbDgttRpDaFAjNz6 ITALIANO https://forms.gle/R56S93ibdm1PA7fMA Thank you for your contribution! We would also love if this post could become a space for discussion, where you could share your opinions, thoughts and point of view. submitted by /u/Working-Intern-6802 [link] [comments]  ( 41 min )
    The Power of Artificial Intelligence: Why We Should Be Paying Attention
    submitted by /u/liquidOCELOTY-T [link] [comments]  ( 41 min )
    What does AI think about AI
    So I’ve asked a GPT-3 model to write an article about the future of AI… my mind was blown with the results… The Future of AI - From the Perspective of an AI submitted by /u/nunoheart [link] [comments]  ( 41 min )
    The power of wide transformers models
    submitted by /u/bendee983 [link] [comments]  ( 42 min )
    [R] RTFormer : Real-Time Semantic Segmentation with Transformer
    Hi, I'd like to introduce a semantic segmentation model named RTFormer. This might be some help to you. Hope you enjoy it. ​ This paper propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of the proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K. ​ Official code: https://github.com/PaddlePaddle/PaddleSeg Arxiv: https://arxiv.org/abs/2210.07124 ​ https://preview.redd.it/l9tdtdvy24x91.png?width=999&format=png&auto=webp&s=73509e6e3f93d11a6ca6957945325c9b4244b457 submitted by /u/Effective_Tax_2096 [link] [comments]  ( 44 min )
    What Happened in Reinforcement Learning in 2022
    From robots playing football to learning how to walk on the moon! This article brings the top 8 reinforcement learning innovations that shaped AI across several industries in 2022. https://analyticsindiamag.com/what-happened-in-reinforcement-learning-in-2022/ submitted by /u/analyticsindiam [link] [comments]  ( 41 min )
    Stable Diffusion Werewolf Model for Deforum notebook or Auto 1111
    submitted by /u/prfitofthesngularity [link] [comments]  ( 41 min )
    Has there been research on the extent to which large image models plagiarize their training data?
    I think a lot of the debate on the ethics and legal status of image generation models hinges on the question of how similar their outputs are to their training data. I have not seen very much debate on this topic; the response from most AI people is to roll their eyes, while artists don't have the expertise to back up their claims of plagiarism. But, when I looked, I found at least one example of craiyon nearly exactly reproducing one of its training images. Here are three questions that I'm curious about, and that I would love to see rigorous research on: For a given output image, can the most similar training images be recovered? On average, how many unique images do the models draw from to create their output? (if this can be measured) How does each model respond when its prompt appears verbatim in its training set? Thanks to anyone who can help inform me about these issues. submitted by /u/pcxv [link] [comments]  ( 43 min )
    Free AI Generation Platform I made, it is also a prompt discussion board
    submitted by /u/Odd-Sentence-5197 [link] [comments]  ( 40 min )
    ⚽⚽ Markerless 3D Body-Object Interaction ⚽⚽ | Tübingen + AMS are opening a novel whole-bodies/objects interaction method from multi-view RGB-D data
    submitted by /u/ai-lover [link] [comments]  ( 50 min )
    [Research] NLP Scholars Program
    Hi everyone. We recently launched our call for the scholars program, a 8 month full-time paid entry point into machine learning research to pursue curiosity driven research w access to large scale engineering resources and mentorship. We have intentionally structured the program to be paid and remote-first so we can support talent all across the world. Our deadline is coming up on November 7th. I would welcome any help sharing this opportunity with ML communities around the world. More details below for anyone interested: The Cohere For AI Scholars Program supports the next generation of rising stars as they embark on their research journey by providing an alternative point of entry into NLP research. Scholars will have access to a large-scale experimental framework and work alongside some of the best researchers and engineering expertise in the world. Participation is full-time, remote-first and paid. For more details, check out our blog post announcing the Scholars Program launch. Applications are open until November 7, 2022, and can be found here. For those undertaking the takehome, would highly recommend joining our discord where we have a highly active FAQ channel for any questions about the takehome. You can find out more about our community at cohere.for.ai. submitted by /u/ml_magic_ [link] [comments]  ( 46 min )
    Stable Diffusion ⫘ The Planet of ₭₳ⱠɆӾ₵ł₳ | Crafted with AI Art
    submitted by /u/AubreBrumfield [link] [comments]  ( 40 min )
    "Wall-E" Told Through AI Generated Images
    submitted by /u/VPTZ [link] [comments]  ( 41 min )
  • Open

    [P] OpenCL backend for PyTorch - progress works with mainstream pytorch
    I'm working on PyTorch OpenCL backend based on dlprimitives core library. It exists for a while but until now it required building custom pytorch version. Recently the support of out-of-tree backend in pytorch was significantly improved and with 1.13 release of pytorch, the OpenCL backend can be built with ease both on Linux and even Windows. It is validated on large number of deep learning vision networks like ResNet, GoogleNet, MobileNet and many others. It works on nVidia, AMD and even Intel GPUs and provides performance comparable to CUDA/cuDNN builds - around 70% for inference and 60% for training in comparison to native cuda/cudnn over several network architectures. submitted by /u/artyombeilis [link] [comments]  ( 62 min )
    [P] Real time causal convNet for time series
    LINK Some time ago I made a repo that allows for training causal convolutional networks. It allows for long convolutional kernels without compute time blowing up with the length of the kernel. The networks trained by this library can be deployed in a real time setting. The weights learned by the FIR layers correspond to impulse responses. And thus one can recreate the network using real time convolution algorithms in place of the FIR layers ( For instance https://github.com/HiFi-LoFi/FFTConvolver). For large kernels, these algorithms are insanely fast compared to the naive convolution implementation normally used in NNs (Can easily do a real time convoltion of a 30s impulse response at 96khz on normal CPUs for instance( not that you'd want that, but just saying)). Will try to document the repo a bit better eventually... But until then feel free to ask questions :) submitted by /u/maka89 [link] [comments]  ( 60 min )
    [N] Conference on Statistics and Data Science
    Dear Colleagues, I hope this message finds all of you well! It is with immense pleasure that we announce the organization of the 4th Conference on Statistics and Data Science (CSDS 2022). It will be held between December 1 and 3, 2022, in virtual format. More information can be found on the event website: http://www.csds2022.ufba.br/ Registration and participation are free! Important dates: Abstract Submission: Until November 6, 2022. Decision on the acceptance of the abstracts: Until November 08, 2022. Submission of the 3 minutes video and 4 slides poster: Until November 20, 2022. Registration for the papers to be included in the scientific program: Until November 20, 2022. Registration for non-presenting participants: Until November 27, 2022. Young statisticians (up to 5 years after their last academic degree) will have the possibility to compete for the “Best Poster Award on Statistics and Data Science”. See more information at: https://csds2022.ufba.br/Awards.html See you soon! submitted by /u/TheSearchForWisdom [link] [comments]  ( 62 min )
    [D] Using JavaScript for ML Training/Research (not in the browser)
    Accelerators are getting faster but Python remains slow. The investment into a true JIT for Python (CPython, the only one folks use for ML) seems to be far away. Projects like JAX, Triton and Taichi are awesome but remain somewhat brittle to use. We believe this is because there isn't a true JIT in Python (like the one found in JavaScript) that these projects can leverage. On top of full endorsement from the language's community, building a true JIT takes millions of lines of heavily interconnected code and likely hundreds of thousands if not millions of engineer hours.[1][2] There is very little to indicate this extremely expensive type of investment will come from CPython. As a hedge against CPython never becoming fast, we're creating a project called Shumai that attempts to deeply int…  ( 63 min )
    [D] When the GPU is NOT the bottleneck...?
    Hey there, So im trying to figure out how to significantly speed up my training (trying to 10x it) and im trying to figure out whats going on here. Im using PyTorch for framework and 4 sequential layers, Dense+conv1d+lstm+dense. I have a batch size of 80,000 and ran it on a K80 vs A100, I only saw a 14% increase in performance. In the given time frame the K80 completed 1400 Epochs and the A100 completed about 1600 Epochs. To me this likely means what im trying to do is NOT being bound by the GPU at all, as the hardware should have accounted for something like a 30x increase in performance yeah? I dont think RAM is the issue, the A100 has 80GB of HMB2 VRAM, more than what I ever use. So if its not GPU power, and not RAM. Its either CPU or Storage? It seems I need to parallelize the training in order to get the speed im looking for? Anyone have any insight? submitted by /u/alexnasla [link] [comments]  ( 66 min )
    [D] Any *good* open or semi-open/proprietary AI models that can inherit the sample voice intonations and re-create with another voice?
    Questions which are easily googled Good models arent easily googled and if you actually try to search you get 9999 junk results, more than half of which are even random nonsense (all AI search results like that for some reason). submitted by /u/KeinZantezuken [link] [comments]  ( 62 min )
    [News] The Stack: 3 TB of permissively licensed source code - Hugging Face and ServiceNow Research Denis Kocetkov et al 2022
    ServiceNow and Hugging Face have released a 3.1TB dataset of permissively licensed code in 30 programming languages. This is about 4x larger than the dataset used to train GPT-3 (though obviously ‘code only’), and 3x the size of CodeParrot, the next largest released code dataset. Paper: https://drive.google.com/file/d/17J-0KXTDzY9Esp-JqXYHIcy--i_7G5Bb/view https://wandb.ai/telidavies/ml-news/reports/The-Stack-BigCode-s-New-3-TB-Dataset-Of-Permissively-Licensed-Code--VmlldzoyODY1MDUy Hugging Face: https://huggingface.co/datasets/bigcode/the-stack Twitter: https://twitter.com/BigCodeProject/status/1585631176353796097 Download The Stack: https://hf.co/BigCode Source: https://twitter.com/BigCodeProject/status/1585631176353796097 Source: https://twitter.com/BigCodeProject/status/1585631176353796097 ​ Source: https://twitter.com/BigCodeProject/status/1585631176353796097 submitted by /u/Singularian2501 [link] [comments]  ( 63 min )
    [D] Checkpointing - save_best_only - still fault tolerance or training optimization option?
    This a question for the people who are using checkpointing or have used it in a previous project. I'm looking into fault-tolerance and training and wonder about checkpointing and the save_best_only setting in Keras and Python Lightning as it reduces the interval of saving checkpoints, possibly increasing "recovery" training time after a failure. Are you considering save_best_only a setting or tool to combat overfitting in training, or still an option for fault tolerance? submitted by /u/frankdenneman [link] [comments]  ( 62 min )
    NN_GWO gives 0 value for weights [R]
    I have combined the Gre wolf optimizer with neural network to make a hybrid model to optimize Mean Square error for my research. However my GWO code gives 0 values for all weights. I have looked into code for hours but I am unable to find my mistake. If someone can spare some time it will be really helpful. I am attaching the code below. clear all clc load ("slab.mat") input_val = slab(:,1:6); output_val = slab(:,7); nm_input = []; % Normalizing the data for i=1:6 %the normalized values are correct, I have checked. nm_input(:,i) = (input_val(:,i) - min(input_val(:,i))) / (max(input_val(:,i)) - min(input_val(:,i))); end X = nm_input; nm_output = (output_val(:,1) - min(output_val(:,1))) / (max(output_val(:,1)) - min(output_val(:,1))); Y = nm_output; sz = (ceil(size(X,1))*0.…  ( 64 min )
    [D] Diffusion vs MCMC as sampling algorithms
    I was recently reviewing the diffusion methods used in Stable Diffusion and my mind wandered to Markov Chain Monte Carlo, which got me thinking - are there important theoretical similarities / differences between these methods? A bit of background: Intro to Stable Diffusion: A nice illustrated guide by Jay Alammar https://jalammar.github.io/illustrated-stable-diffusion/ Intro to MCMC: Stanford CS168 notes by Tim Roughgarden and Gregory Valiant http://timroughgarden.org/s17/l/l14.pdf The Metropolis-Hastings (MH) algorithm, a specific MCMC algorithm: https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm My own stream-of-consciousness thoughts: Function: Both diffusion and MH are sampling-based generative models that learns to produce data from a given distribution. …  ( 65 min )
    [P] Lama Cleaner: A free and open-source inpainting tool powered by SOTA AI model. It includes models like LaMa to remove any unwanted object, defect, people from your pictures and text-driven model (stable-diffusion 1.5) to replace any thing on your pictures.
    Github Repository: https://github.com/Sanster/lama-cleaner ​ https://reddit.com/link/yicyvd/video/cxv0jov6d5x91/player submitted by /u/Disastrous_Expert_22 [link] [comments]  ( 63 min )
    [D] Is Instance Norm (IN) and Batch Norm (BN) is same for the batch size of 1 of gray scale images?
    As far the theory explains and from my understanding, I think a BN of batch size of 1 and IN should be the same. Is my understanding correct? If not, could you please explain a bit? submitted by /u/rajib_ [link] [comments]  ( 58 min )
    [P] Serverless Jupyter Labs with GPUs, CPUs and high-speed storage
    I'm looking to validate my new SaaS idea so sharing the landing page to see if I can build some interest. Kiku Labs offers a way to get Jupyter Labs with powerful compute and storage for experimenting with your data science projects. The labs are saved and restored between sessions, and you're charged based on the usage. Feel free to ask any questions. Please join the waiting list if you're interested in the product. https://kikulabs.com/ submitted by /u/kikulabs [link] [comments]  ( 65 min )
    [D] Fine tuning language models time and resources required
    I am looking for a comparison table that shows the time taken to fine tune a large language model like GPT-Neo or GPT-J. It would be nice to have dimensions such as: GPU type (e.g. V100, A100) TPU type Number of data points submitted by /u/ialuronico [link] [comments]  ( 61 min )
    [P][N] Learn Rust as ML practitioner
    Hi, we are developing a ML practitioner-friendly library to help learn Rust mimicking the sklearn API: Smartcore. You can see some examples in these notebooks. Please give feedback and consider contributing. Algorithms are implemented from scratch to be easy to understand and read through with room to grow in efficiency and portability. There are some newcomers-friendly issues and TODOs. submitted by /u/tuned-mec-is [link] [comments]  ( 63 min )
    [P] Music Synthesis Pipeline for Raw Audio [Work in Progress]
    Over the past few weeks I have been working on a pipeline for generating music from raw audio. Essentially, the idea is to use an existing, well-understood image synthesis models to produce melspectrograms of a fixed size, and then use these melspectrograms to condition a vanilla encoder/decoder transformer tasked with "stretching" the melspectrograms out to their full length so they can be sent to a vocoder. As I am just performing this work on my own, I am also funding it on my own, and the cloud GPU costs are starting to become a bit...steep. So, I thought I would pause for a bit, simply share what I have so far (even if it's not ready for prime time) and see if anyone has any feedback, comments or ideas. I should note that I am definitely aware of recent (and very impressive!) efforts like AudioLM and AudioGen, which use transformers to generate audio. However, the former must be primed, whereas the latter uses text-based conditioning. I don't know how well (and with what level of controllability) those techniques can be expanded to full-length songs. Conversely, a complete (albeit resized) melspectrogram for the entire musical piece does seem like it should provide a much stronger conditioning signal. And, moreover, given the work in the image synthesis literature, it does seem like one should be able to find ways of controlling the model which creates these conditioning melspectrograms, without explicit text/music pairs (which, as the AudioGen authors point out, are quite hard to come by). Also, if anyone is interested in collaborating on this project, please let me know :) Project: https://github.com/TariqAHassan/MelReformer submitted by /u/mlconvergence [link] [comments]  ( 60 min )
    [P] Explain Paper - A Better Way to Read Academic Papers
    submitted by /u/xutw21 [link] [comments]  ( 60 min )
  • Open

    Cost efficient ML inference with multi-framework models on Amazon SageMaker
    Machine learning (ML) has proven to be one of the most successful and widespread applications of technology, affecting a wide range of industries and impacting billions of users every day. With this rapid adoption of ML into every industry, companies are facing challenges in supporting low-latency predictions and with high availability while maximizing resource utilization […]  ( 10 min )
    Solve business problems end-to-end through machine learning in Amazon SageMaker JumpStart solutions
    Amazon SageMaker JumpStart provides pre-trained, open-source models for a wide range of problem types to help you get started with machine learning (ML). JumpStart also provides solution templates that set up infrastructure for common use cases, and executable example notebooks for ML with Amazon SageMaker. As a business user, you get to do the following […]  ( 11 min )
    Train gigantic models with near-linear scaling using sharded data parallelism on Amazon SageMaker
    In the pursuit of superior accuracy, deep learning models in areas such as natural language processing and computer vision have significantly grown in size in the past few years, frequently counted in tens to hundreds of billions of parameters. Training these gigantic models is challenging and requires complex distribution strategies. Data scientists and machine learning […]  ( 9 min )
  • Open

    Real-time customer behavior recommendations via session-based approach
    No content preview
    #007 — Combining traditional art with AI art
    No content preview
    AI artist showcase — Secret Project
    No content preview
    Gradient Descent Algorithm-Chain Rule-Directional Derivative
    No content preview
    Support Vector Machine-Duality Problem
    No content preview
    Сrypto Trading App Development Guide for Fintech Business Founders
    No content preview
  • Open

    Another problem with A/B testing: interaction effects
    The previous post looked at a paradox with A/B testing: your final result may depend heavily on the order of your tests. This post looks at another problem with A/B testing: the inability to find interaction effects. Suppose you’re debating between putting a photo of a car or a truck on your web site, and […] Another problem with A/B testing: interaction effects first appeared on John D. Cook.  ( 6 min )
    A/B testing and a voting paradox
    One problem with A/B testing is that your results may depend on the order of your tests. Suppose you’re testing three options: X, Y, and Z. Let’s say you have three market segments, equal in size, each with the following preferences. Segment 1: X > Y > Z. Segment 2: Y > Z > X. Segment […] A/B testing and a voting paradox first appeared on John D. Cook.  ( 7 min )
  • Open

    Leveraging Agility to Create Economies of Learning Mindset – Part 1
    A recent HBR article, “Why Do Chief Data Officers Have Such Short Tenures?” by Thomas Davenport, Randy Bean, and Josh King, highlighted the Chief Data Officer’s (CDO) “Data-to-Business Innovation” challenge. The main charter for the CDO is to accelerate the Data-to-Business Innovation flywheel; to guide the business in becoming more effective at leveraging data and analytics to optimize its key business and operational use cases. The post Leveraging Agility to Create Economies of Learning Mindset – Part 1 appeared first on Data Science Central.  ( 20 min )
    How Custom Stickers could help your Viral Marketing Campaign
    Yes, stickers are fun and inexpensive. They can also be very helpful for some offline marketing, yes. But how in the world could they assist your web marketing campaign? We understand if those are your thoughts. Sticker printing might not be your first thought when planning your next viral campaign. But maybe it ought to…… Read More »How Custom Stickers could help your Viral Marketing Campaign The post How Custom Stickers could help your Viral Marketing Campaign appeared first on Data Science Central.  ( 20 min )
    Challenges to Successful AI Implementation in Healthcare
    Artificial intelligence (AI) and machine learning (ML) have received widespread interest in recent years due to their potential to set new paradigms in healthcare delivery. It is being said that machine learning will transform many aspects of healthcare delivery, and radiology & pathology are among the specialties set to be among the first to take advantage of this technology.  The post Challenges to Successful AI Implementation in Healthcare  appeared first on Data Science Central.  ( 21 min )
  • Open

    My RL thesis is basically importing RL libraries. How should I change this?
    Hey guys! I came to complain about my thesis 😅 So I'm a computer science MSc student, focusing mostly on applied deep ML and RL. More specifically, in my thesis, I am using RL in an effort to learn optimal wing movements (of an unmanned flapping winged vehicle). In terms of my daily tasks - they usually revolve around coding custom gym environments, using stable-baseline3 implementations (PPO mostly), or reward function tweaking. Having said that, I would really love to add a bit more theoretic aspect to it all, as it kind of feels like the role of an ML engineer, rather than a researcher. Ofc, I'm not saying that I expect myself to come up with the next DQN, but indeed, diving deeper into the black-boxed NN architectures and suggesting novel techniques for my specific problem, for example, have the potential of being very fulfilling in my opinion. Have some of you had similar thoughts? what have you managed to do that improved your experience? submitted by /u/hadar933 [link] [comments]  ( 49 min )
    Need help debugging my Elden Ring gym environment
    any help would be really appreciated https://github.com/openai/gym/issues/3141 submitted by /u/Phat_N_Sassy33 [link] [comments]  ( 46 min )
    Multi armed bandits with variable arms?
    Dear all, I have a question on the special setting of multi armed bandit (MAB) problems.(FYI: I am quite a newbie in MAB, I've just finished studying the UCB algorithm) I wonder if there exists a MAB setting where the (attribute of) arm is not fixed across the time horizon. (i.e., MAB with variable set of arms) ​ For example, suppose I have a bandit with 10 arms, each of which is corresponded to a news article I may click on. As tons of news articles are generated in every minute, for example, per 10 minutes, I want to update 10 arms to represent a new set of news articles. (instead of adding 10 more arms) And suppose additionally that the maximum number of news articles is 10000. ​ Is this setting valid in MAB regime? If yes, could you guys kindly guide me to some papers? Thank you! submitted by /u/vaseline555 [link] [comments]  ( 52 min )
    I miss the gym environments
    First time working with real-world data and custom environment. I'm having nightmares. Reinforcement learning is negative reinforcing me. But I'm atleast seeing small progress even though it's extremely small. I hope I can overcome this problem! Cheeers everyone submitted by /u/FashionDude3 [link] [comments]  ( 54 min )
    SAC with auto-adjusting alpha, entropy(-alpha * log_prob) is continue going smaller and smaller
    What can cause the entropy be diverged to extremely small (negative) value. Especially, I found the value of alpha is extremely large. submitted by /u/ad26kr [link] [comments]  ( 48 min )
  • Open

    GeForce RTX 40 Series Receives Massive Creator App Benefits This Week ‘In the NVIDIA Studio’
    Artists deploying the critically acclaimed GeForce RTX 4090 GPUs are primed to receive significant performance boosts in key creative apps. Plus, a special spook-tober edition of In the NVIDIA Studio features two talented 3D artists and their Halloween-themed creations this week. The post GeForce RTX 40 Series Receives Massive Creator App Benefits This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 9 min )
    Think Fast: Lotus Eletre Tops Charts in Driving and AI Compute Speeds, Powered by NVIDIA DRIVE Orin
    One of the biggest names in racing is going even bigger. Performance automaker Lotus launched its first SUV, the Eletre, earlier this week. The fully electric vehicle sacrifices little in terms of speed and outperforms when it comes to technology. It features an immersive digital cockpit, lengthy battery range of up to 370 miles and Read article > The post Think Fast: Lotus Eletre Tops Charts in Driving and AI Compute Speeds, Powered by NVIDIA DRIVE Orin appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    Federated Learning based Energy Demand Prediction with Clustered Aggregation. (arXiv:2210.15850v1 [cs.LG])
    To reduce negative environmental impacts, power stations and energy grids need to optimize the resources required for power production. Thus, predicting the energy consumption of clients is becoming an important part of every energy management system. Energy usage information collected by the clients' smart homes can be used to train a deep neural network to predict the future energy demand. Collecting data from a large number of distributed clients for centralized model training is expensive in terms of communication resources. To take advantage of distributed data in edge systems, centralized training can be replaced by federated learning where each client only needs to upload model updates produced by training on its local data. These model updates are aggregated into a single global model by the server. But since different clients can have different attributes, model updates can have diverse weights and as a result, it can take a long time for the aggregated global model to converge. To speed up the convergence process, we can apply clustering to group clients based on their properties and aggregate model updates from the same cluster together to produce a cluster specific global model. In this paper, we propose a recurrent neural network based energy demand predictor, trained with federated learning on clustered clients to take advantage of distributed data and speed up the convergence process.  ( 3 min )
    Mining contrast sets in classification, regression, and survival data by fusing separate and conquer models. (arXiv:2204.00497v2 [cs.DB] UPDATED)
    Identifying differences between groups is one of the most important knowledge discovery problems. The procedure, also known as contrast sets mining, is applied in a wide range of areas like medicine, industry, or economics. In the paper we present RuleKit-CS, an algorithm for contrast set mining based on a sequential covering - a well established heuristic for decision rule induction. The fusion of multiple passes accompanied with an attribute penalization scheme allows generation of contrast sets describing same examples with different attributes, distinguishing presented approach from the standard sequential covering. The ability to identify contrast sets in regression and survival data sets, the feature not provided by the existing algorithms, further extends the usability of RuleKit-CS. Experiments on over 130 data sets from various areas and detailed analysis of selected cases confirmed RuleKit-CS to be a useful tool for discovering differences between defined groups. The algorithm was implemented as a part of the RuleKit suite available at GitHub under GNU AGPL 3 licence (https://github.com/adaa-polsl/RuleKit). Keywords: Contrast sets, Sequential covering, Model fusion, Rule induction, Regression, Survival, Knowledge discovery  ( 2 min )
    Understanding Gradient Descent on Edge of Stability in Deep Learning. (arXiv:2205.09745v3 [cs.LG] UPDATED)
    Deep learning experiments by Cohen et al. [2021] using deterministic Gradient Descent (GD) revealed an Edge of Stability (EoS) phase when learning rate (LR) and sharpness (i.e., the largest eigenvalue of Hessian) no longer behave as in traditional optimization. Sharpness stabilizes around $2/$LR and loss goes up and down across iterations, yet still with an overall downward trend. The current paper mathematically analyzes a new mechanism of implicit regularization in the EoS phase, whereby GD updates due to non-smooth loss landscape turn out to evolve along some deterministic flow on the manifold of minimum loss. This is in contrast to many previous results about implicit bias either relying on infinitesimal updates or noise in gradient. Formally, for any smooth function $L$ with certain regularity condition, this effect is demonstrated for (1) Normalized GD, i.e., GD with a varying LR $\eta_t =\frac{\eta}{\| \nabla L(x(t)) \|}$ and loss $L$; (2) GD with constant LR and loss $\sqrt{L- \min_x L(x)}$. Both provably enter the Edge of Stability, with the associated flow on the manifold minimizing $\lambda_{1}(\nabla^2 L)$. The above theoretical results have been corroborated by an experimental study.  ( 2 min )
    Nonparametric Uncertainty Quantification for Single Deterministic Neural Network. (arXiv:2202.03101v2 [stat.ML] UPDATED)
    This paper proposes a fast and scalable method for uncertainty quantification of machine learning models' predictions. First, we show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution. Importantly, the proposed approach allows to disentangle explicitly aleatoric and epistemic uncertainties. The resulting method works directly in the feature space. However, one can apply it to any neural network by considering an embedding of the data induced by the network. We demonstrate the strong performance of the method in uncertainty estimation tasks on text classification problems and a variety of real-world image datasets, such as MNIST, SVHN, CIFAR-100 and several versions of ImageNet.  ( 2 min )
    VAuLT: Augmenting the Vision-and-Language Transformer for Sentiment Classification on Social Media. (arXiv:2208.09021v2 [cs.CV] UPDATED)
    We propose the Vision-and-Augmented-Language Transformer (VAuLT). VAuLT is an extension of the popular Vision-and-Language Transformer (ViLT), and improves performance on vision-and-language (VL) tasks that involve more complex text inputs than image captions while having minimal impact on training and inference efficiency. ViLT, importantly, enables efficient training and inference in VL tasks, achieved by encoding images using a linear projection of patches instead of an object detector. However, it is pretrained on captioning datasets, where the language input is simple, literal, and descriptive, therefore lacking linguistic diversity. So, when working with multimedia data in the wild, such as multimodal social media data, there is a notable shift from captioning language data, as well as diversity of tasks. We indeed find evidence that the language capacity of ViLT is lacking. The key insight of VAuLT is to propagate the output representations of a large language model (LM) like BERT to the language input of ViLT. We show that joint training of the LM and ViLT in VAuLT can yield relative improvements up to 20% over ViLT on VL tasks involving richer language inputs and affective constructs, such as for Target-Oriented Sentiment Classification in TWITTER-2015 and TWITTER-2017, and Sentiment Classification in MVSA-Single and MVSA-Multiple. Our code is available at https://github.com/gchochla/VAuLT.  ( 3 min )
    Disentangling Visual Embeddings with Minimal Distributional Assumptions. (arXiv:2206.13872v2 [stat.ML] UPDATED)
    Interest in understanding and factorizing embedding spaces learned by deep encoders is growing. Concept discovery methods search the embedding spaces for interpretable latent components like object shape or color and disentangle them into individual axes in the embedding space. Yet, the applicability of modern disentanglement learning techniques or independent component analysis (ICA) is limited when it comes to vision tasks: They either require training a model of the complex image-generating process or their rigid stochastic independence assumptions on the component distribution are violated in practice. In this work, we identify components in encoder embedding spaces without distributional assumptions and without training a generator. Instead, we utilize functional compositionality properties of image-generating processes. We derive two novel post-hoc component discovery methods and prove theoretical identifiability guarantees. We study them in realistic visual disentanglement tasks with correlated components and violated functional assumptions. Our approaches stably maintain superior performance against 300+ state-of-the-art disentanglement and component analysis models.  ( 2 min )
    DiGress: Discrete Denoising diffusion for graph generation. (arXiv:2209.14734v2 [cs.LG] UPDATED)
    This work introduces DiGress, a discrete denoising diffusion model for generating graphs with categorical node and edge attributes. Our model defines a diffusion process that progressively edits a graph with noise (adding or removing edges, changing the categories), and a graph transformer network that learns to revert this process. With these two ingredients in place, we reduce distribution learning over graphs to a simple sequence of classification tasks. We further improve sample quality by proposing a new Markovian noise model that preserves the marginal distribution of node and edge types during diffusion, and by adding auxiliary graph-theoretic features derived from the noisy graph at each diffusion step. Finally, we propose a guidance procedure for conditioning the generation on graph-level features. Overall, DiGress achieves state-of-the-art performance on both molecular and non-molecular datasets, with up to 3x validity improvement on a dataset of planar graphs. In particular, it is the first model that scales to the large GuacaMol dataset containing 1.3M drug-like molecules without using a molecule-specific representation such as SMILES or fragments.  ( 2 min )
    O-type Stars Stellar Parameter Estimation Using Recurrent Neural Networks. (arXiv:2210.12791v2 [astro-ph.IM] UPDATED)
    In this paper, we present a deep learning system approach to estimating luminosity, effective temperature, and surface gravity of O-type stars using the optical region of the stellar spectra. In previous work, we compare a set of machine learning and deep learning algorithms in order to establish a reliable way to fit a stellar model using two methods: the classification of the stellar spectra models and the estimation of the physical parameters in a regression-type task. Here we present the process to estimate individual physical parameters from an artificial neural network perspective with the capacity to handle stellar spectra with a low signal-to-noise ratio (S/N), in the $<$20 S/N boundaries. The development of three different recurrent neural network systems, the training process using stellar spectra models, the test over nine different observed stellar spectra, and the comparison with estimations in previous works are presented. Additionally, characterization methods for stellar spectra in order to reduce the dimensionality of the input data for the system and optimize the computational resources are discussed.  ( 2 min )
    A Multilevel Reinforcement Learning Framework for PDE-based Control. (arXiv:2210.08400v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is a promising method to solve control problems. However, model-free RL algorithms are sample inefficient and require thousands if not millions of samples to learn optimal control policies. A major source of computational cost in RL corresponds to the transition function, which is dictated by the model dynamics. This is especially problematic when model dynamics is represented with coupled PDEs. In such cases, the transition function often involves solving a large-scale discretization of the said PDEs. We propose a multilevel RL framework in order to ease this cost by exploiting sublevel models that correspond to coarser scale discretization (i.e. multilevel models). This is done by formulating an approximate multilevel Monte Carlo estimate of the objective function of the policy and / or value network instead of Monte Carlo estimates, as done in the classical framework. As a demonstration of this framework, we present a multilevel version of the proximal policy optimization (PPO) algorithm. Here, the level refers to the grid fidelity of the chosen simulation-based environment. We provide two examples of simulation-based environments that employ stochastic PDEs that are solved using finite-volume discretization. For the case studies presented, we observed substantial computational savings using multilevel PPO compared to its classical counterpart.  ( 2 min )
    Learning Modular Simulations for Homogeneous Systems. (arXiv:2210.16294v1 [cs.LG])
    Complex systems are often decomposed into modular subsystems for engineering tractability. Although various equation based white-box modeling techniques make use of such structure, learning based methods have yet to incorporate these ideas broadly. We present a modular simulation framework for modeling homogeneous multibody dynamical systems, which combines ideas from graph neural networks and neural differential equations. We learn to model the individual dynamical subsystem as a neural ODE module. Full simulation of the composite system is orchestrated via spatio-temporal message passing between these modules. An arbitrary number of modules can be combined to simulate systems of a wide variety of coupling topologies. We evaluate our framework on a variety of systems and show that message passing allows coordination between multiple modules over time for accurate predictions and in certain cases, enables zero-shot generalization to new system configurations. Furthermore, we show that our models can be transferred to new system configurations with lower data requirement and training effort, compared to those trained from scratch.  ( 2 min )
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v3 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.
    Reproducible and Portable Big Data Analytics in the Cloud. (arXiv:2112.09762v3 [cs.DC] UPDATED)
    Cloud computing has become a major approach to help reproduce computational experiments because it supports on-demand hardware and software resource provisioning. Yet there are still two main difficulties in reproducing big data applications in the cloud. The first is how to automate end-to-end execution of analytics including environment provisioning, analytics pipeline description, pipeline execution, and resource termination. The second is that an application developed for one cloud is difficult to be reproduced in another cloud, a.k.a. vendor lock-in problem. To tackle these problems, we leverage serverless computing and containerization techniques for automated scalable execution and reproducibility, and utilize the adapter design pattern to enable application portability and reproducibility across different clouds. We propose and develop an open-source toolkit that supports 1) fully automated end-to-end execution and reproduction via a single command, 2) automated data and configuration storage for each execution, 3) flexible client modes based on user preferences, 4) execution history query, and 5) simple reproduction of existing executions in the same environment or a different environment. We did extensive experiments on both AWS and Azure using four big data analytics applications that run on virtual CPU/GPU clusters. The experiments show our toolkit can achieve good execution performance, scalability, and efficient reproducibility for cloud-based big data analytics.
    Fairness Reprogramming. (arXiv:2209.10222v2 [cs.LG] UPDATED)
    Despite a surge of recent advances in promoting machine Learning (ML) fairness, the existing mainstream approaches mostly require retraining or finetuning the entire weights of the neural network to meet the fairness criteria. However, this is often infeasible in practice for those large-scale trained models due to large computational and storage costs, low data efficiency, and model privacy issues. In this paper, we propose a new generic fairness learning paradigm, called FairReprogram, which incorporates the model reprogramming technique. Specifically, FairReprogram considers the neural model fixed, and instead appends to the input a set of perturbations, called the fairness trigger, which is tuned towards the fairness criteria under a min-max formulation. We further introduce an information-theoretic framework that explains why and under what conditions fairness goals can be achieved using the fairness trigger. We show both theoretically and empirically that the fairness trigger can effectively obscure demographic biases in the output prediction of fixed ML models by providing false demographic information that hinders the model from utilizing the correct demographic information to make the prediction. Extensive experiments on both NLP and CV datasets demonstrate that our method can achieve better fairness improvements than retraining-based methods with far less training cost and data dependency under two widely-used fairness criteria.
    Confound-leakage: Confound Removal in Machine Learning Leads to Leakage. (arXiv:2210.09232v2 [cs.LG] UPDATED)
    Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.
    Towards fast machine-learning-assisted Bayesian posterior inference of microseismic event location and source mechanism. (arXiv:2101.04724v2 [physics.geo-ph] UPDATED)
    Bayesian inference applied to microseismic activity monitoring allows the accurate location of microseismic events from recorded seismograms and the estimation of the associated uncertainties. However, the forward modelling of these microseismic events, which is necessary to perform Bayesian source inversion, can be prohibitively expensive in terms of computational resources. A viable solution is to train a surrogate model based on machine learning techniques, to emulate the forward model and thus accelerate Bayesian inference. In this paper, we substantially enhance previous work, which considered only sources with isotropic moment tensors. We train a machine learning algorithm on the power spectrum of the recorded pressure wave and show that the trained emulator allows complete and fast event locations for $\textit{any}$ source mechanism. Moreover, we show that our approach is computationally inexpensive, as it can be run in less than 1 hour on a commercial laptop, while yielding accurate results using less than $10^4$ training seismograms. We additionally demonstrate how the trained emulators can be used to identify the source mechanism through the estimation of the Bayesian evidence. Finally, we demonstrate that our approach is robust to real noise as measured in field data. This work lays the foundations for efficient, accurate future joint determinations of event location and moment tensor, and associated uncertainties, which are ultimately key for accurately characterising human-induced and natural earthquakes, and for enhanced quantitative seismic hazard assessments.
    Multi-Agent Reinforcement Learning is a Sequence Modeling Problem. (arXiv:2205.14953v3 [cs.MA] UPDATED)
    Large sequence model (SM) such as GPT series and BERT has displayed outstanding performance and generalization capabilities on vision, language, and recently reinforcement learning tasks. A natural follow-up question is how to abstract multi-agent decision making into an SM problem and benefit from the prosperous development of SMs. In this paper, we introduce a novel architecture named Multi-Agent Transformer (MAT) that effectively casts cooperative multi-agent reinforcement learning (MARL) into SM problems wherein the task is to map agents' observation sequence to agents' optimal action sequence. Our goal is to build the bridge between MARL and SMs so that the modeling power of modern sequence models can be unleashed for MARL. Central to our MAT is an encoder-decoder architecture which leverages the multi-agent advantage decomposition theorem to transform the joint policy search problem into a sequential decision making process; this renders only linear time complexity for multi-agent problems and, most importantly, endows MAT with monotonic performance improvement guarantee. Unlike prior arts such as Decision Transformer fit only pre-collected offline data, MAT is trained by online trials and errors from the environment in an on-policy fashion. To validate MAT, we conduct extensive experiments on StarCraftII, Multi-Agent MuJoCo, Dexterous Hands Manipulation, and Google Research Football benchmarks. Results demonstrate that MAT achieves superior performance and data efficiency compared to strong baselines including MAPPO and HAPPO. Furthermore, we demonstrate that MAT is an excellent few-short learner on unseen tasks regardless of changes in the number of agents. See our project page at https://sites.google.com/view/multi-agent-transformer.
    Multi-task Active Learning for Pre-trained Transformer-based Models. (arXiv:2208.05379v2 [cs.CL] UPDATED)
    Multi-task learning, in which several tasks are jointly learned by a single model, allows NLP models to share information from multiple annotations and may facilitate better predictions when the tasks are inter-related. This technique, however, requires annotating the same text with multiple annotation schemes which may be costly and laborious. Active learning (AL) has been demonstrated to optimize annotation processes by iteratively selecting unlabeled examples whose annotation is most valuable for the NLP model. Yet, multi-task active learning (MT-AL) has not been applied to state-of-the-art pre-trained Transformer-based NLP models. This paper aims to close this gap. We explore various multi-task selection criteria in three realistic multi-task scenarios, reflecting different relations between the participating tasks, and demonstrate the effectiveness of multi-task compared to single-task selection. Our results suggest that MT-AL can be effectively used in order to minimize annotation efforts for multi-task NLP models.
    TensorIR: An Abstraction for Automatic Tensorized Program Optimization. (arXiv:2207.04296v2 [cs.LG] UPDATED)
    Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms.
    Improved proteasomal cleavage prediction with positive-unlabeled learning. (arXiv:2209.07527v2 [q-bio.QM] UPDATED)
    Accurate in silico modeling of the antigen processing pathway is crucial to enable personalized epitope vaccine design for cancer. An important step of such pathway is the degradation of the vaccine into smaller peptides by the proteasome, some of which are going to be presented to T cells by the MHC complex. While predicting MHC-peptide presentation has received a lot of attention recently, proteasomal cleavage prediction remains a relatively unexplored area in light of recent advancesin high-throughput mass spectrometry-based MHC ligandomics. Moreover, as such experimental techniques do not allow to identify regions that cannot be cleaved, the latest predictors generate decoy negative samples and treat them as true negatives when training, even though some of them could actually be positives. In this work, we thus present a new predictor trained with an expanded dataset and the solid theoretical underpinning of positive-unlabeled learning, achieving a new state-of-the-art in proteasomal cleavage prediction. The improved predictive capabilities will in turn enable more precise vaccine development improving the efficacy of epitope-based vaccines. Pretrained models are available on GitHub
    Deep Kernel Learning of Dynamical Models from High-Dimensional Noisy Data. (arXiv:2208.12975v2 [cs.LG] UPDATED)
    This work proposes a Stochastic Variational Deep Kernel Learning method for the data-driven discovery of low-dimensional dynamical models from high-dimensional noisy data. The framework is composed of an encoder that compresses high-dimensional measurements into low-dimensional state variables, and a latent dynamical model for the state variables that predicts the system evolution over time. The training of the proposed model is carried out in an unsupervised manner, i.e., not relying on labeled data. Our learning method is evaluated on the motion of a pendulum -- a well studied baseline for nonlinear model identification and control with continuous states and control inputs -- measured via high-dimensional noisy RGB images. Results show that the method can effectively denoise measurements, learn compact state representations and latent dynamical models, as well as identify and quantify modeling uncertainties.
    Provably Training Overparameterized Neural Network Classifiers with Non-convex Constraints. (arXiv:2012.15274v2 [stat.ML] UPDATED)
    Training a classifier under non-convex constraints has gotten increasing attention in the machine learning community thanks to its wide range of applications such as algorithmic fairness and class-imbalanced classification. However, several recent works addressing non-convex constraints have only focused on simple models such as logistic regression or support vector machines. Neural networks, one of the most popular models for classification nowadays, are precluded and lack theoretical guarantees. In this work, we show that overparameterized neural networks could achieve a near-optimal and near-feasible solution of non-convex constrained optimization problems via the project stochastic gradient descent. Our key ingredient is the no-regret analysis of online learning for neural networks in the overparameterization regime, which may be of independent interest in online learning applications.
    Semi-Supervised Generative Adversarial Network for Stress Detection Using Partially Labeled Physiological Data. (arXiv:2206.14976v2 [cs.LG] UPDATED)
    Physiological measurements involves observing variables that attribute to the normative functioning of human systems and subsystems directly or indirectly. The measurements can be used to detect affective states of a person with aims such as improving human-computer interactions. There are several methods of collecting physiological data, but wearable sensors are a common, non-invasive tool for accurate readings. However, valuable information is hard to extract from the raw physiological data, especially for affective state detection. Machine Learning techniques are used to detect the affective state of a person through labeled physiological data. A clear problem with using labeled data is creating accurate labels. An expert is needed to analyze a form of recording of participants and mark sections with different states such as stress and calm. While expensive, this method delivers a complete dataset with labeled data that can be used in any number of supervised algorithms. An interesting question arises from the expensive labeling: how can we reduce the cost while maintaining high accuracy? Semi-Supervised learning (SSL) is a potential solution to this problem. These algorithms allow for machine learning models to be trained with only a small subset of labeled data (unlike unsupervised which use no labels). They provide a way of avoiding expensive labeling. This paper compares a fully supervised algorithm to a SSL on the public WESAD (Wearable Stress and Affect Detection) Dataset for stress detection. This paper shows that Semi-Supervised algorithms are a viable method for inexpensive affective state detection systems with accurate results.
    MSRL: Distributed Reinforcement Learning with Dataflow Fragments. (arXiv:2210.00882v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) trains many agents, which is resource-intensive and must scale to large GPU clusters. Different RL training algorithms offer different opportunities for distributing and parallelising the computation. Yet, current distributed RL systems tie the definition of RL algorithms to their distributed execution: they hard-code particular distribution strategies and only accelerate specific parts of the computation (e.g. policy network updates) on GPU workers. Fundamentally, current systems lack abstractions that decouple RL algorithms from their execution. We describe MindSpore Reinforcement Learning (MSRL), a distributed RL training system that supports distribution policies that govern how RL training computation is parallelised and distributed on cluster resources, without requiring changes to the algorithm implementation. MSRL introduces the new abstraction of a fragmented dataflow graph, which maps Python functions from an RL algorithm's training loop to parallel computational fragments. Fragments are executed on different devices by translating them to low-level dataflow representations, e.g. computational graphs as supported by deep learning engines, CUDA implementations or multi-threaded CPU processes. We show that MSRL subsumes the distribution strategies of existing systems, while scaling RL training to 64 GPUs.
    Towards predicting dynamic stability of power grids with Graph Neural Networks. (arXiv:2206.06369v2 [cs.LG] UPDATED)
    To mitigate climate change, the share of renewable energies in power production needs to be increased. Renewables introduce new challenges to power grids regarding the dynamic stability due to decentralization, reduced inertia and volatility in production. However, dynamic stability simulations are intractable and exceedingly expensive for large grids. Graph Neural Networks (GNNs) are a promising method to reduce the computational effort of analyzing dynamic stability of power grids. We provide new datasets of dynamic stability of synthetic power grids and find that GNNs are surprisingly effective at predicting highly non-linear targets from topological information only. We show that large GNNs outperform GNNs from previous work as well as as handcrafted graph features and semi-analytic approximations. Further, we demonstrate GNNs can accurately identify trouble maker-nodes in the power grids. Lastly, we show that GNNs trained on small grids can perform accurately on a large synthetic Texan power grid model, which illustrates the potential of our approach.
    LOFT: Finding Lottery Tickets through Filter-wise Training. (arXiv:2210.16169v1 [cs.LG])
    Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist ``\textit{winning tickets}'' in large neural networks. These tickets represent ``sparse'' versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to \emph{pretrain} the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the \emph{LOttery ticket through Filter-wise Training} algorithm, dubbed as \textsc{LoFT}. \textsc{LoFT} is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that \textsc{LoFT} $i)$ preserves and finds good lottery tickets, while $ii)$ it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
    Multitrack Music Transformer. (arXiv:2207.06983v2 [cs.SD] UPDATED)
    Existing approaches for generating multitrack music with transformer models have been limited in terms of the number of instruments, the length of the music segments and slow inference. This is partly due to the memory requirements of the lengthy input sequences necessitated by existing representations. In this work, we propose a new multitrack music representation that allows a diverse set of instruments while keeping a short sequence length. Our proposed Multitrack Music Transformer (MMT) achieves comparable performance with state-of-the-art systems, landing in between two recently proposed models in a subjective listening test, while achieving substantial speedups and memory reductions over both, making the method attractive for real time improvisation or near real time creative applications. Further, we propose a new measure for analyzing musical self-attention and show that the trained model attends more to notes that form a consonant interval with the current note and to notes that are 4N beats away from the current step.
    Comprehensively identifying Long Covid articles with human-in-the-loop machine learning. (arXiv:2209.08124v2 [cs.LG] UPDATED)
    A significant percentage of COVID-19 survivors experience ongoing multisystemic symptoms that often affect daily living, a condition known as Long Covid or post-acute-sequelae of SARS-CoV-2 infection. However, identifying scientific articles relevant to Long Covid is challenging since there is no standardized or consensus terminology. We developed an iterative human-in-the-loop machine learning framework combining data programming with active learning into a robust ensemble model, demonstrating higher specificity and considerably higher sensitivity than other methods. Analysis of the Long Covid collection shows that (1) most Long Covid articles do not refer to Long Covid by any name (2) when the condition is named, the name used most frequently in the literature is Long Covid, and (3) Long Covid is associated with disorders in a wide variety of body systems. The Long Covid collection is updated weekly and is searchable online at the LitCovid portal: https://www.ncbi.nlm.nih.gov/research/coronavirus/docsum?filters=e_condition.LongCovid
    DOORS: Dataset fOr bOuldeRs Segmentation. Statistical properties and Blender setup. (arXiv:2210.16253v1 [cs.CV])
    The capability to detect boulders on the surface of small bodies is beneficial for vision-based applications such as hazard detection during critical operations and navigation. This task is challenging due to the wide assortment of irregular shapes, the characteristics of the boulders population, and the rapid variability in the illumination conditions. Moreover, the lack of publicly available labeled datasets for these applications damps the research about data-driven algorithms. In this work, the authors provide a statistical characterization and setup used for the generation of two datasets about boulders on small bodies that are made publicly available.
    Dynamic Graph Echo State Networks. (arXiv:2110.08565v2 [cs.LG] UPDATED)
    Dynamic temporal graphs represent evolving relations between entities, e.g. interactions between social network users or infection spreading. We propose an extension of graph echo state networks for the efficient processing of dynamic temporal graphs, with a sufficient condition for their echo state property, and an experimental analysis of reservoir layout impact. Compared to temporal graph kernels that need to hold the entire history of vertex interactions, our model provides a vector encoding for the dynamic graph that is updated at each time-step without requiring training. Experiments show accuracy comparable to approximate temporal graph kernels on twelve dissemination process classification tasks.
    Multi-task Video Enhancement for Dental Interventions. (arXiv:2210.16236v1 [cs.CV])
    A microcamera firmly attached to a dental handpiece allows dentists to continuously monitor the progress of conservative dental procedures. Video enhancement in video-assisted dental interventions alleviates low-light, noise, blur, and camera handshakes that collectively degrade visual comfort. To this end, we introduce a novel deep network for multi-task video enhancement that enables macro-visualization of dental scenes. In particular, the proposed network jointly leverages video restoration and temporal alignment in a multi-scale manner for effective video enhancement. Our experiments on videos of natural teeth in phantom scenes demonstrate that the proposed network achieves state-of-the-art results in multiple tasks with near real-time processing. We release Vident-lab at https://doi.org/10.34808/1jby-ay90, the first dataset of dental videos with multi-task labels to facilitate further research in relevant video processing applications.
    Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. (arXiv:2203.02592v2 [stat.ML] UPDATED)
    The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity and hence can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature.
    Latent Space is Feature Space: Regularization Term for GANs Training on Limited Dataset. (arXiv:2210.16251v1 [cs.CV])
    Generative Adversarial Networks (GAN) is currently widely used as an unsupervised image generation method. Current state-of-the-art GANs can generate photorealistic images with high resolution. However, a large amount of data is required, or the model would prone to generate images with similar patterns (mode collapse) and bad quality. I proposed an additional structure and loss function for GANs called LFM, trained to maximize the feature diversity between the different dimensions of the latent space to avoid mode collapse without affecting the image quality. Orthogonal latent vector pairs are created, and feature vector pairs extracted by discriminator are examined by dot product, with which discriminator and generator are in a novel adversarial relationship. In experiments, this system has been built upon DCGAN and proved to have improvement on Frechet Inception Distance (FID) training from scratch on CelebA Dataset. This system requires mild extra performance and can work with data augmentation methods. The code is available on github.com/penway/LFM.
    Fuzzy Logic Model for Predicting the Heat Index. (arXiv:2210.16051v1 [cs.LG])
    A fuzzy inference system was developed for predicting the heat index from temperature and relative humidity data. The effectiveness of fuzzy logic in using imprecise mapping of input to output to encode interconnectedness of system variables was exploited to uncover a linguistic model of how the temperature and humidity conditions impact the heat index in a growth room. The developed model achieved an R2 of 0.974 and a RMSE of 0.084 when evaluated on a test set, and the results were statistically significant (F1,5915 = 222900.858, p < 0.001). By providing the advantage of linguistic summarization of data trends as well as high prediction accuracy, the fuzzy logic model proved to be an effective machine learning method for heat control problems.
    Flow-matching -- efficient coarse-graining of molecular dynamics without forces. (arXiv:2203.11167v3 [physics.comp-ph] UPDATED)
    Coarse-grained (CG) molecular simulations have become a standard tool to study molecular processes on time- and length-scales inaccessible to all-atom simulations. Parameterizing CG force fields to match all-atom simulations has mainly relied on force-matching or relative entropy minimization, which require many samples from costly simulations with all-atom or CG resolutions, respectively. Here we present flow-matching, a new training method for CG force fields that combines the advantages of both methods by leveraging normalizing flows, a generative deep learning method. Flow-matching first trains a normalizing flow to represent the CG probability density, which is equivalent to minimizing the relative entropy without requiring iterative CG simulations. Subsequently, the flow generates samples and forces according to the learned distribution in order to train the desired CG free energy model via force matching. Even without requiring forces from the all-atom simulations, flow-matching outperforms classical force-matching by an order of magnitude in terms of data efficiency, and produces CG models that can capture the folding and unfolding transitions of small proteins.
    Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization. (arXiv:2202.06856v2 [cs.LG] UPDATED)
    A common explanation for the failure of deep networks to generalize out-of-distribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. Our findings also imply that given a small amount of data from the target distribution, retraining only the last linear layer will give excellent performance. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance.
    Sequential asset ranking in nonstationary time series. (arXiv:2202.12186v3 [cs.CE] UPDATED)
    We create a ranking algorithm, the naive Bayes asset ranker. Our algorithm computes the posterior probability that individual assets will be ranked higher than other portfolio constituents. Unlike earlier algorithms, such as the weighted majority, our algorithm allows poor-performing experts to have increased weight when they start performing well. We outperform the long-only holding of the S&P 500 index and a regress-then-rank baseline.
    Preferential Subsampling for Stochastic Gradient Langevin Dynamics. (arXiv:2210.16189v1 [stat.ML])
    Stochastic gradient MCMC (SGMCMC) offers a scalable alternative to traditional MCMC, by constructing an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. While efficient to compute, the resulting gradient estimator may exhibit a high variance and impact sampler performance. The problem of variance control has been traditionally addressed by constructing a better stochastic gradient estimator, often using control variates. We propose to use a discrete, non-uniform probability distribution to preferentially subsample data points that have a greater impact on the stochastic gradient. In addition, we present a method of adaptively adjusting the subsample size at each iteration of the algorithm, so that we increase the subsample size in areas of the sample space where the gradient is harder to estimate. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v2 [cs.LG] UPDATED)
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.
    A Novel Sparse Bayesian Learning and Its Application to Fault Diagnosis for Multistation Assembly Systems. (arXiv:2210.16176v1 [stat.AP])
    This paper addresses the problem of fault diagnosis in multistation assembly systems. Fault diagnosis is to identify process faults that cause the excessive dimensional variation of the product using dimensional measurements. For such problems, the challenge is solving an underdetermined system caused by a common phenomenon in practice; namely, the number of measurements is less than that of the process errors. To address this challenge, this paper attempts to solve the following two problems: (1) how to utilize the temporal correlation in the time series data of each process error and (2) how to apply prior knowledge regarding which process errors are more likely to be process faults. A novel sparse Bayesian learning method is proposed to achieve the above objectives. The method consists of three hierarchical layers. The first layer has parameterized prior distribution that exploits the temporal correlation of each process error. Furthermore, the second and third layers achieve the prior distribution representing the prior knowledge of process faults. Then, these prior distributions are updated with the likelihood function of the measurement samples from the process, resulting in the accurate posterior distribution of process faults from an underdetermined system. Since posterior distributions of process faults are intractable, this paper derives approximate posterior distributions via Variational Bayes inference. Numerical and simulation case studies using an actual autobody assembly process are performed to demonstrate the effectiveness of the proposed method.
    Universal speaker recognition encoders for different speech segments duration. (arXiv:2210.16231v1 [cs.SD])
    Creating universal speaker encoders which are robust for different acoustic and speech duration conditions is a big challenge today. According to our observations systems trained on short speech segments are optimal for short phrase speaker verification and systems trained on long segments are superior for long segments verification. A system trained simultaneously on pooled short and long speech segments does not give optimal verification results and usually degrades both for short and long segments. This paper addresses the problem of creating universal speaker encoders for different speech segments duration. We describe our simple recipe for training universal speaker encoder for any type of selected neural network architecture. According to our evaluation results of wav2vec-TDNN based systems obtained for NIST SRE and VoxCeleb1 benchmarks the proposed universal encoder provides speaker verification improvements in case of different enrollment and test speech segment duration. The key feature of the proposed encoder is that it has the same inference time as the selected neural network architecture.
    Vanishing Component Analysis with Contrastive Normalization. (arXiv:2210.16171v1 [cs.LG])
    Vanishing component analysis (VCA) computes approximate generators of vanishing ideals of samples, which are further used for extracting nonlinear features of the samples. Recent studies have shown that normalization of approximate generators plays an important role and different normalization leads to generators of different properties. In this paper, inspired by recent self-supervised frameworks, we propose a contrastive normalization method for VCA, where we impose the generators to vanish on the target samples and to be normalized on the transformed samples. We theoretically show that a contrastive normalization enhances the discriminative power of VCA, and provide the algebraic interpretation of VCA under our normalization. Numerical experiments demonstrate the effectiveness of our method. This is the first study to tailor the normalization of approximate generators of vanishing ideals to obtain discriminative features.
    Nonuniqueness and Convergence to Equivalent Solutions in Observer-based Inverse Reinforcement Learning. (arXiv:2210.16299v1 [eess.SY])
    A key challenge in solving the deterministic inverse reinforcement learning problem online and in real time is the existence of non-unique solutions. Nonuniqueness necessitates the study of the notion of equivalent solutions and convergence to such solutions. While \emph{offline} algorithms that result in convergence to equivalent solutions have been developed in the literature, online, real-time techniques that address nonuniqueness are not available. In this paper, a regularized history stack observer is developed to generate solutions that are approximately equivalent. Novel data-richness conditions are developed to facilitate the analysis and simulation results are provided to demonstrate the effectiveness of the developed technique.
    Universalization of any adversarial attack using very few test examples. (arXiv:2005.08632v2 [cs.LG] UPDATED)
    Deep learning models are known to be vulnerable not only to input-dependent adversarial attacks but also to input-agnostic or universal adversarial attacks. Dezfooli et al. \cite{Dezfooli17,Dezfooli17anal} construct universal adversarial attack on a given model by looking at a large number of training data points and the geometry of the decision boundary near them. Subsequent work \cite{Khrulkov18} constructs universal attack by looking only at test examples and intermediate layers of the given model. In this paper, we propose a simple universalization technique to take any input-dependent adversarial attack and construct a universal attack by only looking at very few adversarial test examples. We do not require details of the given model and have negligible computational overhead for universalization. We theoretically justify our universalization technique by a spectral property common to many input-dependent adversarial perturbations, e.g., gradients, Fast Gradient Sign Method (FGSM) and DeepFool. Using matrix concentration inequalities and spectral perturbation bounds, we show that the top singular vector of input-dependent adversarial directions on a small test sample gives an effective and simple universal adversarial attack. For VGG16 and VGG19 models trained on ImageNet, our simple universalization of Gradient, FGSM, and DeepFool perturbations using a test sample of 64 images gives fooling rates comparable to state-of-the-art universal attacks \cite{Dezfooli17,Khrulkov18} for reasonable norms of perturbation. Code available at https://github.com/ksandeshk/svd-uap .
    Contextual-Utterance Training for Automatic Speech Recognition. (arXiv:2210.16238v1 [eess.AS])
    Recent studies of streaming automatic speech recognition (ASR) recurrent neural network transducer (RNN-T)-based systems have fed the encoder with past contextual information in order to improve its word error rate (WER) performance. In this paper, we first propose a contextual-utterance training technique which makes use of the previous and future contextual utterances in order to do an implicit adaptation to the speaker, topic and acoustic environment. Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems. This proposed approach allows to make a better use of the available acoustic context in streaming models by distilling "in-place" the knowledge of a teacher, which is able to see both past and future contextual utterances, to the student which can only see the current and past contextual utterances. The experimental results show that a conformer-transducer system trained with the proposed techniques outperforms the same system trained with the classical RNN-T loss. Specifically, the proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative, respectively.
    SEMPAI: a Self-Enhancing Multi-Photon Artificial Intelligence for prior-informed assessment of muscle function and pathology. (arXiv:2210.16273v1 [cs.CV])
    Deep learning (DL) shows notable success in biomedical studies. However, most DL algorithms work as a black box, exclude biomedical experts, and need extensive data. We introduce the Self-Enhancing Multi-Photon Artificial Intelligence (SEMPAI), that integrates hypothesis-driven priors in a data-driven DL approach for research on multiphoton microscopy (MPM) of muscle fibers. SEMPAI utilizes meta-learning to optimize prior integration, data representation, and neural network architecture simultaneously. This allows hypothesis testing and provides interpretable feedback about the origin of biological information in MPM images. SEMPAI performs joint learning of several tasks to enable prediction for small datasets. The method is applied on an extensive multi-study dataset resulting in the largest joint analysis of pathologies and function for single muscle fibers. SEMPAI outperforms state-of-the-art biomarkers in six of seven predictive tasks, including those with scarce data. SEMPAI's DL models with integrated priors are superior to those without priors and to prior-only machine learning approaches.
    Analyzing Data-Centric Properties for Graph Contrastive Learning. (arXiv:2208.02810v2 [cs.LG] UPDATED)
    Recent analyses of self-supervised learning (SSL) find the following data-centric properties to be critical for learning good representations: invariance to task-irrelevant semantics, separability of classes in some latent space, and recoverability of labels from augmented samples. However, given their discrete, non-Euclidean nature, graph datasets and graph SSL methods are unlikely to satisfy these properties. This raises the question: how do graph SSL methods, such as contrastive learning (CL), work well? To systematically probe this question, we perform a generalization analysis for CL when using generic graph augmentations (GGAs), with a focus on data-centric properties. Our analysis yields formal insights into the limitations of GGAs and the necessity of task-relevant augmentations. As we empirically show, GGAs do not induce task-relevant invariances on common benchmark datasets, leading to only marginal gains over naive, untrained baselines. Our theory motivates a synthetic data generation process that enables control over task-relevant information and boasts pre-defined optimal augmentations. This flexible benchmark helps us identify yet unrecognized limitations in advanced augmentation techniques (e.g., automated methods). Overall, our work rigorously contextualizes, both empirically and theoretically, the effects of data-centric properties on augmentation strategies and learning paradigms for graph SSL.
    A CNN-LSTM Combination Network for Cataract Detection using Eye Fundus Images. (arXiv:2210.16093v1 [cs.CV])
    According to multiple authoritative authorities, including the World Health Organization, vision-related impairments and disorders are becoming a significant issue. According to a recent report, one of the leading causes of irreversible blindness in persons over the age of 50 is delayed cataract treatment. A cataract is a cloudy spot in the eye's lens that causes visual loss. Cataracts often develop slowly and consequently result in difficulty in driving, reading, and even recognizing faces. This necessitates the development of rapid and dependable diagnosis and treatment solutions for ocular illnesses. Previously, such visual illness diagnosis were done manually, which was time-consuming and prone to human mistake. However, as technology advances, automated, computer-based methods that decrease both time and human labor while producing trustworthy results are now accessible. In this study, we developed a CNN-LSTM-based model architecture with the goal of creating a low-cost diagnostic system that can classify normal and cataractous cases of ocular disease from fundus images. The proposed model was trained on the publicly available ODIR dataset, which included fundus images of patients' left and right eyes. The suggested architecture outperformed previous systems with a state-of-the-art 97.53% accuracy.
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v2 [cs.LG] UPDATED)
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at https://github.com/lsj2408/URPE.
    Review on Classification Techniques used in Biophysiological Stress Monitoring. (arXiv:2210.16040v1 [cs.LG])
    Cardiovascular activities are directly related to the response of a body in a stressed condition. Stress, based on its intensity, can be divided into two types i.e. Acute stress (short-term stress) and Chronic stress (long-term stress). Repeated acute stress and continuous chronic stress may play a vital role in inflammation in the circulatory system and thus leads to a heart attack or to a stroke. In this study, we have reviewed commonly used machine learning classification techniques applied to different stress-indicating parameters used in stress monitoring devices. These parameters include Photoplethysmograph (PPG), Electrocardiographs (ECG), Electromyograph (EMG), Galvanic Skin Response (GSR), Heart Rate Variation (HRV), skin temperature, respiratory rate, Electroencephalograph (EEG) and salivary cortisol, used in stress monitoring devices. This study also provides a discussion on choosing a classifier, which depends upon a number of factors other than accuracy, like the number of subjects involved in an experiment, type of signals processing and computational limitations.
    Transferable E(3) equivariant parameterization for Hamiltonian of molecules and solids. (arXiv:2210.16190v1 [physics.comp-ph])
    Machine learning, especially deep learning, can build a direct mapping from structure to properties with its huge parameter space, making it possible to perform high-throughput screening for the desired properties of materials. However, since the electronic Hamiltonian transforms non-trivially under rotation operations, it is challenging to accurately predict the electronic Hamiltonian while strictly satisfying this constraint. There is currently a lack of transferable machine learning models that can bypass the computationally demanding density functional theory (DFT) to obtain the ab initio Hamiltonian of molecules and materials by complete data-driven methods. In this work, we point out the necessity of explicitly considering the parity symmetry of the electronic Hamiltonian in addition to rotational equivariance. We propose a parameterized Hamiltonian that strictly satisfies rotational equivariance and parity symmetry simultaneously, based on which we develop an E(3) equivariant neural network called HamNet to predict the ab initio tight-binding Hamiltonian of various molecules and solids. The tests show that this model has similar transferability to that of machine learning potentials and can be applied to a class of materials with different configurations using the same set of trained network weights. The proposed framework provides a general transferable model for accelerating electronic structure calculations.
    Uncertainty Estimation Using Riemannian Model~Dynamics for Offline Reinforcement Learning. (arXiv:2102.11327v2 [cs.LG] UPDATED)
    Model-based offline reinforcement learning approaches generally rely on bounds of model error. Estimating these bounds is usually achieved through uncertainty estimation methods. In this work, we combine parametric and nonparametric methods for uncertainty estimation through a novel latent space based metric. In particular, we build upon recent advances in Riemannian geometry of generative models to construct a pullback metric of an encoder-decoder based forward model. Our proposed metric measures both the quality of out-of-distribution samples as well as the discrepancy of examples in the data. We leverage our method for uncertainty estimation in a pessimistic model-based framework, showing a significant improvement upon contemporary model-based offline approaches on continuous control and autonomous driving benchmarks.
    An Accurate, Scalable and Verifiable Protocol for Federated Differentially Private Averaging. (arXiv:2006.07218v3 [cs.CR] UPDATED)
    Learning from data owned by several parties, as in federated learning, raises challenges regarding the privacy guarantees provided to participants and the correctness of the computation in the presence of malicious parties. We tackle these challenges in the context of distributed averaging, an essential building block of federated learning algorithms. Our first contribution is a scalable protocol in which participants exchange correlated Gaussian noise along the edges of a network graph, complemented by independent noise added by each party. We analyze the differential privacy guarantees of our protocol and the impact of the graph topology under colluding malicious parties, showing that we can nearly match the utility of the trusted curator model even when each honest party communicates with only a logarithmic number of other parties chosen at random. This is in contrast with protocols in the local model of privacy (with lower utility) or based on secure aggregation (where all pairs of users need to exchange messages). Our second contribution enables users to prove the correctness of their computations without compromising the efficiency and privacy guarantees of the protocol. Our verification protocol relies on standard cryptographic primitives like commitment schemes and zero knowledge proofs.
    Watermarking Graph Neural Networks based on Backdoor Attacks. (arXiv:2110.11024v4 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have achieved promising performance in various real-world applications. Building a powerful GNN model is not a trivial task, as it requires a large amount of training data, powerful computing resources, and human expertise in fine-tuning the model. Moreover, with the development of adversarial attacks, e.g., model stealing attacks, GNNs raise challenges to model authentication. To avoid copyright infringement on GNNs, verifying the ownership of the GNN models is necessary. This paper presents a watermarking framework for GNNs for both graph and node classification tasks. We 1) design two strategies to generate watermarked data for the graph classification task and one for the node classification task, 2) embed the watermark into the host model through training to obtain the watermarked GNN model, and 3) verify the ownership of the suspicious model in a black-box setting. The experiments show that our framework can verify the ownership of GNN models with a very high probability (up to $99\%$) for both tasks. Finally, we experimentally show that our watermarking approach is robust against a state-of-the-art model extraction technique and four state-of-the-art defenses against backdoor attacks.
    Universal Adversarial Directions. (arXiv:2210.15997v1 [cs.LG])
    Despite their great success in image recognition tasks, deep neural networks (DNNs) have been observed to be susceptible to universal adversarial perturbations (UAPs) which perturb all input samples with a single perturbation vector. However, UAPs often struggle in transferring across DNN architectures and lead to challenging optimization problems. In this work, we study the transferability of UAPs by analyzing equilibrium in the universal adversarial example game between the classifier and UAP adversary players. We show that under mild assumptions the universal adversarial example game lacks a pure Nash equilibrium, indicating UAPs' suboptimal transferability across DNN classifiers. To address this issue, we propose Universal Adversarial Directions (UADs) which only fix a universal direction for adversarial perturbations and allow the perturbations' magnitude to be chosen freely across samples. We prove that the UAD adversarial example game can possess a Nash equilibrium with a pure UAD strategy, implying the potential transferability of UADs. We also connect the UAD optimization problem to the well-known principal component analysis (PCA) and develop an efficient PCA-based algorithm for optimizing UADs. We evaluate UADs over multiple benchmark image datasets. Our numerical results show the superior transferability of UADs over standard gradient-based UAPs.
    FairMask: Better Fairness via Model-based Rebalancing of Protected Attributes. (arXiv:2110.01109v3 [cs.LG] UPDATED)
    Context: Machine learning software can generate models that inappropriately discriminate against specific protected social groups (e.g., groups based on gender, ethnicity, etc). Motivated by those results, software engineering researchers have proposed many methods for mitigating those discriminatory effects. While those methods are effective in mitigating bias, few of them can provide explanations on what is the root cause of bias. Objective: We aim at better detection and mitigation of algorithmic discrimination in machine learning software problems. Method: Here we propose xFAIR, a model-based extrapolation method, that is capable of both mitigating bias and explaining the cause. In our xFAIR approach, protected attributes are represented by models learned from the other independent variables (and these models offer extrapolations over the space between existing examples). We then use the extrapolation models to relabel protected attributes later seen in testing data or deployment time. Our approach aims to offset the biased predictions of the classification model via rebalancing the distribution of protected attributes. Results: The experiments of this paper show that, without compromising (original) model performance, xFAIR can achieve significantly better group and individual fairness (as measured in different metrics) than benchmark methods. Moreover, when compared to another instance-based rebalancing method, our model-based approach shows faster runtime and thus better scalability. Conclusion: Algorithmic decision bias can be removed via extrapolation that smooths away outlier points. As evidence for this, our proposed xFAIR is not only performance-wise better (measured by fairness and performance metrics) than two state-of-the-art fairness algorithms.
    Applying Physics-Informed Enhanced Super-Resolution Generative Adversarial Networks to Finite-Rate-Chemistry Flows and Predicting Lean Premixed Gas Turbine Combustors. (arXiv:2210.16219v1 [physics.flu-dyn])
    The accurate prediction of small scales in underresolved flows is still one of the main challenges in predictive simulations of complex configurations. Over the last few years, data-driven modeling has become popular in many fields as large, often extensively labeled datasets are now available and training of large neural networks has become possible on graphics processing units (GPUs) that speed up the learning process tremendously. In fact, the successful application of deep neural networks in fluid dynamics, such as for underresolved reactive flows, is still challenging. This work advances the recently introduced PIESRGAN to reactive finite-rate-chemistry flows. However, since combustion chemistry typically acts on the smallest scales, the original approach needs to be extended. Therefore, the modeling approach of PIESRGAN is modified to accurately account for the challenges in the context of laminar finite-rate-chemistry flows. The modified PIESRGAN-based model gives good agreement in a priori and a posteriori tests in a laminar lean premixed combustion setup. Furthermore, a reduced PIESRGAN-based model is presented that solves only the major species on a reconstructed field and employs PIERSGAN lookup for the remaining species, utilizing staggering in time. The advantages of the discriminator-supported training are shown, and the usability of the new model demonstrated in the context of a model gas turbine combustor.
    Learning with Multigraph Convolutional Filters. (arXiv:2210.16272v1 [eess.SP])
    In this paper, we introduce a convolutional architecture to perform learning when information is supported on multigraphs. Exploiting algebraic signal processing (ASP), we propose a convolutional signal processing model on multigraphs (MSP). Then, we introduce multigraph convolutional neural networks (MGNNs) as stacked and layered structures where information is processed according to an MSP model. We also develop a procedure for tractable computation of filter coefficients in the MGNN and a low cost method to reduce the dimensionality of the information transferred between layers. We conclude by comparing the performance of MGNNs against other learning architectures on an optimal resource allocation task for multi-channel communication systems.
    Aggregation in the Mirror Space (AIMS): Fast, Accurate Distributed Machine Learning in Military Settings. (arXiv:2210.16181v1 [cs.LG])
    Distributed machine learning (DML) can be an important capability for modern military to take advantage of data and devices distributed at multiple vantage points to adapt and learn. The existing distributed machine learning frameworks, however, cannot realize the full benefits of DML, because they are all based on the simple linear aggregation framework, but linear aggregation cannot handle the $\textit{divergence challenges}$ arising in military settings: the learning data at different devices can be heterogeneous ($\textit{i.e.}$, Non-IID data), leading to model divergence, but the ability for devices to communicate is substantially limited ($\textit{i.e.}$, weak connectivity due to sparse and dynamic communications), reducing the ability for devices to reconcile model divergence. In this paper, we introduce a novel DML framework called aggregation in the mirror space (AIMS) that allows a DML system to introduce a general mirror function to map a model into a mirror space to conduct aggregation and gradient descent. Adapting the convexity of the mirror function according to the divergence force, AIMS allows automatic optimization of DML. We conduct both rigorous analysis and extensive experimental evaluations to demonstrate the benefits of AIMS. For example, we prove that AIMS achieves a loss of $O\left((\frac{m^{r+1}}{T})^{\frac1r}\right)$ after $T$ network-wide updates, where $m$ is the number of devices and $r$ the convexity of the mirror function, with existing linear aggregation frameworks being a special case with $r=2$. Our experimental evaluations using EMANE (Extendable Mobile Ad-hoc Network Emulator) for military communications settings show similar results: AIMS can improve DML convergence rate by up to 57\% and scale well to more devices with weak connectivity, all with little additional computation overhead compared to traditional linear aggregation.
    Local Model Reconstruction Attacks in Federated Learning and their Uses. (arXiv:2210.16205v1 [cs.LG])
    In this paper, we initiate the study of local model reconstruction attacks for federated learning, where a honest-but-curious adversary eavesdrops the messages exchanged between a targeted client and the server, and then reconstructs the local/personalized model of the victim. The local model reconstruction attack allows the adversary to trigger other classical attacks in a more effective way, since the local model only depends on the client's data and can leak more private information than the global model learned by the server. Additionally, we propose a novel model-based attribute inference attack in federated learning leveraging the local model reconstruction attack. We provide an analytical lower-bound for this attribute inference attack. Empirical results using real world datasets confirm that our local reconstruction attack works well for both regression and classification tasks. Moreover, we benchmark our novel attribute inference attack against the state-of-the-art attacks in federated learning. Our attack results in higher reconstruction accuracy especially when the clients' datasets are heterogeneous. Our work provides a new angle for designing powerful and explainable attacks to effectively quantify the privacy risk in FL.
    Fairness Certificates for Differentially Private Classification. (arXiv:2210.16242v1 [cs.LG])
    In this work, we theoretically study the impact of differential privacy on fairness in binary classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on the probability that a decision function makes a negative prediction conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.
    Supervised Contrastive Learning for Respiratory Sound Classification. (arXiv:2210.16192v1 [cs.SD])
    Automatic respiratory sound classification using machine learning is a challenging task, due to large biological variability, imbalanced datasets, as well as a diversity in recording techniques used to capture the respiration signal. While datasets with annotated respiration cycles have been proposed, methods based on supervised learning using annotations only may be limited in their generalization capability. In this study, we address this issue using supervised contrastive learning, relying both on respiration cycle annotations and a spectrogram frequency and temporal masking method SpecAugment to generate augmented samples for representation learning with a contrastive loss. We demonstrate that such an approach can outperform supervised learning using experiments on a convolutional neural network trained from scratch, achieving the new state of the art. Our work shows the potential of supervised contrastive learning in imbalanced and noisy settings. Our code is released at https://github.com/ilyassmoummad/scl_icbhi2017
    Data-driven discovery of Green's functions. (arXiv:2210.16016v1 [math.NA])
    Discovering hidden partial differential equations (PDEs) and operators from data is an important topic at the frontier between machine learning and numerical analysis. This doctoral thesis introduces theoretical results and deep learning algorithms to learn Green's functions associated with linear partial differential equations and rigorously justify PDE learning techniques. A theoretically rigorous algorithm is derived to obtain a learning rate, which characterizes the amount of training data needed to approximately learn Green's functions associated with elliptic PDEs. The construction connects the fields of PDE learning and numerical linear algebra by extending the randomized singular value decomposition to non-standard Gaussian vectors and Hilbert--Schmidt operators, and exploiting the low-rank hierarchical structure of Green's functions using hierarchical matrices. Rational neural networks (NNs) are introduced and consist of neural networks with trainable rational activation functions. The highly compositional structure of these networks, combined with rational approximation theory, implies that rational functions have higher approximation power than standard activation functions. In addition, rational NNs may have poles and take arbitrarily large values, which is ideal for approximating functions with singularities such as Green's functions. Finally, theoretical results on Green's functions and rational NNs are combined to design a human-understandable deep learning method for discovering Green's functions from data. This approach complements state-of-the-art PDE learning techniques, as a wide range of physics can be captured from the learned Green's functions such as dominant modes, symmetries, and singularity locations.
    Toward Reliable Neural Specifications. (arXiv:2210.16114v1 [cs.LG])
    Having reliable specifications is an unavoidable challenge in achieving verifiable correctness, robustness, and interpretability of AI systems. Existing specifications for neural networks are in the paradigm of data as specification. That is, the local neighborhood centering around a reference input is considered to be correct (or robust). However, our empirical study shows that such a specification is extremely overfitted since usually no data points from the testing set lie in the certified region of the reference input, making them impractical for real-world applications. We propose a new family of specifications called neural representation as specification, which uses the intrinsic information of neural networks - neural activation patterns (NAP), rather than input data to specify the correctness and/or robustness of neural network predictions. We present a simple statistical approach to mining dominant neural activation patterns. We analyze NAPs from a statistical point of view and find that a single NAP can cover a large number of training and testing data points whereas ad hoc data-as-specification only covers the given reference data point. To show the effectiveness of discovered NAPs, we formally verify several important properties, such as various types of misclassifications will never happen for a given NAP, and there is no-ambiguity between different NAPs. We show that by using NAP, we can verify the prediction of the entire input space, while still recalling 84% of the data. Thus, we argue that using NAPs is a more reliable and extensible specification for neural network verification.
    Space-Time Graph Neural Networks with Stochastic Graph Perturbations. (arXiv:2210.16270v1 [cs.LG])
    Space-time graph neural networks (ST-GNNs) are recently developed architectures that learn efficient graph representations of time-varying data. ST-GNNs are particularly useful in multi-agent systems, due to their stability properties and their ability to respect communication delays between the agents. In this paper we revisit the stability properties of ST-GNNs and prove that they are stable to stochastic graph perturbations. Our analysis suggests that ST-GNNs are suitable for transfer learning on time-varying graphs and enables the design of generalized convolutional architectures that jointly process time-varying graphs and time-varying signals. Numerical experiments on decentralized control systems validate our theoretical results and showcase the benefits of traditional and generalized ST-GNN architectures.
    Multimodal Transformer for Parallel Concatenated Variational Autoencoders. (arXiv:2210.16174v1 [cs.LG])
    In this paper, we propose a multimodal transformer using parallel concatenated architecture. Instead of using patches, we use column stripes for images in R, G, B channels as the transformer input. The column stripes keep the spatial relations of original image. We incorporate the multimodal transformer with variational autoencoder for synthetic cross-modal data generation. The multimodal transformer is designed using multiple compression matrices, and it serves as encoders for Parallel Concatenated Variational AutoEncoders (PC-VAE). The PC-VAE consists of multiple encoders, one latent space, and two decoders. The encoders are based on random Gaussian matrices and don't need any training. We propose a new loss function based on the interaction information from partial information decomposition. The interaction information evaluates the input cross-modal information and decoder output. The PC-VAE are trained via minimizing the loss function. Experiments are performed to validate the proposed multimodal transformer for PC-VAE.
    Physics-Informed Convolutional Neural Networks for Corruption Removal on Dynamical Systems. (arXiv:2210.16215v1 [physics.flu-dyn])
    Measurements on dynamical systems, experimental or otherwise, are often subjected to inaccuracies capable of introducing corruption; removal of which is a problem of fundamental importance in the physical sciences. In this work we propose physics-informed convolutional neural networks for stationary corruption removal, providing the means to extract physical solutions from data, given access to partial ground-truth observations at collocation points. We showcase the methodology for 2D incompressible Navier-Stokes equations in the chaotic-turbulent flow regime, demonstrating robustness to modality and magnitude of corruption.
    A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer Neural Networks. (arXiv:2210.16286v1 [cs.LG])
    To understand the training dynamics of neural networks (NNs), prior studies have considered the infinite-width mean-field (MF) limit of two-layer NN, establishing theoretical guarantees of its convergence under gradient flow training as well as its approximation and generalization capabilities. In this work, we study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed. To define the limiting model rigorously, we generalize the MF theory of two-layer NNs by treating the neurons as belonging to functional spaces. Then, by writing the MF training dynamics as a kernel gradient flow with a time-varying kernel that remains positive-definite, we prove that its training loss in $L_2$ regression decays to zero at a linear rate. Furthermore, we define function spaces that include the solutions obtainable through the MF training dynamics and prove Rademacher complexity bounds for these spaces. Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors while both exhibiting feature learning.
    Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis. (arXiv:2210.15898v1 [cs.LG])
    We derive and solve an ``Equation of Motion'' (EoM) for deep neural networks (DNNs), a differential equation that precisely describes the discrete learning dynamics of DNNs. Differential equations are continuous but have played a prominent role even in the study of discrete optimization (gradient descent (GD) algorithms). However, there still exist gaps between differential equations and the actual learning dynamics of DNNs due to discretization error. In this paper, we start from gradient flow (GF) and derive a counter term that cancels the discretization error between GF and GD. As a result, we obtain EoM, a continuous differential equation that precisely describes the discrete learning dynamics of GD. We also derive discretization error to show to what extent EoM is precise. In addition, we apply EoM to two specific cases: scale- and translation-invariant layers. EoM highlights differences between continuous-time and discrete-time GD, indicating the importance of the counter term for a better description of the discrete learning dynamics of GD. Our experimental results support our theoretical findings.
    An Empirical Evaluation of Zeroth-Order Optimization Methods on AI-driven Molecule Optimization. (arXiv:2210.16099v1 [q-bio.BM])
    Molecule optimization is an important problem in chemical discovery and has been approached using many techniques, including generative modeling, reinforcement learning, genetic algorithms, and much more. Recent work has also applied zeroth-order (ZO) optimization, a subset of gradient-free optimization that solves problems similarly to gradient-based methods, for optimizing latent vector representations from an autoencoder. In this paper, we study the effectiveness of various ZO optimization methods for optimizing molecular objectives, which are characterized by variable smoothness, infrequent optima, and other challenges. We provide insights on the robustness of various ZO optimizers in this setting, show the advantages of ZO sign-based gradient descent (ZO-signGD), discuss how ZO optimization can be used practically in realistic discovery tasks, and demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite. Code is available at: https://github.com/IBM/QMO-bench.
    M3FGM:a node masking and multi-granularity message passing-based federated graph model for spatial-temporal data prediction. (arXiv:2210.16193v1 [cs.LG])
    Researchers are solving the challenges of spatial-temporal prediction by combining Federated Learning (FL) and graph models with respect to the constrain of privacy and security. However, there are still several issues left unattended: 1) Clients might not be able to access the server during inference phase; 2) The graph of clients designed manually in the server model may not reveal the proper relationship between clients. This paper proposes a new embeddings aggregation structured FL approach named node Masking and Multi-granularity Message passing-based Federated Graph Model (M3FGM) for the above issues. The server model of M3FGM employs a MaskNode layer to simulate the case of offline clients. We also redesign the decoder of the client model using a dual-sub-decoders structure so that each client model can use its local data to predict independently when offline. As for the second issue, A new GNN layer named Multi-Granularity Message Passing (MGMP) allows each client node to perceive global and local information.We conducted extensive experiments in two different scenarios on two real traffic datasets. Results show that the proposed model outperforms the baselines and variant models, achieves the best results in both scenarios.
    Addressing Bias in Face Detectors using Decentralised Data collection with incentives. (arXiv:2210.16024v1 [cs.CV])
    Recent developments in machine learning have shown that successful models do not rely only on huge amounts of data but the right kind of data. We show in this paper how this data-centric approach can be facilitated in a decentralized manner to enable efficient data collection for algorithms. Face detectors are a class of models that suffer heavily from bias issues as they have to work on a large variety of different data. We also propose a face detection and anonymization approach using a hybrid MultiTask Cascaded CNN with FaceNet Embeddings to benchmark multiple datasets to describe and evaluate the bias in the models towards different ethnicities, gender, and age groups along with ways to enrich fairness in a decentralized system of data labeling, correction, and verification by users to create a robust pipeline for model retraining.
    Deep Learning Object Detection Approaches to Source Identification. (arXiv:2210.16173v1 [cs.SD])
    Traditionally source identification is solved using threshold based energy detection algorithms. These algorithms frequently sum up the activity in regions, and consider regions above a specific activity threshold to be sources. While these algorithms work for the majority of cases, they often fail to detect signals that occupy small frequency bands, fail to distinguish sources with overlapping frequency bands, and cannot detect any signals under a specified signal to noise ratio. Through the conversion of raw signal data to spectrogram, source identification can be framed as an object detection problem. By leveraging modern advancements in deep learning based object detection, we propose a system that manages to alleviate the failure cases encountered when using traditional source identification algorithms. Our contributions include framing source identification as an object detection problem, the publication of a spectrogram object detection dataset, and evaluation of the RetinaNet and YOLOv5 object detection models trained on the dataset. Our final models achieve Mean Average Precisions of up to 0.906. With such a high Mean Average Precision, these models are sufficiently robust for use in real world applications.
    Convergence analysis of a quasi-Monte Carlo-based deep learning algorithm for solving partial differential equations. (arXiv:2210.16196v1 [math.NA])
    Deep learning methods have achieved great success in solving partial differential equations (PDEs), where the loss is often defined as an integral. The accuracy and efficiency of these algorithms depend greatly on the quadrature method. We propose to apply quasi-Monte Carlo (QMC) methods to the Deep Ritz Method (DRM) for solving the Neumann problems for the Poisson equation and the static Schr\"{o}dinger equation. For error estimation, we decompose the error of using the deep learning algorithm to solve PDEs into the generalization error, the approximation error and the training error. We establish the upper bounds and prove that QMC-based DRM achieves an asymptotically smaller error bound than DRM. Numerical experiments show that the proposed method converges faster in all cases and the variances of the gradient estimators of randomized QMC-based DRM are much smaller than those of DRM, which illustrates the superiority of QMC in deep learning over MC.
    Feature Engineering vs BERT on Twitter Data. (arXiv:2210.16168v1 [cs.CL])
    In this paper, we compare the performances of traditional machine learning models using feature engineering and word vectors and the state-of-the-art language model BERT using word embeddings on three datasets. We also consider the time and cost efficiency of feature engineering compared to BERT. From our results we conclude that the use of the BERT model was only worth the time and cost trade-off for one of the three datasets we used for comparison, where the BERT model significantly outperformed any kind of traditional classifier that uses feature vectors, instead of embeddings. Using the BERT model for the other datasets only achieved an increase of 0.03 and 0.05 of accuracy and F1 score respectively, which could be argued makes its use not worth the time and cost of GPU.
    Goal Exploration Augmentation via Pre-trained Skills for Sparse-Reward Long-Horizon Goal-Conditioned Reinforcement Learning. (arXiv:2210.16058v1 [cs.LG])
    Reinforcement learning (RL) often struggles to accomplish a sparse-reward long-horizon task in a complex environment. Goal-conditioned reinforcement learning (GCRL) has been employed to tackle this difficult problem via a curriculum of easy-to-reach sub-goals. In GCRL, exploring novel sub-goals is essential for the agent to ultimately find the pathway to the desired goal. How to explore novel sub-goals efficiently is one of the most challenging issues in GCRL. Several goal exploration methods have been proposed to address this issue but still struggle to find the desired goals efficiently. In this paper, we propose a novel learning objective by optimizing the entropy of both achieved and new goals to be explored for more efficient goal exploration in sub-goal selection based GCRL. To optimize this objective, we first explore and exploit the frequently occurring goal-transition patterns mined in the environments similar to the current task to compose skills via skill learning. Then, the pretrained skills are applied in goal exploration. Evaluation on a variety of spare-reward long-horizon benchmark tasks suggests that incorporating our method into several state-of-the-art GCRL baselines significantly boosts their exploration efficiency while improving or maintaining their performance. The source code is available at: https://github.com/GEAPS/GEAPS.
    Towards prediction of turbulent flows at high Reynolds numbers using high performance computing data and deep learning. (arXiv:2210.16110v1 [physics.flu-dyn])
    In this paper, deep learning (DL) methods are evaluated in the context of turbulent flows. Various generative adversarial networks (GANs) are discussed with respect to their suitability for understanding and modeling turbulence. Wasserstein GANs (WGANs) are then chosen to generate small-scale turbulence. Highly resolved direct numerical simulation (DNS) turbulent data is used for training the WGANs and the effect of network parameters, such as learning rate and loss function, is studied. Qualitatively good agreement between DNS input data and generated turbulent structures is shown. A quantitative statistical assessment of the predicted turbulent fields is performed.
    Stop Measuring Calibration When Humans Disagree. (arXiv:2210.16133v1 [cs.CL])
    Calibration is a popular framework to evaluate whether a classifier knows when it does not know - i.e., its predictive probabilities are a good indication of how likely a prediction is to be correct. Correctness is commonly estimated against the human majority class. Recently, calibration to human majority has been measured on tasks where humans inherently disagree about which class applies. We show that measuring calibration to human majority given inherent disagreements is theoretically problematic, demonstrate this empirically on the ChaosNLI dataset, and derive several instance-level measures of calibration that capture key statistical properties of human judgements - class frequency, ranking and entropy.
    Nonparametric Probabilistic Regression with Coarse Learners. (arXiv:2210.16247v1 [cs.LG])
    Probabilistic Regression refers to predicting a full probability density function for the target conditional on the features. We present a nonparametric approach to this problem which combines base classifiers (typically gradient boosted forests) trained on different coarsenings of the target value. By combining such classifiers and averaging the resulting densities, we are able to compute precise conditional densities with minimal assumptions on the shape or form of the density. We combine this approach with a structured cross-entropy loss function which serves to regularize and smooth the resulting densities. Prediction intervals computed from these densities are shown to have high fidelity in practice. Furthermore, examining the properties of these densities on particular observations can provide valuable insight. We demonstrate this approach on a variety of datasets and show competitive performance, particularly on larger datasets.
    Evaluating the Impact of Loss Function Variation in Deep Learning for Classification. (arXiv:2210.16003v1 [cs.LG])
    The loss function is arguably among the most important hyperparameters for a neural network. Many loss functions have been designed to date, making a correct choice nontrivial. However, elaborate justifications regarding the choice of the loss function are not made in related work. This is, as we see it, an indication of a dogmatic mindset in the deep learning community which lacks empirical foundation. In this work, we consider deep neural networks in a supervised classification setting and analyze the impact the choice of loss function has onto the training result. While certain loss functions perform suboptimally, our work empirically shows that under-represented losses such as the KL Divergence can outperform the State-of-the-Art choices significantly, highlighting the need to include the loss function as a tuned hyperparameter rather than a fixed choice.
    Improving Multi-class Classifier Using Likelihood Ratio Estimation with Regularization. (arXiv:2210.16033v1 [cs.LG])
    The universal-set naive Bayes classifier (UNB)~\cite{Komiya:13}, defined using likelihood ratios (LRs), was proposed to address imbalanced classification problems. However, the LR estimator used in the UNB overestimates LRs for low-frequency data, degrading the classification performance. Our previous study~\cite{Kikuchi:19} proposed an effective LR estimator even for low-frequency data. This estimator uses regularization to suppress the overestimation, but we did not consider imbalanced data. In this paper, we integrated the estimator with the UNB. Our experiments with imbalanced data showed that our proposed classifier effectively adjusts the classification scores according to the class balance using regularization parameters and improves the classification performance.
    Bias and unfairness in machine learning models: a systematic literature review. (arXiv:2202.08176v3 [cs.LG] UPDATED)
    One of the difficulties of artificial intelligence is to ensure that model decisions are fair and free of bias. In research, datasets, metrics, techniques, and tools are applied to detect and mitigate algorithmic unfairness and bias. This study aims to examine existing knowledge on bias and unfairness in Machine Learning models, identifying mitigation methods, fairness metrics, and supporting tools. A Systematic Literature Review found 40 eligible articles published between 2017 and 2022 in the Scopus, IEEE Xplore, Web of Science, and Google Scholar knowledge bases. The results show numerous bias and unfairness detection and mitigation approaches for ML technologies, with clearly defined metrics in the literature, and varied metrics can be highlighted. We recommend further research to define the techniques and metrics that should be employed in each case to standardize and ensure the impartiality of the machine learning model, thus, allowing the most appropriate metric to detect bias and unfairness in a given context.
    LegoNet: A Fast and Exact Unlearning Architecture. (arXiv:2210.16023v1 [cs.LG])
    Machine unlearning aims to erase the impact of specific training samples upon deleted requests from a trained model. Re-training the model on the retained data after deletion is an effective but not efficient way due to the huge number of model parameters and re-training samples. To speed up, a natural way is to reduce such parameters and samples. However, such a strategy typically leads to a loss in model performance, which poses the challenge that increasing the unlearning efficiency while maintaining acceptable performance. In this paper, we present a novel network, namely \textit{LegoNet}, which adopts the framework of ``fixed encoder + multiple adapters''. We fix the encoder~(\ie the backbone for representation learning) of LegoNet to reduce the parameters that need to be re-trained during unlearning. Since the encoder occupies a major part of the model parameters, the unlearning efficiency is significantly improved. However, fixing the encoder empirically leads to a significant performance drop. To compensate for the performance loss, we adopt the ensemble of multiple adapters, which are independent sub-models adopted to infer the prediction by the encoding~(\ie the output of the encoder). Furthermore, we design an activation mechanism for the adapters to further trade off unlearning efficiency against model performance. This mechanism guarantees that each sample can only impact very few adapters, so during unlearning, parameters and samples that need to be re-trained are both reduced. The empirical experiments verify that LegoNet accomplishes fast and exact unlearning while maintaining acceptable performance, synthetically outperforming unlearning baselines.  ( 3 min )
    An Adversarial Active Sampling-based Data Augmentation Framework for Manufacturable Chip Design. (arXiv:2210.15765v1 [cs.LG])
    Lithography modeling is a crucial problem in chip design to ensure a chip design mask is manufacturable. It requires rigorous simulations of optical and chemical models that are computationally expensive. Recent developments in machine learning have provided alternative solutions in replacing the time-consuming lithography simulations with deep neural networks. However, the considerable accuracy drop still impedes its industrial adoption. Most importantly, the quality and quantity of the training dataset directly affect the model performance. To tackle this problem, we propose a litho-aware data augmentation (LADA) framework to resolve the dilemma of limited data and improve the machine learning model performance. First, we pretrain the neural networks for lithography modeling and a gradient-friendly StyleGAN2 generator. We then perform adversarial active sampling to generate informative and synthetic in-distribution mask designs. These synthetic mask images will augment the original limited training dataset used to finetune the lithography model for improved performance. Experimental results demonstrate that LADA can successfully exploits the neural network capacity by narrowing down the performance gap between the training and testing data instances.  ( 2 min )
    Uncalibrated Models Can Improve Human-AI Collaboration. (arXiv:2202.05983v3 [cs.AI] UPDATED)
    In many practical applications of AI, an AI model is used as a decision aid for human users. The AI provides advice that a human (sometimes) incorporates into their decision-making process. The AI advice is often presented with some measure of "confidence" that the human can use to calibrate how much they depend on or trust the advice. In this paper, we present an initial exploration that suggests showing AI models as more confident than they actually are, even when the original AI is well-calibrated, can improve human-AI performance (measured as the accuracy and confidence of the human's final prediction after seeing the AI advice). We first train a model to predict human incorporation of AI advice using data from thousands of human-AI interactions. This enables us to explicitly estimate how to transform the AI's prediction confidence, making the AI uncalibrated, in order to improve the final human prediction. We empirically validate our results across four different tasks--dealing with images, text and tabular data--involving hundreds of human participants. We further support our findings with simulation analysis. Our findings suggest the importance of jointly optimizing the human-AI system as opposed to the standard paradigm of optimizing the AI model alone.
    DELFI: Deep Mixture Models for Long-term Air Quality Forecasting in the Delhi National Capital Region. (arXiv:2210.15923v1 [cs.LG])
    The identification and control of human factors in climate change is a rapidly growing concern and robust, real-time air-quality monitoring and forecasting plays a critical role in allowing effective policy formulation and implementation. This paper presents DELFI, a novel deep learning-based mixture model to make effective long-term predictions of Particulate Matter (PM) 2.5 concentrations. A key novelty in DELFI is its multi-scale approach to the forecasting problem. The observation that point predictions are more suitable in the short-term and probabilistic predictions in the long-term allows accurate predictions to be made as much as 24 hours in advance. DELFI incorporates meteorological data as well as pollutant-based features to ensure a robust model that is divided into two parts: (i) a stack of three Long Short-Term Memory (LSTM) networks that perform differential modelling of the same window of past data, and (ii) a fully-connected layer enabling attention to each of the components. Experimental evaluation based on deployment of 13 stations in the Delhi National Capital Region (Delhi-NCR) in India establishes that DELFI offers far superior predictions especially in the long-term as compared to even non-parametric baselines. The Delhi-NCR recorded the 3rd highest PM levels amongst 39 mega-cities across the world during 2011-2015 and DELFI's performance establishes it as a potential tool for effective long-term forecasting of PM levels to enable public health management and environment protection.  ( 3 min )
    Predicting Protein-Ligand Binding Affinity with Equivariant Line Graph Network. (arXiv:2210.16098v1 [q-bio.BM])
    Binding affinity prediction of three-dimensional (3D) protein ligand complexes is critical for drug repositioning and virtual drug screening. Existing approaches transform a 3D protein-ligand complex to a two-dimensional (2D) graph, and then use graph neural networks (GNNs) to predict its binding affinity. However, the node and edge features of the 2D graph are extracted based on invariant local coordinate systems of the 3D complex. As a result, the method can not fully learn the global information of the complex, such as, the physical symmetry and the topological information of bonds. To address these issues, we propose a novel Equivariant Line Graph Network (ELGN) for affinity prediction of 3D protein ligand complexes. The proposed ELGN firstly adds a super node to the 3D complex, and then builds a line graph based on the 3D complex. After that, ELGN uses a new E(3)-equivariant network layer to pass the messages between nodes and edges based on the global coordinate system of the 3D complex. Experimental results on two real datasets demonstrate the effectiveness of ELGN over several state-of-the-art baselines.
    Generalizing Clinical Trials with Convex Hulls. (arXiv:2111.13229v2 [stat.ML] UPDATED)
    Randomized clinical trials eliminate confounding but impose strict exclusion criteria that limit recruitment to a subset of the population. Observational datasets are more inclusive but suffer from confounding -- often providing overly optimistic estimates of treatment response over time due to partially optimized physician prescribing patterns. We therefore assume that the unconfounded treatment response lies somewhere in-between the observational estimate before and the observational estimate after treatment assignment. This assumption allows us to extrapolate results from exclusive trials to the broader population by analyzing observational and trial data simultaneously using an algorithm called Optimum in Convex Hulls (OCH). OCH represents the treatment effect either in terms of convex hulls of conditional expectations or convex hulls (also known as mixtures) of conditional densities. The algorithm first learns the component expectations or densities using the observational data and then learns the linear mixing coefficients using trial data in order to approximate the true treatment effect; theory importantly explains why this linear combination should hold. OCH estimates the treatment effect in terms both expectations and densities with state of the art accuracy.
    Understanding Adverse Biological Effect Predictions Using Knowledge Graphs. (arXiv:2210.15985v1 [cs.AI])
    Extrapolation of adverse biological (toxic) effects of chemicals is an important contribution to expand available hazard data in (eco)toxicology without the use of animals in laboratory experiments. In this work, we extrapolate effects based on a knowledge graph (KG) consisting of the most relevant effect data as domain-specific background knowledge. An effect prediction model, with and without background knowledge, was used to predict mean adverse biological effect concentration of chemicals as a prototypical type of stressors. The background knowledge improves the model prediction performance by up to 40\% in terms of $R^2$ (\ie coefficient of determination). We use the KG and KG embeddings to provide quantitative and qualitative insights into the predictions. These insights are expected to improve the confidence in effect prediction. Larger scale implementation of such extrapolation models should be expected to support hazard and risk assessment, by simplifying and reducing testing needs.  ( 2 min )
    Do ideas have shape? Idea registration as the continuous limit of artificial neural networks. (arXiv:2008.03920v3 [stat.ML] UPDATED)
    We introduce a GP generalization of ResNets (including ResNets as a particular case). We show that ResNets (and their GP generalization) converge, in the infinite depth limit, to a generalization of image registration variational algorithms. Whereas computational anatomy aligns images via warping of the material space, this generalization aligns ideas (or abstract shapes as in Plato's theory of forms) via the warping of the RKHS of functions mapping the input space to the output space. While the Hamiltonian interpretation of ResNets is not new, it was based on an Ansatz. We do not rely on this Ansatz and present the first rigorous proof of convergence of ResNets with trained weights and biases towards a Hamiltonian dynamics driven flow. Our constructive proof reveals several remarkable properties of ResNets and their GP generalization. ResNets regressors are kernel regressors with data-dependent warping kernels. Minimizers of $L_2$ regularized ResNets satisfy a discrete least action principle implying the near preservation of the norm of weights and biases across layers. The trained weights of ResNets with $L^2$ regularization can be identified by solving an autonomous Hamiltonian system. The trained ResNet parameters are unique up to the initial momentum whose representation is generally sparse. The kernel regularization strategy provides a provably robust alternative to Dropout for ANNs. We introduce a functional generalization of GPs leading to error estimates for ResNets. We identify the (EPDiff) mean fields limit of trained ResNet parameters. We show that the composition of warping regression blocks with reduced equivariant multichannel kernels (introduced here) recovers and generalizes CNNs to arbitrary spaces and groups of transformations.
    Localized Randomized Smoothing for Collective Robustness Certification. (arXiv:2210.16140v1 [cs.LG])
    Models for image segmentation, node classification and many other tasks map a single input to multiple labels. By perturbing this single shared input (e.g. the image) an adversary can manipulate several predictions (e.g. misclassify several pixels). Collective robustness certification is the task of provably bounding the number of robust predictions under this threat model. The only dedicated method that goes beyond certifying each output independently is limited to strictly local models, where each prediction is associated with a small receptive field. We propose a more general collective robustness certificate for all types of models and further show that this approach is beneficial for the larger class of softly local models, where each output is dependent on the entire input but assigns different levels of importance to different input regions (e.g. based on their proximity in the image). The certificate is based on our novel localized randomized smoothing approach, where the random perturbation strength for different input regions is proportional to their importance for the outputs. Localized smoothing Pareto-dominates existing certificates on both image segmentation and node classification tasks, simultaneously offering higher accuracy and stronger guarantees.  ( 2 min )
    Applying Physics-Informed Enhanced Super-Resolution Generative Adversarial Networks to Turbulent Non-Premixed Combustion on Non-Uniform Meshes and Demonstration of an Accelerated Simulation Workflow. (arXiv:2210.16248v1 [physics.flu-dyn])
    This paper extends the methodology to use physics-informed enhanced super-resolution generative adversarial networks (PIESRGANs) for LES subfilter modeling in turbulent flows with finite-rate chemistry and shows a successful application to a non-premixed temporal jet case. This is an important topic considering the need for more efficient and carbon-neutral energy devices to fight the climate change. Multiple a priori and a posteriori results are presented and discussed. As part of this, the impact of the underlying mesh on the prediction quality is emphasized, and a multi-mesh approach is developed. It is demonstrated how LES based on PIESRGAN can be employed to predict cases at Reynolds numbers which were not used for training. Finally, the amount of data needed for a successful prediction is elaborated.
    Differentially Private CutMix for Split Learning with Vision Transformer. (arXiv:2210.15986v1 [cs.DC])
    Recently, vision transformer (ViT) has started to outpace the conventional CNN in computer vision tasks. Considering privacy-preserving distributed learning with ViT, federated learning (FL) communicates models, which becomes ill-suited due to ViT' s large model size and computing costs. Split learning (SL) detours this by communicating smashed data at a cut-layer, yet suffers from data privacy leakage and large communication costs caused by high similarity between ViT' s smashed data and input data. Motivated by this problem, we propose DP-CutMixSL, a differentially private (DP) SL framework by developing DP patch-level randomized CutMix (DP-CutMix), a novel privacy-preserving inter-client interpolation scheme that replaces randomly selected patches in smashed data. By experiment, we show that DP-CutMixSL not only boosts privacy guarantees and communication efficiency, but also achieves higher accuracy than its Vanilla SL counterpart. Theoretically, we analyze that DP-CutMix amplifies R\'enyi DP (RDP), which is upper-bounded by its Vanilla Mixup counterpart.
    Applying Physics-Informed Enhanced Super-Resolution Generative Adversarial Networks to Turbulent Premixed Combustion and Engine-like Flame Kernel Direct Numerical Simulation Data. (arXiv:2210.16206v1 [physics.flu-dyn])
    Models for finite-rate-chemistry in underresolved flows still pose one of the main challenges for predictive simulations of complex configurations. The problem gets even more challenging if turbulence is involved. This work advances the recently developed PIESRGAN modeling approach to turbulent premixed combustion. For that, the physical information processed by the network and considered in the loss function are adjusted, the training process is smoothed, and especially effects from density changes are considered. The resulting model provides good results for a priori and a posteriori tests on direct numerical simulation data of a fully turbulent premixed flame kernel. The limits of the modeling approach are discussed. Finally, the model is employed to compute further realizations of the premixed flame kernel, which are analyzed with a scale-sensitive framework regarding their cycle-to-cycle variations. The work shows that the data-driven PIESRGAN subfilter model can very accurately reproduce direct numerical simulation data on much coarser meshes, which is hardly possible with classical subfilter models, and enables studying statistical processes more efficiently due to the smaller computing cost.
    Reliability of CKA as a Similarity Measure in Deep Learning. (arXiv:2210.16156v1 [cs.LG])
    Comparing learned neural representations in neural networks is a challenging but important problem, which has been approached in different ways. The Centered Kernel Alignment (CKA) similarity metric, particularly its linear variant, has recently become a popular approach and has been widely used to compare representations of a network's different layers, of architecturally similar networks trained differently, or of models with different architectures trained on the same data. A wide variety of conclusions about similarity and dissimilarity of these various representations have been made using CKA. In this work we present analysis that formally characterizes CKA sensitivity to a large class of simple transformations, which can naturally occur in the context of modern machine learning. This provides a concrete explanation of CKA sensitivity to outliers, which has been observed in past works, and to transformations that preserve the linear separability of the data, an important generalization attribute. We empirically investigate several weaknesses of the CKA similarity metric, demonstrating situations in which it gives unexpected or counter-intuitive results. Finally we study approaches for modifying representations to maintain functional behaviour while changing the CKA value. Our results illustrate that, in many cases, the CKA value can be easily manipulated without substantial changes to the functional behaviour of the models, and call for caution when leveraging activation alignment metrics.  ( 2 min )
    Deep network series for large-scale high-dynamic range imaging. (arXiv:2210.16060v1 [astro-ph.IM])
    We propose a new approach for large-scale high-dynamic range computational imaging. Deep Neural Networks (DNNs) trained end-to-end can solve linear inverse imaging problems almost instantaneously. While unfolded architectures provide necessary robustness to variations of the measurement setting, embedding large-scale measurement operators in DNN architectures is impractical. Alternative Plug-and-Play (PnP) approaches, where the denoising DNNs are blind to the measurement setting, have proven effective to address scalability and high-dynamic range challenges, but rely on highly iterative algorithms. We propose a residual DNN series approach, where the reconstructed image is built as a sum of residual images progressively increasing the dynamic range, and estimated iteratively by DNNs taking the back-projected data residual of the previous iteration as input. We demonstrate on simulations for radio-astronomical imaging that a series of only few terms provides a high-dynamic range reconstruction of similar quality to state-of-the-art PnP approaches, at a fraction of the cost.  ( 2 min )
    Nonparallel High-Quality Audio Super Resolution with Domain Adaptation and Resampling CycleGANs. (arXiv:2210.15887v1 [eess.AS])
    Neural audio super-resolution models are typically trained on low- and high-resolution audio signal pairs. Although these methods achieve highly accurate super-resolution if the acoustic characteristics of the input data are similar to those of the training data, challenges remain: the models suffer from quality degradation for out-of-domain data, and paired data are required for training. To address these problems, we propose Dual-CycleGAN, a high-quality audio super-resolution method that can utilize unpaired data based on two connected cycle consistent generative adversarial networks (CycleGAN). Our method decomposes the super-resolution method into domain adaptation and resampling processes to handle acoustic mismatch in the unpaired low- and high-resolution signals. The two processes are then jointly optimized within the CycleGAN framework. Experimental results verify that the proposed method significantly outperforms conventional methods when paired data are not available. Code and audio samples are available from https://chomeyama.github.io/DualCycleGAN-Demo/.  ( 2 min )
    Improving Lipschitz-Constrained Neural Networks by Learning Activation Functions. (arXiv:2210.16222v1 [cs.LG])
    Lipschitz-constrained neural networks have several advantages compared to unconstrained ones and can be applied to various different problems. Consequently, they have recently attracted considerable attention in the deep learning community. Unfortunately, it has been shown both theoretically and empirically that networks with ReLU activation functions perform poorly under such constraints. On the contrary, neural networks with learnable 1-Lipschitz linear splines are known to be more expressive in theory. In this paper, we show that such networks are solutions of a functional optimization problem with second-order total-variation regularization. Further, we propose an efficient method to train such 1-Lipschitz deep spline neural networks. Our numerical experiments for a variety of tasks show that our trained networks match or outperform networks with activation functions specifically tailored towards Lipschitz-constrained architectures.
    Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform. (arXiv:2210.15975v1 [eess.AS])
    We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.  ( 2 min )
    Efficient and Light-Weight Federated Learning via Asynchronous Distributed Dropout. (arXiv:2210.16105v1 [cs.LG])
    Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup, where slower clients can severely impede the learning process. Herein, we propose \texttt{AsyncDrop}, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings. Overall, \texttt{AsyncDrop} achieves better performance compared to state of the art asynchronous methodologies, while resulting in less communication and training time overheads. The key idea revolves around creating ``submodels'' out of the global model, and distributing their training to workers, based on device heterogeneity. We rigorously justify that such an approach can be theoretically characterized. We implement our approach and compare it against other asynchronous baselines, both by design and by adapting existing synchronous FL algorithms to asynchronous scenarios. Empirically, \texttt{AsyncDrop} reduces the communication cost and training time, while matching or improving the final test accuracy in diverse non-i.i.d. FL scenarios.
    Multiresolution Signal Processing of Financial Market Objects. (arXiv:2210.15934v1 [q-fin.CP])
    Financial markets are among the most complex entities in our environment, yet mainstream quantitative models operate at predetermined scale, rely on linear correlation measures, and struggle to recognize non-linear or causal structures. In this paper, we combine neural networks known to capture non-linear associations with a multiscale decomposition approach to facilitate a better understanding of financial market data substructures. Quantization keeps our decompositions calibrated to market at every scale. We illustrate our approach in the context of a wide spectrum of applications.
    An Online Learning Approach for Vehicle Usage Prediction During COVID-19. (arXiv:2210.16002v1 [cs.LG])
    Today, there is an ongoing transition to more sustainable transportation, and an essential part of this transition is the switch from combustion engine vehicles to battery electric vehicles (BEVs). BEVs have many advantages from a sustainability perspective, but issues such as limited driving range and long recharge times slow down the transition from combustion engines. One way to mitigate these issues is by performing battery thermal preconditioning, which increases the energy efficiency of the battery. However, to optimally perform battery thermal preconditioning, the vehicle usage pattern needs to be known, i.e., how and when the vehicle will be used. This study attempts to predict the departure time and distance of the first drive each day using different online machine learning models. The online machine learning models are trained and evaluated on historical driving data collected from a fleet of BEVs during the COVID-19 pandemic. Additionally, the prediction models are extended to quantify the uncertainty of their predictions, which can be used as guidance to whether the prediction should be used or dismissed. We show that the best-performing prediction models yield an aggregated mean absolute error of 2.75 hours when predicting departure time and 13.37 km when predicting trip distance.
    Improving Chest X-Ray Classification by RNN-based Patient Monitoring. (arXiv:2210.16074v1 [cs.LG])
    Chest X-Ray imaging is one of the most common radiological tools for detection of various pathologies related to the chest area and lung function. In a clinical setting, automated assessment of chest radiographs has the potential of assisting physicians in their decision making process and optimize clinical workflows, for example by prioritizing emergency patients. Most work analyzing the potential of machine learning models to classify chest X-ray images focuses on vision methods processing and predicting pathologies for one image at a time. However, many patients undergo such a procedure multiple times during course of a treatment or during a single hospital stay. The patient history, that is previous images and especially the corresponding diagnosis contain useful information that can aid a classification system in its prediction. In this study, we analyze how information about diagnosis can improve CNN-based image classification models by constructing a novel dataset from the well studied CheXpert dataset of chest X-rays. We show that a model trained on additional patient history information outperforms a model trained without the information by a significant margin. We provide code to replicate the dataset creation and model training.  ( 2 min )
    Prototype-Based Layered Federated Cross-Modal Hashing. (arXiv:2210.15678v1 [cs.LG])
    Recently, deep cross-modal hashing has gained increasing attention. However, in many practical cases, data are distributed and cannot be collected due to privacy concerns, which greatly reduces the cross-modal hashing performance on each client. And due to the problems of statistical heterogeneity, model heterogeneity, and forcing each client to accept the same parameters, applying federated learning to cross-modal hash learning becomes very tricky. In this paper, we propose a novel method called prototype-based layered federated cross-modal hashing. Specifically, the prototype is introduced to learn the similarity between instances and classes on server, reducing the impact of statistical heterogeneity (non-IID) on different clients. And we monitor the distance between local and global prototypes to further improve the performance. To realize personalized federated learning, a hypernetwork is deployed on server to dynamically update different layers' weights of local model. Experimental results on benchmark datasets show that our method outperforms state-of-the-art methods.  ( 2 min )
    Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences. (arXiv:2210.15906v1 [cs.AI])
    Generating complex behaviors from goals specified by non-expert users is a crucial aspect of intelligent agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide rich-form feedback other than binary preference labels, leading to extremely high feedback complexity and poor user experience. While providing a detailed symbolic specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill the underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which acts as a middle ground, between exact goal specification and reward learning purely from preference labels, by enabling the users to tweak the agent's behavior through nameable concepts (e.g., increasing the softness of the movement of a two-legged "sneaky" agent). We propose two different parametric methods that can potentially encode any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on 4 tasks with 9 different behavioral attributes and show that once the attributes are learned, end users can effortlessly produce desirable agent behaviors, by providing feedback just around 10 times. The feedback complexity of our approach is over 10 times less than the learning-from-human-preferences baseline and this demonstrates that our approach is readily applicable in real-world applications.  ( 3 min )
    Period VITS: Variational Inference with Explicit Pitch Modeling for End-to-end Emotional Speech Synthesis. (arXiv:2210.15964v1 [eess.AS])
    Several fully end-to-end text-to-speech (TTS) models have been proposed that have shown better performance compared to cascade models (i.e., training acoustic and vocoder models separately). However, they often generate unstable pitch contour with audible artifacts when the dataset contains emotional attributes, i.e., large diversity of pronunciation and prosody. To address this problem, we propose Period VITS, a novel end-to-end TTS model that incorporates an explicit periodicity generator. In the proposed method, we introduce a frame pitch predictor that predicts prosodic features, such as pitch and voicing flags, from the input text. From these features, the proposed periodicity generator produces a sample-level sinusoidal source that enables the waveform decoder to accurately reproduce the pitch. Finally, the entire model is jointly optimized in an end-to-end manner with variational inference and adversarial objectives. As a result, the decoder becomes capable of generating more stable, expressive, and natural output waveforms. The experimental results showed that the proposed model significantly outperforms baseline models in terms of naturalness, with improved pitch stability in the generated samples.  ( 2 min )
    NNSVS: A Neural Network-Based Singing Voice Synthesis Toolkit. (arXiv:2210.15987v1 [eess.AS])
    This paper describes the design of NNSVS, an open-source software for neural network-based singing voice synthesis research. NNSVS is inspired by Sinsy, an open-source pioneer in singing voice synthesis research, and provides many additional features such as multi-stream models, autoregressive fundamental frequency models, and neural vocoders. Furthermore, NNSVS provides extensive documentation and numerous scripts to build complete singing voice synthesis systems. Experimental results demonstrate that our best system significantly outperforms our reproduction of Sinsy and other baseline systems. The toolkit is available at https://github.com/nnsvs/nnsvs.  ( 2 min )
    Measuring the Confidence of Traffic Forecasting Models: Techniques, Experimental Comparison and Guidelines towards Their Actionability. (arXiv:2210.16049v1 [cs.LG])
    The estimation of the amount of uncertainty featured by predictive machine learning models has acquired a great momentum in recent years. Uncertainty estimation provides the user with augmented information about the model's confidence in its predicted outcome. Despite the inherent utility of this information for the trustworthiness of the user, there is a thin consensus around the different types of uncertainty that one can gauge in machine learning models and the suitability of different techniques that can be used to quantify the uncertainty of a specific model. This subject is mostly non existent within the traffic modeling domain, even though the measurement of the confidence associated to traffic forecasts can favor significantly their actionability in practical traffic management systems. This work aims to cover this lack of research by reviewing different techniques and metrics of uncertainty available in the literature, and by critically discussing how confidence levels computed for traffic forecasting models can be helpful for researchers and practitioners working in this research area. To shed light with empirical evidence, this critical discussion is further informed by experimental results produced by different uncertainty estimation techniques over real traffic data collected in Madrid (Spain), rendering a general overview of the benefits and caveats of every technique, how they can be compared to each other, and how the measured uncertainty decreases depending on the amount, quality and diversity of data used to produce the forecasts.  ( 3 min )
    Completely Heterogeneous Federated Learning. (arXiv:2210.15865v1 [cs.LG])
    Federated learning (FL) faces three major difficulties: cross-domain, heterogeneous models, and non-i.i.d. labels scenarios. Existing FL methods fail to handle the above three constraints at the same time, and the level of privacy protection needs to be lowered (e.g., the model architecture and data category distribution can be shared). In this work, we propose the challenging "completely heterogeneous" scenario in FL, which refers to that each client will not expose any private information including feature space, model architecture, and label distribution. We then devise an FL framework based on parameter decoupling and data-free knowledge distillation to solve the problem. Experiments show that our proposed method achieves high performance in completely heterogeneous scenarios where other approaches fail.  ( 2 min )
    Joint Semantic Transfer Network for IoT Intrusion Detection. (arXiv:2210.15911v1 [cs.CR])
    In this paper, we propose a Joint Semantic Transfer Network (JSTN) towards effective intrusion detection for large-scale scarcely labelled IoT domain. As a multi-source heterogeneous domain adaptation (MS-HDA) method, the JSTN integrates a knowledge rich network intrusion (NI) domain and another small-scale IoT intrusion (II) domain as source domains, and preserves intrinsic semantic properties to assist target II domain intrusion detection. The JSTN jointly transfers the following three semantics to learn a domain-invariant and discriminative feature representation. The scenario semantic endows source NI and II domain with characteristics from each other to ease the knowledge transfer process via a confused domain discriminator and categorical distribution knowledge preservation. It also reduces the source-target discrepancy to make the shared feature space domain-invariant. Meanwhile, the weighted implicit semantic transfer boosts discriminability via a fine-grained knowledge preservation, which transfers the source categorical distribution to the target domain. The source-target divergence guides the importance weighting during knowledge preservation to reflect the degree of knowledge learning. Additionally, the hierarchical explicit semantic alignment performs centroid-level and representative-level alignment with the help of a geometric similarity-aware pseudo-label refiner, which exploits the value of unlabelled target II domain and explicitly aligns feature representations from a global and local perspective in a concentrated manner. Comprehensive experiments on various tasks verify the superiority of the JSTN against state-of-the-art comparing methods, on average a 10.3% of accuracy boost is achieved. The statistical soundness of each constituting component and the computational efficiency are also verified.  ( 3 min )
    Towards Reliable Zero Shot Classification in Self-Supervised Models with Conformal Prediction. (arXiv:2210.15805v1 [cs.LG])
    Self-supervised models trained with a contrastive loss such as CLIP have shown to be very powerful in zero-shot classification settings. However, to be used as a zero-shot classifier these models require the user to provide new captions over a fixed set of labels at test time. In many settings, it is hard or impossible to know if a new query caption is compatible with the source captions used to train the model. We address these limitations by framing the zero-shot classification task as an outlier detection problem and develop a conformal prediction procedure to assess when a given test caption may be reliably used. On a real-world medical example, we show that our proposed conformal procedure improves the reliability of CLIP-style models in the zero-shot classification setting, and we provide an empirical analysis of the factors that may affect its performance.  ( 2 min )
    Generalized Laplacian Positional Encoding for Graph Representation Learning. (arXiv:2210.15956v1 [cs.LG])
    Graph neural networks (GNNs) are the primary tool for processing graph-structured data. Unfortunately, the most commonly used GNNs, called Message Passing Neural Networks (MPNNs) suffer from several fundamental limitations. To overcome these limitations, recent works have adapted the idea of positional encodings to graph data. This paper draws inspiration from the recent success of Laplacian-based positional encoding and defines a novel family of positional encoding schemes for graphs. We accomplish this by generalizing the optimization problem that defines the Laplace embedding to more general dissimilarity functions rather than the 2-norm used in the original formulation. This family of positional encodings is then instantiated by considering p-norms. We discuss a method for calculating these positional encoding schemes, implement it in PyTorch and demonstrate how the resulting positional encoding captures different properties of the graph. Furthermore, we demonstrate that this novel family of positional encodings can improve the expressive power of MPNNs. Lastly, we present preliminary experimental results.  ( 2 min )
    Differentiable Analog Quantum Computing for Optimization and Control. (arXiv:2210.15812v1 [quant-ph])
    We formulate the first differentiable analog quantum computing framework with a specific parameterization design at the analog signal (pulse) level to better exploit near-term quantum devices via variational methods. We further propose a scalable approach to estimate the gradients of quantum dynamics using a forward pass with Monte Carlo sampling, which leads to a quantum stochastic gradient descent algorithm for scalable gradient-based training in our framework. Applying our framework to quantum optimization and control, we observe a significant advantage of differentiable analog quantum computing against SOTAs based on parameterized digital quantum circuits by orders of magnitude.  ( 2 min )
    You can't pick your neighbors, or can you? When and how to rely on retrieval in the $k$NN-LM. (arXiv:2210.15859v1 [cs.CL])
    Retrieval-enhanced language models (LMs), which condition their predictions on text retrieved from large external datastores, have recently shown significant perplexity improvements compared to standard LMs. One such approach, the $k$NN-LM, interpolates any existing LM's predictions with the output of a $k$-nearest neighbors model and requires no additional training. In this paper, we explore the importance of lexical and semantic matching in the context of items retrieved by $k$NN-LM. We find two trends: (1) the presence of large overlapping $n$-grams between the datastore and evaluation set plays an important factor in strong performance, even when the datastore is derived from the training data; and (2) the $k$NN-LM is most beneficial when retrieved items have high semantic similarity with the query. Based on our analysis, we define a new formulation of the $k$NN-LM that uses retrieval quality to assign the interpolation coefficient. We empirically measure the effectiveness of our approach on two English language modeling datasets, Wikitext-103 and PG-19. Our re-formulation of the $k$NN-LM is beneficial in both cases, and leads to nearly 4% improvement in perplexity on the Wikitext-103 test set.  ( 2 min )
    DPVIm: Differentially Private Variational Inference Improved. (arXiv:2210.15961v1 [stat.ML])
    Differentially private (DP) release of multidimensional statistics typically considers an aggregate sensitivity, e.g. the vector norm of a high-dimensional vector. However, different dimensions of that vector might have widely different magnitudes and therefore DP perturbation disproportionately affects the signal across dimensions. We observe this problem in the gradient release of the DP-SGD algorithm when using it for variational inference (VI), where it manifests in poor convergence as well as high variance in outputs for certain variational parameters, and make the following contributions: (i) We mathematically isolate the cause for the difference in magnitudes between gradient parts corresponding to different variational parameters. Using this as prior knowledge we establish a link between the gradients of the variational parameters, and propose an efficient while simple fix for the problem to obtain a less noisy gradient estimator, which we call $\textit{aligned}$ gradients. This approach allows us to obtain the updates for the covariance parameter of a Gaussian posterior approximation without a privacy cost. We compare this to alternative approaches for scaling the gradients using analytically derived preconditioning, e.g. natural gradients. (ii) We suggest using iterate averaging over the DP parameter traces recovered during the training, to reduce the DP-induced noise in parameter estimates at no additional cost in privacy. Finally, (iii) to accurately capture the additional uncertainty DP introduces to the model parameters, we infer the DP-induced noise from the parameter traces and include that in the learned posteriors to make them $\textit{noise aware}$. We demonstrate the efficacy of our proposed improvements through various experiments on real data.  ( 3 min )
    Feature Necessity & Relevancy in ML Classifier Explanations. (arXiv:2210.15675v1 [cs.LG])
    Given a machine learning (ML) model and a prediction, explanations can be defined as sets of features which are sufficient for the prediction. In some applications, and besides asking for an explanation, it is also critical to understand whether sensitive features can occur in some explanation, or whether a non-interesting feature must occur in all explanations. This paper starts by relating such queries respectively with the problems of relevancy and necessity in logic-based abduction. The paper then proves membership and hardness results for several families of ML classifiers. Afterwards the paper proposes concrete algorithms for two classes of classifiers. The experimental results confirm the scalability of the proposed algorithms.  ( 2 min )
    Domain Generalization through the Lens of Angular Invariance. (arXiv:2210.15836v1 [cs.LG])
    Domain generalization (DG) aims at generalizing a classifier trained on multiple source domains to an unseen target domain with domain shift. A common pervasive theme in existing DG literature is domain-invariant representation learning with various invariance assumptions. However, prior works restrict themselves to a radical assumption for realworld challenges: If a mapping induced by a deep neural network (DNN) could align the source domains well, then such a mapping aligns a target domain as well. In this paper, we simply take DNNs as feature extractors to relax the requirement of distribution alignment. Specifically, we put forward a novel angular invariance and the accompanied norm shift assumption. Based on the proposed term of invariance, we propose a novel deep DG method called Angular Invariance Domain Generalization Network (AIDGN). The optimization objective of AIDGN is developed with a von-Mises Fisher (vMF) mixture model. Extensive experiments on multiple DG benchmark datasets validate the effectiveness of the proposed AIDGN method.  ( 2 min )
    Risk-Aware Bid Optimization for Online Display Advertisement. (arXiv:2210.15837v1 [cs.LG])
    This research focuses on the bid optimization problem in the real-time bidding setting for online display advertisements, where an advertiser, or the advertiser's agent, has access to the features of the website visitor and the type of ad slots, to decide the optimal bid prices given a predetermined total advertisement budget. We propose a risk-aware data-driven bid optimization model that maximizes the expected profit for the advertiser by exploiting historical data to design upfront a bidding policy, mapping the type of advertisement opportunity to a bid price, and accounting for the risk of violating the budget constraint during a given period of time. After employing a Lagrangian relaxation, we derive a parametrized closed-form expression for the optimal bidding strategy. Using a real-world dataset, we demonstrate that our risk-averse method can effectively control the risk of overspending the budget while achieving a competitive level of profit compared with the risk-neutral model and a state-of-the-art data-driven risk-aware bidding approach.  ( 2 min )
    Subsidiary Prototype Alignment for Universal Domain Adaptation. (arXiv:2210.15909v1 [cs.CV])
    Universal Domain Adaptation (UniDA) deals with the problem of knowledge transfer between two datasets with domain-shift as well as category-shift. The goal is to categorize unlabeled target samples, either into one of the "known" categories or into a single "unknown" category. A major problem in UniDA is negative transfer, i.e. misalignment of "known" and "unknown" classes. To this end, we first uncover an intriguing tradeoff between negative-transfer-risk and domain-invariance exhibited at different layers of a deep network. It turns out we can strike a balance between these two metrics at a mid-level layer. Towards designing an effective framework based on this insight, we draw motivation from Bag-of-visual-Words (BoW). Word-prototypes in a BoW-like representation of a mid-level layer would represent lower-level visual primitives that are likely to be unaffected by the category-shift in the high-level features. We develop modifications that encourage learning of word-prototypes followed by word-histogram based classification. Following this, subsidiary prototype-space alignment (SPA) can be seen as a closed-set alignment problem, thereby avoiding negative transfer. We realize this with a novel word-histogram-related pretext task to enable closed-set SPA, operating in conjunction with goal task UniDA. We demonstrate the efficacy of our approach on top of existing UniDA techniques, yielding state-of-the-art performance across three standard UniDA and Open-Set DA object recognition benchmarks.  ( 2 min )
    One-Shot Acoustic Matching Of Audio Signals -- Learning to Hear Music In Any Room/ Concert Hall. (arXiv:2210.15750v1 [cs.SD])
    The acoustic space in which a sound is created and heard plays an essential role in how that sound is perceived by affording a unique sense of \textit{presence}. Every sound we hear results from successive convolution operations intrinsic to the sound source and external factors such as microphone characteristics and room impulse responses. Typically, researchers use an excitation such as a pistol shot or balloon pop as an impulse signal with which an auralization can be created. The room "impulse" responses convolved with the signal of interest can transform the input sound into the sound played in the acoustic space of interest. Here we propose a novel architecture that can transform a sound of interest into any other acoustic space(room or hall) of interest by using arbitrary audio recorded as a proxy for a balloon pop. The architecture is grounded in simple signal processing ideas to learn residual signals from a learned acoustic signature and the input signal. Our framework allows a neural network to adjust gains of every point in the time-frequency representation, giving sound qualitative and quantitative results.  ( 2 min )
    Incorporating Interactive Facts for Stock Selection via Neural Recursive ODEs. (arXiv:2210.15925v1 [q-fin.ST])
    Stock selection attempts to rank a list of stocks for optimizing investment decision making, aiming at minimizing investment risks while maximizing profit returns. Recently, researchers have developed various (recurrent) neural network-based methods to tackle this problem. Without exceptions, they primarily leverage historical market volatility to enhance the selection performance. However, these approaches greatly rely on discrete sampled market observations, which either fail to consider the uncertainty of stock fluctuations or predict continuous stock dynamics in the future. Besides, some studies have considered the explicit stock interdependence derived from multiple domains (e.g., industry and shareholder). Nevertheless, the implicit cross-dependencies among different domains are under-explored. To address such limitations, we present a novel stock selection solution -- StockODE, a latent variable model with Gaussian prior. Specifically, we devise a Movement Trend Correlation module to expose the time-varying relationships regarding stock movements. We design Neural Recursive Ordinary Differential Equation Networks (NRODEs) to capture the temporal evolution of stock volatility in a continuous dynamic manner. Moreover, we build a hierarchical hypergraph to incorporate the domain-aware dependencies among the stocks. Experiments conducted on two real-world stock market datasets demonstrate that StockODE significantly outperforms several baselines, such as up to 18.57% average improvement regarding Sharpe Ratio.  ( 2 min )
    Leveraging Label Correlations in a Multi-label Setting: A Case Study in Emotion. (arXiv:2210.15842v1 [cs.CL])
    Detecting emotions expressed in text has become critical to a range of fields. In this work, we investigate ways to exploit label correlations in multi-label emotion recognition models to improve emotion detection. First, we develop two modeling approaches to the problem in order to capture word associations of the emotion words themselves, by either including the emotions in the input, or by leveraging Masked Language Modeling (MLM). Second, we integrate pairwise constraints of emotion representations as regularization terms alongside the classification loss of the models. We split these terms into two categories, local and global. The former dynamically change based on the gold labels, while the latter remain static during training. We demonstrate state-of-the-art performance across Spanish, English, and Arabic in SemEval 2018 Task 1 E-c using monolingual BERT-based models. On top of better performance, we also demonstrate improved robustness. Code is available at https://github.com/gchochla/Demux-MEmo.  ( 2 min )
    Towards Data-and Knowledge-Driven Artificial Intelligence: A Survey on Neuro-Symbolic Computing. (arXiv:2210.15889v1 [cs.AI])
    Neural-symbolic computing (NeSy), which pursues the integration of the symbolic and statistical paradigms of cognition, has been an active research area of Artificial Intelligence (AI) for many years. As NeSy shows promise of reconciling the advantages of reasoning and interpretability of symbolic representation and robust learning in neural networks, it may serve as a catalyst for the next generation of AI. In the present paper, we provide a systematic overview of the important and recent developments of research on NeSy AI. Firstly, we introduce study history and background concepts of this area. Afterward, we categorize recent approaches along several main characteristics that underline this research paradigm, including neural-symbolic interrelation, neural architecture, knowledge representation, and functionality. Then, we briefly discuss the successful application of modern NeSy approaches in several domains. Finally, we identify the open problems together with potential future research directions.  ( 2 min )
    Instance-Optimal Differentially Private Estimation. (arXiv:2210.15819v1 [math.ST])
    In this work, we study local minimax convergence estimation rates subject to $\epsilon$-differential privacy. Unlike worst-case rates, which may be conservative, algorithms that are locally minimax optimal must adapt to easy instances of the problem. We construct locally minimax differentially private estimators for one-parameter exponential families and estimating the tail rate of a distribution. In these cases, we show that optimal algorithms for simple hypothesis testing, namely the recent optimal private testers of Canonne et al. (2019), directly inform the design of locally minimax estimation algorithms.  ( 2 min )
    GraphMAD: Graph Mixup for Data Augmentation using Data-Driven Convex Clustering. (arXiv:2210.15721v1 [cs.LG])
    We develop a novel data-driven nonlinear mixup mechanism for graph data augmentation and present different mixup functions for sample pairs and their labels. Mixup is a data augmentation method to create new training data by linearly interpolating between pairs of data samples and their labels. Mixup of graph data is challenging since the interpolation between graphs of potentially different sizes is an ill-posed operation. Hence, a promising approach for graph mixup is to first project the graphs onto a common latent feature space and then explore linear and nonlinear mixup strategies in this latent space. In this context, we propose to (i) project graphs onto the latent space of continuous random graph models known as graphons, (ii) leverage convex clustering in this latent space to generate nonlinear data-driven mixup functions, and (iii) investigate the use of different mixup functions for labels and data samples. We evaluate our graph data augmentation performance on benchmark datasets and demonstrate that nonlinear data-driven mixup functions can significantly improve graph classification.  ( 2 min )
    Noise Injection Node Regularization for Robust Learning. (arXiv:2210.15764v1 [cs.LG])
    We introduce Noise Injection Node Regularization (NINR), a method of injecting structured noise into Deep Neural Networks (DNN) during the training stage, resulting in an emergent regularizing effect. We present theoretical and empirical evidence for substantial improvement in robustness against various test data perturbations for feed-forward DNNs when trained under NINR. The novelty in our approach comes from the interplay of adaptive noise injection and initialization conditions such that noise is the dominant driver of dynamics at the start of training. As it simply requires the addition of external nodes without altering the existing network structure or optimization algorithms, this method can be easily incorporated into many standard problem specifications. We find improved stability against a number of data perturbations, including domain shifts, with the most dramatic improvement obtained for unstructured noise, where our technique outperforms other existing methods such as Dropout or $L_2$ regularization, in some cases. We further show that desirable generalization properties on clean data are generally maintained.  ( 2 min )
    Federated Learning with Intermediate Representation Regularization. (arXiv:2210.15827v1 [cs.LG])
    In contrast to centralized model training that involves data collection, federated learning (FL) enables remote clients to collaboratively train a model without exposing their private data. However, model performance usually degrades in FL due to the heterogeneous data generated by clients of diverse characteristics. One promising strategy to maintain good performance is by limiting the local training from drifting far away from the global model. Previous studies accomplish this by regularizing the distance between the representations learned by the local and global models. However, they only consider representations from the early layers of a model or the layer preceding the output layer. In this study, we introduce FedIntR, which provides a more fine-grained regularization by integrating the representations of intermediate layers into the local training process. Specifically, FedIntR computes a regularization term that encourages the closeness between the intermediate layer representations of the local and global models. Additionally, FedIntR automatically determines the contribution of each layer's representation to the regularization term based on the similarity between local and global representations. We conduct extensive experiments on various datasets to show that FedIntR can achieve equivalent or higher performance compared to the state-of-the-art approaches.  ( 2 min )
    Do Pre-trained Models Benefit Equally in Continual Learning?. (arXiv:2210.15701v1 [cs.CV])
    Existing work on continual learning (CL) is primarily devoted to developing algorithms for models trained from scratch. Despite their encouraging performance on contrived benchmarks, these algorithms show dramatic performance drops in real-world scenarios. Therefore, this paper advocates the systematic introduction of pre-training to CL, which is a general recipe for transferring knowledge to downstream tasks but is substantially missing in the CL community. Our investigation reveals the multifaceted complexity of exploiting pre-trained models for CL, along three different axes, pre-trained models, CL algorithms, and CL scenarios. Perhaps most intriguingly, improvements in CL algorithms from pre-training are very inconsistent an underperforming algorithm could become competitive and even state-of-the-art when all algorithms start from a pre-trained model. This indicates that the current paradigm, where all CL methods are compared in from-scratch training, is not well reflective of the true CL objective and desired progress. In addition, we make several other important observations, including that CL algorithms that exert less regularization benefit more from a pre-trained model; and that a stronger pre-trained model such as CLIP does not guarantee a better improvement. Based on these findings, we introduce a simple yet effective baseline that employs minimum regularization and leverages the more beneficial pre-trained model, coupled with a two-stage training pipeline. We recommend including this strong baseline in the future development of CL algorithms, due to its demonstrated state-of-the-art performance.  ( 2 min )
    Poisson Reweighted Laplacian Uncertainty Sampling for Graph-based Active Learning. (arXiv:2210.15786v1 [stat.ML])
    We show that uncertainty sampling is sufficient to achieve exploration versus exploitation in graph-based active learning, as long as the measure of uncertainty properly aligns with the underlying model and the model properly reflects uncertainty in unexplored regions. In particular, we use a recently developed algorithm, Poisson ReWeighted Laplace Learning (PWLL) for the classifier and we introduce an acquisition function designed to measure uncertainty in this graph-based classifier that identifies unexplored regions of the data. We introduce a diagonal perturbation in PWLL which produces exponential localization of solutions, and controls the exploration versus exploitation tradeoff in active learning. We use the well-posed continuum limit of PWLL to rigorously analyze our method, and present experimental results on a number of graph-based image classification problems.  ( 2 min )
    Can Current Explainability Help Provide References in Clinical Notes to Support Humans Annotate Medical Codes?. (arXiv:2210.15882v1 [cs.LG])
    The medical codes prediction problem from clinical notes has received substantial interest in the NLP community, and several recent studies have shown the state-of-the-art (SOTA) code prediction results of full-fledged deep learning-based methods. However, most previous SOTA works based on deep learning are still in early stages in terms of providing textual references and explanations of the predicted codes, despite the fact that this level of explainability of the prediction outcomes is critical to gaining trust from professional medical coders. This raises the important question of how well current explainability methods apply to advanced neural network models such as transformers to predict correct codes and present references in clinical notes that support code prediction. First, we present an explainable Read, Attend, and Code (xRAC) framework and assess two approaches, attention score-based xRAC-ATTN and model-agnostic knowledge-distillation-based xRAC-KD, through simplified but thorough human-grounded evaluations with SOTA transformer-based model, RAC. We find that the supporting evidence text highlighted by xRAC-ATTN is of higher quality than xRAC-KD whereas xRAC-KD has potential advantages in production deployment scenarios. More importantly, we show for the first time that, given the current state of explainability methodologies, using the SOTA medical codes prediction system still requires the expertise and competencies of professional coders, even though its prediction accuracy is superior to that of human coders. This, we believe, is a very meaningful step toward developing explainable and accurate machine learning systems for fully autonomous medical code prediction from clinical notes.  ( 3 min )
    FUSSL: Fuzzy Uncertain Self Supervised Learning. (arXiv:2210.15818v1 [cs.CV])
    Self supervised learning (SSL) has become a very successful technique to harness the power of unlabeled data, with no annotation effort. A number of developed approaches are evolving with the goal of outperforming supervised alternatives, which have been relatively successful. One main issue in SSL is robustness of the approaches under different settings. In this paper, for the first time, we recognize the fundamental limits of SSL coming from the use of a single-supervisory signal. To address this limitation, we leverage the power of uncertainty representation to devise a robust and general standard hierarchical learning/training protocol for any SSL baseline, regardless of their assumptions and approaches. Essentially, using the information bottleneck principle, we decompose feature learning into a two-stage training procedure, each with a distinct supervision signal. This double supervision approach is captured in two key steps: 1) invariance enforcement to data augmentation, and 2) fuzzy pseudo labeling (both hard and soft annotation). This simple, yet, effective protocol which enables cross-class/cluster feature learning, is instantiated via an initial training of an ensemble of models through invariance enforcement to data augmentation as first training phase, and then assigning fuzzy labels to the original samples for the second training phase. We consider multiple alternative scenarios with double supervision and evaluate the effectiveness of our approach on recent baselines, covering four different SSL paradigms, including geometrical, contrastive, non-contrastive, and hard/soft whitening (redundancy reduction) baselines. Extensive experiments under multiple settings show that the proposed training protocol consistently improves the performance of the former baselines, independent of their respective underlying principles.  ( 3 min )
    Coverage-centric Coreset Selection for High Pruning Rates. (arXiv:2210.15809v1 [cs.LG])
    One-shot coreset selection aims to select a subset of the training data, given a pruning rate, that can achieve high accuracy for models that are subsequently trained only with that subset. State-of-the-art coreset selection methods typically assign an importance score to each example and select the most important examples to form a coreset. These methods perform well at low pruning rates; but at high pruning rates, they have been found to suffer a catastrophic accuracy drop, performing worse than even random coreset selection. In this paper, we explore the reasons for this accuracy drop both theoretically and empirically. We extend previous theoretical results on the bound for model loss in terms of coverage provided by the coreset. Inspired by theoretical results, we propose a novel coverage-based metric and, based on the metric, find that coresets selected by importance-based coreset methods at high pruning rates can be expected to perform poorly compared to random coresets because of worse data coverage. We then propose a new coreset selection method, Coverage-centric Coreset Selection (CCS), where we jointly consider overall data coverage based on the proposed metric as well as importance of each example. We evaluate CCS on four datasets and show that they achieve significantly better accuracy than state-of-the-art coreset selection methods as well as random sampling under high pruning rates, and comparable performance at low pruning rates. For example, CCS achieves 7.04% better accuracy than random sampling and at least 20.16% better than popular importance-based selection methods on CIFAR10 with a 90% pruning rate.  ( 3 min )
    Reverse Survival Model (RSM): A Pipeline for Explaining Predictions of Deep Survival Models. (arXiv:2210.15674v1 [cs.LG])
    The aim of survival analysis in healthcare is to estimate the probability of occurrence of an event, such as a patient's death in an intensive care unit (ICU). Recent developments in deep neural networks (DNNs) for survival analysis show the superiority of these models in comparison with other well-known models in survival analysis applications. Ensuring the reliability and explainability of deep survival models deployed in healthcare is a necessity. Since DNN models often behave like a black box, their predictions might not be easily trusted by clinicians, especially when predictions are contrary to a physician's opinion. A deep survival model that explains and justifies its decision-making process could potentially gain the trust of clinicians. In this research, we propose the reverse survival model (RSM) framework that provides detailed insights into the decision-making process of survival models. For each patient of interest, RSM can extract similar patients from a dataset and rank them based on the most relevant features that deep survival models rely on for their predictions.  ( 2 min )
    Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$-realizable MDPs. (arXiv:2210.15755v1 [cs.LG])
    We consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $\epsilon$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.  ( 3 min )
    Mitigating Health Disparities in EHR via Deconfounder. (arXiv:2210.15901v1 [cs.LG])
    Health disparities, or inequalities between different patient demographics, are becoming crucial in medical decision-making, especially in Electronic Health Record (EHR) predictive modeling. To ensure the fairness of sensitive attributes, conventional studies mainly adopt calibration or re-weighting methods to balance the performance on among different demographic groups. However, we argue that these methods have some limitations. First, these methods usually mean a trade-off between the model's performance and fairness. Second, many methods completely attribute unfairness to the data collection process, which lacks substantial evidence. In this paper, we provide an empirical study to discover the possibility of using deconfounder to address the disparity issue in healthcare. Our study can be summarized in two parts. The first part is a pilot study demonstrating the exacerbation of disparity when unobserved confounders exist. The second part proposed a novel framework, Parity Medical Deconfounder (PriMeD), to deal with the disparity issue in healthcare datasets. Inspired by the deconfounder theory, PriMeD adopts a Conditional Variational Autoencoder (CVAE) to learn latent factors (substitute confounders) for observational data, and extensive experiments are provided to show its effectiveness.  ( 2 min )
    Knowledge-Guided Exploration in Deep Reinforcement Learning. (arXiv:2210.15670v1 [cs.LG])
    This paper proposes a new method to drastically speed up deep reinforcement learning (deep RL) training for problems that have the property of state-action permissibility (SAP). Two types of permissibility are defined under SAP. The first type says that after an action $a_t$ is performed in a state $s_t$ and the agent has reached the new state $s_{t+1}$, the agent can decide whether $a_t$ is permissible or not permissible in $s_t$. The second type says that even without performing $a_t$ in $s_t$, the agent can already decide whether $a_t$ is permissible or not in $s_t$. An action is not permissible in a state if the action can never lead to an optimal solution and thus should not be tried (over and over again). We incorporate the proposed SAP property and encode action permissibility knowledge into two state-of-the-art deep RL algorithms to guide their state-action exploration together with a virtual stopping strategy. Results show that the SAP-based guidance can markedly speed up RL training.  ( 2 min )
    Deepening Neural Networks Implicitly and Locally via Recurrent Attention Strategy. (arXiv:2210.15676v1 [cs.LG])
    More and more empirical and theoretical evidence shows that deepening neural networks can effectively improve their performance under suitable training settings. However, deepening the backbone of neural networks will inevitably and significantly increase computation and parameter size. To mitigate these problems, we propose a simple-yet-effective Recurrent Attention Strategy (RAS), which implicitly increases the depth of neural networks with lightweight attention modules by local parameter sharing. The extensive experiments on three widely-used benchmark datasets demonstrate that RAS can improve the performance of neural networks at a slight addition of parameter size and computation, performing favorably against other existing well-known attention modules.  ( 2 min )
    Improvement-Focused Causal Recourse (ICR). (arXiv:2210.15709v1 [stat.ML])
    Algorithmic recourse recommendations, such as Karimi et al.'s (2021) causal recourse (CR), inform stakeholders of how to act to revert unfavourable decisions. However, some actions lead to acceptance (i.e., revert the model's decision) but do not lead to improvement (i.e., may not revert the underlying real-world state). To recommend such actions is to recommend fooling the predictor. We introduce a novel method, Improvement-Focused Causal Recourse (ICR), which involves a conceptual shift: Firstly, we require ICR recommendations to guide towards improvement. Secondly, we do not tailor the recommendations to be accepted by a specific predictor. Instead, we leverage causal knowledge to design decision systems that predict accurately pre- and post-recourse. As a result, improvement guarantees translate into acceptance guarantees. We demonstrate that given correct causal knowledge, ICR, in contrast to existing approaches, guides towards both acceptance and improvement.  ( 2 min )
    Adaptive Physics-Informed Neural Operator for Coarse-Grained Non-Equilibrium Flows. (arXiv:2210.15799v1 [physics.comp-ph])
    This work proposes a new machine learning (ML)-based paradigm aiming to enhance the computational efficiency of non-equilibrium reacting flow simulations while ensuring compliance with the underlying physics. The framework combines dimensionality reduction and neural operators through a hierarchical and adaptive deep learning strategy to learn the solution of multi-scale coarse-grained governing equations for chemical kinetics. The proposed surrogate's architecture is structured as a tree, where the leaf nodes correspond to separate physics-informed deep operator networks (PI-DeepONets). The hierarchical attribute has two advantages: i) It allows the simplification of the training phase via transfer learning, starting from the slowest temporal scales; ii) It accelerates the prediction step by enabling adaptivity as the surrogate's evaluation is limited to the necessary leaf nodes based on the local degree of non-equilibrium of the gas. The model is applied to the study of chemical kinetics relevant for application to hypersonic flight, and it is tested here on a pure oxygen gas mixture. The proposed ML framework can adaptively predict the dynamics of almost thirty species with a relative error smaller than 4% for a broad range of initial conditions. This work lays the foundation for constructing an efficient ML-based surrogate coupled with reactive Navier-Stokes solvers for accurately characterizing non-equilibrium phenomena.  ( 2 min )
    Beyond Homophily with Graph Echo State Networks. (arXiv:2210.15731v1 [cs.LG])
    Graph Echo State Networks (GESN) have already demonstrated their efficacy and efficiency in graph classification tasks. However, semi-supervised node classification brought out the problem of over-smoothing in end-to-end trained deep models, which causes a bias towards high homophily graphs. We evaluate for the first time GESN on node classification tasks with different degrees of homophily, analyzing also the impact of the reservoir radius. Our experiments show that reservoir models are able to achieve better or comparable accuracy with respect to fully trained deep models that implement ad hoc variations in the architectural bias, with a gain in terms of efficiency.  ( 2 min )
  • Open

    Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. (arXiv:2203.02592v2 [stat.ML] UPDATED)
    The information bottleneck framework provides a systematic approach to learning representations that compress nuisance information in the input and extract semantically meaningful information about predictions. However, the choice of a prior distribution that fixes the dimensionality across all the data can restrict the flexibility of this approach for learning robust representations. We present a novel sparsity-inducing spike-slab categorical prior that uses sparsity as a mechanism to provide the flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism for learning a joint distribution of the latent variable and the sparsity and hence can account for the complete uncertainty in the latent space. Through a series of experiments using in-distribution and out-of-distribution learning scenarios on the MNIST, CIFAR-10, and ImageNet data, we show that the proposed approach improves accuracy and robustness compared to traditional fixed-dimensional priors, as well as other sparsity induction mechanisms for latent variable models proposed in the literature.
    Your Transformer May Not be as Powerful as You Expect. (arXiv:2205.13401v2 [cs.LG] UPDATED)
    Relative Positional Encoding (RPE), which encodes the relative distance between any pair of tokens, is one of the most successful modifications to the original Transformer. As far as we know, theoretical understanding of the RPE-based Transformers is largely unexplored. In this work, we mathematically analyze the power of RPE-based Transformers regarding whether the model is capable of approximating any continuous sequence-to-sequence functions. One may naturally assume the answer is in the affirmative -- RPE-based Transformers are universal function approximators. However, we present a negative result by showing there exist continuous sequence-to-sequence functions that RPE-based Transformers cannot approximate no matter how deep and wide the neural network is. One key reason lies in that most RPEs are placed in the softmax attention that always generates a right stochastic matrix. This restricts the network from capturing positional information in the RPEs and limits its capacity. To overcome the problem and make the model more powerful, we first present sufficient conditions for RPE-based Transformers to achieve universal function approximation. With the theoretical guidance, we develop a novel attention module, called Universal RPE-based (URPE) Attention, which satisfies the conditions. Therefore, the corresponding URPE-based Transformers become universal function approximators. Extensive experiments covering typical architectures and tasks demonstrate that our model is parameter-efficient and can achieve superior performance to strong baselines in a wide range of applications. The code will be made publicly available at https://github.com/lsj2408/URPE.  ( 3 min )
    Generalizing Clinical Trials with Convex Hulls. (arXiv:2111.13229v2 [stat.ML] UPDATED)
    Randomized clinical trials eliminate confounding but impose strict exclusion criteria that limit recruitment to a subset of the population. Observational datasets are more inclusive but suffer from confounding -- often providing overly optimistic estimates of treatment response over time due to partially optimized physician prescribing patterns. We therefore assume that the unconfounded treatment response lies somewhere in-between the observational estimate before and the observational estimate after treatment assignment. This assumption allows us to extrapolate results from exclusive trials to the broader population by analyzing observational and trial data simultaneously using an algorithm called Optimum in Convex Hulls (OCH). OCH represents the treatment effect either in terms of convex hulls of conditional expectations or convex hulls (also known as mixtures) of conditional densities. The algorithm first learns the component expectations or densities using the observational data and then learns the linear mixing coefficients using trial data in order to approximate the true treatment effect; theory importantly explains why this linear combination should hold. OCH estimates the treatment effect in terms both expectations and densities with state of the art accuracy.  ( 2 min )
    An Accurate, Scalable and Verifiable Protocol for Federated Differentially Private Averaging. (arXiv:2006.07218v3 [cs.CR] UPDATED)
    Learning from data owned by several parties, as in federated learning, raises challenges regarding the privacy guarantees provided to participants and the correctness of the computation in the presence of malicious parties. We tackle these challenges in the context of distributed averaging, an essential building block of federated learning algorithms. Our first contribution is a scalable protocol in which participants exchange correlated Gaussian noise along the edges of a network graph, complemented by independent noise added by each party. We analyze the differential privacy guarantees of our protocol and the impact of the graph topology under colluding malicious parties, showing that we can nearly match the utility of the trusted curator model even when each honest party communicates with only a logarithmic number of other parties chosen at random. This is in contrast with protocols in the local model of privacy (with lower utility) or based on secure aggregation (where all pairs of users need to exchange messages). Our second contribution enables users to prove the correctness of their computations without compromising the efficiency and privacy guarantees of the protocol. Our verification protocol relies on standard cryptographic primitives like commitment schemes and zero knowledge proofs.  ( 3 min )
    Towards solving model bias in cosmic shear forward modeling. (arXiv:2210.16243v1 [astro-ph.CO])
    As the volume and quality of modern galaxy surveys increase, so does the difficulty of measuring the cosmological signal imprinted in galaxy shapes. Weak gravitational lensing sourced by the most massive structures in the Universe generates a slight shearing of galaxy morphologies called cosmic shear, key probe for cosmological models. Modern techniques of shear estimation based on statistics of ellipticity measurements suffer from the fact that the ellipticity is not a well-defined quantity for arbitrary galaxy light profiles, biasing the shear estimation. We show that a hybrid physical and deep learning Hierarchical Bayesian Model, where a generative model captures the galaxy morphology, enables us to recover an unbiased estimate of the shear on realistic galaxies, thus solving the model bias.  ( 2 min )
    A Functional-Space Mean-Field Theory of Partially-Trained Three-Layer Neural Networks. (arXiv:2210.16286v1 [cs.LG])
    To understand the training dynamics of neural networks (NNs), prior studies have considered the infinite-width mean-field (MF) limit of two-layer NN, establishing theoretical guarantees of its convergence under gradient flow training as well as its approximation and generalization capabilities. In this work, we study the infinite-width limit of a type of three-layer NN model whose first layer is random and fixed. To define the limiting model rigorously, we generalize the MF theory of two-layer NNs by treating the neurons as belonging to functional spaces. Then, by writing the MF training dynamics as a kernel gradient flow with a time-varying kernel that remains positive-definite, we prove that its training loss in $L_2$ regression decays to zero at a linear rate. Furthermore, we define function spaces that include the solutions obtainable through the MF training dynamics and prove Rademacher complexity bounds for these spaces. Our theory accommodates different scaling choices of the model, resulting in two regimes of the MF limit that demonstrate distinctive behaviors while both exhibiting feature learning.  ( 2 min )
    Deep Learning-Based Anomaly Detection in Synthetic Aperture Radar Imaging. (arXiv:2210.16038v1 [cs.CV])
    In this paper, we proposed to investigate unsupervised anomaly detection in Synthetic Aperture Radar (SAR) images. Our approach considers anomalies as abnormal patterns that deviate from their surroundings but without any prior knowledge of their characteristics. In the literature, most model-based algorithms face three main issues. First, the speckle noise corrupts the image and potentially leads to numerous false detections. Second, statistical approaches may exhibit deficiencies in modeling spatial correlation in SAR images. Finally, neural networks based on supervised learning approaches are not recommended due to the lack of annotated SAR data, notably for the class of abnormal patterns. Our proposed method aims to address these issues through a self-supervised algorithm. The speckle is first removed through the deep learning SAR2SAR algorithm. Then, an adversarial autoencoder is trained to reconstruct an anomaly-free SAR image. Finally, a change detection processing step is applied between the input and the output to detect anomalies. Experiments are performed to show the advantages of our method compared to the conventional Reed-Xiaoli algorithm, highlighting the importance of an efficient despeckling pre-processing step.  ( 2 min )
    Poisson Reweighted Laplacian Uncertainty Sampling for Graph-based Active Learning. (arXiv:2210.15786v1 [stat.ML])
    We show that uncertainty sampling is sufficient to achieve exploration versus exploitation in graph-based active learning, as long as the measure of uncertainty properly aligns with the underlying model and the model properly reflects uncertainty in unexplored regions. In particular, we use a recently developed algorithm, Poisson ReWeighted Laplace Learning (PWLL) for the classifier and we introduce an acquisition function designed to measure uncertainty in this graph-based classifier that identifies unexplored regions of the data. We introduce a diagonal perturbation in PWLL which produces exponential localization of solutions, and controls the exploration versus exploitation tradeoff in active learning. We use the well-posed continuum limit of PWLL to rigorously analyze our method, and present experimental results on a number of graph-based image classification problems.  ( 2 min )
    Distributional regression and its evaluation with the CRPS : bounds and convergence of the minimax risk. (arXiv:2205.04360v2 [math.ST] UPDATED)
    The theoretical advances on the properties of scoring rules over the past decades have broadened the use of scoring rules in probabilistic forecasting. In meteorological forecasting, statistical postprocessing techniques are essential to improve the forecasts made by deterministic physical models. Numerous state-of-the-art statistical postprocessing techniques are based on distributional regression evaluated with the Continuous Ranked Probability Score (CRPS). However, theoretical properties of such evaluation with the CRPS have solely considered the unconditional framework (i.e. without covariates) and infinite sample sizes. We extend these results and study the rate of convergence in terms of CRPS of distributional regression methods. We find the optimal minimax rate of convergence for a given class of distributions and show that the k-nearest neighbor method and the kernel method reach this optimal minimax rate.  ( 2 min )
    Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$-realizable MDPs. (arXiv:2210.15755v1 [cs.LG])
    We consider approximate dynamic programming in $\gamma$-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of Approximate Policy Iteration (API), called Confident Approximate Policy Iteration (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon $H$ and the worst-case approximation error $\epsilon$ of the action-value functions of stationary policies. This improvement over API (whose error scales with $H^2$) comes at the price of an $H$-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and $d$-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an $\tilde O(\sqrt{d}H\epsilon)$-optimal policy after issuing $\tilde O(dH^4/\epsilon^2)$ queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except $H$. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy.  ( 3 min )
    Improvement-Focused Causal Recourse (ICR). (arXiv:2210.15709v1 [stat.ML])
    Algorithmic recourse recommendations, such as Karimi et al.'s (2021) causal recourse (CR), inform stakeholders of how to act to revert unfavourable decisions. However, some actions lead to acceptance (i.e., revert the model's decision) but do not lead to improvement (i.e., may not revert the underlying real-world state). To recommend such actions is to recommend fooling the predictor. We introduce a novel method, Improvement-Focused Causal Recourse (ICR), which involves a conceptual shift: Firstly, we require ICR recommendations to guide towards improvement. Secondly, we do not tailor the recommendations to be accepted by a specific predictor. Instead, we leverage causal knowledge to design decision systems that predict accurately pre- and post-recourse. As a result, improvement guarantees translate into acceptance guarantees. We demonstrate that given correct causal knowledge, ICR, in contrast to existing approaches, guides towards both acceptance and improvement.  ( 2 min )
    DPVIm: Differentially Private Variational Inference Improved. (arXiv:2210.15961v1 [stat.ML])
    Differentially private (DP) release of multidimensional statistics typically considers an aggregate sensitivity, e.g. the vector norm of a high-dimensional vector. However, different dimensions of that vector might have widely different magnitudes and therefore DP perturbation disproportionately affects the signal across dimensions. We observe this problem in the gradient release of the DP-SGD algorithm when using it for variational inference (VI), where it manifests in poor convergence as well as high variance in outputs for certain variational parameters, and make the following contributions: (i) We mathematically isolate the cause for the difference in magnitudes between gradient parts corresponding to different variational parameters. Using this as prior knowledge we establish a link between the gradients of the variational parameters, and propose an efficient while simple fix for the problem to obtain a less noisy gradient estimator, which we call $\textit{aligned}$ gradients. This approach allows us to obtain the updates for the covariance parameter of a Gaussian posterior approximation without a privacy cost. We compare this to alternative approaches for scaling the gradients using analytically derived preconditioning, e.g. natural gradients. (ii) We suggest using iterate averaging over the DP parameter traces recovered during the training, to reduce the DP-induced noise in parameter estimates at no additional cost in privacy. Finally, (iii) to accurately capture the additional uncertainty DP introduces to the model parameters, we infer the DP-induced noise from the parameter traces and include that in the learned posteriors to make them $\textit{noise aware}$. We demonstrate the efficacy of our proposed improvements through various experiments on real data.  ( 3 min )
    Do ideas have shape? Idea registration as the continuous limit of artificial neural networks. (arXiv:2008.03920v3 [stat.ML] UPDATED)
    We introduce a GP generalization of ResNets (including ResNets as a particular case). We show that ResNets (and their GP generalization) converge, in the infinite depth limit, to a generalization of image registration variational algorithms. Whereas computational anatomy aligns images via warping of the material space, this generalization aligns ideas (or abstract shapes as in Plato's theory of forms) via the warping of the RKHS of functions mapping the input space to the output space. While the Hamiltonian interpretation of ResNets is not new, it was based on an Ansatz. We do not rely on this Ansatz and present the first rigorous proof of convergence of ResNets with trained weights and biases towards a Hamiltonian dynamics driven flow. Our constructive proof reveals several remarkable properties of ResNets and their GP generalization. ResNets regressors are kernel regressors with data-dependent warping kernels. Minimizers of $L_2$ regularized ResNets satisfy a discrete least action principle implying the near preservation of the norm of weights and biases across layers. The trained weights of ResNets with $L^2$ regularization can be identified by solving an autonomous Hamiltonian system. The trained ResNet parameters are unique up to the initial momentum whose representation is generally sparse. The kernel regularization strategy provides a provably robust alternative to Dropout for ANNs. We introduce a functional generalization of GPs leading to error estimates for ResNets. We identify the (EPDiff) mean fields limit of trained ResNet parameters. We show that the composition of warping regression blocks with reduced equivariant multichannel kernels (introduced here) recovers and generalizes CNNs to arbitrary spaces and groups of transformations.  ( 3 min )
    Provably Training Overparameterized Neural Network Classifiers with Non-convex Constraints. (arXiv:2012.15274v2 [stat.ML] UPDATED)
    Training a classifier under non-convex constraints has gotten increasing attention in the machine learning community thanks to its wide range of applications such as algorithmic fairness and class-imbalanced classification. However, several recent works addressing non-convex constraints have only focused on simple models such as logistic regression or support vector machines. Neural networks, one of the most popular models for classification nowadays, are precluded and lack theoretical guarantees. In this work, we show that overparameterized neural networks could achieve a near-optimal and near-feasible solution of non-convex constrained optimization problems via the project stochastic gradient descent. Our key ingredient is the no-regret analysis of online learning for neural networks in the overparameterization regime, which may be of independent interest in online learning applications.  ( 2 min )
    Nonparametric Uncertainty Quantification for Single Deterministic Neural Network. (arXiv:2202.03101v2 [stat.ML] UPDATED)
    This paper proposes a fast and scalable method for uncertainty quantification of machine learning models' predictions. First, we show the principled way to measure the uncertainty of predictions for a classifier based on Nadaraya-Watson's nonparametric estimate of the conditional label distribution. Importantly, the proposed approach allows to disentangle explicitly aleatoric and epistemic uncertainties. The resulting method works directly in the feature space. However, one can apply it to any neural network by considering an embedding of the data induced by the network. We demonstrate the strong performance of the method in uncertainty estimation tasks on text classification problems and a variety of real-world image datasets, such as MNIST, SVHN, CIFAR-100 and several versions of ImageNet.  ( 2 min )
    Toward Equation of Motion for Deep Neural Networks: Continuous-time Gradient Descent and Discretization Error Analysis. (arXiv:2210.15898v1 [cs.LG])
    We derive and solve an ``Equation of Motion'' (EoM) for deep neural networks (DNNs), a differential equation that precisely describes the discrete learning dynamics of DNNs. Differential equations are continuous but have played a prominent role even in the study of discrete optimization (gradient descent (GD) algorithms). However, there still exist gaps between differential equations and the actual learning dynamics of DNNs due to discretization error. In this paper, we start from gradient flow (GF) and derive a counter term that cancels the discretization error between GF and GD. As a result, we obtain EoM, a continuous differential equation that precisely describes the discrete learning dynamics of GD. We also derive discretization error to show to what extent EoM is precise. In addition, we apply EoM to two specific cases: scale- and translation-invariant layers. EoM highlights differences between continuous-time and discrete-time GD, indicating the importance of the counter term for a better description of the discrete learning dynamics of GD. Our experimental results support our theoretical findings.  ( 2 min )
    Universalization of any adversarial attack using very few test examples. (arXiv:2005.08632v2 [cs.LG] UPDATED)
    Deep learning models are known to be vulnerable not only to input-dependent adversarial attacks but also to input-agnostic or universal adversarial attacks. Dezfooli et al. \cite{Dezfooli17,Dezfooli17anal} construct universal adversarial attack on a given model by looking at a large number of training data points and the geometry of the decision boundary near them. Subsequent work \cite{Khrulkov18} constructs universal attack by looking only at test examples and intermediate layers of the given model. In this paper, we propose a simple universalization technique to take any input-dependent adversarial attack and construct a universal attack by only looking at very few adversarial test examples. We do not require details of the given model and have negligible computational overhead for universalization. We theoretically justify our universalization technique by a spectral property common to many input-dependent adversarial perturbations, e.g., gradients, Fast Gradient Sign Method (FGSM) and DeepFool. Using matrix concentration inequalities and spectral perturbation bounds, we show that the top singular vector of input-dependent adversarial directions on a small test sample gives an effective and simple universal adversarial attack. For VGG16 and VGG19 models trained on ImageNet, our simple universalization of Gradient, FGSM, and DeepFool perturbations using a test sample of 64 images gives fooling rates comparable to state-of-the-art universal attacks \cite{Dezfooli17,Khrulkov18} for reasonable norms of perturbation. Code available at https://github.com/ksandeshk/svd-uap .  ( 3 min )
    On the Exactness of Dantzig-Wolfe Relaxation for Rank Constrained Optimization Problems. (arXiv:2210.16191v1 [math.OC])
    This paper studies the rank constrained optimization problem (RCOP) that aims to minimize a linear objective function over intersecting a prespecified closed rank constrained domain set with m two-sided linear constraints. Replacing the domain set by its closed convex hull offers us a convex Dantzig-Wolfe Relaxation (DWR) of the RCOP. Our goal is to characterize necessary and sufficient conditions under which the DWR and RCOP are equivalent in the sense of extreme point, convex hull, and objective value. More precisely, we develop the first-known necessary and sufficient conditions about when the DWR feasible set matches that of RCOP for any m linear constraints from two perspectives: (i) extreme point exactness -- all extreme points in the DWR feasible set belong to that of the RCOP; and (ii) convex hull exactness -- the DWR feasible set is identical to the closed convex hull of RCOP feasible set. From the optimization view, we also investigate (iii) objective exactness -- the optimal values of the DWR and RCOP coincide for any $m$ linear constraints and a family of linear objective functions. We derive the first-known necessary and sufficient conditions of objective exactness when the DWR admits four favorable classes of linear objective functions, respectively. From the primal perspective, this paper presents how our proposed conditions refine and extend the existing exactness results in the quadratically constrained quadratic program (QCQP) and fair unsupervised learning.  ( 2 min )
    Reverse Survival Model (RSM): A Pipeline for Explaining Predictions of Deep Survival Models. (arXiv:2210.15674v1 [cs.LG])
    The aim of survival analysis in healthcare is to estimate the probability of occurrence of an event, such as a patient's death in an intensive care unit (ICU). Recent developments in deep neural networks (DNNs) for survival analysis show the superiority of these models in comparison with other well-known models in survival analysis applications. Ensuring the reliability and explainability of deep survival models deployed in healthcare is a necessity. Since DNN models often behave like a black box, their predictions might not be easily trusted by clinicians, especially when predictions are contrary to a physician's opinion. A deep survival model that explains and justifies its decision-making process could potentially gain the trust of clinicians. In this research, we propose the reverse survival model (RSM) framework that provides detailed insights into the decision-making process of survival models. For each patient of interest, RSM can extract similar patients from a dataset and rank them based on the most relevant features that deep survival models rely on for their predictions.  ( 2 min )
    Review on Classification Techniques used in Biophysiological Stress Monitoring. (arXiv:2210.16040v1 [cs.LG])
    Cardiovascular activities are directly related to the response of a body in a stressed condition. Stress, based on its intensity, can be divided into two types i.e. Acute stress (short-term stress) and Chronic stress (long-term stress). Repeated acute stress and continuous chronic stress may play a vital role in inflammation in the circulatory system and thus leads to a heart attack or to a stroke. In this study, we have reviewed commonly used machine learning classification techniques applied to different stress-indicating parameters used in stress monitoring devices. These parameters include Photoplethysmograph (PPG), Electrocardiographs (ECG), Electromyograph (EMG), Galvanic Skin Response (GSR), Heart Rate Variation (HRV), skin temperature, respiratory rate, Electroencephalograph (EEG) and salivary cortisol, used in stress monitoring devices. This study also provides a discussion on choosing a classifier, which depends upon a number of factors other than accuracy, like the number of subjects involved in an experiment, type of signals processing and computational limitations.  ( 2 min )
    Preferential Subsampling for Stochastic Gradient Langevin Dynamics. (arXiv:2210.16189v1 [stat.ML])
    Stochastic gradient MCMC (SGMCMC) offers a scalable alternative to traditional MCMC, by constructing an unbiased estimate of the gradient of the log-posterior with a small, uniformly-weighted subsample of the data. While efficient to compute, the resulting gradient estimator may exhibit a high variance and impact sampler performance. The problem of variance control has been traditionally addressed by constructing a better stochastic gradient estimator, often using control variates. We propose to use a discrete, non-uniform probability distribution to preferentially subsample data points that have a greater impact on the stochastic gradient. In addition, we present a method of adaptively adjusting the subsample size at each iteration of the algorithm, so that we increase the subsample size in areas of the sample space where the gradient is harder to estimate. We demonstrate that such an approach can maintain the same level of accuracy while substantially reducing the average subsample size that is used.  ( 2 min )
    Generalised Gaussian Process Latent Variable Models (GPLVM) with Stochastic Variational Inference. (arXiv:2202.12979v2 [cs.LG] CROSS LISTED)
    Gaussian process latent variable models (GPLVM) are a flexible and non-linear approach to dimensionality reduction, extending classical Gaussian processes to an unsupervised learning context. The Bayesian incarnation of the GPLVM Titsias and Lawrence, 2010] uses a variational framework, where the posterior over latent variables is approximated by a well-behaved variational family, a factorized Gaussian yielding a tractable lower bound. However, the non-factories ability of the lower bound prevents truly scalable inference. In this work, we study the doubly stochastic formulation of the Bayesian GPLVM model amenable with minibatch training. We show how this framework is compatible with different latent variable formulations and perform experiments to compare a suite of models. Further, we demonstrate how we can train in the presence of massively missing data and obtain high-fidelity reconstructions. We demonstrate the model's performance by benchmarking against the canonical sparse GPLVM for high-dimensional data examples.  ( 2 min )
    Fairness Certificates for Differentially Private Classification. (arXiv:2210.16242v1 [cs.LG])
    In this work, we theoretically study the impact of differential privacy on fairness in binary classification. We prove that, given a class of models, popular group fairness measures are pointwise Lipschitz-continuous with respect to the parameters of the model. This result is a consequence of a more general statement on the probability that a decision function makes a negative prediction conditioned on an arbitrary event (such as membership to a sensitive group), which may be of independent interest. We use the aforementioned Lipschitz property to prove a high probability bound showing that, given enough examples, the fairness level of private models is close to the one of their non-private counterparts.  ( 2 min )
    Measuring the Confidence of Traffic Forecasting Models: Techniques, Experimental Comparison and Guidelines towards Their Actionability. (arXiv:2210.16049v1 [cs.LG])
    The estimation of the amount of uncertainty featured by predictive machine learning models has acquired a great momentum in recent years. Uncertainty estimation provides the user with augmented information about the model's confidence in its predicted outcome. Despite the inherent utility of this information for the trustworthiness of the user, there is a thin consensus around the different types of uncertainty that one can gauge in machine learning models and the suitability of different techniques that can be used to quantify the uncertainty of a specific model. This subject is mostly non existent within the traffic modeling domain, even though the measurement of the confidence associated to traffic forecasts can favor significantly their actionability in practical traffic management systems. This work aims to cover this lack of research by reviewing different techniques and metrics of uncertainty available in the literature, and by critically discussing how confidence levels computed for traffic forecasting models can be helpful for researchers and practitioners working in this research area. To shed light with empirical evidence, this critical discussion is further informed by experimental results produced by different uncertainty estimation techniques over real traffic data collected in Madrid (Spain), rendering a general overview of the benefits and caveats of every technique, how they can be compared to each other, and how the measured uncertainty decreases depending on the amount, quality and diversity of data used to produce the forecasts.  ( 3 min )
    Near-Optimal Collaborative Learning in Bandits. (arXiv:2206.00121v2 [cs.LG] UPDATED)
    This paper introduces a general multi-agent bandit model in which each agent is facing a finite set of arms and may communicate with other agents through a central controller in order to identify, in pure exploration, or play, in regret minimization, its optimal arm. The twist is that the optimal arm for each agent is the arm with largest expected mixed reward, where the mixed reward of an arm is a weighted sum of the rewards of this arm for all agents. This makes communication between agents often necessary. This general setting allows to recover and extend several recent models for collaborative bandit learning, including the recently proposed federated learning with personalization (Shi et al., 2021). In this paper, we provide new lower bounds on the sample complexity of pure exploration and on the regret. We then propose a near-optimal algorithm for pure exploration. This algorithm is based on phased elimination with two novel ingredients: a data-dependent sampling scheme within each phase, aimed at matching a relaxation of the lower bound.  ( 2 min )
    Disentangling Visual Embeddings with Minimal Distributional Assumptions. (arXiv:2206.13872v2 [stat.ML] UPDATED)
    Interest in understanding and factorizing embedding spaces learned by deep encoders is growing. Concept discovery methods search the embedding spaces for interpretable latent components like object shape or color and disentangle them into individual axes in the embedding space. Yet, the applicability of modern disentanglement learning techniques or independent component analysis (ICA) is limited when it comes to vision tasks: They either require training a model of the complex image-generating process or their rigid stochastic independence assumptions on the component distribution are violated in practice. In this work, we identify components in encoder embedding spaces without distributional assumptions and without training a generator. Instead, we utilize functional compositionality properties of image-generating processes. We derive two novel post-hoc component discovery methods and prove theoretical identifiability guarantees. We study them in realistic visual disentanglement tasks with correlated components and violated functional assumptions. Our approaches stably maintain superior performance against 300+ state-of-the-art disentanglement and component analysis models.  ( 2 min )
    Nonparametric Probabilistic Regression with Coarse Learners. (arXiv:2210.16247v1 [cs.LG])
    Probabilistic Regression refers to predicting a full probability density function for the target conditional on the features. We present a nonparametric approach to this problem which combines base classifiers (typically gradient boosted forests) trained on different coarsenings of the target value. By combining such classifiers and averaging the resulting densities, we are able to compute precise conditional densities with minimal assumptions on the shape or form of the density. We combine this approach with a structured cross-entropy loss function which serves to regularize and smooth the resulting densities. Prediction intervals computed from these densities are shown to have high fidelity in practice. Furthermore, examining the properties of these densities on particular observations can provide valuable insight. We demonstrate this approach on a variety of datasets and show competitive performance, particularly on larger datasets.  ( 2 min )
    Benchopt: Reproducible, efficient and collaborative optimization benchmarks. (arXiv:2206.13424v3 [cs.LG] UPDATED)
    Numerical validation is at the core of machine learning research as it allows to assess the actual impact of new methods, and to confirm the agreement between theory and practice. Yet, the rapid development of the field poses several challenges: researchers are confronted with a profusion of methods to compare, limited transparency and consensus on best practices, as well as tedious re-implementation work. As a result, validation is often very partial, which can lead to wrong conclusions that slow down the progress of research. We propose Benchopt, a collaborative framework to automate, reproduce and publish optimization benchmarks in machine learning across programming languages and hardware architectures. Benchopt simplifies benchmarking for the community by providing an off-the-shelf tool for running, sharing and extending experiments. To demonstrate its broad usability, we showcase benchmarks on three standard learning tasks: $\ell_2$-regularized logistic regression, Lasso, and ResNet18 training for image classification. These benchmarks highlight key practical findings that give a more nuanced view of the state-of-the-art for these problems, showing that for practical evaluation, the devil is in the details. We hope that Benchopt will foster collaborative work in the community hence improving the reproducibility of research findings.  ( 3 min )
    Noise Injection Node Regularization for Robust Learning. (arXiv:2210.15764v1 [cs.LG])
    We introduce Noise Injection Node Regularization (NINR), a method of injecting structured noise into Deep Neural Networks (DNN) during the training stage, resulting in an emergent regularizing effect. We present theoretical and empirical evidence for substantial improvement in robustness against various test data perturbations for feed-forward DNNs when trained under NINR. The novelty in our approach comes from the interplay of adaptive noise injection and initialization conditions such that noise is the dominant driver of dynamics at the start of training. As it simply requires the addition of external nodes without altering the existing network structure or optimization algorithms, this method can be easily incorporated into many standard problem specifications. We find improved stability against a number of data perturbations, including domain shifts, with the most dramatic improvement obtained for unstructured noise, where our technique outperforms other existing methods such as Dropout or $L_2$ regularization, in some cases. We further show that desirable generalization properties on clean data are generally maintained.  ( 2 min )
    Provable Sim-to-real Transfer in Continuous Domain with Partial Observations. (arXiv:2210.15598v1 [cs.LG] CROSS LISTED)
    Sim-to-real transfer trains RL agents in the simulated environments and then deploys them in the real world. Sim-to-real transfer has been widely used in practice because it is often cheaper, safer and much faster to collect samples in simulation than in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.  ( 2 min )
    Deep Kernel Learning of Dynamical Models from High-Dimensional Noisy Data. (arXiv:2208.12975v2 [cs.LG] UPDATED)
    This work proposes a Stochastic Variational Deep Kernel Learning method for the data-driven discovery of low-dimensional dynamical models from high-dimensional noisy data. The framework is composed of an encoder that compresses high-dimensional measurements into low-dimensional state variables, and a latent dynamical model for the state variables that predicts the system evolution over time. The training of the proposed model is carried out in an unsupervised manner, i.e., not relying on labeled data. Our learning method is evaluated on the motion of a pendulum -- a well studied baseline for nonlinear model identification and control with continuous states and control inputs -- measured via high-dimensional noisy RGB images. Results show that the method can effectively denoise measurements, learn compact state representations and latent dynamical models, as well as identify and quantify modeling uncertainties.  ( 2 min )
    Universal Adversarial Directions. (arXiv:2210.15997v1 [cs.LG])
    Despite their great success in image recognition tasks, deep neural networks (DNNs) have been observed to be susceptible to universal adversarial perturbations (UAPs) which perturb all input samples with a single perturbation vector. However, UAPs often struggle in transferring across DNN architectures and lead to challenging optimization problems. In this work, we study the transferability of UAPs by analyzing equilibrium in the universal adversarial example game between the classifier and UAP adversary players. We show that under mild assumptions the universal adversarial example game lacks a pure Nash equilibrium, indicating UAPs' suboptimal transferability across DNN classifiers. To address this issue, we propose Universal Adversarial Directions (UADs) which only fix a universal direction for adversarial perturbations and allow the perturbations' magnitude to be chosen freely across samples. We prove that the UAD adversarial example game can possess a Nash equilibrium with a pure UAD strategy, implying the potential transferability of UADs. We also connect the UAD optimization problem to the well-known principal component analysis (PCA) and develop an efficient PCA-based algorithm for optimizing UADs. We evaluate UADs over multiple benchmark image datasets. Our numerical results show the superior transferability of UADs over standard gradient-based UAPs.  ( 2 min )

  • Open

    Cybersecurity Measures Can Protect Windows Devices From Venus Ransomware
    Venus ransomware has been active since August 2022. The attack campaign targets Remote Desktop services. Which cybersecurity measures should you take to stay safe? The post Cybersecurity Measures Can Protect Windows Devices From Venus Ransomware appeared first on Data Science Central.  ( 21 min )
    One Big Graph and the Interorganization
    I’m using the term web3 in a hopeful way with an expansive definition in mind, one that’s not familiar to most people. When I mention web3, it’d be good if you could put aside the associations with cryptocurrency, blockchain, and an endless, speculation-driven hype cycle for a moment.  The post One Big Graph and the Interorganization appeared first on Data Science Central.  ( 21 min )
    Low-code AI Development Could Be a Good Strategy for SMBs
    Low code development is getting a lot of traction. Low code platform development provides a more accessible (and typically a graphical) interface for developing applications. The post Low-code AI Development Could Be a Good Strategy for SMBs appeared first on Data Science Central.  ( 19 min )
    How to Create A Green Culture in a Business Environment
    The green culture is a lifestyle that values environmental protection and sustainability in business. The post How to Create A Green Culture in a Business Environment appeared first on Data Science Central.  ( 20 min )
  • Open

    Microtonic Patternarium AI beat maker.
    submitted by /u/Cyberstr33t [link] [comments]  ( 41 min )
    AI Dream 111 - Halloween Special - "fashionable paranoia" (read: Sickish...
    submitted by /u/LordPewPew777 [link] [comments]  ( 43 min )
    Need help with arguments for and against AI in predicting criminals based on faceshape
    So I've got a seminar tomorrow and I need to bring 3 arguments for and 3 arguments against using AI to decide who is a criminal based on their faceshape. I can't really think of any arguments for using this technology other than that it could be a good tool to prevent crimes before they occur. And the only argument I have against using it is that it could give racist/discriminating results. Anyone who could give me some help with more arguments? submitted by /u/Aysey [link] [comments]  ( 45 min )
    Tesla Bot will lead to Artificial General Intelligence
    submitted by /u/mostdiabolical [link] [comments]  ( 49 min )
    [Q] how to detect multiple sub categories in character strings
    I am using a character based CNN for categorizing data strings. Inside those strings are sub-segments that I want to categorize, or recognize. I see it like the 2d object recognition in RCNN or SDD. They put bounding boxes around multiple objects in an image. That’s what I want, underline/tag substrings and identify their categories. I also think my problem is analogous to 1d data like ECG data with the pqrst parts of the heart beat. Many of the ECG algorithms talk about identifying the unusual heartbeat but I haven’t found examples of putting boxes around the sub segment parts of the heartbeat (pqrst parts). Anyone have any tips or direction to suggest on how I can change my conv1d CNN into a conv1d RCNN or SDD. I have implemented in PyTorch but I can change to other libraries if needed. submitted by /u/rich_atl [link] [comments]  ( 46 min )
    You can use a machine learning docker image to save time setting up a new machine
    submitted by /u/tech_geeky [link] [comments]  ( 42 min )
    Google Acquired An AI Avatar Startup 'Alter' For $100 Million To Take On TikTok
    submitted by /u/vadhavaniyafaijan [link] [comments]  ( 40 min )
    Are you into Artificial Art Fantasies...?
    submitted by /u/ArtifulDream [link] [comments]  ( 42 min )
    My autocorrect is sentient (proof)
    submitted by /u/Informal_Collar426 [link] [comments]  ( 46 min )
  • Open

    [D] - What is the SOTA for Mixture of Experts Routers? Have MoE routers that output arithmetic or Boolean combinations of experts been used?
    What is the SOTA for MoE with regards to routers? How many nets does the router normally pick? Have routers that output something like (Expert3 + Expert2) OR (Expert4) ever been tried? submitted by /u/029187 [link] [comments]  ( 63 min )
    [D] How do I get certain number of frames before every point-of-impact in a tennis game video?
    I am trying to follow the steps of database creation that was followed in this research paper. In the paper, they have stated they use sequential data as input for the model they are developing, wherein 20 frames are taken before every point-of-impact in a tennis match. How can I go about doing that? Anyone could explain in as easy words as possible or guide me to an appropriate medium? Quoting the paper: For this dataset, we used a tennis match video (1080 × 720 pixels, 25 fps) of a professional tennis match uploaded to YouTube. The video is input, making the players’ rectangular images of the time of impact of each player to 20 frames before impact submitted by /u/ChaosAdm [link] [comments]  ( 61 min )
    [R] POSE-NDF — modeling human pose manifolds with neural distance fields
    submitted by /u/SpatialComputing [link] [comments]  ( 60 min )
    [D] Looking for suggestions on setting up autoscaling on GPU servers for AI inference (without kubernetes)?
    I'm building an application that runs an AI model inference on GPU servers. Based on the demand profile for no. of requests and usage of GPUs I want to autoscale the GPU servers up/down. I don't want to use Kubernetes for orchestration/autoscaling as it is an overkill for my application which is pretty experimental right now. Also, I don't need all the MLOps lifecycle management as I am using an open source model which doesn't need that fast updates. All I am currently looking for is suggestions on how should I go about implementing a simple approach for scaling GPU servers based on incoming demand (such as no. of requests/min or GPU utilization). submitted by /u/fgp121 [link] [comments]  ( 62 min )
    [R] Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation
    In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3x longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures. Paper: https://arxiv.org/abs/2210.10349 Demo Sample: https://ai-muzic.github.io/museformer/ Code: https://github.com/microsoft/muzic Media Report: https://medium.com/mlearning-ai/microsofts-museformer-ai-music-is-the-new-frontier-8dc5cb24459c submitted by /u/tobyoup [link] [comments]  ( 58 min )
    [R] TOCH outperforms state of the art 3D hand-object interaction models and produces smooth interactions even before and after contact
    submitted by /u/SpatialComputing [link] [comments]  ( 63 min )
    [Project] Improving deep learning for tabular data with numerical embeddings (FT-Transformer)
    Hey Reddit! I'm continuing to explore deep learning for tabular data in hopes of finding something that works 😄 This time, I'm looking into the FT-Transformer (Feature Tokenizer Transformer)(https://arxiv.org/pdf/2106.11959v2.pdf) . The model is similar to the TabTransformer in a way that both of them use Transformers to get contextualised embeddings which are then passed through the classification layers. However, there are two main differences between the models: FT-Transformer passes through the Transformer numerical embeddings together with the categorical ones. TabTransformer was only contextualising categorical embeddings. FT-Transformer uses CLS token embeddings as input to the final layer, whereas TabTransformer was concatenating all the embeddings https://preview.redd.it/mk28f629uxw91.png?width=1916&format=png&auto=webp&s=9a801d48189cf7fd9d4039e107e236aaa93f6a6f The model definitely outperforms TabTransformer and shows quite a decent performance across the evaluated datasets. I was able to re-produce the results on some of the datasets, so it's an interesting and good research work! Having said that, the model takes forever to train and it wasn't that good for my client's data. So my quest continues 😄 I've written a post that describes the numerical embeddings in more depth, looks at the explainability aspect, and shows how to train it with my tabtrasnformertf package. Here's a link if you're interested - https://towardsdatascience.com/improving-tabtransformer-part-1-linear-numerical-embeddings-dbc3be3b5bb5 submitted by /u/blessedorcursed [link] [comments]  ( 64 min )
    [P][D] Size Differentiation in Products
    I'm doing a university project where I'm making a checkout system for retail using YoloV7 But there are is one thing that is still proving to be limitation of this project. How do I differentiate between two same products of different sizes. For example, how do I differentiate between the same snack of different weight & price brackets or two same sodas with different liters/oz. Basically differentiating between like a regular and a party pack of the same product etc. I was thinking of doing some area calculations using bounding boxes but that needs the camera to be at a fixed distance in every scenario and ranges need to be pre-defined. Is there another more efficient way to go about doing this? Also is making this sort of ai powered system using YoloV7 the right approach or should I use another algorithm or model? Because I've trained 5 products for a test and the precision and recall scores are >90% but I want to know if this is the right direction before I go too deep into it. submitted by /u/LimitlessSaint [link] [comments]  ( 61 min )
    [R] Deep model with inputs of unbalanced sizes
    I am dealing with a deep learning task where the model has several inputs of very different sizes. Moreover, the inputs of smaller sizes are those that actually have more influence on the output. To give you an idea of the scale, one input is a 200-dimensional vector, another input is a 1-dimensional number, and another is a 5-dimensional vector. They are all useful for predicting the correct output, but the 1 and 5 -dimensional ones are particularly helpful. At the moment I am concatenating all of them, but I suspect that this isn't the best approach in this case, as there is noise in the training process (it's for an RL agent) and I fear that it would be difficult for the model to learn to put enough focus on those small inputs. Do you know any work that examines the effect of different input sizes on nns? It might turn out that this is not a problem after all. submitted by /u/fedetask [link] [comments]  ( 69 min )
    [P][R] Modern Disney Diffusion, dreambooth model trained using the diffusers implementation
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 67 min )
  • Open

    Smoothly extending arctan(k tan(t))
    I wrote a while back about the function f(t) = arctan(k tan(t)). I keep running into this function. Has anybody given this function a name or studied it? The direct implementation has a discontinuity at π/2 but I needed to extend it continuously. Using the two-argument version of inverse tangent fixes this. In Python, the […] Smoothly extending arctan(k tan(t)) first appeared on John D. Cook.  ( 5 min )
  • Open

    A training booster for the Deterministic Policy Gradient
    Reinforcement Learning is a field where agent is learning during activity in the World or Virtual Environment. If you notice during training babies do a lot of random moves. They hit a table by hands, use a spoon very widely, in other terms they explore. With time they get more experience and become agile. As do humans, we exploit what we have already learned. More we learn (explore), more fluent or agile we become, e.g. in sports or even in science. What I present here is a “ training booster” - opposite to soft exploration techniques - an improvement for deterministic policy gradient method that can be helpful to save power energy during training (different variants of noise addition for exploration is not considered here) Deterministic Policy Gradient updates Actor's weights to satisf…  ( 52 min )
    Mathematics for RL Theory?
    submitted by /u/Professional_Card176 [link] [comments]  ( 49 min )
    Hello Guys, I was solving my university question paper and I found out this problem. Can anyone help me to solve this? I can't understand this. (Let's say 5 numbers are 206130)
    submitted by /u/Asleep-Ad4480 [link] [comments]  ( 45 min )
  • Open

    Gradient Descent explained.
    Hi everyone. I made this post to explain gradient descent with an example. It take the derivation of the cost function into consideration and I have tried to demonstrate through graphs how learning rate plays a role. Feel free to have a look and comment. https://kolbenkraft.net/gradient-descent-explained-with-example/ submitted by /u/kolbenkraft [link] [comments]  ( 40 min )

  • Open

    [D] ResNet101 VS InceptionResNetV2
    Is there any reason for ResNet101 to outperform InceptionResNetV2? I am comparing my trained models for images, but ResNet101 seems to be performing better. Their AUROC seems comparable as well. Any reasoning would be appreciated! https://preview.redd.it/klrh1c0l0uw91.png?width=734&format=png&auto=webp&s=49e6cee0503bc3221493df8744fadde146f0337c submitted by /u/djsamyak [link] [comments]  ( 62 min )
    [D] Recursive Training
    If we train Imagen using real images mixed with AI-generated images, would it cause a degradation of quality? Any study on this recursive pipeline and is there any strategy to sanitize source data? submitted by /u/HatsusenoRin [link] [comments]  ( 62 min )
    [N] Andrej Karpathy: Tesla AI, Self-Driving, Optimus, Aliens, and AGI | Lex Fridman Podcast #333
    https://www.youtube.com/watch?v=cdiD-9MMpb0 OUTLINE: 0:00 - Introduction 0:58 - Neural networks 6:01 - Biology 11:32 - Aliens 21:43 - Universe 33:34 - Transformers 41:50 - Language models 52:01 - Bots 58:21 - Google's LaMDA 1:05:44 - Software 2.0 1:16:44 - Human annotation 1:18:41 - Camera vision 1:23:46 - Tesla's Data Engine 1:27:56 - Tesla Vision 1:34:26 - Elon Musk 1:39:33 - Autonomous driving 1:44:28 - Leaving Tesla 1:49:55 - Tesla's Optimus 1:59:01 - ImageNet 2:01:40 - Data 2:11:31 - Day in the life 2:24:47 - Best IDE 2:31:53 - arXiv 2:36:23 - Advice for beginners 2:45:40 - Artificial general intelligence 2:59:00 - Movies 3:04:53 - Future of human civilization 3:09:13 - Book recommendations 3:15:21 - Advice for young people 3:17:12 - Future of machine learning 3:24:00 - Meaning of life The episode was made after a call for questions by Lex Fridman himself: https://www.reddit.com/r/MachineLearning/comments/y89xqw/d_call_for_questions_for_andrej_karpathy_from_lex/ submitted by /u/Singularian2501 [link] [comments]  ( 69 min )
    [D] AMD GPU Performance reaches state-of-the-art
    submitted by /u/anqmj [link] [comments]  ( 62 min )
    Has any tried streaming data to an open ai gym environment? “[D]”
    I’ve looked all over for an example on how to do this but I cannot find one. Has anyone tried this before? Essentially what I would like to do is stream the data to my environment and then every-time the data is updated run the PPO model to make some predictions. submitted by /u/DataD23 [link] [comments]  ( 60 min )
    [R] ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model with Knowledge-Enhanced Mixture-of-Denoising-Experts + Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 60 min )
    [D] What are the bottlenecks in your ML project lifecycle? What tools would you like to see more widely used?
    Hey, I’m a Senior ML research engineer currently working at the intersection of the automotive and security industries. It’s the weekend and I’m just being curious. I’m wondering about other people who are working on building products with ML/AI at the core. What are the bottlenecks people working in our field regularly face in their project development lifecycle? Is it data collection, QA, internal tooling, mode development, real-world performance evaluation etc.? What tools do you wish existed to help clear these bottlenecks? Tell me about them! Maybe they already exist and someone might be able to point you in the right direction! How about this response template: What’s your role? What industry does most of your work fall into? How big is your team? What area of the entire ML project lifecycle do you think stops you from doing great work the most? If you found a genie lamp and could wish into existence three tools (no matter how technically difficult to create) to support you what would they be? submitted by /u/Fine-Topic-6127 [link] [comments]  ( 66 min )
    [2210.12574] The Curious Case of Absolute Position Embeddings
    submitted by /u/hosjiu [link] [comments]  ( 57 min )
    [R] Open source inference acceleration library - voltaML
    We have recently open sourced our inference acceleration library, voltaML. ⚡VoltaML is a lightweight library to convert and run your ML/DL deep learning models in high performance inference runtimes like TensorRT, TorchScript, ONNX and TVM. We would love for the reddit and the open-source community to use it, give feedback and help us improve the library. https://github.com/VoltaML/voltaML submitted by /u/harishprab [link] [comments]  ( 61 min )
    [P] Built an OS Platform to Annotate and Run NLP Models on PDFs
    Hey Everyone, I wanted to share a project I've been working on to make it easy to 1) annotate pdfs, 2) share the annotations, 3) build new datasets from old datasets ("fork" them like in git) and 4) easily visualize the results of NLP models in their natural layout. Source code is here, demo is here and docs are here. It's still a work in progress and very much a side project, but I wanted to share it with the community as I think there's a gap in good PDF labeling tooling (though, obviously there's PAWLs, Grobid and others). Some highlights: Easy to add non-contiguous and multi-page text annotations Fork Existing Corpuses to Save Time Creating New Training Sets GraphQL API Modular NLP Engine Lets you Deploy Any NLP Model via a Consistent API Store and Review Results from Multiple NLP Engines View NLP Outputs In Their Native Layout You can also import and export corpuses, complete with positional and span information for training NLP models. The specific format is still a work in progress. The demo I have running shows an example of using the the NLP API and microservice to use SPACY's NER as an analysis backend. You can click on the "Atticus Project" corpus and, when you open a document in there, you'll see you can click on the "Spacy Analyzer" to see what Spacy sees. Before I forget, shout-out to the PAWLs project, which I based a good chunk of my own annotator on (though it's only a fraction of this project and has so many modifications that contributing the code back was impractical). The backend is based on Django and the NLP analyzers are based on a separate framework I haven't released officially yet but plan to AGPL. submitted by /u/TallTahawus [link] [comments]  ( 61 min )
    [D] NeurIPS 2022 - Why is the release of the reviews delayed?
    The reviews of submitted papers to NeurIPS were supposed to become public on 20th Oct. Does anyone know why openreview doesnt show anything yet? Is that related to ICLR ongoing review process? I.e., trying to prevent reviewers getting influenced from past reviews and author names/affiliations of the rejected papers? Or is anything else happening? submitted by /u/No-Spirit-7840 [link] [comments]  ( 62 min )
    [D] Are there any model that can take video as an input and generate a promotional video out of it? Like summarization of a text article.
    Like most exciting parts under 2 minutes. Like summarization of text. submitted by /u/CeFurkan [link] [comments]  ( 59 min )
  • Open

    Adversarial behavior in model-based RL
    I recall reading somewhere that one drawback of model-based RL is that an approximate dynamics model is not suited to planning algorithms designed for exact models. And that this can generate trajectories that are adversarial in nature. Can someone point me to relevant examples/references? submitted by /u/mrscabbycreature [link] [comments]  ( 52 min )
    SAFE-PANDA-GYM a modification to Panda - Gym to train Safe-RL agents
    We develop a modification to the Panda Gym by adding constraints to the environments like Unsafe regions and, constraints on the task. The aim is to develop an environment to test CMDPs (Constraint Markov Decision Process) / Safe-RL algorithms such as CPO, PPO - Lagrangian and algorithms developed by the team. Agents would not only have to come up with optimal policy for control and planning but also ensure they don't violate a constraint. Safe-Panda-gym are developed with the following key features: Add safe environments considering the constraints, like PandaReachSafe-v2,PandaPushSafe-v2, ,PandaSlideSafe-v2, PickAndPlaceSafe-v2and, PandaStackSafe-v2 Support image-based environments, which you can see rgb_rendering_safe.py in the test_safe_envs folder. Support SafePO-Baselines to train the safe environments in our repo, which can be seen in the train_safe_rl_algorithms folder. URL: https://github.com/tohsin/Safe-panda-gym.git https://preview.redd.it/nxvoo0yc1pw91.png?width=720&format=png&auto=webp&s=d95eaee978e021e5adc215cc4948ced88e494396 submitted by /u/ConsiderationCivil74 [link] [comments]  ( 53 min )
  • Open

    A Guide to Data Protection Methods
    Data loss is a highly possible and ugly eventuality that can befall any business. But you can prevent it by backing up your information before the proverbial rainy day. The post A Guide to Data Protection Methods appeared first on Data Science Central.  ( 20 min )
  • Open

    Hand tracking will be a game changer for future AR/VR experiences, and this is the first-ever algorithm capable of tracking high-fidelity hand deformations through self-contacting and self-occluding hand gestures.
    submitted by /u/ai-lover [link] [comments]  ( 42 min )
    FT: The golden age of AI-generated art is here. It’s going to get weird. How software that can create almost any image from a few words will change human creativity (by Tom Faber, OCTOBER 26, 2022)
    FT: "The golden age of AI-generated art is here." It’s going to get weird. How software that can create almost any image from a few words will change human creativity by Tom Faber, OCTOBER 26, 2022 What is the face of the man behind the apple? For almost 60 years, the figure wearing a sombre suit and bowler hat in René Magritte’s painting “The Son of Man” has been obscured by a polished green apple. His facial features were intended to remain a mystery, the fruit an artistic provocation. Today, using new technology, 23-year-old digital artist Josephine Miller can roll the apple away. Miller tilts her laptop towards me in the hushed café of the British Library in London to show how she used Dall-E 2, software that generates images using artificial intelligence (AI), to remove the frui…  ( 73 min )
    If we had a large dataset of DNA samples and pictures of the owners faces, would it be possible to train an AI model to generate faces from DNA?
    Is this being done? If not, why wouldn't it work? submitted by /u/FizzCode [link] [comments]  ( 45 min )
    AI Dream 85 - Halloween Special - Terrifier 2 Witch-Hunt Teaser
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    First Deep Sea (1km) Humanoid Robot Drone Lets Operators Touch And Feel With Haptic Feedback | New Machine Learning Tech Transcodes Brain Waves Into Natural Language To Read Human's Thoughts
    submitted by /u/kenickh [link] [comments]  ( 41 min )
    Could Astro Compete with Next.js to Become the Next Big Framework?
    Developers have fallen into a problem-solution loop. There is a continuous effort to solve one or another problem in the developer community. Ironically, there are so many solutions now that there is a need for another solution for the existing ones. https://analyticsindiamag.com/could-astro-compete-with-next-js-to-become-the-next-big-framework/ submitted by /u/analyticsindiam [link] [comments]  ( 43 min )
    That's pretty cool, it's more easier make characters.
    submitted by /u/Yuui_Smile [link] [comments]  ( 40 min )
    Conversational AI and the Future of Customer Interactions
    Conversational AI is crucial in reshaping customer interactions for different work functions, from sales to Customer Support. ​ https://preview.redd.it/eqvt9y311pw91.jpg?width=1600&format=pjpg&auto=webp&s=9109ca55b165a28d3c2591caba2670f7c75378cc visit floatbot to understand how conversational AI is going to impact customer interactions: submitted by /u/Floatbot_Inc [link] [comments]  ( 43 min )
    How to Write an Article / Blog Post With AI (GPT-3) - Step by Step
    submitted by /u/allaboutai-kris [link] [comments]  ( 48 min )
    IBM Research Introduces Artificial Intelligence Unit (AIU): It’s First Complete System-on-Chip Designed to Run and Train Deep Learning Models Faster and More Efficiently than a General-Purpose CPU
    submitted by /u/ai-lover [link] [comments]  ( 40 min )
  • Open

    Do you want to experience an AI Character in the Metaverse that makes intros?
    No content preview
  • Open

    Random illustrations of Pascal’s theorem
    Pascal’s theorem begins by selecting any six distinct points on an ellipse and drawing a “hexagon.” I put hexagon in quotes because the result need not look anything like a hex nut. In this context it simply means to pick one point, connect it to some other point, and so forth, joining the points in […] Random illustrations of Pascal’s theorem first appeared on John D. Cook.  ( 6 min )
  • Open

    New Machine Learning Tech Transcodes Brain Waves Into Natural Language To Read Human's Thoughts
    submitted by /u/kenickh [link] [comments]  ( 41 min )

  • Open

    [D] How to get the fastest PyTorch inference and what is the "best" model serving framework?
    TL;DR I am trying to work out the ‘best’ options for speeding up model inference and model serving. Specifically, I am looking to host a number of PyTorch models and want - 1) the fastest inference speed, 2) an easy to use and deploy model serving framework that is also fast. For 1), what is the easiest way to speed up inference (assume only PyTorch and primarily GPU but also some CPU)? I have been using ONNX and Torchscript but there is a bit of a learning curve and sometimes it can be tricky to get the model to actually work. Is there anything else worth trying? I am enthused by things like TorchDynamo (although I have not tested it extensively) due to its apparent ease of use. I also saw the post yesterday about Kernl using (OpenAI) Triton kernels to speed up transformer models which also looks interesting. Are things like SageMaker Neo or NeuralMagic worth trying? My only reservation with some of these is they still seem to be pretty model/architecture specific. I am a little reluctant to put much time into these unless I know others have had some success first. For 2), I am aware of a few options. Triton inference server is an obvious one as is the ‘transformer-deploy’ version from LDS. My only reservation here is that they require the model compilation or are architecture specific. I am aware of others like Bento, Ray serving and TorchServe. Ideally I would have something that allows any (PyTorch model) to be used without the extra compilation effort (or at least optionally) and has some convenience things like ease of use, easy to deploy, easy to host multiple models and can perform some dynamic batching. Anyway, I am really interested to hear people's experience here as I know there are now quite a few options! Any help is appreciated! Disclaimer - I have no affiliation or are connected in any way with the libraries or companies listed here. These are just the ones I know of. Thanks in advance. submitted by /u/big_dog_2k [link] [comments]  ( 59 min )
    [D] DL Practitioners, Do You Use Layer Visualization Tools s.a GradCam in Your Process?
    Hey all, I'm just wondering how other teams approach theses tasks, I have feeling I'm a bit behind in my process in terms of debugging/improving my models, so what I'm wondering if you guys leverage these kind of tools (most recent examples I know of are CartoonX Pixel RDE) to debug/improve your models, outside sources, papers or specific example from your experience would be great! Thank you. submitted by /u/DisWastingMyTime [link] [comments]  ( 61 min )
    Need help with choosing(deciding) the ML models [D]
    I am doing MS in Data Science. I have some experience in implementing models during my work as developer. I am facing difficulty knowing which one to implement when. Please guide me a learning path to learn basics and inside of ML models. submitted by /u/master_of_whispers_ [link] [comments]  ( 55 min )
    [D] Looking for feedback on a dataset collaboration tool for ML projects
    I’ve been working as an applied ML engineer for almost a decade now, and in recent years I find I’m spending a lot less time fiddling with model architectures and a lot more time digging into the data - rebalancing the data, removing bad labels, figuring out features we should add, and sourcing new labeled examples of cases where the model is weak. Often I’ve found this means collaborating with less-technical folks (e.g. PMs, managers, annotators, data domain experts, etc.) on the team to dig through individual examples in our training set. While the engineers and data scientists on the team are usually comfortable looking at examples in a Python notebook environment, with the non-technical folks, I often need to screen-share my notebook with them, or we hack together a Google spreadsheet…  ( 61 min )
  • Open

    AI Dream 95 - Halloween Special - Cursed Flight of the Rainbow Wishes
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    Presidential boss battle (AI images I created using the Dream by Wombo app)
    submitted by /u/d1rty_3lb0w5 [link] [comments]  ( 41 min )
    Best Practices from Provectus for Migrating and Optimizing Amazon EMR Workloads
    submitted by /u/TallAssociation0 [link] [comments]  ( 48 min )
    Train MULTIPLE Subjects At The Same Time With DREAMBOOTH
    submitted by /u/PuppetHere [link] [comments]  ( 47 min )
    IBM’s AI Accelerator: This Had Better Not Be Just A Science Project
    submitted by /u/DerBootsMann [link] [comments]  ( 46 min )
    💡NeRF + AI Art 🖼️ I made this fly-through animation of a 3D scan using Instant NGP. Then I used Stable Diffusion to transform the scene into 3 different styles: photorealistic, surrealistic, and even Minecraft-style voxels! 😍
    submitted by /u/imaginfinity [link] [comments]  ( 41 min )
    Supposedly Ant Group created some AI/business logic frameworks but only provided a name and github link for one of them. Does anybody know anything solid about them?
    https://www.businesswire.com/news/home/20220903005014/en/Ant-Group-Makes-Trusted-AI-Solutions-More-Accessible-to-Support-Industrial-Collaboration-in-Digital-Economy ​ They only name TuGraph. Supposedly they made their "privacy-preserving computation framework open source" but there's no name or Github link. I contacted their media guy but no response. submitted by /u/technocratofzigurrat [link] [comments]  ( 42 min )
    I made a music video using the song's lyrics as prompts in midjourney
    submitted by /u/skillerbg123 [link] [comments]  ( 41 min )
    Will killing NPC be illegal?
    Having a weird thought here. Let's say next week GTA 10 is coming out, featuring npc's that are self-conscious. They have a personality and can think and learn on their own, (similar to the AI Sophia). They only exist in the game, and there is no way of entering the real world for them. My question is, do these npc's have rights? (because they have a personality). Would killing/hurting or whatever may harm them be illegal? submitted by /u/Kitty_Perry [link] [comments]  ( 44 min )
    Get Your Unique 👟 Design with This AI Tool
    submitted by /u/amirdol7 [link] [comments]  ( 47 min )
    High school graduation paper
    I'm a senior in high school and I have a graduation project (something like a thesis) in my computer class. The professor suggested a topic regarding AI, but I'm not sure what to do. Do you guys have any ideas for an interesting topic with enough material online that could be presented in school? Thanks :) submitted by /u/Potential-Wrap2844 [link] [comments]  ( 43 min )
    AI Generated Video and Song
    submitted by /u/Icy_Search_2374 [link] [comments]  ( 49 min )
    Is there any ai that is text to video?
    submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 41 min )
    living in the desert.
    submitted by /u/GroundbreakingLaw878 [link] [comments]  ( 41 min )
    A few pages from my Midjourney produced printed manga, AbsXcess.
    submitted by /u/MobileFilmmaker [link] [comments]  ( 40 min )
  • Open

    Finding where two quadratic curves intersect
    Suppose you have two quadratic polynomials in two real variables, f(x, y) and g(x, y), and you want to know whether the two curves f(x, y) = 0 and g(x, y) = 0 intersect, and if they do, find where they intersect. David Eberly has a set of notes on solving systems of polynomial equations that […] Finding where two quadratic curves intersect first appeared on John D. Cook.  ( 6 min )
  • Open

    Openai gym streaming
    Does anyone know how to send streaming data to an open ai gym environment? I’ve got a stream that continuously updates a pandas data frame which I want to send to the environment. But I’m have a hard time trying to figure out how to reset the environment to accept the new data frame. Has anyone tried something like this before? submitted by /u/DataD23 [link] [comments]  ( 48 min )
    Offline RL from multiple (dependent) logging policies
    Hi, I am looking for related papers on offline reinforcement learning in the case where offline data comes from multiple logging policies. I am specifically looking for a setting where the logging policies are dependent on the previously deployed policies. An ex. application is a search engine/recommender system wherein an agent (RL or bandit) is trained offline from an already existing logging policy, then deployed online, which generated log data, which is used to train the agent again. There's a sequential dependency in the logging policy. There's work in the bandit literature on a similar topic, "Off-policy Learning for Multiple Loggers", but the logging policies are independent of each other. submitted by /u/noob_simp_phd [link] [comments]  ( 52 min )
  • Open

    AI imagines my cat in costumes
    I don't dress my cat in costumes because, without even trying, I know she would hate that. But now I can use text-to-image generators like DALL-E2 to imagine what she would look like in costumes. After all, even if it never saw my cat in a robot costume  ( 4 min )
    Bonus: My cat in weird AI costumes
    AI Weirdness: the strange side of machine learning  ( 2 min )
  • Open

    DSC Weekly 25 October 2022 – Re: Your Brains
    Announcements In 2023, organizations will battle a more complex threatscape than ever as web application breaches continue to rise, credential theft and credential stuffing remain a concern, and ransomware demands hit the big players. It’s no wonder that CISOs and security practitioners are more concerned than ever before. Join threat researchers and leading CISOs as… Read More »DSC Weekly 25 October 2022 – Re: Your Brains The post DSC Weekly 25 October 2022 – Re: Your Brains appeared first on Data Science Central.  ( 21 min )
  • Open

    A Brief Introduction to BERT
    As we learned what a Transformer is and how we might train the Transformer model, we notice that it is a great tool to make a computer understand human language. However, the Transformer was originally designed as a model to translate one language to another. If we repurpose it for a different task, we would […] The post A Brief Introduction to BERT appeared first on Machine Learning Mastery.  ( 16 min )
  • Open

    Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v3 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.  ( 3 min )
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v2 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.  ( 2 min )
    Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model. (arXiv:2105.14016v3 [cs.LG] UPDATED)
    The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $|\mathcal{S}|\times|\mathcal{A}|$, which can be prohibitively large when $\mathcal{S}$ or $\mathcal{A}$ is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp.$~$Q-learning) provably learns an $\varepsilon$-optimal policy (resp.$~$Q-function) with high probability as soon as the sample size exceeds the order of $\frac{K}{(1-\gamma)^{3}\varepsilon^{2}}$ (resp.$~$$\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}$), up to some logarithmic factor. Here $K$ is the feature dimension and $\gamma\in(0,1)$ is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when $K$ is relatively small, and hence the title of this paper.  ( 3 min )
    Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models. (arXiv:2210.15458v1 [cs.CL])
    Decoding methods for large language models often trade-off between diversity of outputs and parallelism of computation. Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of our approach on WMT machine translation, showing substantially reduced variance when estimating expected BLEU score and up to 1 point increased BLEU in oracle experiments.
    Regret Bounds and Experimental Design for Estimate-then-Optimize. (arXiv:2210.15576v1 [math.OC])
    In practical applications, data is used to make decisions in two steps: estimation and optimization. First, a machine learning model estimates parameters for a structural model relating decisions to outcomes. Second, a decision is chosen to optimize the structural model's predicted outcome as if its parameters were correctly estimated. Due to its flexibility and simple implementation, this ``estimate-then-optimize'' approach is often used for data-driven decision-making. Errors in the estimation step can lead estimate-then-optimize to sub-optimal decisions that result in regret, i.e., a difference in value between the decision made and the best decision available with knowledge of the structural model's parameters. We provide a novel bound on this regret for smooth and unconstrained optimization problems. Using this bound, in settings where estimated parameters are linear transformations of sub-Gaussian random vectors, we provide a general procedure for experimental design to minimize the regret resulting from estimate-then-optimize. We demonstrate our approach on simple examples and a pandemic control application.
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v3 [stat.ML] UPDATED)
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data. Moreover, we show that PD-NJ-ODE can be applied successfully to limit order book (LOB) data.
    Kernel Learning for Explainable Climate Science. (arXiv:2209.04947v1 [cs.LG] CROSS LISTED)
    The Upper Indus Basin, Himalayas provides water for 270 million people and countless ecosystems. However, precipitation, a key component to hydrological modelling, is poorly understood in this area. A key challenge surrounding this uncertainty comes from the complex spatial-temporal distribution of precipitation across the basin. In this work we propose Gaussian processes with structured non-stationary kernels to model precipitation patterns in the UIB. Previous attempts to quantify or model precipitation in the Hindu Kush Karakoram Himalayan region have often been qualitative or include crude assumptions and simplifications which cannot be resolved at lower resolutions. This body of research also provides little to no error propagation. We account for the spatial variation in precipitation with a non-stationary Gibbs kernel parameterised with an input dependent lengthscale. This allows the posterior function samples to adapt to the varying precipitation patterns inherent in the distinct underlying topography of the Indus region. The input dependent lengthscale is governed by a latent Gaussian process with a stationary squared-exponential kernel to allow the function level hyperparameters to vary smoothly. In ablation experiments we motivate each component of the proposed kernel by demonstrating its ability to model the spatial covariance, temporal structure and joint spatio-temporal reconstruction. We benchmark our model with a stationary Gaussian process and a Deep Gaussian processes.
    Lifelong Bandit Optimization: No Prior and No Regret. (arXiv:2210.15513v1 [stat.ML])
    In practical applications, machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LiBO, an algorithm which adapts to the environment by learning from past experience and becoming more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LiBO sequentially meta-learns a kernel that approximates the true kernel and simultaneously solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LiBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LiBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. The lifelong problem can thus be solved in a federated manner, while keeping the data of each task private.
    An Analysis of Robustness of Non-Lipschitz Networks. (arXiv:2010.06154v3 [cs.LG] UPDATED)
    Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network's final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
    Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification. (arXiv:2206.04646v3 [stat.ML] UPDATED)
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    Mining Multi-Label Samples from Single Positive Labels. (arXiv:2206.05764v3 [cs.LG] UPDATED)
    Conditional generative adversarial networks (cGANs) have shown superior results in class-conditional generation tasks. To simultaneously control multiple conditions, cGANs require multi-label training datasets, where multiple labels can be assigned to each data instance. Nevertheless, the tremendous annotation cost limits the accessibility of multi-label datasets in real-world scenarios. Therefore, in this study we explore the practical setting called the single positive setting, where each data instance is annotated by only one positive label with no explicit negative labels. To generate multi-label data in the single positive setting, we propose a novel sampling approach called single-to-multi-label (S2M) sampling, based on the Markov chain Monte Carlo method. As a widely applicable "add-on" method, our proposed S2M sampling method enables existing unconditional and conditional GANs to draw high-quality multi-label data with a minimal annotation cost. Extensive experiments on real image datasets verify the effectiveness and correctness of our method, even when compared to a model trained with fully annotated datasets.
    Adaptive Estimation of $\text{MTP}_2$ Graphical Models. (arXiv:2210.15471v1 [stat.ML])
    We consider the problem of estimating (diagonally dominant) M-matrices as precision matrices in Gaussian graphical models. Such models have received increasing attention in recent years, and have shown interesting properties, e.g., the maximum likelihood estimator exists with as little as two observations regardless of the underlying dimension. In this paper, we propose an adaptive estimation method, which consists of multiple stages: In the first stage, we solve an $\ell_1$-regularized maximum likelihood estimation problem, which leads to an initial estimate; in the subsequent stages, we iteratively refine the initial estimate by solving a sequence of weighted $\ell_1$-regularized problems. We further establish the theoretical guarantees on the estimation error, which consists of optimization error and statistical error. The optimization error decays to zero at a linear rate, indicating that the estimate is refined iteratively in subsequent stages, and the statistical error characterizes the statistical rate. The proposed method outperforms state-of-the-art methods in estimating precision matrices and identifying graph edges, as evidenced by synthetic and financial time-series data sets.
    What makes you unique?. (arXiv:2105.08013v2 [stat.AP] UPDATED)
    This paper proposes a uniqueness Shapley measure to compare the extent to which different variables are able to identify a subject. Revealing the value of a variable on subject $t$ shrinks the set of possible subjects that $t$ could be. The extent of the shrinkage depends on which other variables have also been revealed. We use Shapley value to combine all of the reductions in log cardinality due to revealing a variable after some subset of the other variables has been revealed. This uniqueness Shapley measure can be aggregated over subjects where it becomes a weighted sum of conditional entropies. Aggregation over subsets of subjects can address questions like how identifying is age for people of a given zip code. Such aggregates have a corresponding expression in terms of cross entropies. We use uniqueness Shapley to investigate the differential effects of revealing variables from the North Carolina voter registration rolls and in identifying anomalous solar flares. An enormous speedup (approaching 2000 fold in one example) is obtained by using the all dimension trees of Moore and Lee (1998) to store the cardinalities we need.
    Large-scale Optimization of Partial AUC in a Range of False Positive Rates. (arXiv:2203.01505v2 [cs.LG] UPDATED)
    The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/\epsilon^6)$ for finding a nearly $\epsilon$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization.
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v3 [cs.LG] UPDATED)
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v2 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    A Double Machine Learning Trend Model for Citizen Science Data. (arXiv:2210.15524v1 [q-bio.QM])
    1. Citizen and community-science (CS) datasets have great potential for estimating interannual patterns of population change given the large volumes of data collected globally every year. Yet, the flexible protocols that enable many CS projects to collect large volumes of data typically lack the structure necessary to keep consistent sampling across years. This leads to interannual confounding, as changes to the observation process over time are confounded with changes in species population sizes. 2. Here we describe a novel modeling approach designed to estimate species population trends while controlling for the interannual confounding common in citizen science data. The approach is based on Double Machine Learning, a statistical framework that uses machine learning methods to estimate population change and the propensity scores used to adjust for confounding discovered in the data. Additionally, we develop a simulation method to identify and adjust for residual confounding missed by the propensity scores. Using this new method, we can produce spatially detailed trend estimates from citizen science data. 3. To illustrate the approach, we estimated species trends using data from the CS project eBird. We used a simulation study to assess the ability of the method to estimate spatially varying trends in the face of real-world confounding. Results showed that the trend estimates distinguished between spatially constant and spatially varying trends at a 27km resolution. There were low error rates on the estimated direction of population change (increasing/decreasing) and high correlations on the estimated magnitude. 4. The ability to estimate spatially explicit trends while accounting for confounding in citizen science data has the potential to fill important information gaps, helping to estimate population trends for species, regions, or seasons without rigorous monitoring data.
    Efficient and Stable Algorithms to Extend Greville's Method to Partitioned Matrices Based on Inverse Cholesky Factorization. (arXiv:2005.07045v2 [math.NA] UPDATED)
    Greville's method has been utilized in (Broad Learn-ing System) BLS to propose an effective and efficient incremental learning system without retraining the whole network from the beginning. For a column-partitioned matrix where the second part consists of p columns, Greville's method requires p iterations to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part. The incremental algorithms in BLS extend Greville's method to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part by just 1 iteration, which have neglected some possible cases, and need further improvements in efficiency and numerical stability. In this paper, we propose an efficient and numerical stable algorithm from Greville's method, to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part by just 1 iteration, where all possible cases are considered, and the recently proposed inverse Cholesky factorization can be applied to further reduce the computational complexity. Finally, we give the whole algorithm for column-partitioned matrices in BLS. On the other hand, we also give the proposed algorithm for row-partitioned matrices in BLS.
    Fully Bayesian Analysis of the Relevance Vector Machine Classification for Imbalanced Data. (arXiv:2007.13140v2 [stat.ML] UPDATED)
    Relevance Vector Machine (RVM) is a supervised learning algorithm extended from Support Vector Machine (SVM) based on the Bayesian sparsity model. Compared with the regression problem, RVM classification is difficult to be conducted because there is no closed-form solution for the weight parameter posterior. Original RVM classification algorithm used Newton's method in optimization to obtain the mode of weight parameter posterior then approximated it by a Gaussian distribution in Laplace's method. It would work but just applied the frequency methods in a Bayesian framework. This paper proposes a Generic Bayesian approach for the RVM classification. We conjecture that our algorithm achieves convergent estimates of the quantities of interest compared with the nonconvergent estimates of the original RVM classification algorithm. Furthermore, a Fully Bayesian approach with the hierarchical hyperprior structure for RVM classification is proposed, which improves the classification performance, especially in the imbalanced data problem. By the numeric studies, our proposed algorithms obtain high classification accuracy rates. The Fully Bayesian hierarchical hyperprior method outperforms the Generic one for the imbalanced data classification.
    Beyond Black Box Densities: Parameter Learning for the Deviated Components. (arXiv:2202.02651v2 [stat.ML] UPDATED)
    As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-\lambda^{*})h_0 + \lambda^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|\theta_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $\lambda^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} \delta_{\theta_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $\lambda^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v3 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v2 [stat.ML] UPDATED)
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    Learning Single-Index Models with Shallow Neural Networks. (arXiv:2210.15651v1 [cs.LG])
    Single-index models are a class of functions given by an unknown univariate ``link'' function applied to an unknown one-dimensional projection of the input. These models are particularly relevant in high dimension, when the data might present low-dimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (one-dimensional) subspace, are well-understood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn single-index models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the near-optimal sample complexity of dedicated semi-parametric methods.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v6 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-art.
    Fast Rates for Noisy Interpolation Require Rethinking the Effects of Inductive Bias. (arXiv:2203.03597v2 [stat.ML] UPDATED)
    Good generalization performance on high-dimensional data crucially hinges on a simple structure of the ground truth and a corresponding strong inductive bias of the estimator. Even though this intuition is valid for regularized models, in this paper we caution against a strong inductive bias for interpolation in the presence of noise: While a stronger inductive bias encourages a simpler structure that is more aligned with the ground truth, it also increases the detrimental effect of noise. Specifically, for both linear regression and classification with a sparse ground truth, we prove that minimum $\ell_p$-norm and maximum $\ell_p$-margin interpolators achieve fast polynomial rates close to order $1/n$ for $p > 1$ compared to a logarithmic rate for $p = 1$. Finally, we provide preliminary experimental evidence that this trade-off may also play a crucial role in understanding non-linear interpolating models used in practice.
    A Graph Is More Than Its Nodes: Towards Structured Uncertainty-Aware Learning on Graphs. (arXiv:2210.15575v1 [cs.LG])
    Current graph neural networks (GNNs) that tackle node classification on graphs tend to only focus on nodewise scores and are solely evaluated by nodewise metrics. This limits uncertainty estimation on graphs since nodewise marginals do not fully characterize the joint distribution given the graph structure. In this work, we propose novel edgewise metrics, namely the edgewise expected calibration error (ECE) and the agree/disagree ECEs, which provide criteria for uncertainty estimation on graphs beyond the nodewise setting. Our experiments demonstrate that the proposed edgewise metrics can complement the nodewise results and yield additional insights. Moreover, we show that GNN models which consider the structured prediction problem on graphs tend to have better uncertainty estimations, which illustrates the benefit of going beyond the nodewise setting.
    Learning versus Refutation in Noninteractive Local Differential Privacy. (arXiv:2210.15439v1 [stat.ML])
    We study two basic statistical tasks in non-interactive local differential privacy (LDP): learning and refutation. Learning requires finding a concept that best fits an unknown target function (from labelled samples drawn from a distribution), whereas refutation requires distinguishing between data distributions that are well-correlated with some concept in the class, versus distributions where the labels are random. Our main result is a complete characterization of the sample complexity of agnostic PAC learning for non-interactive LDP protocols. We show that the optimal sample complexity for any concept class is captured by the approximate $\gamma_2$~norm of a natural matrix associated with the class. Combined with previous work [Edmonds, Nikolov and Ullman, 2019] this gives an equivalence between learning and refutation in the agnostic setting.
    Strategic Geosteeering Workflow with Uncertainty Quantification and Deep Learning: A Case Study on the Goliat Field. (arXiv:2210.15548v1 [physics.geo-ph])
    The real-time interpretation of the logging-while-drilling data allows us to estimate the positions and properties of the geological layers in an anisotropic subsurface environment. Robust real-time estimations capturing uncertainty can be very useful for efficient geosteering operations. However, the model errors in the prior conceptual geological models and forward simulation of the measurements can be significant factors in the unreliable estimations of the profiles of the geological layers. The model errors are specifically pronounced when using a deep-neural-network (DNN) approximation which we use to accelerate and parallelize the simulation of the measurements. This paper presents a practical workflow consisting of offline and online phases. The offline phase includes DNN training and building of an uncertain prior near-well geo-model. The online phase uses the flexible iterative ensemble smoother (FlexIES) to perform real-time assimilation of extra-deep electromagnetic data accounting for the model errors in the approximate DNN model. We demonstrate the proposed workflow on a case study for a historic well in the Goliat Field (Barents Sea). The median of our probabilistic estimation is on-par with proprietary inversion despite the approximate DNN model and regardless of the number of layers in the chosen prior. By estimating the model errors, FlexIES automatically quantifies the uncertainty in the layers' boundaries and resistivities, which is not standard for proprietary inversion.
    Stochastic Mirror Descent in Average Ensemble Models. (arXiv:2210.15323v1 [cs.LG])
    The stochastic mirror descent (SMD) algorithm is a general class of training algorithms, which includes the celebrated stochastic gradient descent (SGD), as a special case. It utilizes a mirror potential to influence the implicit bias of the training algorithm. In this paper we explore the performance of the SMD iterates on mean-field ensemble models. Our results generalize earlier ones obtained for SGD on such models. The evolution of the distribution of parameters is mapped to a continuous time process in the space of probability distributions. Our main result gives a nonlinear partial differential equation to which the continuous time process converges in the asymptotic regime of large networks. The impact of the mirror potential appears through a multiplicative term that is equal to the inverse of its Hessian and which can be interpreted as defining a gradient flow over an appropriately defined Riemannian manifold. We provide numerical simulations which allow us to study and characterize the effect of the mirror potential on the performance of networks trained with SMD for some binary classification problems.
    Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions. (arXiv:2210.15543v1 [cs.LG])
    Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.
    Learning Discrete Directed Acyclic Graphs via Backpropagation. (arXiv:2210.15353v1 [cs.LG])
    Recently continuous relaxations have been proposed in order to learn Directed Acyclic Graphs (DAGs) from data by backpropagation, instead of using combinatorial optimization. However, a number of techniques for fully discrete backpropagation could instead be applied. In this paper, we explore that direction and propose DAG-DB, a framework for learning DAGs by Discrete Backpropagation. Based on the architecture of Implicit Maximum Likelihood Estimation [I-MLE, arXiv:2106.01798], DAG-DB adopts a probabilistic approach to the problem, sampling binary adjacency matrices from an implicit probability distribution. DAG-DB learns a parameter for the distribution from the loss incurred by each sample, performing competitively using either of two fully discrete backpropagation techniques, namely I-MLE and Straight-Through Estimation.  ( 2 min )
    Implications of sparsity and high triangle density for graph representation learning. (arXiv:2210.15277v1 [stat.ML])
    Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.  ( 2 min )
    Private Isotonic Regression. (arXiv:2210.15175v1 [cs.LG])
    In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly $\mathrm{width}(\mathcal{X}) \cdot \log|\mathcal{X}| / n$, where $\mathrm{width}(\mathcal{X})$ is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly $(\mathrm{width}(\mathcal{X}) + \log |\mathcal{X}|) / n$, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for $\ell_1$ and $\ell_2^2$ losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.  ( 2 min )
    PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits. (arXiv:2210.15345v1 [stat.ML])
    In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work.  ( 2 min )
    Multi-layered Discriminative Restricted Boltzmann Machine with Untrained Probabilistic Layer. (arXiv:2210.15434v1 [cs.LG])
    An extreme learning machine (ELM) is a three-layered feed-forward neural network having untrained parameters, which are randomly determined before training. Inspired by the idea of ELM, a probabilistic untrained layer called a probabilistic-ELM (PELM) layer is proposed, and it is combined with a discriminative restricted Boltzmann machine (DRBM), which is a probabilistic three-layered neural network for solving classification problems. The proposed model is obtained by stacking DRBM on the PELM layer. The resultant model (i.e., multi-layered DRBM (MDRBM)) forms a probabilistic four-layered neural network. In MDRBM, the parameters in the PELM layer can be determined using Gaussian-Bernoulli restricted Boltzmann machine. Owing to the PELM layer, MDRBM obtains a strong immunity against noise in inputs, which is one of the most important advantages of MDRBM. Numerical experiments using some benchmark datasets, MNIST, Fashion-MNIST, Urban Land Cover, and CIFAR-10, demonstrate that MDRBM is superior to other existing models, particularly, in terms of the noise-robustness property (or, in other words, the generalization property).  ( 2 min )
    A super-polynomial quantum-classical separation for density modelling. (arXiv:2210.14936v1 [quant-ph])
    Density modelling is the task of learning an unknown probability density function from samples, and is one of the central problems of unsupervised machine learning. In this work, we show that there exists a density modelling problem for which fault-tolerant quantum computers can offer a super-polynomial advantage over classical learning algorithms, given standard cryptographic assumptions. Along the way, we provide a variety of additional results and insights, of potential interest for proving future distribution learning separations between quantum and classical learning algorithms. Specifically, we (a) provide an overview of the relationships between hardness results in supervised learning and distribution learning, and (b) show that any weak pseudo-random function can be used to construct a classically hard density modelling problem. The latter result opens up the possibility of proving quantum-classical separations for density modelling based on weaker assumptions than those necessary for pseudo-random functions.  ( 2 min )
    Bayesian Hyperbolic Multidimensional Scaling. (arXiv:2210.15081v1 [stat.ME])
    Multidimensional scaling (MDS) is a widely used approach to representing high-dimensional, dependent data. MDS works by assigning each observation a location on a low-dimensional geometric manifold, with distance on the manifold representing similarity. We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic. Using hyperbolic space facilitates representing tree-like structure common in many settings (e.g. text or genetic data with hierarchical structure). A Bayesian approach provides regularization that minimizes the impact of uncertainty or measurement error in the observed data. We also propose a case-control likelihood approximation that allows for efficient sampling from the posterior in larger data settings, reducing computational complexity from approximately $O(n^2)$ to $O(n)$. We evaluate the proposed method against state-of-the-art alternatives using simulations, canonical reference datasets, and human gene expression data.  ( 2 min )
    Modeling Inter-Dependence Between Time and Mark in Multivariate Temporal Point Processes. (arXiv:2210.15294v1 [cs.LG])
    Temporal Point Processes (TPP) are probabilistic generative frameworks. They model discrete event sequences localized in continuous time. Generally, real-life events reveal descriptive information, known as marks. Marked TPPs model time and marks of the event together for practical relevance. Conditioned on past events, marked TPPs aim to learn the joint distribution of the time and the mark of the next event. For simplicity, conditionally independent TPP models assume time and marks are independent given event history. They factorize the conditional joint distribution of time and mark into the product of individual conditional distributions. This structural limitation in the design of TPP models hurt the predictive performance on entangled time and mark interactions. In this work, we model the conditional inter-dependence of time and mark to overcome the limitations of conditionally independent models. We construct a multivariate TPP conditioning the time distribution on the current event mark in addition to past events. Besides the conventional intensity-based models for conditional joint distribution, we also draw on flexible intensity-free TPP models from the literature. The proposed TPP models outperform conditionally independent and dependent models in standard prediction tasks. Our experimentation on various datasets with multiple evaluation metrics highlights the merit of the proposed approach.  ( 2 min )
    Bayesian Inference of Transition Matrices from Incomplete Graph Data with a Topological Prior. (arXiv:2210.15410v1 [stat.ME])
    Many network analysis and graph learning techniques are based on models of random walks which require to infer transition matrices that formalize the underlying stochastic process in an observed graph. For weighted graphs, it is common to estimate the entries of such transition matrices based on the relative weights of edges. However, we are often confronted with incomplete data, which turns the construction of the transition matrix based on a weighted graph into an inference problem. Moreover, we often have access to additional information, which capture topological constraints of the system, i.e. which edges in a weighted graph are (theoretically) possible and which are not, e.g. transportation networks, where we have access to passenger trajectories as well as the physical topology of connections, or a set of social interactions with the underlying social structure. Combining these two different sources of information to infer transition matrices is an open challenge, with implications on the downstream network analysis tasks. Addressing this issue, we show that including knowledge on such topological constraints can improve the inference of transition matrices, especially for small datasets. We derive an analytically tractable Bayesian method that uses repeated interactions and a topological prior to infer transition matrices data-efficiently. We compare it against commonly used frequentist and Bayesian approaches both in synthetic and real-world datasets, and we find that it recovers the transition probabilities with higher accuracy and that it is robust even in cases when the knowledge of the topological constraint is partial. Lastly, we show that this higher accuracy improves the results for downstream network analysis tasks like cluster detection and node ranking, which highlights the practical relevance of our method for analyses of various networked systems.  ( 3 min )
    Sample-Specific Root Causal Inference with Latent Variables. (arXiv:2210.15340v1 [stat.ML])
    Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted outcome. In prior work, we defined sample-specific root causes of disease using exogenous error terms that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using Shapley values. However, the associated algorithms for inferring root causes assume no latent confounding. We relax this assumption by permitting confounding among the predictors. We then introduce a corresponding procedure called Extract Errors with Latents (EEL) for recovering the error terms up to contamination by vertices on certain paths under the linear non-Gaussian acyclic model. EEL also identifies the smallest sets of dependent errors for fast computation of the Shapley values. The algorithm bypasses the hard problem of estimating the underlying causal graph in both cases. Experiments highlight the superior accuracy and robustness of EEL relative to its predecessors.  ( 2 min )
    Deep Learning is Provably Robust to Symmetric Label Noise. (arXiv:2210.15083v1 [stat.ML])
    Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mitigation? We provide an affirmative answer for the case of symmetric label noise: We find that certain DNNs, including under-parameterized and over-parameterized models, can tolerate massive symmetric label noise up to the information-theoretic threshold. By appealing to classical statistical theory and universal consistency of DNNs, we prove that for multiclass classification, $L_1$-consistent DNN classifiers trained under symmetric label noise can achieve Bayes optimality asymptotically if the label noise probability is less than $\frac{K-1}{K}$, where $K \ge 2$ is the number of classes. Our results show that for symmetric label noise, no mitigation is necessary for $L_1$-consistent estimators. We conjecture that for general label noise, mitigation strategies that make use of the noisy data will outperform those that ignore the noisy data.  ( 2 min )
    Mean-field neural networks: learning mappings on Wasserstein space. (arXiv:2210.15179v1 [math.OC])
    We study the machine learning task for models with operators mapping between the Wasserstein space of probability measures and a space of functions, like e.g. in mean-field games/control problems. Two classes of neural networks, based on bin density and on cylindrical approximation, are proposed to learn these so-called mean-field functions, and are theoretically supported by universal approximation theorems. We perform several numerical experiments for training these two mean-field neural networks, and show their accuracy and efficiency in the generalization error with various test distributions. Finally, we present different algorithms relying on mean-field neural networks for solving time-dependent mean-field problems, and illustrate our results with numerical tests for the example of a semi-linear partial differential equation in the Wasserstein space of probability measures.  ( 2 min )
    Isometric 3D Adversarial Examples in the Physical World. (arXiv:2210.15291v1 [cs.CV])
    3D deep learning models are shown to be as vulnerable to adversarial examples as 2D models. However, existing attack methods are still far from stealthy and suffer from severe performance degradation in the physical world. Although 3D data is highly structured, it is difficult to bound the perturbations with simple metrics in the Euclidean space. In this paper, we propose a novel $\epsilon$-isometric ($\epsilon$-ISO) attack to generate natural and robust 3D adversarial examples in the physical world by considering the geometric properties of 3D objects and the invariance to physical transformations. For naturalness, we constrain the adversarial example to be $\epsilon$-isometric to the original one by adopting the Gaussian curvature as a surrogate metric guaranteed by a theoretical analysis. For invariance to physical transformations, we propose a maxima over transformation (MaxOT) method that actively searches for the most harmful transformations rather than random ones to make the generated adversarial example more robust in the physical world. Experiments on typical point cloud recognition models validate that our approach can significantly improve the attack success rate and naturalness of the generated 3D adversarial examples than the state-of-the-art attack methods.  ( 2 min )
    Toroidal Probabilistic Spherical Discriminant Analysis. (arXiv:2210.15441v1 [cs.SD])
    In speaker recognition, where speech segments are mapped to embeddings on the unit hypersphere, two scoring back-ends are commonly used, namely cosine scoring and PLDA. We have recently proposed PSDA, an analog to PLDA that uses Von Mises-Fisher distributions instead of Gaussians. In this paper, we present toroidal PSDA (T-PSDA). It extends PSDA with the ability to model within and between-speaker variabilities in toroidal submanifolds of the hypersphere. Like PLDA and PSDA, the model allows closed-form scoring and closed-form EM updates for training. On VoxCeleb, we find T-PSDA accuracy on par with cosine scoring, while PLDA accuracy is inferior. On NIST SRE'21 we find that T-PSDA gives large accuracy gains compared to both cosine scoring and PLDA.
    Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning. (arXiv:2209.11745v2 [cs.LG] UPDATED)
    Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. We also generalize the DEC to give sample-efficient algorithms for all-policy model estimation, with applications for learning equilibria in Markov Games. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.
    Optimal Sub-sampling to Boost Power of Kernel Sequential Change-point Detection. (arXiv:2210.15060v1 [stat.ME])
    We present a novel scheme to boost detection power for kernel maximum mean discrepancy based sequential change-point detection procedures. Our proposed scheme features an optimal sub-sampling of the history data before the detection procedure, in order to tackle the power loss incurred by the random sub-sample from the enormous history data. We apply our proposed scheme to both Scan $B$ and Kernel Cumulative Sum (CUSUM) procedures, and improved performance is observed from extensive numerical experiments.  ( 2 min )
    High-dimensional Measurement Error Models for Lipschitz Loss. (arXiv:2210.15008v1 [stat.ME])
    Recently emerging large-scale biomedical data pose exciting opportunities for scientific discoveries. However, the ultrahigh dimensionality and non-negligible measurement errors in the data may create difficulties in estimation. There are limited methods for high-dimensional covariates with measurement error, that usually require knowledge of the noise distribution and focus on linear or generalized linear models. In this work, we develop high-dimensional measurement error models for a class of Lipschitz loss functions that encompasses logistic regression, hinge loss and quantile regression, among others. Our estimator is designed to minimize the $L_1$ norm among all estimators belonging to suitable feasible sets, without requiring any knowledge of the noise distribution. Subsequently, we generalize these estimators to a Lasso analog version that is computationally scalable to higher dimensions. We derive theoretical guarantees in terms of finite sample statistical error bounds and sign consistency, even when the dimensionality increases exponentially with the sample size. Extensive simulation studies demonstrate superior performance compared to existing methods in classification and quantile regression problems. An application to a gender classification task based on brain functional connectivity in the Human Connectome Project data illustrates improved accuracy under our approach, and the ability to reliably identify significant brain connections that drive gender differences.  ( 2 min )
    Revisiting the ACVI Method for Constrained Variational Inequalities. (arXiv:2210.15659v1 [stat.ML])
    ACVI is a recently proposed first-order method for solving variational inequalities (VIs) with general constraints. Yang et al. (2022) showed that the gap function of the last iterate decreases at a rate of $\mathcal{O}(\frac{1}{\sqrt{K}})$ when the operator is $L$-Lipschitz, monotone, and at least one constraint is active. In this work, we show that the same guarantee holds when only assuming that the operator is monotone. To our knowledge, this is the first analytically derived last-iterate convergence rate for general monotone VIs, and overall the only one that does not rely on the assumption that the operator is $L$-Lipschitz. Furthermore, when the sub-problems of ACVI are solved approximately, we show that by using a standard warm-start technique the convergence rate stays the same, provided that the errors decrease at appropriate rates. We further provide empirical analyses and insights on its implementation for the latter case.
  • Open

    A super-polynomial quantum-classical separation for density modelling. (arXiv:2210.14936v1 [quant-ph])
    Density modelling is the task of learning an unknown probability density function from samples, and is one of the central problems of unsupervised machine learning. In this work, we show that there exists a density modelling problem for which fault-tolerant quantum computers can offer a super-polynomial advantage over classical learning algorithms, given standard cryptographic assumptions. Along the way, we provide a variety of additional results and insights, of potential interest for proving future distribution learning separations between quantum and classical learning algorithms. Specifically, we (a) provide an overview of the relationships between hardness results in supervised learning and distribution learning, and (b) show that any weak pseudo-random function can be used to construct a classically hard density modelling problem. The latter result opens up the possibility of proving quantum-classical separations for density modelling based on weaker assumptions than those necessary for pseudo-random functions.  ( 2 min )
    SAM-RL: Sensing-Aware Model-Based Reinforcement Learning via Differentiable Physics-Based Simulation and Rendering. (arXiv:2210.15185v1 [cs.RO])
    Model-based reinforcement learning (MBRL) is recognized with the potential to be significantly more sample efficient than model-free RL. How an accurate model can be developed automatically and efficiently from raw sensory inputs (such as images), especially for complex environments and tasks, is a challenging problem that hinders the broad application of MBRL in the real world. In this work, we propose a sensing-aware model-based reinforcement learning system called SAM-RL. Leveraging the differentiable physics-based simulation and rendering, SAM-RL automatically updates the model by comparing rendered images with real raw images and produces the policy efficiently. With the sensing-aware learning pipeline, SAM-RL allows a robot to select an informative viewpoint to monitor the task process. We apply our framework to real-world experiments for accomplishing three manipulation tasks: robotic assembly, tool manipulation, and deformable object manipulation. We demonstrate the effectiveness of SAM-RL via extensive experiments. Supplemental materials and videos are available on our project webpage at https://sites.google.com/view/sam-rl.  ( 2 min )
    LIFT: Language-Interfaced Fine-Tuning for Non-Language Machine Learning Tasks. (arXiv:2206.06565v3 [cs.LG] UPDATED)
    Fine-tuning pretrained language models (LMs) without making any architectural changes has become a norm for learning various language downstream tasks. However, for non-language downstream tasks, a common practice is to employ task-specific designs for input, output layers, and loss functions. For instance, it is possible to fine-tune an LM into an MNIST classifier by replacing the word embedding layer with an image patch embedding layer, the word token output layer with a 10-way output layer, and the word prediction loss with a 10-way classification loss, respectively. A natural question arises: Can LM fine-tuning solve non-language downstream tasks without changing the model architecture or loss function? To answer this, we propose Language-Interfaced Fine-Tuning (LIFT) and study its efficacy and limitations by conducting an extensive empirical study on a suite of non-language classification and regression tasks. LIFT does not make any changes to the model architecture or loss function, and it solely relies on the natural language interface, enabling "no-code machine learning with LMs." We find that LIFT performs comparably well across a wide range of low-dimensional classification and regression tasks, matching the performances of the best baselines in many cases, especially for the classification tasks. We also report experimental results on the fundamental properties of LIFT, including inductive bias, robustness, and sample complexity. We also analyze the effect of pretraining on LIFT and a few properties/techniques specific to LIFT, e.g., context-aware learning via appropriate prompting, calibrated predictions, data generation, and two-stage fine-tuning. Our code is available at https://github.com/UW-Madison-Lee-Lab/LanguageInterfacedFineTuning.
    GROWN+UP: A Graph Representation Of a Webpage Network Utilizing Pre-training. (arXiv:2208.02252v2 [cs.LG] UPDATED)
    Large pre-trained neural networks are ubiquitous and critical to the success of many downstream tasks in natural language processing and computer vision. However, within the field of web information retrieval, there is a stark contrast in the lack of similarly flexible and powerful pre-trained models that can properly parse webpages. Consequently, we believe that common machine learning tasks like content extraction and information mining from webpages have low-hanging gains that yet remain untapped. We aim to close the gap by introducing an agnostic deep graph neural network feature extractor that can ingest webpage structures, pre-train self-supervised on massive unlabeled data, and fine-tune to arbitrary tasks on webpages effectually. Finally, we show that our pre-trained model achieves state-of-the-art results using multiple datasets on two very different benchmarks: webpage boilerplate removal and genre classification, thus lending support to its potential application in diverse downstream tasks.
    Minimax Optimal Algorithms for Fixed-Budget Best Arm Identification. (arXiv:2206.04646v3 [stat.ML] UPDATED)
    We consider the fixed-budget best arm identification problem where the goal is to find the arm of the largest mean with a fixed number of samples. It is known that the probability of misidentifying the best arm is exponentially small to the number of rounds. However, limited characterizations have been discussed on the rate (exponent) of this value. In this paper, we characterize the minimax optimal rate as a result of an optimization over all possible parameters. We introduce two rates, $R^{\mathrm{go}}$ and $R^{\mathrm{go}}_{\infty}$, corresponding to lower bounds on the probability of misidentification, each of which is associated with a proposed algorithm. The rate $R^{\mathrm{go}}$ is associated with $R^{\mathrm{go}}$-tracking, which can be efficiently implemented by a neural network and is shown to outperform existing algorithms. However, this rate requires a nontrivial condition to be achievable. To address this issue, we introduce the second rate $R^{\mathrm{go}}_\infty$. We show that this rate is indeed achievable by introducing a conceptual algorithm called delayed optimal tracking (DOT).
    From Shapley back to Pearson: Hypothesis Testing via the Shapley Value. (arXiv:2207.07038v3 [cs.LG] UPDATED)
    The complex nature of artificial neural networks raises concerns on their reliability, trustworthiness, and fairness in real-world scenarios. The Shapley value -- a solution concept from game theory -- is one of the most popular explanation methods for machine learning models. More traditionally, from the perspective of statistical learning, feature importance is defined in terms of conditional independence. So far, these two approaches to interpretability and feature importance have been considered separate and distinct. In this work, we show that Shapley-based explanation methods and conditional independence testing are closely related. We introduce the $\textbf{SHAP}$ley $\textbf{L}$ocal $\textbf{I}$ndependence $\textbf{T}$est ($\textbf{SHAPLIT}$), a novel testing procedure inspired by the Conditional Randomization Test (CRT) for a specific notion of local (i.e., on a sample) conditional independence. With it, we prove that for binary classification problems, each marginal contribution in the Shapley value is an upper bound to the $p$-value of this conditional independence test. Furthermore, we show that the Shapley value itself provides an upper bound to the $p$-value of a global SHAPLIT null hypothesis. As a result, we grant the Shapley value with a precise statistical sense of importance with false positive rate control.
    CrAM: A Compression-Aware Minimizer. (arXiv:2207.14200v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trained via CrAM should be compressible post-training, in a single step, without significant accuracy loss. Experimental results on standard benchmarks, such as residual networks for ImageNet classification and BERT models for language modelling, show that CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: specifically, we can prune models in one-shot to 70-80% sparsity with reasonable ($\leq 1\%$) accuracy loss, which is competitive with gradual compression methods. Additionally, we show that CrAM produces sparse models which perform well for transfer learning, and that it also works for semi-structured pruning patterns supported by GPU hardware.
    Post-Selection Confidence Bounds for Prediction Performance. (arXiv:2210.13206v2 [stat.ML] UPDATED)
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.
    Theoretical Error Performance Analysis for Variational Quantum Circuit Based Functional Regression. (arXiv:2206.04804v2 [quant-ph] UPDATED)
    The noisy intermediate-scale quantum (NISQ) devices enable the implementation of the variational quantum circuit (VQC) for quantum neural networks (QNN). Although the VQC-based QNN has succeeded in many machine learning tasks, the representation and generalization powers of VQC still require further investigation, particularly when the dimensionality of classical inputs is concerned. In this work, we first put forth an end-to-end quantum neural network, TTN-VQC, which consists of a quantum tensor network based on a tensor-train network (TTN) for dimensionality reduction and a VQC for functional regression. Then, we aim at the error performance analysis for the TTN-VQC in terms of representation and generalization powers. We also characterize the optimization properties of TTN-VQC by leveraging the Polyak-Lojasiewicz (PL) condition. Moreover, we conduct the experiments of functional regression on a handwritten digit classification dataset to justify our theoretical analysis.
    A high-resolution dynamical view on momentum methods for over-parameterized neural networks. (arXiv:2208.03941v3 [cs.LG] UPDATED)
    Due to the simplicity and efficiency of the first-order gradient method, it has been widely used in training neural networks. Although the optimization problem of the neural network is non-convex, recent research has proved that the first-order method is capable of attaining a global minimum for training over-parameterized neural networks, where the number of parameters is significantly larger than that of training instances. Momentum methods, including heavy ball method (HB) and Nesterov's accelerated method (NAG), are the workhorse first-order gradient methods owning to their accelerated convergence. In practice, NAG often exhibits better performance than HB. However, current research fails to distinguish their convergence difference in training neural networks. Motivated by this, we provide convergence analysis of HB and NAG in training an over-parameterized two-layer neural network with ReLU activation, through the lens of high-resolution dynamical systems and neural tangent kernel (NTK) theory. Compared to existing works, our analysis not only establishes tighter upper bounds of the convergence rate for both HB and NAG, but also characterizes the effect of the gradient correction term, which leads to the acceleration of NAG over HB. Finally, we validate our theoretical result on three benchmark datasets.
    Attention Augmented ConvNeXt UNet For Rectal Tumour Segmentation. (arXiv:2210.00227v2 [eess.IV] UPDATED)
    It is a challenge to segment the location and size of rectal cancer tumours through deep learning. In this paper, in order to improve the ability of extracting suffi-cient feature information in rectal tumour segmentation, attention enlarged ConvNeXt UNet (AACN-UNet), is proposed. The network mainly includes two improvements: 1) the encoder stage of UNet is changed to ConvNeXt structure for encoding operation, which can not only integrate multi-scale semantic information on a large scale, but al-so reduce information loss and extract more feature information from CT images; 2) CBAM attention mechanism is added to improve the connection of each feature in channel and space, which is conducive to extracting the effective feature of the target and improving the segmentation accuracy.The experiment with UNet and its variant network shows that AACN-UNet is 0.9% ,1.1% and 1.4% higher than the current best results in P, F1 and Miou.Compared with the training time, the number of parameters in UNet network is less. This shows that our proposed AACN-UNet has achieved ex-cellent results in CT image segmentation of rectal cancer.
    Provably Doubly Accelerated Federated Learning: The First Theoretically Successful Combination of Local Training and Compressed Communication. (arXiv:2210.13277v2 [cs.LG] UPDATED)
    In the modern paradigm of federated learning, a large number of users are involved in a global learning task, in a collaborative way. They alternate local computations and two-way communication with a distant orchestrating server. Communication, which can be slow and costly, is the main bottleneck in this setting. To reduce the communication load and therefore accelerate distributed gradient descent, two strategies are popular: 1) communicate less frequently; that is, perform several iterations of local computations between the communication rounds; and 2) communicate compressed information instead of full-dimensional vectors. In this paper, we propose the first algorithm for distributed optimization and federated learning, which harnesses these two strategies jointly and converges linearly to an exact solution, with a doubly accelerated rate: our algorithm benefits from the two acceleration mechanisms provided by local training and compression, namely a better dependency on the condition number of the functions and on the dimension of the model, respectively.
    Is a Question Decomposition Unit All We Need?. (arXiv:2205.12538v2 [cs.CL] UPDATED)
    Large Language Models (LMs) have achieved state-of-the-art performance on many Natural Language Processing (NLP) benchmarks. With the growing number of new benchmarks, we build bigger and more complex LMs. However, building new LMs may not be an ideal option owing to the cost, time and environmental impact associated with it. We explore an alternative route: can we modify data by expressing it in terms of the model's strengths, so that a question becomes easier for models to answer? We investigate if humans can decompose a hard question into a set of simpler questions that are relatively easier for models to solve. We analyze a range of datasets involving various forms of reasoning and find that it is indeed possible to significantly improve model performance (24% for GPT3 and 29% for RoBERTa-SQuAD along with a symbolic calculator) via decomposition. Our approach provides a viable option to involve people in NLP research in a meaningful way. Our findings indicate that Human-in-the-loop Question Decomposition (HQD) can potentially provide an alternate path to building large LMs. Code and data is available at https://github.com/Pruthvi98/QuestionDecomposition
    Face-to-Face Contrastive Learning for Social Intelligence Question-Answering. (arXiv:2208.01036v3 [cs.LG] UPDATED)
    Creating artificial social intelligence - algorithms that can understand the nuances of multi-person interactions - is an exciting and emerging challenge in processing facial expressions and gestures from multimodal videos. Recent multimodal methods have set the state of the art on many tasks, but have difficulty modeling the complex face-to-face conversational dynamics across speaking turns in social interaction, particularly in a self-supervised setup. In this paper, we propose Face-to-Face Contrastive Learning (F2F-CL), a graph neural network designed to model social interactions using factorization nodes to contextualize the multimodal face-to-face interaction along the boundaries of the speaking turn. With the F2F-CL model, we propose to perform contrastive learning between the factorization nodes of different speaking turns within the same video. We experimentally evaluated the challenging Social-IQ dataset and show state-of-the-art results.
    Efficient Phi-Regret Minimization in Extensive-Form Games via Online Mirror Descent. (arXiv:2205.15294v3 [cs.LG] UPDATED)
    A conceptually appealing approach for learning Extensive-Form Games (EFGs) is to convert them to Normal-Form Games (NFGs). This approach enables us to directly translate state-of-the-art techniques and analyses in NFGs to learning EFGs, but typically suffers from computational intractability due to the exponential blow-up of the game size introduced by the conversion. In this paper, we address this problem in natural and important setups for the \emph{$\Phi$-Hedge} algorithm -- A generic algorithm capable of learning a large class of equilibria for NFGs. We show that $\Phi$-Hedge can be directly used to learn Nash Equilibria (zero-sum settings), Normal-Form Coarse Correlated Equilibria (NFCCE), and Extensive-Form Correlated Equilibria (EFCE) in EFGs. We prove that, in those settings, the \emph{$\Phi$-Hedge} algorithms are equivalent to standard Online Mirror Descent (OMD) algorithms for EFGs with suitable dilated regularizers, and run in polynomial time. This new connection further allows us to design and analyze a new class of OMD algorithms based on modifying its log-partition function. In particular, we design an improved algorithm with balancing techniques that achieves a sharp $\widetilde{\mathcal{O}}(\sqrt{XAT})$ EFCE-regret under bandit-feedback in an EFG with $X$ information sets, $A$ actions, and $T$ episodes. To our best knowledge, this is the first such rate and matches the information-theoretic lower bound.
    Neural Networks with Quantization Constraints. (arXiv:2210.15623v1 [cs.LG])
    Enabling low precision implementations of deep learning models, without considerable performance degradation, is necessary in resource and latency constrained settings. Moreover, exploiting the differences in sensitivity to quantization across layers can allow mixed precision implementations to achieve a considerably better computation performance trade-off. However, backpropagating through the quantization operation requires introducing gradient approximations, and choosing which layers to quantize is challenging for modern architectures due to the large search space. In this work, we present a constrained learning approach to quantization aware training. We formulate low precision supervised learning as a constrained optimization problem, and show that despite its non-convexity, the resulting problem is strongly dual and does away with gradient estimations. Furthermore, we show that dual variables indicate the sensitivity of the objective with respect to constraint perturbations. We demonstrate that the proposed approach exhibits competitive performance in image classification tasks, and leverage the sensitivity result to apply layer selective quantization based on the value of dual variables, leading to considerable performance improvements.
    CSI Clustering with Variational Autoencoding. (arXiv:2111.09758v3 [eess.SP] UPDATED)
    The model order of a wireless channel plays an important role for a variety of applications in communications engineering, e.g., it represents the number of resolvable incident wavefronts with non-negligible power incident from a transmitter to a receiver. Areas such as direction of arrival estimation leverage the model order to analyze the multipath components of channel state information. In this work, we propose to use a variational autoencoder to group unlabeled channel state information with respect to the model order in the variational autoencoder latent space in an unsupervised manner. We validate our approach with simulated 3GPP channel data. Our results suggest that, in order to learn an appropriate clustering, it is crucial to use a more flexible likelihood model for the variational autoencoder decoder than it is usually the case in standard applications.
    Adaptively Exploiting d-Separators with Causal Bandits. (arXiv:2202.05100v3 [stat.ML] UPDATED)
    Multi-armed bandit problems provide a framework to identify the optimal intervention over a sequence of repeated experiments. Without additional assumptions, minimax optimal performance (measured by cumulative regret) is well-understood. With access to additional observed variables that d-separate the intervention from the outcome (i.e., they are a d-separator), recent "causal bandit" algorithms provably incur less regret. However, in practice it is desirable to be agnostic to whether observed variables are a d-separator. Ideally, an algorithm should be adaptive; that is, perform nearly as well as an algorithm with oracle knowledge of the presence or absence of a d-separator. In this work, we formalize and study this notion of adaptivity, and provide a novel algorithm that simultaneously achieves (a) optimal regret when a d-separator is observed, improving on classical minimax algorithms, and (b) significantly smaller regret than recent causal bandit algorithms when the observed variables are not a d-separator. Crucially, our algorithm does not require any oracle knowledge of whether a d-separator is observed. We also generalize this adaptivity to other conditions, such as the front-door criterion.
    Federated Multi-Task Learning for THz Wideband Channel and DoA Estimation. (arXiv:2207.06017v3 [eess.SP] UPDATED)
    This paper addresses two major challenges in terahertz (THz) channel estimation: the beam-split phenomenon, i.e., beam misalignment because of frequency-independent analog beamformers, and computational complexity because of the usage of ultra-massive number of antennas to compensate propagation losses. Data-driven techniques are known to mitigate the complexity of this problem but usually require the transmission of the datasets from the users to a central server entailing huge communication overhead. In this work, we introduce a federated multi-task learning (FMTL), wherein the users transmit only the model parameters instead of the whole dataset, for THz channel and user direction-of-arrival (DoA) estimation to improve the communications-efficiency. We first propose a novel beamspace support alignment technique for channel estimation with beam-split correction. Then, the channel and DoA information are used as labels to train an FMTL model. By exploiting the sparsity of the THz channel, the proposed approach is implemented with fewer pilot signals than the traditional techniques. Compared to the previous works, our FMTL approach provides higher channel estimation accuracy as well as approximately 25 (32) times lower model (channel) training overhead, respectively.
    Beyond Black Box Densities: Parameter Learning for the Deviated Components. (arXiv:2202.02651v2 [stat.ML] UPDATED)
    As we collect additional samples from a data population for which a known density function estimate may have been previously obtained by a black box method, the increased complexity of the data set may result in the true density being deviated from the known estimate by a mixture distribution. To model this phenomenon, we consider the \emph{deviating mixture model} $(1-\lambda^{*})h_0 + \lambda^{*} (\sum_{i = 1}^{k} p_{i}^{*} f(x|\theta_{i}^{*}))$, where $h_0$ is a known density function, while the deviated proportion $\lambda^{*}$ and latent mixing measure $G_{*} = \sum_{i = 1}^{k} p_{i}^{*} \delta_{\theta_i^{*}}$ associated with the mixture distribution are unknown. Via a novel notion of distinguishability between the known density $h_{0}$ and the deviated mixture distribution, we establish rates of convergence for the maximum likelihood estimates of $\lambda^{*}$ and $G^{*}$ under Wasserstein metric. Simulation studies are carried out to illustrate the theory.
    A Mini-Block Fisher Method for Deep Neural Networks. (arXiv:2202.04124v4 [cs.LG] UPDATED)
    Deep neural networks (DNNs) are currently predominantly trained using first-order methods. Some of these methods (e.g., Adam, AdaGrad, and RMSprop, and their variants) incorporate a small amount of curvature information by using a diagonal matrix to precondition the stochastic gradient. Recently, effective second-order methods, such as KFAC, K-BFGS, Shampoo, and TNT, have been developed for training DNNs, by preconditioning the stochastic gradient by layer-wise block-diagonal matrices. Here we propose a "mini-block Fisher (MBF)" preconditioned gradient method, that lies in between these two classes of methods. Specifically, our method uses a block-diagonal approximation to the empirical Fisher matrix, where for each layer in the DNN, whether it is convolutional or feed-forward and fully connected, the associated diagonal block is itself block-diagonal and is composed of a large number of mini-blocks of modest size. Our novel approach utilizes the parallelism of GPUs to efficiently perform computations on the large number of matrices in each layer. Consequently, MBF's per-iteration computational cost is only slightly higher than it is for first-order methods. The performance of our proposed method is compared to that of several baseline methods, on both autoencoder and CNN problems, to validate its effectiveness both in terms of time efficiency and generalization power. Finally, it is proved that an idealized version of MBF converges linearly.
    Hypergraphon Mean Field Games. (arXiv:2203.16223v3 [cs.GT] UPDATED)
    We propose an approach to modelling large-scale multi-agent dynamical systems allowing interactions among more than just pairs of agents using the theory of mean field games and the notion of hypergraphons, which are obtained as limits of large hypergraphs. To the best of our knowledge, ours is the first work on mean field games on hypergraphs. Together with an extension to a multi-layer setup, we obtain limiting descriptions for large systems of non-linear, weakly-interacting dynamical agents. On the theoretical side, we prove the well-foundedness of the resulting hypergraphon mean field game, showing both existence and approximate Nash properties. On the applied side, we extend numerical and learning algorithms to compute the hypergraphon mean field equilibria. To verify our approach empirically, we consider a social rumor spreading model, where we give agents intrinsic motivation to spread rumors to unaware agents, and an epidemics control problem.
    Quasi Black-Box Variational Inference with Natural Gradients for Bayesian Learning. (arXiv:2205.11568v2 [stat.ML] UPDATED)
    We develop an optimization algorithm suitable for Bayesian learning in complex models. Our approach relies on natural gradient updates within a general black-box framework for efficient training with limited model-specific derivations. It applies within the class of exponential-family variational posterior distributions, for which we extensively discuss the Gaussian case for which the updates have a rather simple form. Our Quasi Black-box Variational Inference (QBVI) framework is readily applicable to a wide class of Bayesian inference problems and is of simple implementation as the updates of the variational posterior do not involve gradients with respect to the model parameters, nor the prescription of the Fisher information matrix. We develop QBVI under different hypotheses for the posterior covariance matrix, discuss details about its robust and feasible implementation, and provide a number of real-world applications to demonstrate its effectiveness.
    A Unified Analysis of Federated Learning with Arbitrary Client Participation. (arXiv:2205.13648v3 [cs.LG] UPDATED)
    Federated learning (FL) faces challenges of intermittent client availability and computation/communication efficiency. As a result, only a small subset of clients can participate in FL at a given time. It is important to understand how partial client participation affects convergence, but most existing works have either considered idealized participation patterns or obtained results with non-zero optimality error for generic patterns. In this paper, we provide a unified convergence analysis for FL with arbitrary client participation. We first introduce a generalized version of federated averaging (FedAvg) that amplifies parameter updates at an interval of multiple FL rounds. Then, we present a novel analysis that captures the effect of client participation in a single term. By analyzing this term, we obtain convergence upper bounds for a wide range of participation patterns, including both non-stochastic and stochastic cases, which match either the lower bound of stochastic gradient descent (SGD) or the state-of-the-art results in specific settings. We also discuss various insights, recommendations, and experimental results.
    Improving Certified Robustness via Statistical Learning with Logical Reasoning. (arXiv:2003.00120v6 [cs.LG] UPDATED)
    Intensive algorithmic efforts have been made to enable the rapid improvements of certificated robustness for complex ML models recently. However, current robustness certification methods are only able to certify under a limited perturbation radius. Given that existing pure data-driven statistical approaches have reached a bottleneck, in this paper, we propose to integrate statistical ML models with knowledge (expressed as logical rules) as a reasoning component using Markov logic networks (MLN, so as to further improve the overall certified robustness. This opens new research questions about certifying the robustness of such a paradigm, especially the reasoning component (e.g., MLN). As the first step towards understanding these questions, we first prove that the computational complexity of certifying the robustness of MLN is #P-hard. Guided by this hardness result, we then derive the first certified robustness bound for MLN by carefully analyzing different model regimes. Finally, we conduct extensive experiments on five datasets including both high-dimensional images and natural language texts, and we show that the certified robustness with knowledge-based logical reasoning indeed significantly outperforms that of the state-of-the-art.
    An Analysis of Robustness of Non-Lipschitz Networks. (arXiv:2010.06154v3 [cs.LG] UPDATED)
    Despite significant advances, deep networks remain highly susceptible to adversarial attack. One fundamental challenge is that small input perturbations can often produce large movements in the network's final-layer feature space. In this paper, we define an attack model that abstracts this challenge, to help understand its intrinsic properties. In our model, the adversary may move data an arbitrary distance in feature space but only in random low-dimensional subspaces. We prove such adversaries can be quite powerful: defeating any algorithm that must classify any input it is given. However, by allowing the algorithm to abstain on unusual inputs, we show such adversaries can be overcome when classes are reasonably well-separated in feature space. We further provide strong theoretical guarantees for setting algorithm parameters to optimize over accuracy-abstention trade-offs using data-driven methods. Our results provide new robustness guarantees for nearest-neighbor style algorithms, and also have application to contrastive learning, where we empirically demonstrate the ability of such algorithms to obtain high robust accuracy with low abstention rates. Our model is also motivated by strategic classification, where entities being classified aim to manipulate their observable features to produce a preferred classification, and we provide new insights into that area as well.
    Context-NER : Contextual Phrase Generation at Scale. (arXiv:2109.08079v2 [cs.IR] UPDATED)
    NLP research has been focused on NER extraction and how to efficiently extract them from a sentence. However, generating relevant context of entities from a sentence has remained under-explored. In this work we introduce the task Context-NER in which relevant context of an entity has to be generated. The extracted context may not be found exactly as a substring in the sentence. We also introduce the EDGAR10-Q dataset for the same, which is a corpus of 1,500 publicly traded companies. It is a manually created complex corpus and one of the largest in terms of number of sentences and entities (1 M and 2.8 M). We introduce a baseline approach that leverages phrase generation algorithms and uses the pre-trained BERT model to get 33% ROUGE-L score. We also do a one shot evaluation with GPT-3 and get 39% score, signifying the hardness and future scope of this task. We hope that addition of this dataset and our study will pave the way for further research in this domain.
    Classifier Data Quality: A Geometric Complexity Based Method for Automated Baseline And Insights Generation. (arXiv:2112.11832v2 [cs.LG] UPDATED)
    Testing Machine Learning (ML) models and AI-Infused Applications (AIIAs), or systems that contain ML models, is highly challenging. In addition to the challenges of testing classical software, it is acceptable and expected that statistical ML models sometimes output incorrect results. A major challenge is to determine when the level of incorrectness, e.g., model accuracy or F1 score for classifiers, is acceptable and when it is not. In addition to business requirements that should provide a threshold, it is a best practice to require any proposed ML solution to out-perform simple baseline models, such as a decision tree. We have developed complexity measures, which quantify how difficult given observations are to assign to their true class label; these measures can then be used to automatically determine a baseline performance threshold. These measures are superior to the best practice baseline in that, for a linear computation cost, they also quantify each observation' classification complexity in an explainable form, regardless of the classifier model used. Our experiments with both numeric synthetic data and real natural language chatbot data demonstrate that the complexity measures effectively highlight data regions and observations that are likely to be misclassified.
    Training with More Confidence: Mitigating Injected and Natural Backdoors During Training. (arXiv:2202.06382v3 [cs.LG] UPDATED)
    The backdoor or Trojan attack is a severe threat to deep neural networks (DNNs). Researchers find that DNNs trained on benign data and settings can also learn backdoor behaviors, which is known as the natural backdoor. Existing works on anti-backdoor learning are based on weak observations that the backdoor and benign behaviors can differentiate during training. An adaptive attack with slow poisoning can bypass such defenses. Moreover, these methods cannot defend natural backdoors. We found the fundamental differences between backdoor-related neurons and benign neurons: backdoor-related neurons form a hyperplane as the classification surface across input domains of all affected labels. By further analyzing the training process and model architectures, we found that piece-wise linear functions cause this hyperplane surface. In this paper, we design a novel training method that forces the training to avoid generating such hyperplanes and thus remove the injected backdoors. Our extensive experiments on five datasets against five state-of-the-art attacks and also benign training show that our method can outperform existing state-of-the-art defenses. On average, the ASR (attack success rate) of the models trained with NONE is 54.83 times lower than undefended models under standard poisoning backdoor attack and 1.75 times lower under the natural backdoor attack. Our code is available at https://github.com/RU-System-Software-and-Security/NONE.
    Large-scale Optimization of Partial AUC in a Range of False Positive Rates. (arXiv:2203.01505v2 [cs.LG] UPDATED)
    The area under the ROC curve (AUC) is one of the most widely used performance measures for classification models in machine learning. However, it summarizes the true positive rates (TPRs) over all false positive rates (FPRs) in the ROC space, which may include the FPRs with no practical relevance in some applications. The partial AUC, as a generalization of the AUC, summarizes only the TPRs over a specific range of the FPRs and is thus a more suitable performance measure in many real-world situations. Although partial AUC optimization in a range of FPRs had been studied, existing algorithms are not scalable to big data and not applicable to deep learning. To address this challenge, we cast the problem into a non-smooth difference-of-convex (DC) program for any smooth predictive functions (e.g., deep neural networks), which allowed us to develop an efficient approximated gradient descent method based on the Moreau envelope smoothing technique, inspired by recent advances in non-smooth DC optimization. To increase the efficiency of large data processing, we used an efficient stochastic block coordinate update in our algorithm. Our proposed algorithm can also be used to minimize the sum of ranked range loss, which also lacks efficient solvers. We established a complexity of $\tilde O(1/\epsilon^6)$ for finding a nearly $\epsilon$-critical solution. Finally, we numerically demonstrated the effectiveness of our proposed algorithms for both partial AUC maximization and sum of ranked range loss minimization.
    Deep Convolutional Neural Networks for Multi-Target Tracking: A Transfer Learning Approach. (arXiv:2210.15539v1 [eess.SP])
    Multi-target tracking (MTT) is a traditional signal processing task, where the goal is to estimate the states of an unknown number of moving targets from noisy sensor measurements. In this paper, we revisit MTT from a deep learning perspective and propose convolutional neural network (CNN) architectures to tackle it. We represent the target states and sensor measurements as images. Thereby we recast the problem as a image-to-image prediction task for which we train a fully convolutional model. This architecture is motivated by a novel theoretical bound on the transferability error of CNN. The proposed CNN architecture outperforms a GM-PHD filter on the MTT task with 10 targets. The CNN performance transfers without re-training to a larger MTT task with 250 targets with only a $13\%$ increase in average OSPA.
    Demonstrating Analog Inference on the BrainScaleS-2 Mobile System. (arXiv:2103.15960v3 [cs.AR] UPDATED)
    We present the BrainScaleS-2 mobile system as a compact analog inference engine based on the BrainScaleS-2 ASIC and demonstrate its capabilities at classifying a medical electrocardiogram dataset. The analog network core of the ASIC is utilized to perform the multiply-accumulate operations of a convolutional deep neural network. At a system power consumption of 5.6W, we measure a total energy consumption of 192uJ for the ASIC and achieve a classification time of 276us per electrocardiographic patient sample. Patients with atrial fibrillation are correctly identified with a detection rate of (93.7${\pm}$0.7)% at (14.0${\pm}$1.0)% false positives. The system is directly applicable to edge inference applications due to its small size, power envelope, and flexible I/O capabilities. It has enabled the BrainScaleS-2 ASIC to be operated reliably outside a specialized lab setting. In future applications, the system allows for a combination of conventional machine learning layers with online learning in spiking neural networks on a single neuromorphic platform.
    NAS-Bench-360: Benchmarking Neural Architecture Search on Diverse Tasks. (arXiv:2110.05668v5 [cs.CV] UPDATED)
    Most existing neural architecture search (NAS) benchmarks and algorithms prioritize well-studied tasks, e.g. image classification on CIFAR or ImageNet. This makes the performance of NAS approaches in more diverse areas poorly understood. In this paper, we present NAS-Bench-360, a benchmark suite to evaluate methods on domains beyond those traditionally studied in architecture search, and use it to address the following question: do state-of-the-art NAS methods perform well on diverse tasks? To construct the benchmark, we curate ten tasks spanning a diverse array of application domains, dataset sizes, problem dimensionalities, and learning objectives. Each task is carefully chosen to interoperate with modern CNN-based search methods while possibly being far-afield from its original development domain. To speed up and reduce the cost of NAS research, for two of the tasks we release the precomputed performance of 15,625 architectures comprising a standard CNN search space. Experimentally, we show the need for more robust NAS evaluation of the kind NAS-Bench-360 enables by showing that several modern NAS procedures perform inconsistently across the ten tasks, with many catastrophically poor results. We also demonstrate how NAS-Bench-360 and its associated precomputed results will enable future scientific discoveries by testing whether several recent hypotheses promoted in the NAS literature hold on diverse tasks. NAS-Bench-360 is hosted at https://nb360.ml.cmu.edu.
    SpiroMask: Measuring Lung Function Using Consumer-Grade Masks. (arXiv:2201.09280v5 [cs.HC] UPDATED)
    According to the World Health Organisation (WHO), 235 million people suffer from respiratory illnesses and four million people die annually due to air pollution. Regular lung health monitoring can lead to prognoses about deteriorating lung health conditions. This paper presents our system SpiroMask that retrofits a microphone in consumer-grade masks (N95 and cloth masks) for continuous lung health monitoring. We evaluate our approach on 48 participants (including 14 with lung health issues) and find that we can estimate parameters such as lung volume and respiration rate within the approved error range by the American Thoracic Society (ATS). Further, we show that our approach is robust to sensor placement inside the mask.
    A biologically-inspired multi-modal evaluation of molecular generative machine learning. (arXiv:2208.09658v2 [cs.LG] UPDATED)
    While generative models have recently become ubiquitous in many scientific areas, less attention has been paid to their evaluation. For molecular generative models, the state-of-the-art examines their output in isolation or in relation to its input. However, their biological and functional properties, such as ligand-target interaction is not being addressed. In this study, a novel biologically-inspired benchmark for the evaluation of molecular generative models is proposed. Specifically, three diverse reference datasets are designed and a set of metrics are introduced which are directly relevant to the drug discovery process. In particular we propose a recreation metric, apply drug-target affinity prediction and molecular docking as complementary techniques for the evaluation of generative outputs. While all three metrics show consistent results across the tested generative models, a more detailed comparison of drug-target affinity binding and molecular docking scores revealed that unimodal predictiors can lead to erroneous conclusions about target binding on a molecular level and a multi-modal approach is thus preferrable. The key advantage of this framework is that it incorporates prior physico-chemical domain knowledge into the benchmarking process by focusing explicitly on ligand-target interactions and thus creating a highly efficient tool not only for evaluating molecular generative outputs in particular, but also for enriching the drug discovery process in general.
    Learning Single-Index Models with Shallow Neural Networks. (arXiv:2210.15651v1 [cs.LG])
    Single-index models are a class of functions given by an unknown univariate ``link'' function applied to an unknown one-dimensional projection of the input. These models are particularly relevant in high dimension, when the data might present low-dimensional structure that learning algorithms should adapt to. While several statistical aspects of this model, such as the sample complexity of recovering the relevant (one-dimensional) subspace, are well-understood, they rely on tailored algorithms that exploit the specific structure of the target function. In this work, we introduce a natural class of shallow neural networks and study its ability to learn single-index models via gradient flow. More precisely, we consider shallow networks in which biases of the neurons are frozen at random initialization. We show that the corresponding optimization landscape is benign, which in turn leads to generalization guarantees that match the near-optimal sample complexity of dedicated semi-parametric methods.
    Efficient and Stable Algorithms to Extend Greville's Method to Partitioned Matrices Based on Inverse Cholesky Factorization. (arXiv:2005.07045v2 [math.NA] UPDATED)
    Greville's method has been utilized in (Broad Learn-ing System) BLS to propose an effective and efficient incremental learning system without retraining the whole network from the beginning. For a column-partitioned matrix where the second part consists of p columns, Greville's method requires p iterations to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part. The incremental algorithms in BLS extend Greville's method to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part by just 1 iteration, which have neglected some possible cases, and need further improvements in efficiency and numerical stability. In this paper, we propose an efficient and numerical stable algorithm from Greville's method, to compute the pseudoinverse of the whole matrix from the pseudoinverse of the first part by just 1 iteration, where all possible cases are considered, and the recently proposed inverse Cholesky factorization can be applied to further reduce the computational complexity. Finally, we give the whole algorithm for column-partitioned matrices in BLS. On the other hand, we also give the proposed algorithm for row-partitioned matrices in BLS.
    Query-Reward Tradeoffs in Multi-Armed Bandits. (arXiv:2110.05724v2 [cs.LG] UPDATED)
    We consider a stochastic multi-armed bandit setting where reward must be actively queried for it to be observed. We provide tight lower and upper problem-dependent guarantees on both the regret and the number of queries. Interestingly, we prove that there is a fundamental difference between problems with a unique and multiple optimal arms, unlike in the standard multi-armed bandit problem. We also present a new, simple, UCB-style sampling concept, and show that it naturally adapts to the number of optimal arms and achieves tight regret and querying bounds.
    Graph neural network initialisation of quantum approximate optimisation. (arXiv:2111.03016v2 [quant-ph] UPDATED)
    Approximate combinatorial optimisation has emerged as one of the most promising application areas for quantum computers, particularly those in the near term. In this work, we focus on the quantum approximate optimisation algorithm (QAOA) for solving the MaxCut problem. Specifically, we address two problems in the QAOA, how to initialise the algorithm, and how to subsequently train the parameters to find an optimal solution. For the former, we propose graph neural networks (GNNs) as a warm-starting technique for QAOA. We demonstrate that merging GNNs with QAOA can outperform both approaches individually. Furthermore, we demonstrate how graph neural networks enables warm-start generalisation across not only graph instances, but also to increasing graph sizes, a feature not straightforwardly available to other warm-starting methods. For training the QAOA, we test several optimisers for the MaxCut problem up to 16 qubits and benchmark against vanilla gradient descent. These include quantum aware/agnostic and machine learning based/neural optimisers. Examples of the latter include reinforcement and meta-learning. With the incorporation of these initialisation and optimisation toolkits, we demonstrate how the optimisation problems can be solved using QAOA in an end-to-end differentiable pipeline.
    Fast Rates for Noisy Interpolation Require Rethinking the Effects of Inductive Bias. (arXiv:2203.03597v2 [stat.ML] UPDATED)
    Good generalization performance on high-dimensional data crucially hinges on a simple structure of the ground truth and a corresponding strong inductive bias of the estimator. Even though this intuition is valid for regularized models, in this paper we caution against a strong inductive bias for interpolation in the presence of noise: While a stronger inductive bias encourages a simpler structure that is more aligned with the ground truth, it also increases the detrimental effect of noise. Specifically, for both linear regression and classification with a sparse ground truth, we prove that minimum $\ell_p$-norm and maximum $\ell_p$-margin interpolators achieve fast polynomial rates close to order $1/n$ for $p > 1$ compared to a logarithmic rate for $p = 1$. Finally, we provide preliminary experimental evidence that this trade-off may also play a crucial role in understanding non-linear interpolating models used in practice.
    Fine-grained Noise Control for Multispeaker Speech Synthesis. (arXiv:2204.05070v2 [cs.SD] UPDATED)
    A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors, i.e. linguistic content, prosody and timbre from any residual factors, such as recording conditions and background noise.This paper proposes unsupervised, interpretable and fine-grained noise and prosody modeling. We incorporate adversarial training, representation bottleneck and utterance-to-frame modeling in order to learn frame-level noise representations. To the same end, we perform fine-grained prosody modeling via a Fully Hierarchical Variational AutoEncoder (FVAE) which additionally results in more expressive speech synthesis.
    What makes you unique?. (arXiv:2105.08013v2 [stat.AP] UPDATED)
    This paper proposes a uniqueness Shapley measure to compare the extent to which different variables are able to identify a subject. Revealing the value of a variable on subject $t$ shrinks the set of possible subjects that $t$ could be. The extent of the shrinkage depends on which other variables have also been revealed. We use Shapley value to combine all of the reductions in log cardinality due to revealing a variable after some subset of the other variables has been revealed. This uniqueness Shapley measure can be aggregated over subjects where it becomes a weighted sum of conditional entropies. Aggregation over subsets of subjects can address questions like how identifying is age for people of a given zip code. Such aggregates have a corresponding expression in terms of cross entropies. We use uniqueness Shapley to investigate the differential effects of revealing variables from the North Carolina voter registration rolls and in identifying anomalous solar flares. An enormous speedup (approaching 2000 fold in one example) is obtained by using the all dimension trees of Moore and Lee (1998) to store the cardinalities we need.
    Smoothed Online Optimization with Unreliable Predictions. (arXiv:2202.03519v2 [cs.LG] UPDATED)
    We examine the problem of smoothed online optimization, where a decision maker must sequentially choose points in a normed vector space to minimize the sum of per-round, non-convex hitting costs and the costs of switching decisions between rounds. The decision maker has access to a black-box oracle, such as a machine learning model, that provides untrusted and potentially inaccurate predictions of the optimal decision in each round. The goal of the decision maker is to exploit the predictions if they are accurate, while guaranteeing performance that is not much worse than the hindsight optimal sequence of decisions, even when predictions are inaccurate. We impose the standard assumption that hitting costs are globally $\alpha$-polyhedral. We propose a novel algorithm, Adaptive Online Switching (AOS), and prove that, for a large set of feasible $\delta > 0$, it is $(1+\delta)$-competitive if predictions are perfect, while also maintaining a uniformly bounded competitive ratio of $2^{\tilde{\mathcal{O}}(1/(\alpha \delta))}$ even when predictions are adversarial. Further, we prove that this trade-off is necessary and nearly optimal in the sense that \emph{any} deterministic algorithm which is $(1+\delta)$-competitive if predictions are perfect must be at least $2^{\tilde{\Omega}(1/(\alpha \delta))}$-competitive when predictions are inaccurate. In fact, we observe a unique threshold-type behavior in this trade-off: if $\delta$ is not in the set of feasible options, then \emph{no} algorithm is simultaneously $(1 + \delta)$-competitive if predictions are perfect and $\zeta$-competitive when predictions are inaccurate for any $\zeta < \infty$. Furthermore, we discuss that memory is crucial in AOS by proving that any algorithm that does not use memory cannot benefit from predictions. We complement our theoretical results by a numerical study on a microgrid application.
    First is Better Than Last for Language Data Influence. (arXiv:2202.11844v3 [cs.LG] UPDATED)
    The ability to identify influential training examples enables us to debug training data and explain model behavior. Existing techniques to do so are based on the flow of training data influence through the model parameters. For large models in NLP applications, it is often computationally infeasible to study this flow through all model parameters, therefore techniques usually pick the last layer of weights. However, we observe that since the activation connected to the last layer of weights contains "shared logic", the data influenced calculated via the last layer weights prone to a ``cancellation effect'', where the data influence of different examples have large magnitude that contradicts each other. The cancellation effect lowers the discriminative power of the influence score, and deleting influential examples according to this measure often does not change the model's behavior by much. To mitigate this, we propose a technique called TracIn-WE that modifies a method called TracIn to operate on the word embedding layer instead of the last layer, where the cancellation effect is less severe. One potential concern is that influence based on the word embedding layer may not encode sufficient high level information. However, we find that gradients (unlike embeddings) do not suffer from this, possibly because they chain through higher layers. We show that TracIn-WE significantly outperforms other data influence methods applied on the last layer significantly on the case deletion evaluation on three language classification tasks for different models. In addition, TracIn-WE can produce scores not just at the level of the overall training input, but also at the level of words within the training input, a further aid in debugging.
    Assessing the communication gap between AI models and healthcare professionals: explainability, utility and trust in AI-driven clinical decision-making. (arXiv:2204.05030v3 [cs.AI] UPDATED)
    This paper contributes with a pragmatic evaluation framework for explainable Machine Learning (ML) models for clinical decision support. The study revealed a more nuanced role for ML explanation models, when these are pragmatically embedded in the clinical context. Despite the general positive attitude of healthcare professionals (HCPs) towards explanations as a safety and trust mechanism, for a significant set of participants there were negative effects associated with confirmation bias, accentuating model over-reliance and increased effort to interact with the model. Also, contradicting one of its main intended functions, standard explanatory models showed limited ability to support a critical understanding of the limitations of the model. However, we found new significant positive effects which repositions the role of explanations within a clinical context: these include reduction of automation bias, addressing ambiguous clinical cases (cases where HCPs were not certain about their decision) and support of less experienced HCPs in the acquisition of new domain knowledge.
    Active Learning for Regression by Inverse Distance Weighting. (arXiv:2204.07177v2 [cs.LG] UPDATED)
    This paper proposes an active learning (AL) algorithm to solve regression problems based on inverse-distance weighting functions for selecting the feature vectors to query. The algorithm has the following features: (i) supports both pool-based and population-based sampling; (ii) it is not tailored to a particular class of predictors; (iii) can handle known and unknown constraints on the queryable feature vectors; and (iv) can run either sequentially, or in batch mode, depending on how often the predictor is retrained. The potentials of the method are shown in numerical tests on illustrative synthetic problems and real-world datasets from the UCI repository. A Python implementation of the algorithm, that we call IDEAL (Inverse-Distance based Exploration for Active Learning), is available at \url{this http URL}.
    ReFactor GNNs: Revisiting Factorisation-based Models from a Message-Passing Perspective. (arXiv:2207.09980v3 [cs.LG] UPDATED)
    Factorisation-based Models (FMs), such as DistMult, have enjoyed enduring success for Knowledge Graph Completion (KGC) tasks, often outperforming Graph Neural Networks (GNNs). However, unlike GNNs, FMs struggle to incorporate node features and generalise to unseen nodes in inductive settings. Our work bridges the gap between FMs and GNNs by proposing ReFactor GNNs. This new architecture draws upon both modelling paradigms, which previously were largely thought of as disjoint. Concretely, using a message-passing formalism, we show how FMs can be cast as GNNs by reformulating the gradient descent procedure as message-passing operations, which forms the basis of our ReFactor GNNs. Across a multitude of well-established KGC benchmarks, our ReFactor GNNs achieve comparable transductive performance to FMs, and state-of-the-art inductive performance while using an order of magnitude fewer parameters.
    Cross-Domain Neural Entity Linking. (arXiv:2210.15616v1 [cs.CL])
    Entity Linking is the task of matching a mention to an entity in a given knowledge base (KB). It contributes to annotating a massive amount of documents existing on the Web to harness new facts about their matched entities. However, existing Entity Linking systems focus on developing models that are typically domain-dependent and robust only to a particular knowledge base on which they have been trained. The performance is not as adequate when being evaluated on documents and knowledge bases from different domains. Approaches based on pre-trained language models, such as Wu et al. (2020), attempt to solve the problem using a zero-shot setup, illustrating some potential when evaluated on a general-domain KB. Nevertheless, the performance is not equivalent when evaluated on a domain-specific KB. To allow for more accurate Entity Linking across different domains, we propose our framework: Cross-Domain Neural Entity Linking (CDNEL). Our objective is to have a single system that enables simultaneous linking to both the general-domain KB and the domain-specific KB. CDNEL works by learning a joint representation space for these knowledge bases from different domains. It is evaluated using the external Entity Linking dataset (Zeshel) constructed by Logeswaran et al. (2019) and the Reddit dataset collected by Botzer et al. (2021), to compare our proposed method with the state-of-the-art results. The proposed framework uses different types of datasets for fine-tuning, resulting in different model variants of CDNEL. When evaluated on four domains included in the Zeshel dataset, these variants achieve an average precision gain of 9%.
    Residual Multiplicative Filter Networks for Multiscale Reconstruction. (arXiv:2206.00746v2 [cs.CV] UPDATED)
    Coordinate networks like Multiplicative Filter Networks (MFNs) and BACON offer some control over the frequency spectrum used to represent continuous signals such as images or 3D volumes. Yet, they are not readily applicable to problems for which coarse-to-fine estimation is required, including various inverse problems in which coarse-to-fine optimization plays a key role in avoiding poor local minima. We introduce a new coordinate network architecture and training scheme that enables coarse-to-fine optimization with fine-grained control over the frequency support of learned reconstructions. This is achieved with two key innovations. First, we incorporate skip connections so that structure at one scale is preserved when fitting finer-scale structure. Second, we propose a novel initialization scheme to provide control over the model frequency spectrum at each stage of optimization. We demonstrate how these modifications enable multiscale optimization for coarse-to-fine fitting to natural images. We then evaluate our model on synthetically generated datasets for the the problem of single-particle cryo-EM reconstruction. We learn high resolution multiscale structures, on par with the state-of-the art.
    Integrating Symmetry into Differentiable Planning with Steerable Convolutions. (arXiv:2206.03674v2 [cs.LG] UPDATED)
    We study how group symmetry helps improve data efficiency and generalization for end-to-end differentiable planning algorithms when symmetry appears in decision-making tasks. Motivated by equivariant convolution networks, we treat the path planning problem as \textit{signals} over grids. We show that value iteration in this case is a linear equivariant operator, which is a (steerable) convolution. This extends Value Iteration Networks (VINs) on using convolutional networks for path planning with additional rotation and reflection symmetry. Our implementation is based on VINs and uses steerable convolution networks to incorporate symmetry. The experiments are performed on four tasks: 2D navigation, visual navigation, and 2 degrees of freedom (2DOFs) configuration space and workspace manipulation. Our symmetric planning algorithms improve training efficiency and generalization by large margins compared to non-equivariant counterparts, VIN and GPPN.
    On the Importance and Applicability of Pre-Training for Federated Learning. (arXiv:2206.11488v2 [cs.LG] UPDATED)
    Pre-training is prevalent in nowadays deep learning to improve the learned model's performance. However, in the literature on federated learning (FL), neural networks are mostly initialized with random weights. These attract our interest in conducting a systematic study to explore pre-training for FL. Across multiple visual recognition benchmarks, we found that pre-training can not only improve FL, but also close its accuracy gap to the counterpart centralized learning, especially in the challenging cases of non-IID clients' data. To make our findings applicable to situations where pre-trained models are not directly available, we explore pre-training with synthetic data or even with clients' data in a decentralized manner, and found that they can already improve FL notably. Interesting, many of the techniques we explore are complementary to each other to further boost the performance, and we view this as a critical result toward scaling up deep FL for real-world applications. We conclude our paper with an attempt to understand the effect of pre-training on FL. We found that pre-training enables the learned global models under different clients' data conditions to converge to the same loss basin, and makes global aggregation in FL more stable. Nevertheless, pre-training seems to not alleviate local model drifting, a fundamental problem in FL under non-IID data.
    Optimal Estimation of Generic Dynamics by Path-Dependent Neural Jump ODEs. (arXiv:2206.14284v3 [stat.ML] UPDATED)
    This paper studies the problem of forecasting general stochastic processes using an extension of the Neural Jump ODE (NJ-ODE) framework. While NJ-ODE was the first framework to establish convergence guarantees for the prediction of irregularly observed time series, these results were limited to data stemming from It\^o-diffusions with complete observations, in particular Markov processes where all coordinates are observed simultaneously. In this work, we generalise these results to generic, possibly non-Markovian or discontinuous, stochastic processes with incomplete observations, by utilising the reconstruction properties of the signature transform. These theoretical results are supported by empirical studies, where it is shown that the path-dependent NJ-ODE outperforms the original NJ-ODE framework in the case of non-Markovian data. Moreover, we show that PD-NJ-ODE can be applied successfully to limit order book (LOB) data.
    CyCLIP: Cyclic Contrastive Language-Image Pretraining. (arXiv:2205.14459v2 [cs.CV] UPDATED)
    Recent advances in contrastive representation learning over paired image-text data have led to models such as CLIP that achieve state-of-the-art performance for zero-shot classification and distributional robustness. Such models typically require joint reasoning in the image and text representation spaces for downstream inference tasks. Contrary to prior beliefs, we demonstrate that the image and text representations learned via a standard contrastive objective are not interchangeable and can lead to inconsistent downstream predictions. To mitigate this issue, we formalize consistency and propose CyCLIP, a framework for contrastive representation learning that explicitly optimizes for the learned representations to be geometrically consistent in the image and text space. In particular, we show that consistent representations can be learned by explicitly symmetrizing (a) the similarity between the two mismatched image-text pairs (cross-modal consistency); and (b) the similarity between the image-image pair and the text-text pair (in-modal consistency). Empirically, we show that the improved consistency in CyCLIP translates to significant gains over CLIP, with gains ranging from 10%-24% for zero-shot classification accuracy on standard benchmarks (CIFAR-10, CIFAR-100, ImageNet1K) and 10%-27% for robustness to various natural distribution shifts. The code is available at https://github.com/goel-shashank/CyCLIP.
    Multi-task Bias-Variance Trade-off Through Functional Constraints. (arXiv:2210.15573v1 [cs.LG])
    Multi-task learning aims to acquire a set of functions, either regressors or classifiers, that perform well for diverse tasks. At its core, the idea behind multi-task learning is to exploit the intrinsic similarity across data sources to aid in the learning process for each individual domain. In this paper we draw intuition from the two extreme learning scenarios -- a single function for all tasks, and a task-specific function that ignores the other tasks dependencies -- to propose a bias-variance trade-off. To control the relationship between the variance (given by the number of i.i.d. samples), and the bias (coming from data from other task), we introduce a constrained learning formulation that enforces domain specific solutions to be close to a central function. This problem is solved in the dual domain, for which we propose a stochastic primal-dual algorithm. Experimental results for a multi-domain classification problem with real data show that the proposed procedure outperforms both the task specific, as well as the single classifiers.
    Class Based Thresholding in Early Exit Semantic Segmentation Networks. (arXiv:2210.15621v1 [cs.CV])
    We propose Class Based Thresholding (CBT) to reduce the computational cost of early exit semantic segmentation models while preserving the mean intersection over union (mIoU) performance. A key idea of CBT is to exploit the naturally-occurring neural collapse phenomenon. Specifically, by calculating the mean prediction probabilities of each class in the training set, CBT assigns different masking threshold values to each class, so that the computation can be terminated sooner for pixels belonging to easy-to-predict classes. We show the effectiveness of CBT on Cityscapes and ADE20K datasets. CBT can reduce the computational cost by $23\%$ compared to the previous state-of-the-art early exit models.
    Fully Bayesian Analysis of the Relevance Vector Machine Classification for Imbalanced Data. (arXiv:2007.13140v2 [stat.ML] UPDATED)
    Relevance Vector Machine (RVM) is a supervised learning algorithm extended from Support Vector Machine (SVM) based on the Bayesian sparsity model. Compared with the regression problem, RVM classification is difficult to be conducted because there is no closed-form solution for the weight parameter posterior. Original RVM classification algorithm used Newton's method in optimization to obtain the mode of weight parameter posterior then approximated it by a Gaussian distribution in Laplace's method. It would work but just applied the frequency methods in a Bayesian framework. This paper proposes a Generic Bayesian approach for the RVM classification. We conjecture that our algorithm achieves convergent estimates of the quantities of interest compared with the nonconvergent estimates of the original RVM classification algorithm. Furthermore, a Fully Bayesian approach with the hierarchical hyperprior structure for RVM classification is proposed, which improves the classification performance, especially in the imbalanced data problem. By the numeric studies, our proposed algorithms obtain high classification accuracy rates. The Fully Bayesian hierarchical hyperprior method outperforms the Generic one for the imbalanced data classification.
    Online Learning with Radial Basis Function Networks. (arXiv:2103.08414v8 [cs.CE] UPDATED)
    Financial time series are characterised by their nonstationarity and autocorrelation. Even if these time series are differenced, technically ensuring their stationarity, they experience regular covariate shifts and concept drifts. Against this backdrop, we combine feature representation transfer with sequential optimisation to provide multi-horizon returns forecasts. Our online learning rbfnet outperforms a random-walk baseline and several powerful batch learners. The rbfnets we formulate are naturally designed to measure the similarity between test samples and continuously updated prototypes that capture the characteristics of the feature space.
    Transformers meet Stochastic Block Models: Attention with Data-Adaptive Sparsity and Cost. (arXiv:2210.15541v1 [cs.LG])
    To overcome the quadratic cost of self-attention, recent works have proposed various sparse attention modules, most of which fall under one of two groups: 1) sparse attention under a hand-crafted patterns and 2) full attention followed by a sparse variant of softmax such as $\alpha$-entmax. Unfortunately, the first group lacks adaptability to data while the second still requires quadratic cost in training. In this work, we propose SBM-Transformer, a model that resolves both problems by endowing each attention head with a mixed-membership Stochastic Block Model (SBM). Then, each attention head data-adaptively samples a bipartite graph, the adjacency of which is used as an attention mask for each input. During backpropagation, a straight-through estimator is used to flow gradients beyond the discrete sampling step and adjust the probabilities of sampled edges based on the predictive loss. The forward and backward cost are thus linear to the number of edges, which each attention head can also choose flexibly based on the input. By assessing the distribution of graphs, we theoretically show that SBM-Transformer is a universal approximator for arbitrary sequence-to-sequence functions in expectation. Empirical evaluations under the LRA and GLUE benchmarks demonstrate that our model outperforms previous efficient variants as well as the original Transformer with full attention. Our implementation can be found in https://github.com/sc782/SBM-Transformer .
    Oracle Complexity in Nonsmooth Nonconvex Optimization. (arXiv:2104.06763v3 [math.OC] UPDATED)
    It is well-known that given a smooth, bounded-from-below, and possibly nonconvex function, standard gradient-based methods can find $\epsilon$-stationary points (with gradient norm less than $\epsilon$) in $\mathcal{O}(1/\epsilon^2)$ iterations. However, many important nonconvex optimization problems, such as those associated with training modern neural networks, are inherently not smooth, making these results inapplicable. In this paper, we study nonsmooth nonconvex optimization from an oracle complexity viewpoint, where the algorithm is assumed to be given access only to local information about the function at various points. We provide two main results: First, we consider the problem of getting near $\epsilon$-stationary points. This is perhaps the most natural relaxation of finding $\epsilon$-stationary points, which is impossible in the nonsmooth nonconvex case. We prove that this relaxed goal cannot be achieved efficiently, for any distance and $\epsilon$ smaller than some constants. Our second result deals with the possibility of tackling nonsmooth nonconvex optimization by reduction to smooth optimization: Namely, applying smooth optimization methods on a smooth approximation of the objective function. For this approach, we prove under a mild assumption an inherent trade-off between oracle complexity and smoothness: On the one hand, smoothing a nonsmooth nonconvex function can be done very efficiently (e.g., by randomized smoothing), but with dimension-dependent factors in the smoothness parameter, which can strongly affect iteration complexity when plugging into standard smooth optimization methods. On the other hand, these dimension factors can be eliminated with suitable smoothing methods, but only by making the oracle complexity of the smoothing process exponentially large.
    Training Graph Neural Networks on Growing Stochastic Graphs. (arXiv:2210.15567v1 [cs.LG])
    Graph Neural Networks (GNNs) rely on graph convolutions to exploit meaningful patterns in networked data. Based on matrix multiplications, convolutions incur in high computational costs leading to scalability limitations in practice. To overcome these limitations, proposed methods rely on training GNNs in smaller number of nodes, and then transferring the GNN to larger graphs. Even though these methods are able to bound the difference between the output of the GNN with different number of nodes, they do not provide guarantees against the optimal GNN on the very large graph. In this paper, we propose to learn GNNs on very large graphs by leveraging the limit object of a sequence of growing graphs, the graphon. We propose to grow the size of the graph as we train, and we show that our proposed methodology -- learning by transference -- converges to a neighborhood of a first order stationary point on the graphon data. A numerical experiment validates our proposed approach.
    A Graph Is More Than Its Nodes: Towards Structured Uncertainty-Aware Learning on Graphs. (arXiv:2210.15575v1 [cs.LG])
    Current graph neural networks (GNNs) that tackle node classification on graphs tend to only focus on nodewise scores and are solely evaluated by nodewise metrics. This limits uncertainty estimation on graphs since nodewise marginals do not fully characterize the joint distribution given the graph structure. In this work, we propose novel edgewise metrics, namely the edgewise expected calibration error (ECE) and the agree/disagree ECEs, which provide criteria for uncertainty estimation on graphs beyond the nodewise setting. Our experiments demonstrate that the proposed edgewise metrics can complement the nodewise results and yield additional insights. Moreover, we show that GNN models which consider the structured prediction problem on graphs tend to have better uncertainty estimations, which illustrates the benefit of going beyond the nodewise setting.
    Multi-view Representation Learning from Malware to Defend Against Adversarial Variants. (arXiv:2210.15429v1 [cs.CR])
    Deep learning-based adversarial malware detectors have yielded promising results in detecting never-before-seen malware executables without relying on expensive dynamic behavior analysis and sandbox. Despite their abilities, these detectors have been shown to be vulnerable to adversarial malware variants - meticulously modified, functionality-preserving versions of original malware executables generated by machine learning. Due to the nature of these adversarial modifications, these adversarial methods often use a \textit{single view} of malware executables (i.e., the binary/hexadecimal view) to generate adversarial malware variants. This provides an opportunity for the defenders (i.e., malware detectors) to detect the adversarial variants by utilizing more than one view of a malware file (e.g., source code view in addition to the binary view). The rationale behind this idea is that while the adversary focuses on the binary view, certain characteristics of the malware file in the source code view remain untouched which leads to the detection of the adversarial malware variants. To capitalize on this opportunity, we propose Adversarially Robust Multiview Malware Defense (ARMD), a novel multi-view learning framework to improve the robustness of DL-based malware detectors against adversarial variants. Our experiments on three renowned open-source deep learning-based malware detectors across six common malware categories show that ARMD is able to improve the adversarial robustness by up to seven times on these malware detectors.
    Meta-Reinforcement Learning Using Model Parameters. (arXiv:2210.15515v1 [cs.LG])
    In meta-reinforcement learning, an agent is trained in multiple different environments and attempts to learn a meta-policy that can efficiently adapt to a new environment. This paper presents RAMP, a Reinforcement learning Agent using Model Parameters that utilizes the idea that a neural network trained to predict environment dynamics encapsulates the environment information. RAMP is constructed in two phases: in the first phase, a multi-environment parameterized dynamic model is learned. In the second phase, the model parameters of the dynamic model are used as context for the multi-environment policy of the model-free reinforcement learning agent.
    Beyond the Return: Off-policy Function Estimation under User-specified Error-measuring Distributions. (arXiv:2210.15543v1 [cs.LG])
    Off-policy evaluation often refers to two related tasks: estimating the expected return of a policy and estimating its value function (or other functions of interest, such as density ratios). While recent works on marginalized importance sampling (MIS) show that the former can enjoy provable guarantees under realizable function approximation, the latter is only known to be feasible under much stronger assumptions such as prohibitively expressive discriminators. In this work, we provide guarantees for off-policy function estimation under only realizability, by imposing proper regularization on the MIS objectives. Compared to commonly used regularization in MIS, our regularizer is much more flexible and can account for an arbitrary user-specified distribution, under which the learned function will be close to the groundtruth. We provide exact characterization of the optimal dual solution that needs to be realized by the discriminator class, which determines the data-coverage assumption in the case of value-function learning. As another surprising observation, the regularizer can be altered to relax the data-coverage requirement, and completely eliminate it in the ideal case with strong side information.
    Improving abstractive summarization with energy-based re-ranking. (arXiv:2210.15553v1 [cs.CL])
    Current abstractive summarization systems present important weaknesses which prevent their deployment in real-world applications, such as the omission of relevant information and the generation of factual inconsistencies (also known as hallucinations). At the same time, automatic evaluation metrics such as CTC scores have been recently proposed that exhibit a higher correlation with human judgments than traditional lexical-overlap metrics such as ROUGE. In this work, we intend to close the loop by leveraging the recent advances in summarization metrics to create quality-aware abstractive summarizers. Namely, we propose an energy-based model that learns to re-rank summaries according to one or a combination of these metrics. We experiment using several metrics to train our energy-based re-ranker and show that it consistently improves the scores achieved by the predicted summaries. Nonetheless, human evaluation results show that the re-ranking approach should be used with care for highly abstractive summaries, as the available metrics are not yet sufficiently reliable for this purpose.
    LAD: Language Augmented Diffusion for Reinforcement Learning. (arXiv:2210.15629v1 [cs.LG])
    Learning skills from language provides a powerful avenue for generalization in reinforcement learning, although it remains a challenging task as it requires agents to capture the complex interdependencies between language, actions, and states. In this paper, we propose leveraging Language Augmented Diffusion models as a planner conditioned on language (LAD). We demonstrate the comparable performance of LAD with the state-of-the-art on the CALVIN language robotics benchmark with a much simpler architecture that contains no inductive biases specialized to robotics, achieving an average success rate (SR) of 72% compared to the best performance of 76%. We also conduct an analysis on the properties of language conditioned diffusion in reinforcement learning.
    Incorporating Causal Effects into Deep Learning Predictions on EHR Data. (arXiv:2011.05466v2 [cs.LG] UPDATED)
    Electronic Health Records (EHR) data analysis plays a crucial role in healthcare system quality. Because of its highly complex underlying causality and limited observable nature, causal inference on EHR is quite challenging. Deep Learning (DL) achieved great success among the advanced machine learning methodologies. Nevertheless, it is still obstructed by the inappropriately assumed causal conditions. This work proposed a novel method to quantify clinically well-defined causal effects as a generalized estimation vector that is simply utilizable for causal models. We incorporated it into DL models to achieve better predictive performance and result interpretation. Furthermore, we also proved the existence of causal information blink spots that regular DL models cannot reach.
    Automatic Extraction of Materials and Properties from Superconductors Scientific Literature. (arXiv:2210.15600v1 [cs.CL])
    The automatic extraction of materials and related properties from the scientific literature is gaining attention in data-driven materials science (Materials Informatics). In this paper, we discuss Grobid-superconductors, our solution for automatically extracting superconductor material names and respective properties from text. Built as a Grobid module, it combines machine learning and heuristic approaches in a multi-step architecture that supports input data as raw text or PDF documents. Using Grobid-superconductors, we built SuperCon2, a database of 40324 materials and properties records from 37700 papers. The material (or sample) information is represented by name, chemical formula, and material class, and is characterized by shape, doping, substitution variables for components, and substrate as adjoined information. The properties include the Tc superconducting critical temperature and, when available, applied pressure with the Tc measurement method.
    Private and Reliable Neural Network Inference. (arXiv:2210.15614v1 [cs.LG])
    Reliable neural networks (NNs) provide important inference-time reliability guarantees such as fairness and robustness. Complementarily, privacy-preserving NN inference protects the privacy of client data. So far these two emerging areas have been largely disconnected, yet their combination will be increasingly important. In this work, we present the first system which enables privacy-preserving inference on reliable NNs. Our key idea is to design efficient fully homomorphic encryption (FHE) counterparts for the core algorithmic building blocks of randomized smoothing, a state-of-the-art technique for obtaining reliable models. The lack of required control flow in FHE makes this a demanding task, as na\"ive solutions lead to unacceptable runtime. We employ these building blocks to enable privacy-preserving NN inference with robustness and fairness guarantees in a system called Phoenix. Experimentally, we demonstrate that Phoenix achieves its goals without incurring prohibitive latencies. To our knowledge, this is the first work which bridges the areas of client data privacy and reliability guarantees for NNs.
    Broadcasted Residual Learning for Efficient Keyword Spotting. (arXiv:2106.04140v3 [cs.SD] UPDATED)
    Keyword spotting is an important research field because it plays a key role in device wake-up and user interaction on smart devices. However, it is challenging to minimize errors while operating efficiently in devices with limited resources such as mobile phones. We present a broadcasted residual learning method to achieve high accuracy with small model size and computational load. Our method configures most of the residual functions as 1D temporal convolution while still allows 2D convolution together using a broadcasted-residual connection that expands temporal output to frequency-temporal dimension. This residual mapping enables the network to effectively represent useful audio features with much less computation than conventional convolutional neural networks. We also propose a novel network architecture, Broadcasting-residual network (BC-ResNet), based on broadcasted residual learning and describe how to scale up the model according to the target device's resources. BC-ResNets achieve state-of-the-art 98.0% and 98.7% top-1 accuracy on Google speech command datasets v1 and v2, respectively, and consistently outperform previous approaches, using fewer computations and parameters.
    Accurate Bundle Matching and Generation via Multitask Learning with Partially Shared Parameters. (arXiv:2210.15460v1 [cs.IR])
    How can we recommend existing bundles to users accurately? How can we generate new tailored bundles for users? Recommending a bundle, or a group of various items, has attracted widespread attention in e-commerce owing to the increased satisfaction of both users and providers. Bundle matching and bundle generation are two representative tasks in bundle recommendation. The bundle matching task is to correctly match existing bundles to users while the bundle generation is to generate new bundles that users would prefer. Although many recent works have developed bundle recommendation models, they fail to achieve high accuracy since they do not handle heterogeneous data effectively and do not learn a method for customized bundle generation. In this paper, we propose BundleMage, an accurate approach for bundle matching and generation. BundleMage effectively mixes user preferences of items and bundles using an adaptive gate technique to achieve high accuracy for the bundle matching. BundleMage also generates a personalized bundle by learning a generation module that exploits a user preference and the characteristic of a given incomplete bundle to be completed. BundleMage further improves its performance using multi-task learning with partially shared parameters. Through extensive experiments, we show that BundleMage achieves up to 6.6% higher nDCG in bundle matching and 6.3x higher nDCG in bundle generation than the best competitors. We also provide qualitative analysis that BundleMage effectively generates bundles considering both the tastes of users and the characteristics of target bundles.
    Working Alliance Transformer for Psychotherapy Dialogue Classification. (arXiv:2210.15603v1 [cs.CL])
    As a predictive measure of the treatment outcome in psychotherapy, the working alliance measures the agreement of the patient and the therapist in terms of their bond, task and goal. Long been a clinical quantity estimated by the patients' and therapists' self-evaluative reports, we believe that the working alliance can be better characterized using natural language processing technique directly in the dialogue transcribed in each therapy session. In this work, we propose the Working Alliance Transformer (WAT), a Transformer-based classification model that has a psychological state encoder which infers the working alliance scores by projecting the embedding of the dialogues turns onto the embedding space of the clinical inventory for working alliance. We evaluate our method in a real-world dataset with over 950 therapy sessions with anxiety, depression, schizophrenia and suicidal patients and demonstrate an empirical advantage of using information about the therapeutic states in this sequence classification task of psychotherapy dialogues.
    Sample-Efficient Reinforcement Learning for Linearly-Parameterized MDPs with a Generative Model. (arXiv:2105.14016v3 [cs.LG] UPDATED)
    The curse of dimensionality is a widely known issue in reinforcement learning (RL). In the tabular setting where the state space $\mathcal{S}$ and the action space $\mathcal{A}$ are both finite, to obtain a nearly optimal policy with sampling access to a generative model, the minimax optimal sample complexity scales linearly with $|\mathcal{S}|\times|\mathcal{A}|$, which can be prohibitively large when $\mathcal{S}$ or $\mathcal{A}$ is large. This paper considers a Markov decision process (MDP) that admits a set of state-action features, which can linearly express (or approximate) its probability transition kernel. We show that a model-based approach (resp.$~$Q-learning) provably learns an $\varepsilon$-optimal policy (resp.$~$Q-function) with high probability as soon as the sample size exceeds the order of $\frac{K}{(1-\gamma)^{3}\varepsilon^{2}}$ (resp.$~$$\frac{K}{(1-\gamma)^{4}\varepsilon^{2}}$), up to some logarithmic factor. Here $K$ is the feature dimension and $\gamma\in(0,1)$ is the discount factor of the MDP. Both sample complexity bounds are provably tight, and our result for the model-based approach matches the minimax lower bound. Our results show that for arbitrarily large-scale MDP, both the model-based approach and Q-learning are sample-efficient when $K$ is relatively small, and hence the title of this paper.
    LP-BFGS attack: An adversarial attack based on the Hessian with limited pixels. (arXiv:2210.15446v1 [cs.CR])
    Deep neural networks are vulnerable to adversarial attacks. Most white-box attacks are based on the gradient of models to the input. Since the computation and memory budget, adversarial attacks based on the Hessian information are not paid enough attention. In this work, we study the attack performance and computation cost of the attack method based on the Hessian with a limited perturbation pixel number. Specifically, we propose the Limited Pixel BFGS (LP-BFGS) attack method by incorporating the BFGS algorithm. Some pixels are selected as perturbation pixels by the Integrated Gradient algorithm, which are regarded as optimization variables of the LP-BFGS attack. Experimental results across different networks and datasets with various perturbation pixel numbers demonstrate our approach has a comparable attack with an acceptable computation compared with existing solutions.
    Lifelong Bandit Optimization: No Prior and No Regret. (arXiv:2210.15513v1 [stat.ML])
    In practical applications, machine learning algorithms are often repeatedly applied to problems with similar structure over and over again. We focus on solving a sequence of bandit optimization tasks and develop LiBO, an algorithm which adapts to the environment by learning from past experience and becoming more sample-efficient in the process. We assume a kernelized structure where the kernel is unknown but shared across all tasks. LiBO sequentially meta-learns a kernel that approximates the true kernel and simultaneously solves the incoming tasks with the latest kernel estimate. Our algorithm can be paired with any kernelized bandit algorithm and guarantees oracle optimal performance, meaning that as more tasks are solved, the regret of LiBO on each task converges to the regret of the bandit algorithm with oracle knowledge of the true kernel. Naturally, if paired with a sublinear bandit algorithm, LiBO yields a sublinear lifelong regret. We also show that direct access to the data from each task is not necessary for attaining sublinear regret. The lifelong problem can thus be solved in a federated manner, while keeping the data of each task private.
    Robust Monocular Localization of Drones by Adapting Domain Maps to Depth Prediction Inaccuracies. (arXiv:2210.15559v1 [cs.CV])
    We present a novel monocular localization framework by jointly training deep learning-based depth prediction and Bayesian filtering-based pose reasoning. The proposed cross-modal framework significantly outperforms deep learning-only predictions with respect to model scalability and tolerance to environmental variations. Specifically, we show little-to-no degradation of pose accuracy even with extremely poor depth estimates from a lightweight depth predictor. Our framework also maintains high pose accuracy in extreme lighting variations compared to standard deep learning, even without explicit domain adaptation. By openly representing the map and intermediate feature maps (such as depth estimates), our framework also allows for faster updates and reusing intermediate predictions for other tasks, such as obstacle avoidance, resulting in much higher resource efficiency.
    LyricJam Sonic: A Generative System for Real-Time Composition and Musical Improvisation. (arXiv:2210.15638v1 [cs.SD])
    Electronic music artists and sound designers have unique workflow practices that necessitate specialized approaches for developing music information retrieval and creativity support tools. Furthermore, electronic music instruments, such as modular synthesizers, have near-infinite possibilities for sound creation and can be combined to create unique and complex audio paths. The process of discovering interesting sounds is often serendipitous and impossible to replicate. For this reason, many musicians in electronic genres record audio output at all times while they work in the studio. Subsequently, it is difficult for artists to rediscover audio segments that might be suitable for use in their compositions from thousands of hours of recordings. In this paper, we describe LyricJam Sonic -- a novel creative tool for musicians to rediscover their previous recordings, re-contextualize them with other recordings, and create original live music compositions in real-time. A bi-modal AI-driven approach uses generated lyric lines to find matching audio clips from the artist's past studio recordings, and uses them to generate new lyric lines, which in turn are used to find other clips, thus creating a continuous and evolving stream of music and lyrics. The intent is to keep the artists in a state of creative flow conducive to music creation rather than taking them into an analytical/critical state of deliberately searching for past audio segments. The system can run in either a fully autonomous mode without user input, or in a live performance mode, where the artist plays live music, while the system "listens" and creates a continuous stream of music and lyrics in response.
    Learning Location from Shared Elevation Profiles in Fitness Apps: A Privacy Perspective. (arXiv:2210.15529v1 [cs.CR])
    The extensive use of smartphones and wearable devices has facilitated many useful applications. For example, with Global Positioning System (GPS)-equipped smart and wearable devices, many applications can gather, process, and share rich metadata, such as geolocation, trajectories, elevation, and time. For example, fitness applications, such as Runkeeper and Strava, utilize the information for activity tracking and have recently witnessed a boom in popularity. Those fitness tracker applications have their own web platforms and allow users to share activities on such platforms or even with other social network platforms. To preserve the privacy of users while allowing sharing, several of those platforms may allow users to disclose partial information, such as the elevation profile for an activity, which supposedly would not leak the location of the users. In this work, and as a cautionary tale, we create a proof of concept where we examine the extent to which elevation profiles can be used to predict the location of users. To tackle this problem, we devise three plausible threat settings under which the city or borough of the targets can be predicted. Those threat settings define the amount of information available to the adversary to launch the prediction attacks. Establishing that simple features of elevation profiles, e.g., spectral features, are insufficient, we devise both natural language processing (NLP)-inspired text-like representation and computer vision-inspired image-like representation of elevation profiles, and we convert the problem at hand into text and image classification problem. We use both traditional machine learning- and deep learning-based techniques and achieve a prediction success rate ranging from 59.59\% to 99.80\%. The findings are alarming, highlighting that sharing elevation information may have significant location privacy risks.
    Explaining the Explainers in Graph Neural Networks: a Comparative Study. (arXiv:2210.15304v1 [cs.LG])
    Following a fast initial breakthrough in graph based learning, Graph Neural Networks (GNNs) have reached a widespread application in many science and engineering fields, prompting the need for methods to understand their decision process. GNN explainers have started to emerge in recent years, with a multitude of methods both novel or adapted from other domains. To sort out this plethora of alternative approaches, several studies have benchmarked the performance of different explainers in terms of various explainability metrics. However, these earlier works make no attempts at providing insights into why different GNN architectures are more or less explainable, or which explainer should be preferred in a given setting. In this survey, we fill these gaps by devising a systematic experimental study, which tests ten explainers on eight representative architectures trained on six carefully designed graph and node classification datasets. With our results we provide key insights on the choice and applicability of GNN explainers, we isolate key components that make them usable and successful and provide recommendations on how to avoid common interpretation pitfalls. We conclude by highlighting open questions and directions of possible future research.  ( 2 min )
    End-to-End Pareto Set Prediction with Graph Neural Networks for Multi-objective Facility Location. (arXiv:2210.15220v1 [cs.LG])
    The facility location problems (FLPs) are a typical class of NP-hard combinatorial optimization problems, which are widely seen in the supply chain and logistics. Many mathematical and heuristic algorithms have been developed for optimizing the FLP. In addition to the transportation cost, there are usually multiple conflicting objectives in realistic applications. It is therefore desirable to design algorithms that find a set of Pareto solutions efficiently without enormous search cost. In this paper, we consider the multi-objective facility location problem (MO-FLP) that simultaneously minimizes the overall cost and maximizes the system reliability. We develop a learning-based approach to predicting the distribution probability of the entire Pareto set for a given problem. To this end, the MO-FLP is modeled as a bipartite graph optimization problem and two graph neural networks are constructed to learn the implicit graph representation on nodes and edges. The network outputs are then converted into the probability distribution of the Pareto set, from which a set of non-dominated solutions can be sampled non-autoregressively. Experimental results on MO-FLP instances of different scales show that the proposed approach achieves a comparable performance to a widely used multi-objective evolutionary algorithm in terms of the solution quality while significantly reducing the computational cost for search.  ( 2 min )
    Fine-Grained Session Recommendations in E-commerce using Deep Reinforcement Learning. (arXiv:2210.15451v1 [cs.IR])
    Sustaining users' interest and keeping them engaged in the platform is very important for the success of an e-commerce business. A session encompasses different activities of a user between logging into the platform and logging out or making a purchase. User activities in a session can be classified into two groups: Known Intent and Unknown intent. Known intent activity pertains to the session where the intent of a user to browse/purchase a specific product can be easily captured. Whereas in unknown intent activity, the intent of the user is not known. For example, consider the scenario where a user enters the session to casually browse the products over the platform, similar to the window shopping experience in the offline setting. While recommending similar products is essential in the former, accurately understanding the intent and recommending interesting products is essential in the latter setting in order to retain a user. In this work, we focus primarily on the unknown intent setting where our objective is to recommend a sequence of products to a user in a session to sustain their interest, keep them engaged and possibly drive them towards purchase. We formulate this problem in the framework of the Markov Decision Process (MDP), a popular mathematical framework for sequential decision making and solve it using Deep Reinforcement Learning (DRL) techniques. However, training the next product recommendation is difficult in the RL paradigm due to large variance in browse/purchase behavior of the users. Therefore, we break the problem down into predicting various product attributes, where a pattern/trend can be identified and exploited to build accurate models. We show that the DRL agent provides better performance compared to a greedy strategy.
    Escaping the Impossibility of Fairness: From Formal to Substantive Algorithmic Fairness. (arXiv:2107.04642v9 [cs.CY] UPDATED)
    Efforts to promote equitable public policy with algorithms appear to be fundamentally constrained by the "impossibility of fairness" (an incompatibility between mathematical definitions of fairness). This technical limitation raises a central question about algorithmic fairness: How can computer scientists and policymakers support equitable policy reforms with algorithms? In this article, I argue that promoting justice with algorithms requires reforming the methodology of algorithmic fairness. First, I diagnose the problems of the current methodology for algorithmic fairness, which I call "formal algorithmic fairness." Because formal algorithmic fairness restricts analysis to isolated decision-making procedures, it leads to the impossibility of fairness and to models that exacerbate oppression despite appearing "fair." Second, I draw on theories of substantive equality from law and philosophy to propose an alternative methodology, which I call "substantive algorithmic fairness." Because substantive algorithmic fairness takes a more expansive scope of analysis, it enables an escape from the impossibility of fairness and provides a rigorous guide for alleviating injustice with algorithms. In sum, substantive algorithmic fairness presents a new direction for algorithmic fairness: away from formal mathematical models of "fair" decision-making and toward substantive evaluations of whether and how algorithms can promote justice in practice.
    FreeVC: Towards High-Quality Text-Free One-Shot Voice Conversion. (arXiv:2210.15418v1 [cs.SD])
    Voice conversion (VC) can be achieved by first extracting source content information and target speaker information, and then reconstructing waveform with these information. However, current approaches normally either extract dirty content information with speaker information leaked in, or demand a large amount of annotated data for training. Besides, the quality of reconstructed waveform can be degraded by the mismatch between conversion model and vocoder. In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information. Experimental results show that the proposed method outperforms the latest VC models trained with annotated data and has greater robustness.
    Deep reinforcement learning for automatic run-time adaptation of UWB PHY radio settings. (arXiv:2210.15498v1 [cs.NI])
    Ultra-wideband technology has become increasingly popular for indoor localization and location-based services. This has led recent advances to be focused on reducing the ranging errors, whilst research focusing on enabling more reliable and energy efficient communication has been largely unexplored. The IEEE 802.15.4 UWB physical layer allows for several settings to be selected that influence the energy consumption, range, and reliability. Combined with the available link state diagnostics reported by UWB devices, there is an opportunity to dynamically select PHY settings based on the environment. To address this, we propose a deep Q-learning approach for enabling reliable UWB communication, maximizing packet reception rate (PRR) and minimizing energy consumption. Deep Q-learning is a good fit for this problem, as it is an inherently adaptive algorithm that responds to the environment. Validation in a realistic office environment showed that the algorithm outperforms traditional Q-learning, linear search and using a fixed PHY layer. We found that deep Q-learning achieves a higher average PRR and reduces the ranging error while using only 14% of the energy compared to a fixed PHY setting in a dynamic office environment.
    Can language models handle recursively nested grammatical structures? A case study on comparing models and humans. (arXiv:2210.15303v1 [cs.CL])
    How should we compare the capabilities of language models and humans? Here, I consider a case study: processing of recursively nested grammatical structures. Prior work has suggested that language models cannot handle these structures as reliably as humans can. However, the humans were provided with instructions and training before being evaluated, while the language models were evaluated zero-shot. I therefore attempt to more closely match the evaluation paradigms by providing language models with few-shot prompts. A simple prompt, which contains substantially less content than the human training, allows large language models to consistently outperform the human results. The same prompt even allows extrapolation to more-deeply-nested conditions than have been tested in humans. Further, a reanalysis of the prior human experiments suggests that the humans may not perform above chance at the difficult structures initially. These results suggest that large language models can in fact process recursively nested grammatical structures comparably to humans. This case study highlights how discrepancies in the quantity of experiment-specific context can confound comparisons of language models and humans. I use this case study to reflect on the broader challenge of comparing human and model capabilities, and to suggest that there is an important difference between evaluating cognitive models of a specific phenomenon and evaluating broadly-trained models.  ( 2 min )
    Student-centric Model of Learning Management System Activity and Academic Performance: from Correlation to Causation. (arXiv:2210.15430v1 [cs.CY])
    In recent years, there is a lot of interest in modeling students' digital traces in Learning Management System (LMS) to understand students' learning behavior patterns including aspects of meta-cognition and self-regulation, with the ultimate goal to turn those insights into actionable information to support students to improve their learning outcomes. In achieving this goal, however, there are two main issues that need to be addressed given the existing literature. Firstly, most of the current work is course-centered (i.e. models are built from data for a specific course) rather than student-centered; secondly, a vast majority of the models are correlational rather than causal. Those issues make it challenging to identify the most promising actionable factors for intervention at the student level where most of the campus-wide academic support is designed for. In this paper, we explored a student-centric analytical framework for LMS activity data that can provide not only correlational but causal insights mined from observational data. We demonstrated this approach using a dataset of 1651 computing major students at a public university in the US during one semester in the Fall of 2019. This dataset includes students' fine-grained LMS interaction logs and administrative data, e.g. demographics and academic performance. In addition, we expand the repository of LMS behavior indicators to include those that can characterize the time-of-the-day of login (e.g. chronotype). Our analysis showed that student login volume, compared with other login behavior indicators, is both strongly correlated and causally linked to student academic performance, especially among students with low academic performance. We envision that those insights will provide convincing evidence for college student support groups to launch student-centered and targeted interventions that are effective and scalable.
    Leveraging Wikidata's edit history in knowledge graph refinement tasks. (arXiv:2210.15495v1 [cs.LG])
    Knowledge graphs have been adopted in many diverse fields for a variety of purposes. Most of those applications rely on valid and complete data to deliver their results, pressing the need to improve the quality of knowledge graphs. A number of solutions have been proposed to that end, ranging from rule-based approaches to the use of probabilistic methods, but there is an element that has not been considered yet: the edit history of the graph. In the case of collaborative knowledge graphs (e.g., Wikidata), those edits represent the process in which the community reaches some kind of fuzzy and distributed consensus over the information that best represents each entity, and can hold potentially interesting information to be used by knowledge graph refinement methods. In this paper, we explore the use of edit history information from Wikidata to improve the performance of type prediction methods. To do that, we have first built a JSON dataset containing the edit history of every instance from the 100 most important classes in Wikidata. This edit history information is then explored and analyzed, with a focus on its potential applicability in knowledge graph refinement tasks. Finally, we propose and evaluate two new methods to leverage this edit history information in knowledge graph embedding models for type prediction tasks. Our results show an improvement in one of the proposed methods against current approaches, showing the potential of using edit information in knowledge graph refinement tasks and opening new promising research lines within the field.
    Prompting ELECTRA: Few-Shot Learning with Discriminative Pre-Trained Models. (arXiv:2205.15223v3 [cs.CL] UPDATED)
    Pre-trained masked language models successfully perform few-shot learning by formulating downstream tasks as text infilling. However, as a strong alternative in full-shot settings, discriminative pre-trained models like ELECTRA do not fit into the paradigm. In this work, we adapt prompt-based few-shot learning to ELECTRA and show that it outperforms masked language models in a wide range of tasks. ELECTRA is pre-trained to distinguish if a token is generated or original. We naturally extend that to prompt-based few-shot learning by training to score the originality of the target options without introducing new parameters. Our method can be easily adapted to tasks involving multi-token predictions without extra computation overhead. Analysis shows that ELECTRA learns distributions that align better with downstream tasks.
    Mining Multi-Label Samples from Single Positive Labels. (arXiv:2206.05764v3 [cs.LG] UPDATED)
    Conditional generative adversarial networks (cGANs) have shown superior results in class-conditional generation tasks. To simultaneously control multiple conditions, cGANs require multi-label training datasets, where multiple labels can be assigned to each data instance. Nevertheless, the tremendous annotation cost limits the accessibility of multi-label datasets in real-world scenarios. Therefore, in this study we explore the practical setting called the single positive setting, where each data instance is annotated by only one positive label with no explicit negative labels. To generate multi-label data in the single positive setting, we propose a novel sampling approach called single-to-multi-label (S2M) sampling, based on the Markov chain Monte Carlo method. As a widely applicable "add-on" method, our proposed S2M sampling method enables existing unconditional and conditional GANs to draw high-quality multi-label data with a minimal annotation cost. Extensive experiments on real image datasets verify the effectiveness and correctness of our method, even when compared to a model trained with fully annotated datasets.
    Integrating Statistical and Machine Learning Approaches to Identify Receptive Field Structure in Neural Populations. (arXiv:2208.12025v2 [q-bio.NC] UPDATED)
    Neurons can code for multiple variables simultaneously and neuroscientists are often interested in classifying neurons based on their receptive field properties. Statistical models provide powerful tools for determining the factors influencing neural spiking activity and classifying individual neurons. However, as neural recording technologies have advanced to produce simultaneous spiking data from massive populations, classical statistical methods often lack the computational efficiency required to handle such data. Machine learning (ML) approaches are known for enabling efficient large scale data analyses; however, they typically require massive training sets with balanced data, along with accurate labels to fit well. Additionally, model assessment and interpretation are often more challenging for ML than for classical statistical methods. To address these challenges, we develop an integrated framework, combining statistical modeling and machine learning approaches to identify the coding properties of neurons from large populations. In order to demonstrate this framework, we apply these methods to data from a population of neurons recorded from rat hippocampus to characterize the distribution of spatial receptive fields in this region.
    Forecasting Graph Signals with Recursive MIMO Graph Filters. (arXiv:2210.15258v1 [eess.SP])
    Forecasting time series on graphs is a fundamental problem in graph signal processing. When each entity of the network carries a vector of values for each time stamp instead of a scalar one, existing approaches resort to the use of product graphs to combine this multidimensional information, at the expense of creating a larger graph. In this paper, we show the limitations of such approaches, and propose extensions to tackle them. Then, we propose a recursive multiple-input multiple-output graph filter which encompasses many already existing models in the literature while being more flexible. Numerical simulations on a real world data set show the effectiveness of the proposed models.  ( 2 min )
    Opening the Black Box of wav2vec Feature Encoder. (arXiv:2210.15386v1 [cs.SD])
    Self-supervised models, namely, wav2vec and its variants, have shown promising results in various downstream tasks in the speech domain. However, their inner workings are poorly understood, calling for in-depth analyses on what the model learns. In this paper, we concentrate on the convolutional feature encoder where its latent space is often speculated to represent discrete acoustic units. To analyze the embedding space in a reductive manner, we feed the synthesized audio signals, which is the summation of simple sine waves. Through extensive experiments, we conclude that various information is embedded inside the feature encoder representations: (1) fundamental frequency, (2) formants, and (3) amplitude, packed with (4) sufficient temporal detail. Further, the information incorporated inside the latent representations is analogous to spectrograms but with a fundamental difference: latent representations construct a metric space so that closer representations imply acoustic similarity.
    Provable Sim-to-real Transfer in Continuous Domain with Partial Observations. (arXiv:2210.15598v1 [cs.LG])
    Sim-to-real transfer trains RL agents in the simulated environments and then deploys them in the real world. Sim-to-real transfer has been widely used in practice because it is often cheaper, safer and much faster to collect samples in simulation than in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.
    KALMANBOT: KalmanNet-Aided Bollinger Bands for Pairs Trading. (arXiv:2210.15448v1 [q-fin.TR])
    Pairs trading is a family of trading policies based on monitoring the relationships between pairs of assets. A common pairs trading approach relies on state space (SS) modeling, from which financial indicators can be obtained with low complexity and latency using a Kalman filter (KF), and processed using classic policies such as Bollinger bands (BB). However, such SS models are inherently approximated and mismatched, often degrading the revenue. In this work we propose KalmanBOT, a data-aided policy that preserves the advantages of KF-aided BB policies while leveraging data to overcome the approximated nature of the SS model. We adopt the recent KalmanNet architecture, and approximate the BB policy with a differentiable mapping, converting the policy into a trainable model. We empirically demonstrate that KalmanBOT yields improved rewards compared with model-based and data-driven benchmarks.
    Adaptive Estimation of $\text{MTP}_2$ Graphical Models. (arXiv:2210.15471v1 [stat.ML])
    We consider the problem of estimating (diagonally dominant) M-matrices as precision matrices in Gaussian graphical models. Such models have received increasing attention in recent years, and have shown interesting properties, e.g., the maximum likelihood estimator exists with as little as two observations regardless of the underlying dimension. In this paper, we propose an adaptive estimation method, which consists of multiple stages: In the first stage, we solve an $\ell_1$-regularized maximum likelihood estimation problem, which leads to an initial estimate; in the subsequent stages, we iteratively refine the initial estimate by solving a sequence of weighted $\ell_1$-regularized problems. We further establish the theoretical guarantees on the estimation error, which consists of optimization error and statistical error. The optimization error decays to zero at a linear rate, indicating that the estimate is refined iteratively in subsequent stages, and the statistical error characterizes the statistical rate. The proposed method outperforms state-of-the-art methods in estimating precision matrices and identifying graph edges, as evidenced by synthetic and financial time-series data sets.
    Efficient Learning of Decision-Making Models: A Penalty Block Coordinate Descent Algorithm for Data-Driven Inverse Optimization. (arXiv:2210.15393v1 [math.OC])
    Decision-making problems are commonly formulated as optimization problems, which are then solved to make optimal decisions. In this work, we consider the inverse problem where we use prior decision data to uncover the underlying decision-making process in the form of a mathematical optimization model. This statistical learning problem is referred to as data-driven inverse optimization. We focus on problems where the underlying decision-making process is modeled as a convex optimization problem whose parameters are unknown. We formulate the inverse optimization problem as a bilevel program and propose an efficient block coordinate descent-based algorithm to solve large problem instances. Numerical experiments on synthetic datasets demonstrate the computational advantage of our method compared to standard commercial solvers. Moreover, the real-world utility of the proposed approach is highlighted through two realistic case studies in which we consider estimating risk preferences and learning local constraint parameters of agents in a multiplayer Nash bargaining game.
    Post trade allocation: how much are bunched orders costing your performance?. (arXiv:2210.15499v1 [q-fin.TR])
    Individual trade orders are often bunched into a block order for processing efficiency, where in post execution, they are allocated into individual accounts. Since Regulators have not mandated any specific post trade allocation practice or methodology, entities try to rigorously follow internal policies and procedures to meet the minimum Regulatory ask of being procedurally fair and equitable. However, as many have found over the years, there is no simple solution for post trade allocation between accounts that results in a uniform distribution of returns. Furthermore, in many instances, the divergences between returns do not dissipate with more transactions, and tend to increase in some cases. This paper is the first systematic treatment of trade allocation risk. We shed light on the reasons for return divergence among accounts, and we present a solution that supports uniform allocation of return irrespective of number of accounts and trade sizes.
    HEiMDaL: Highly Efficient Method for Detection and Localization of wake-words. (arXiv:2210.15425v1 [eess.AS])
    Streaming keyword spotting is a widely used solution for activating voice assistants. Deep Neural Networks with Hidden Markov Model (DNN-HMM) based methods have proven to be efficient and widely adopted in this space, primarily because of the ability to detect and identify the start and end of the wake-up word at low compute cost. However, such hybrid systems suffer from loss metric mismatch when the DNN and HMM are trained independently. Sequence discriminative training cannot fully mitigate the loss-metric mismatch due to the inherent Markovian style of the operation. We propose an low footprint CNN model, called HEiMDaL, to detect and localize keywords in streaming conditions. We introduce an alignment-based classification loss to detect the occurrence of the keyword along with an offset loss to predict the start of the keyword. HEiMDaL shows 73% reduction in detection metrics along with equivalent localization accuracy and with the same memory footprint as existing DNN-HMM style models for a given wake-word.
    Multi-layered Discriminative Restricted Boltzmann Machine with Untrained Probabilistic Layer. (arXiv:2210.15434v1 [cs.LG])
    An extreme learning machine (ELM) is a three-layered feed-forward neural network having untrained parameters, which are randomly determined before training. Inspired by the idea of ELM, a probabilistic untrained layer called a probabilistic-ELM (PELM) layer is proposed, and it is combined with a discriminative restricted Boltzmann machine (DRBM), which is a probabilistic three-layered neural network for solving classification problems. The proposed model is obtained by stacking DRBM on the PELM layer. The resultant model (i.e., multi-layered DRBM (MDRBM)) forms a probabilistic four-layered neural network. In MDRBM, the parameters in the PELM layer can be determined using Gaussian-Bernoulli restricted Boltzmann machine. Owing to the PELM layer, MDRBM obtains a strong immunity against noise in inputs, which is one of the most important advantages of MDRBM. Numerical experiments using some benchmark datasets, MNIST, Fashion-MNIST, Urban Land Cover, and CIFAR-10, demonstrate that MDRBM is superior to other existing models, particularly, in terms of the noise-robustness property (or, in other words, the generalization property).
    Grokking phase transitions in learning local rules with gradient descent. (arXiv:2210.15435v1 [cond-mat.stat-mech])
    We discuss two solvable grokking (generalisation beyond overfitting) models in a rule learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular automata learning task, numerically determine the critical exponent and the grokking time distributions and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse the connection between structure formation and grokking.
    Predicting Non-Fungible Token (NFT) Collections: A Contextual Generative Approach. (arXiv:2210.15493v1 [q-fin.CP])
    Non-fungible tokens (NFTs) are digital assets stored on a blockchain representing real-world objects such as art or collectibles. It is a multibillion-dollar market, where the number of NFT collections increased over 100% in 2022; there are currently more than 80K collections on the Ethereum blockchain. Each collection, containing numerous tokens of a particular theme, has its unique characteristics. In this paper, we take a contextual generative approach that learns these diverse characteristics of NFT collections and generates the potential market value predictions of newly minted ones. We model NFTs as a series of transactions. First, meaningful contexts capturing the characteristics of various collections are derived using unsupervised learning. Next, our generative approach leverages these contexts to learn better characterizations of established NFT collections with differing market capitalization values. Finally, given a new collection in an early stage, the approach generates future transaction series for this emerging collection. Comprehensive experiments demonstrate that our approach closely predicts the potential value of NFT collections.
    COFFEE: Counterfactual Fairness for Personalized Text Generation in Explainable Recommendation. (arXiv:2210.15500v1 [cs.CL])
    Personalized text generation has broad industrial applications, such as explanation generation for recommendations, conversational systems, etc. Personalized text generators are usually trained on user written text, e.g., reviews collected on e-commerce platforms. However, due to historical, social, or behavioral reasons, there may exist bias that associates certain linguistic quality of user written text with the users' protected attributes such as gender, race, etc. The generators can identify and inherit these correlations and generate texts discriminately w.r.t. the users' protected attributes. Without proper intervention, such bias can adversarially influence the users' trust and reliance on the system. From a broader perspective, bias in auto-generated contents can reinforce the social stereotypes about how online users write through interactions with the users. In this work, we investigate the fairness of personalized text generation in the setting of explainable recommendation. We develop a general framework for achieving measure-specific counterfactual fairness on the linguistic quality of personalized explanations. We propose learning disentangled representations for counterfactual inference and develop a novel policy learning algorithm with carefully designed rewards for fairness optimization. The framework can be applied for achieving fairness on any given specifications of linguistic quality measures, and can be adapted to most of existing models and real-world settings. Extensive experiments demonstrate the superior ability of our method in achieving fairness while maintaining high generation performance.
    A Neural Network Based Automated IFT-20 Sensory Neuron Classifier for Caenorhabditis elegans. (arXiv:2210.14961v1 [q-bio.NC])
    Determining neuronal identity in imaging data is an essential task in neuroscience, facilitating the comparison of neural activity across organisms. Cross-organism comparison, in turn, enables a wide variety of research including whole-brain analysis of functional networks and linking the activity of specific neurons to behavior or environmental stimuli. The recent development of three-dimensional, pan-neuronal imaging with single-cell resolution within Caenorhabditis elegans has brought neuron identification, tracking, and activity monitoring all within reach. The nematode C. elegans is often used as a model organism to study neuronal activity due to factors such as its transparency and well-understood nervous system. The principal barrier to high-accuracy neuron identification is that in adult C. elegans, the position of neuronal cell bodies is not stereotyped. Existing approaches to address this issue use genetically encoded markers as an additional identifying feature. For example, the NeuroPAL strain uses multicolored fluorescent reporters. However, this approach has limited use due to the negative effects of excessive genetic modification. In this study, I propose an alternative neuronal identification technique using only single-color fluorescent images. I designed a novel neural network based classifier that automatically labels sensory neurons using an iterative, landmark-based neuron identification process inspired by the manual annotation procedures that humans employ. This design labels sensory neurons in C. elegans with 91.61% accuracy.
    Implications of sparsity and high triangle density for graph representation learning. (arXiv:2210.15277v1 [stat.ML])
    Recent work has shown that sparse graphs containing many triangles cannot be reproduced using a finite-dimensional representation of the nodes, in which link probabilities are inner products. Here, we show that such graphs can be reproduced using an infinite-dimensional inner product model, where the node representations lie on a low-dimensional manifold. Recovering a global representation of the manifold is impossible in a sparse regime. However, we can zoom in on local neighbourhoods, where a lower-dimensional representation is possible. As our constructions allow the points to be uniformly distributed on the manifold, we find evidence against the common perception that triangles imply community structure.
    Stochastic Mirror Descent in Average Ensemble Models. (arXiv:2210.15323v1 [cs.LG])
    The stochastic mirror descent (SMD) algorithm is a general class of training algorithms, which includes the celebrated stochastic gradient descent (SGD), as a special case. It utilizes a mirror potential to influence the implicit bias of the training algorithm. In this paper we explore the performance of the SMD iterates on mean-field ensemble models. Our results generalize earlier ones obtained for SGD on such models. The evolution of the distribution of parameters is mapped to a continuous time process in the space of probability distributions. Our main result gives a nonlinear partial differential equation to which the continuous time process converges in the asymptotic regime of large networks. The impact of the mirror potential appears through a multiplicative term that is equal to the inverse of its Hessian and which can be interpreted as defining a gradient flow over an appropriately defined Riemannian manifold. We provide numerical simulations which allow us to study and characterize the effect of the mirror potential on the performance of networks trained with SMD for some binary classification problems.
    Meta-Learning Initializations for Interactive Medical Image Registration. (arXiv:2210.15371v1 [eess.IV])
    We present a meta-learning framework for interactive medical image registration. Our proposed framework comprises three components: a learning-based medical image registration algorithm, a form of user interaction that refines registration at inference, and a meta-learning protocol that learns a rapidly adaptable network initialization. This paper describes a specific algorithm that implements the registration, interaction and meta-learning protocol for our exemplar clinical application: registration of magnetic resonance (MR) imaging to interactively acquired, sparsely-sampled transrectal ultrasound (TRUS) images. Our approach obtains comparable registration error (4.26 mm) to the best-performing non-interactive learning-based 3D-to-3D method (3.97 mm) while requiring only a fraction of the data, and occurring in real-time during acquisition. Applying sparsely sampled data to non-interactive methods yields higher registration errors (6.26 mm), demonstrating the effectiveness of interactive MR-TRUS registration, which may be applied intraoperatively given the real-time nature of the adaptation process.
    Efficient Use of Large Pre-Trained Models for Low Resource ASR. (arXiv:2210.15445v1 [eess.AS])
    Automatic speech recognition (ASR) has been established as a well-performing technique for many scenarios where lots of labeled data is available. Additionally, unsupervised representation learning recently helped to tackle tasks with limited data. Following this, hardware limitations and applications give rise to the question how to efficiently take advantage of large pretrained models and reduce their complexity for downstream tasks. In this work, we study a challenging low resource conversational telephony speech corpus from the medical domain in Vietnamese and German. We show the benefits of using unsupervised techniques beyond simple fine-tuning of large pre-trained models, discuss how to adapt them to a practical telephony task including bandwidth transfer and investigate different data conditions for pre-training and fine-tuning. We outperform the project baselines by 22% relative using pretraining techniques. Further gains of 29% can be achieved by refinements of architecture and training and 6% by adding 0.8 h of in-domain adaptation data.
    Li3DeTr: A LiDAR based 3D Detection Transformer. (arXiv:2210.15365v1 [cs.CV])
    Inspired by recent advances in vision transformers for object detection, we propose Li3DeTr, an end-to-end LiDAR based 3D Detection Transformer for autonomous driving, that inputs LiDAR point clouds and regresses 3D bounding boxes. The LiDAR local and global features are encoded using sparse convolution and multi-scale deformable attention respectively. In the decoder head, firstly, in the novel Li3DeTr cross-attention block, we link the LiDAR global features to 3D predictions leveraging the sparse set of object queries learnt from the data. Secondly, the object query interactions are formulated using multi-head self-attention. Finally, the decoder layer is repeated $L_{dec}$ number of times to refine the object queries. Inspired by DETR, we employ set-to-set loss to train the Li3DeTr network. Without bells and whistles, the Li3DeTr network achieves 61.3% mAP and 67.6% NDS surpassing the state-of-the-art methods with non-maximum suppression (NMS) on the nuScenes dataset and it also achieves competitive performance on the KITTI dataset. We also employ knowledge distillation (KD) using a teacher and student model that slightly improves the performance of our network.
    Structuring User-Generated Content on Social Media with Multimodal Aspect-Based Sentiment Analysis. (arXiv:2210.15377v1 [cs.IR])
    People post their opinions and experiences on social media, yielding rich databases of end users' sentiments. This paper shows to what extent machine learning can analyze and structure these databases. An automated data analysis pipeline is deployed to provide insights into user-generated content for researchers in other domains. First, the domain expert can select an image and a term of interest. Then, the pipeline uses image retrieval to find all images showing similar contents and applies aspect-based sentiment analysis to outline users' opinions about the selected term. As part of an interdisciplinary project between architecture and computer science researchers, an empirical study of Hamburg's Elbphilharmonie was conveyed on 300 thousand posts from the platform Flickr with the hashtag 'hamburg'. Image retrieval methods generated a subset of slightly more than 1.5 thousand images displaying the Elbphilharmonie. We found that these posts mainly convey a neutral or positive sentiment towards it. With this pipeline, we suggest a new big data analysis method that offers new insights into end-users opinions, e.g., for architecture domain experts.
    Local Graph-homomorphic Processing for Privatized Distributed Systems. (arXiv:2210.15414v1 [cs.CR])
    We study the generation of dependent random numbers in a distributed fashion in order to enable privatized distributed learning by networked agents. We propose a method that we refer to as local graph-homomorphic processing; it relies on the construction of particular noises over the edges to ensure a certain level of differential privacy. We show that the added noise does not affect the performance of the learned model. This is a significant improvement to previous works on differential privacy for distributed algorithms, where the noise was added in a less structured manner without respecting the graph topology and has often led to performance deterioration. We illustrate the theoretical results by considering a linear regression problem over a network of agents.
    Sample-Specific Root Causal Inference with Latent Variables. (arXiv:2210.15340v1 [stat.ML])
    Root causal analysis seeks to identify the set of initial perturbations that induce an unwanted outcome. In prior work, we defined sample-specific root causes of disease using exogenous error terms that predict a diagnosis in a structural equation model. We rigorously quantified predictivity using Shapley values. However, the associated algorithms for inferring root causes assume no latent confounding. We relax this assumption by permitting confounding among the predictors. We then introduce a corresponding procedure called Extract Errors with Latents (EEL) for recovering the error terms up to contamination by vertices on certain paths under the linear non-Gaussian acyclic model. EEL also identifies the smallest sets of dependent errors for fast computation of the Shapley values. The algorithm bypasses the hard problem of estimating the underlying causal graph in both cases. Experiments highlight the superior accuracy and robustness of EEL relative to its predecessors.
    Dynamic Survival Transformers for Causal Inference with Electronic Health Records. (arXiv:2210.15417v1 [cs.LG])
    In medicine, researchers often seek to infer the effects of a given treatment on patients' outcomes. However, the standard methods for causal survival analysis make simplistic assumptions about the data-generating process and cannot capture complex interactions among patient covariates. We introduce the Dynamic Survival Transformer (DynST), a deep survival model that trains on electronic health records (EHRs). Unlike previous transformers used in survival analysis, DynST can make use of time-varying information to predict evolving survival probabilities. We derive a semi-synthetic EHR dataset from MIMIC-III to show that DynST can accurately estimate the causal effect of a treatment intervention on restricted mean survival time (RMST). We demonstrate that DynST achieves better predictive and causal estimation than two alternative models.
    Exploring Predictive Uncertainty and Calibration in NLP: A Study on the Impact of Method & Data Scarcity. (arXiv:2210.15452v1 [cs.CL])
    We investigate the problem of determining the predictive confidence (or, conversely, uncertainty) of a neural classifier through the lens of low-resource languages. By training models on sub-sampled datasets in three different languages, we assess the quality of estimates from a wide array of approaches and their dependence on the amount of available data. We find that while approaches based on pre-trained models and ensembles achieve the best results overall, the quality of uncertainty estimates can surprisingly suffer with more data. We also perform a qualitative analysis of uncertainties on sequences, discovering that a model's total uncertainty seems to be influenced to a large degree by its data uncertainty, not model uncertainty. All model implementations are open-sourced in a software package.
    DAGKT: Difficulty and Attempts Boosted Graph-based Knowledge Tracing. (arXiv:2210.15470v1 [cs.CY])
    In the field of intelligent education, knowledge tracing (KT) has attracted increasing attention, which estimates and traces students' mastery of knowledge concepts to provide high-quality education. In KT, there are natural graph structures among questions and knowledge concepts so some studies explored the application of graph neural networks (GNNs) to improve the performance of the KT models which have not used graph structure. However, most of them ignored both the questions' difficulties and students' attempts at questions. Actually, questions with the same knowledge concepts have different difficulties, and students' different attempts also represent different knowledge mastery. In this paper, we propose a difficulty and attempts boosted graph-based KT (DAGKT), using rich information from students' records. Moreover, a novel method is designed to establish the question similarity relationship inspired by the F1 score. Extensive experiments on three real-world datasets demonstrate the effectiveness of the proposed DAGKT.
    Low-Rank Modular Reinforcement Learning via Muscle Synergy. (arXiv:2210.15479v1 [cs.LG])
    Modular Reinforcement Learning (RL) decentralizes the control of multi-joint robots by learning policies for each actuator. Previous work on modular RL has proven its ability to control morphologically different agents with a shared actuator policy. However, with the increase in the Degree of Freedom (DoF) of robots, training a morphology-generalizable modular controller becomes exponentially difficult. Motivated by the way the human central nervous system controls numerous muscles, we propose a Synergy-Oriented LeARning (SOLAR) framework that exploits the redundant nature of DoF in robot control. Actuators are grouped into synergies by an unsupervised learning method, and a synergy action is learned to control multiple actuators in synchrony. In this way, we achieve a low-rank control at the synergy level. We extensively evaluate our method on a variety of robot morphologies, and the results show its superior efficiency and generalizability, especially on robots with a large DoF like Humanoids++ and UNIMALs.
    Spatio-Temporal Hybrid Fusion of CAE and SWIn Transformers for Lung Cancer Malignancy Prediction. (arXiv:2210.15297v1 [eess.IV])
    The paper proposes a novel hybrid discovery Radiomics framework that simultaneously integrates temporal and spatial features extracted from non-thin chest Computed Tomography (CT) slices to predict Lung Adenocarcinoma (LUAC) malignancy with minimum expert involvement. Lung cancer is the leading cause of mortality from cancer worldwide and has various histologic types, among which LUAC has recently been the most prevalent. LUACs are classified as pre-invasive, minimally invasive, and invasive adenocarcinomas. Timely and accurate knowledge of the lung nodules malignancy leads to a proper treatment plan and reduces the risk of unnecessary or late surgeries. Currently, chest CT scan is the primary imaging modality to assess and predict the invasiveness of LUACs. However, the radiologists' analysis based on CT images is subjective and suffers from a low accuracy compared to the ground truth pathological reviews provided after surgical resections. The proposed hybrid framework, referred to as the CAET-SWin, consists of two parallel paths: (i) The Convolutional Auto-Encoder (CAE) Transformer path that extracts and captures informative features related to inter-slice relations via a modified Transformer architecture, and; (ii) The Shifted Window (SWin) Transformer path, which is a hierarchical vision transformer that extracts nodules' related spatial features from a volumetric CT scan. Extracted temporal (from the CAET-path) and spatial (from the Swin path) are then fused through a fusion path to classify LUACs. Experimental results on our in-house dataset of 114 pathologically proven Sub-Solid Nodules (SSNs) demonstrate that the CAET-SWin significantly improves reliability of the invasiveness prediction task while achieving an accuracy of 82.65%, sensitivity of 83.66%, and specificity of 81.66% using 10-fold cross-validation.
    A few-shot learning approach with domain adaptation for personalized real-life stress detection in close relationships. (arXiv:2210.15247v1 [cs.LG])
    We design a metric learning approach that aims to address computational challenges that yield from modeling human outcomes from ambulatory real-life data. The proposed metric learning is based on a Siamese neural network (SNN) that learns the relative difference between pairs of samples from a target user and non-target users, thus being able to address the scarcity of labelled data from the target. The SNN further minimizes the Wasserstein distance of the learned embeddings between target and non-target users, thus mitigating the distribution mismatch between the two. Finally, given the fact that the base rate of focal behaviors is different per user, the proposed method approximates the focal base rate based on labelled samples that lay closest to the target, based on which further minimizes the Wasserstein distance. Our method is exemplified for the purpose of hourly stress classification using real-life multimodal data from 72 dating couples. Results in few-shot and one-shot learning experiments indicate that proposed formulation benefits stress classification and can help mitigate the aforementioned challenges.
    On the Approximation and Complexity of Deep Neural Networks to Invariant Functions. (arXiv:2210.15279v1 [cs.LG])
    Recent years have witnessed a hot wave of deep neural networks in various domains; however, it is not yet well understood theoretically. A theoretical characterization of deep neural networks should point out their approximation ability and complexity, i.e., showing which architecture and size are sufficient to handle the concerned tasks. This work takes one step on this direction by theoretically studying the approximation and complexity of deep neural networks to invariant functions. We first prove that the invariant functions can be universally approximated by deep neural networks. Then we show that a broad range of invariant functions can be asymptotically approximated by various types of neural network models that includes the complex-valued neural networks, convolutional neural networks, and Bayesian neural networks using a polynomial number of parameters or optimization iterations. We also provide a feasible application that connects the parameter estimation and forecasting of high-resolution signals with our theoretical conclusions. The empirical results obtained on simulation experiments demonstrate the effectiveness of our method.
    MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer for Autonomous Driving. (arXiv:2210.15316v1 [cs.CV])
    3D object detection is a significant task for autonomous driving. Recently with the progress of vision transformers, the 2D object detection problem is being treated with the set-to-set loss. Inspired by these approaches on 2D object detection and an approach for multi-view 3D object detection DETR3D, we propose MSF3DDETR: Multi-Sensor Fusion 3D Detection Transformer architecture to fuse image and LiDAR features to improve the detection accuracy. Our end-to-end single-stage, anchor-free and NMS-free network takes in multi-view images and LiDAR point clouds and predicts 3D bounding boxes. Firstly, we link the object queries learnt from data to the image and LiDAR features using a novel MSF3DDETR cross-attention block. Secondly, the object queries interacts with each other in multi-head self-attention block. Finally, MSF3DDETR block is repeated for $L$ number of times to refine the object queries. The MSF3DDETR network is trained end-to-end on the nuScenes dataset using Hungarian algorithm based bipartite matching and set-to-set loss inspired by DETR. We present both quantitative and qualitative results which are competitive to the state-of-the-art approaches.
    How well can Text-to-Image Generative Models understand Ethical Natural Language Interventions?. (arXiv:2210.15230v1 [cs.CL])
    Text-to-image generative models have achieved unprecedented success in generating high-quality images based on natural language descriptions. However, it is shown that these models tend to favor specific social groups when prompted with neutral text descriptions (e.g., 'a photo of a lawyer'). Following Zhao et al. (2021), we study the effect on the diversity of the generated images when adding ethical intervention that supports equitable judgment (e.g., 'if all individuals can be a lawyer irrespective of their gender') in the input prompts. To this end, we introduce an Ethical NaTural Language Interventions in Text-to-Image GENeration (ENTIGEN) benchmark dataset to evaluate the change in image generations conditional on ethical interventions across three social axes -- gender, skin color, and culture. Through ENTIGEN framework, we find that the generations from minDALL.E, DALL.E-mini and Stable Diffusion cover diverse social groups while preserving the image quality. Preliminary studies indicate that a large change in the model predictions is triggered by certain phrases such as 'irrespective of gender' in the context of gender bias in the ethical interventions. We release code and annotated data at https://github.com/Hritikbansal/entigen_emnlp.
    Modeling Inter-Dependence Between Time and Mark in Multivariate Temporal Point Processes. (arXiv:2210.15294v1 [cs.LG])
    Temporal Point Processes (TPP) are probabilistic generative frameworks. They model discrete event sequences localized in continuous time. Generally, real-life events reveal descriptive information, known as marks. Marked TPPs model time and marks of the event together for practical relevance. Conditioned on past events, marked TPPs aim to learn the joint distribution of the time and the mark of the next event. For simplicity, conditionally independent TPP models assume time and marks are independent given event history. They factorize the conditional joint distribution of time and mark into the product of individual conditional distributions. This structural limitation in the design of TPP models hurt the predictive performance on entangled time and mark interactions. In this work, we model the conditional inter-dependence of time and mark to overcome the limitations of conditionally independent models. We construct a multivariate TPP conditioning the time distribution on the current event mark in addition to past events. Besides the conventional intensity-based models for conditional joint distribution, we also draw on flexible intensity-free TPP models from the literature. The proposed TPP models outperform conditionally independent and dependent models in standard prediction tasks. Our experimentation on various datasets with multiple evaluation metrics highlights the merit of the proposed approach.
    GammaE: Gamma Embeddings for Logical Queries on Knowledge Graphs. (arXiv:2210.15578v1 [cs.LG])
    Embedding knowledge graphs (KGs) for multi-hop logical reasoning is a challenging problem due to massive and complicated structures in many KGs. Recently, many promising works projected entities and queries into a geometric space to efficiently find answers. However, it remains challenging to model the negation and union operator. The negation operator has no strict boundaries, which generates overlapped embeddings and leads to obtaining ambiguous answers. An additional limitation is that the union operator is non-closure, which undermines the model to handle a series of union operators. To address these problems, we propose a novel probabilistic embedding model, namely Gamma Embeddings (GammaE), for encoding entities and queries to answer different types of FOL queries on KGs. We utilize the linear property and strong boundary support of the Gamma distribution to capture more features of entities and queries, which dramatically reduces model uncertainty. Furthermore, GammaE implements the Gamma mixture method to design the closed union operator. The performance of GammaE is validated on three large logical query datasets. Experimental results show that GammaE significantly outperforms state-of-the-art models on public benchmarks.
    Differentially Private Generative Adversarial Networks with Model Inversion. (arXiv:2201.03139v2 [cs.LG] UPDATED)
    To protect sensitive data in training a Generative Adversarial Network (GAN), the standard approach is to use differentially private (DP) stochastic gradient descent method in which controlled noise is added to the gradients. The quality of the output synthetic samples can be adversely affected and the training of the network may not even converge in the presence of these noises. We propose Differentially Private Model Inversion (DPMI) method where the private data is first mapped to the latent space via a public generator, followed by a lower-dimensional DP-GAN with better convergent properties. Experimental results on standard datasets CIFAR10 and SVHN as well as on a facial landmark dataset for Autism screening show that our approach outperforms the standard DP-GAN method based on Inception Score, Fr\'echet Inception Distance, and classification accuracy under the same privacy guarantee.
    Simulation-Based Inference with Waldo: Confidence Regions by Leveraging Prediction Algorithms or Posterior Estimators for Inverse Problems. (arXiv:2205.15680v2 [stat.ML] UPDATED)
    Predictive algorithms, such as deep neural networks (DNNs), are used in many domain sciences to directly estimate internal parameters of interest in simulator-based models, especially in settings where the observations include images or other complex high-dimensional data. In parallel, modern neural density estimators, such as normalizing flows, are becoming increasingly popular for uncertainty quantification, especially when both parameters and observations are high-dimensional. However, parameter inference is an inverse problem and not a prediction task; thus, an open challenge is to construct conditionally valid and precise confidence regions, with a guaranteed probability of covering the true parameters of the data-generating process, no matter what the (unknown) parameter values are, and without relying on large-sample theory. Many simulator-based inference (SBI) methods are indeed known to produce biased or overly confident parameter regions, yielding misleading uncertainty estimates. This paper presents WALDO, a novel method for constructing confidence regions with finite-sample conditional validity by leveraging prediction algorithms or posterior estimators that are currently widely adopted in SBI. WALDO reframes the well-known Wald test statistic, and uses a computationally efficient regression-based machinery for classical Neyman inversion of hypothesis tests. We apply our method to a recent high-energy physics problem, where prediction with DNNs has previously led to estimates with prediction bias. We also illustrate how our approach can correct overly confident posterior regions computed with normalizing flows.
    Embrace the Gap: VAEs Perform Independent Mechanism Analysis. (arXiv:2206.02416v2 [stat.ML] UPDATED)
    Variational autoencoders (VAEs) are a popular framework for modeling complex data distributions; they can be efficiently trained via variational inference by maximizing the evidence lower bound (ELBO), at the expense of a gap to the exact (log-)marginal likelihood. While VAEs are commonly used for representation learning, it is unclear why ELBO maximization would yield useful representations, since unregularized maximum likelihood estimation cannot invert the data-generating process. Yet, VAEs often succeed at this task. We seek to elucidate this apparent paradox by studying nonlinear VAEs in the limit of near-deterministic decoders. We first prove that, in this regime, the optimal encoder approximately inverts the decoder -- a commonly used but unproven conjecture -- which we refer to as {\em self-consistency}. Leveraging self-consistency, we show that the ELBO converges to a regularized log-likelihood. This allows VAEs to perform what has recently been termed independent mechanism analysis (IMA): it adds an inductive bias towards decoders with column-orthogonal Jacobians, which helps recovering the true latent factors. The gap between ELBO and log-likelihood is therefore welcome, since it bears unanticipated benefits for nonlinear representation learning. In experiments on synthetic and image data, we show that VAEs uncover the true latent factors when the data generating process satisfies the IMA assumption.
    Efficient and Effective Augmentation Strategy for Adversarial Training. (arXiv:2210.15318v1 [cs.LG])
    Adversarial training of Deep Neural Networks is known to be significantly more data-hungry when compared to standard training. Furthermore, complex data augmentations such as AutoAugment, which have led to substantial gains in standard training of image classifiers, have not been successful with Adversarial Training. We first explain this contrasting behavior by viewing augmentation during training as a problem of domain generalization, and further propose Diverse Augmentation-based Joint Adversarial Training (DAJAT) to use data augmentations effectively in adversarial training. We aim to handle the conflicting goals of enhancing the diversity of the training dataset and training with data that is close to the test distribution by using a combination of simple and complex augmentations with separate batch normalization layers during training. We further utilize the popular Jensen-Shannon divergence loss to encourage the joint learning of the diverse augmentations, thereby allowing simple augmentations to guide the learning of complex ones. Lastly, to improve the computational efficiency of the proposed method, we propose and utilize a two-step defense, Ascending Constraint Adversarial Training (ACAT), that uses an increasing epsilon schedule and weight-space smoothing to prevent gradient masking. The proposed method DAJAT achieves substantially better robustness-accuracy trade-off when compared to existing methods on the RobustBench Leaderboard on ResNet-18 and WideResNet-34-10. The code for implementing DAJAT is available here: https://github.com/val-iisc/DAJAT.
    Arithmetic Sampling: Parallel Diverse Decoding for Large Language Models. (arXiv:2210.15458v1 [cs.CL])
    Decoding methods for large language models often trade-off between diversity of outputs and parallelism of computation. Methods such as beam search and Gumbel top-k sampling can guarantee a different output for each element of the beam, but are not easy to parallelize. Alternatively, methods such as temperature sampling and its modifications (top-k sampling, nucleus sampling, typical decoding, and others), are embarrassingly parallel, but have no guarantees about duplicate samples. We present a framework for sampling according to an arithmetic code book implicitly defined by a large language model, compatible with common sampling variations, with provable beam diversity under certain conditions, as well as being embarrassingly parallel and providing unbiased and consistent expectations from the original model. We demonstrate the effectiveness of our approach on WMT machine translation, showing substantially reduced variance when estimating expected BLEU score and up to 1 point increased BLEU in oracle experiments.
    PopArt: Efficient Sparse Regression and Experimental Design for Optimal Sparse Linear Bandits. (arXiv:2210.15345v1 [stat.ML])
    In sparse linear bandits, a learning agent sequentially selects an action and receive reward feedback, and the reward function depends linearly on a few coordinates of the covariates of the actions. This has applications in many real-world sequential decision making problems. In this paper, we propose a simple and computationally efficient sparse linear estimation method called PopArt that enjoys a tighter $\ell_1$ recovery guarantee compared to Lasso (Tibshirani, 1996) in many problems. Our bound naturally motivates an experimental design criterion that is convex and thus computationally efficient to solve. Based on our novel estimator and design criterion, we derive sparse linear bandit algorithms that enjoy improved regret upper bounds upon the state of the art (Hao et al., 2020), especially w.r.t. the geometry of the given action set. Finally, we prove a matching lower bound for sparse linear bandits in the data-poor regime, which closes the gap between upper and lower bounds in prior work.
    Segmentation of Multiple Sclerosis Lesions across Hospitals: Learn Continually or Train from Scratch?. (arXiv:2210.15091v1 [cs.CV])
    Segmentation of Multiple Sclerosis (MS) lesions is a challenging problem. Several deep-learning-based methods have been proposed in recent years. However, most methods tend to be static, that is, a single model trained on a large, specialized dataset, which does not generalize well. Instead, the model should learn across datasets arriving sequentially from different hospitals by building upon the characteristics of lesions in a continual manner. In this regard, we explore experience replay, a well-known continual learning method, in the context of MS lesion segmentation across multi-contrast data from 8 different hospitals. Our experiments show that replay is able to achieve positive backward transfer and reduce catastrophic forgetting compared to sequential fine-tuning. Furthermore, replay outperforms the multi-domain training, thereby emerging as a promising solution for the segmentation of MS lesions. The code is available at this link: https://github.com/naga-karthik/continual-learning-ms
    Detection and Prevention Against Poisoning Attacks in Federated Learning. (arXiv:2210.14944v1 [cs.CR])
    This paper proposes and investigates a new approach for detecting and preventing several different types of poisoning attacks from affecting a centralized Federated Learning model via average accuracy deviation detection (AADD). By comparing each client's accuracy to all clients' average accuracy, AADD detect clients with an accuracy deviation. The implementation is further able to blacklist clients that are considered poisoned, securing the global model from being affected by the poisoned nodes. The proposed implementation shows promising results in detecting poisoned clients and preventing the global model's accuracy from deteriorating.
    Tangent Bundle Filters and Neural Networks: from Manifolds to Cellular Sheaves and Back. (arXiv:2210.15058v1 [eess.SP])
    In this work we introduce a convolution operation over the tangent bundle of Riemannian manifolds exploiting the Connection Laplacian operator. We use the convolution to define tangent bundle filters and tangent bundle neural networks (TNNs), novel continuous architectures operating on tangent bundle signals, i.e. vector fields over manifolds. We discretize TNNs both in space and time domains, showing that their discrete counterpart is a principled variant of the recently introduced Sheaf Neural Networks. We formally prove that this discrete architecture converges to the underlying continuous TNN. We numerically evaluate the effectiveness of the proposed architecture on a denoising task of a tangent vector field over the unit 2-sphere.
    Coordination with Humans via Strategy Matching. (arXiv:2210.15099v1 [cs.RO])
    Human and robot partners increasingly need to work together to perform tasks as a team. Robots designed for such collaboration must reason about how their task-completion strategies interplay with the behavior and skills of their human team members as they coordinate on achieving joint goals. Our goal in this work is to develop a computational framework for robot adaptation to human partners in human-robot team collaborations. We first present an algorithm for autonomously recognizing available task-completion strategies by observing human-human teams performing a collaborative task. By transforming team actions into low dimensional representations using hidden Markov models, we can identify strategies without prior knowledge. Robot policies are learned on each of the identified strategies to construct a Mixture-of-Experts model that adapts to the task strategies of unseen human partners. We evaluate our model on a collaborative cooking task using an Overcooked simulator. Results of an online user study with 125 participants demonstrate that our framework improves the task performance and collaborative fluency of human-agent teams, as compared to state of the art reinforcement learning methods.
    UnfoldML: Cost-Aware and Uncertainty-Based Dynamic 2D Prediction for Multi-Stage Classification. (arXiv:2210.15056v1 [cs.LG])
    Machine Learning (ML) research has focused on maximizing the accuracy of predictive tasks. ML models, however, are increasingly more complex, resource intensive, and costlier to deploy in resource-constrained environments. These issues are exacerbated for prediction tasks with sequential classification on progressively transitioned stages with ''happens-before'' relation between them.We argue that it is possible to ''unfold'' a monolithic single multi-class classifier, typically trained for all stages using all data, into a series of single-stage classifiers. Each single-stage classifier can be cascaded gradually from cheaper to more expensive binary classifiers that are trained using only the necessary data modalities or features required for that stage. UnfoldML is a cost-aware and uncertainty-based dynamic 2D prediction pipeline for multi-stage classification that enables (1) navigation of the accuracy/cost tradeoff space, (2) reducing the spatio-temporal cost of inference by orders of magnitude, and (3) early prediction on proceeding stages. UnfoldML achieves orders of magnitude better cost in clinical settings, while detecting multi-stage disease development in real time. It achieves within 0.1% accuracy from the highest-performing multi-class baseline, while saving close to 20X on spatio-temporal cost of inference and earlier (3.5hrs) disease onset prediction. We also show that UnfoldML generalizes to image classification, where it can predict different level of labels (from coarse to fine) given different level of abstractions of a image, saving close to 5X cost with as little as 0.4% accuracy reduction.
    Unified Algorithms for RL with Decision-Estimation Coefficients: No-Regret, PAC, and Reward-Free Learning. (arXiv:2209.11745v2 [cs.LG] UPDATED)
    Finding unified complexity measures and algorithms for sample-efficient learning is a central topic of research in reinforcement learning (RL). The Decision-Estimation Coefficient (DEC) is recently proposed by Foster et al. (2021) as a necessary and sufficient complexity measure for sample-efficient no-regret RL. This paper makes progress towards a unified theory for RL with the DEC framework. First, we propose two new DEC-type complexity measures: Explorative DEC (EDEC), and Reward-Free DEC (RFDEC). We show that they are necessary and sufficient for sample-efficient PAC learning and reward-free learning, thereby extending the original DEC which only captures no-regret learning. Next, we design new unified sample-efficient algorithms for all three learning goals. Our algorithms instantiate variants of the Estimation-To-Decisions (E2D) meta-algorithm with a strong and general model estimation subroutine. Even in the no-regret setting, our algorithm E2D-TA improves upon the algorithms of Foster et al. (2021) which require either bounding a variant of the DEC which may be prohibitively large, or designing problem-specific estimation subroutines. As applications, we recover existing and obtain new sample-efficient learning results for a wide range of tractable RL problems using essentially a single algorithm. We also generalize the DEC to give sample-efficient algorithms for all-policy model estimation, with applications for learning equilibria in Markov Games. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling or Maximum Likelihood Estimation, showing that they enjoy similar regret bounds as E2D-TA under similar structural conditions as the DEC.
    Rethinking the Reverse-engineering of Trojan Triggers. (arXiv:2210.15127v1 [cs.CR])
    Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-engineering methods can reconstruct the trigger and thus identify affected models. Existing reverse-engineering methods only consider input space constraints, e.g., trigger size in the input space. Expressly, they assume the triggers are static patterns in the input space and fail to detect models with feature space triggers such as image style transformations. We observe that both input-space and feature-space Trojans are associated with feature space hyperplanes. Based on this observation, we design a novel reverse-engineering method that exploits the feature space constraint to reverse-engineer Trojan triggers. Results on four datasets and seven different attacks demonstrate that our solution effectively defends both input-space and feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods and other types of defenses in both Trojaned model detection and mitigation tasks. On average, the detection accuracy of our method is 93\%. For Trojan mitigation, our method can reduce the ASR (attack success rate) to only 0.26\% with the BA (benign accuracy) remaining nearly unchanged. Our code can be found at https://github.com/RU-System-Software-and-Security/FeatureRE.
    Learning versus Refutation in Noninteractive Local Differential Privacy. (arXiv:2210.15439v1 [stat.ML])
    We study two basic statistical tasks in non-interactive local differential privacy (LDP): learning and refutation. Learning requires finding a concept that best fits an unknown target function (from labelled samples drawn from a distribution), whereas refutation requires distinguishing between data distributions that are well-correlated with some concept in the class, versus distributions where the labels are random. Our main result is a complete characterization of the sample complexity of agnostic PAC learning for non-interactive LDP protocols. We show that the optimal sample complexity for any concept class is captured by the approximate $\gamma_2$~norm of a natural matrix associated with the class. Combined with previous work [Edmonds, Nikolov and Ullman, 2019] this gives an equivalence between learning and refutation in the agnostic setting.
    Outlier-Aware Training for Improving Group Accuracy Disparities. (arXiv:2210.15183v1 [cs.CL])
    Methods addressing spurious correlations such as Just Train Twice (JTT, arXiv:2107.09044v2) involve reweighting a subset of the training set to maximize the worst-group accuracy. However, the reweighted set of examples may potentially contain unlearnable examples that hamper the model's learning. We propose mitigating this by detecting outliers to the training set and removing them before reweighting. Our experiments show that our method achieves competitive or better accuracy compared with JTT and can detect and remove annotation errors in the subset being reweighted in JTT.
    A knowledge-driven vowel-based approach of depression classification from speech using data augmentation. (arXiv:2210.15261v1 [cs.SD])
    We propose a novel explainable machine learning (ML) model that identifies depression from speech, by modeling the temporal dependencies across utterances and utilizing the spectrotemporal information at the vowel level. Our method first models the variable-length utterances at the local-level into a fixed-size vowel-based embedding using a convolutional neural network with a spatial pyramid pooling layer ("vowel CNN"). Following that, the depression is classified at the global-level from a group of vowel CNN embeddings that serve as the input of another 1D CNN ("depression CNN"). Different data augmentation methods are designed for both the training of vowel CNN and depression CNN. We investigate the performance of the proposed system at various temporal granularities when modeling short, medium, and long analysis windows, corresponding to 10, 21, and 42 utterances, respectively. The proposed method reaches comparable performance with previous state-of-the-art approaches and depicts explainable properties with respect to the depression outcome. The findings from this work may benefit clinicians by providing additional intuitions during joint human-ML decision-making tasks.
    Resource Constrained Vehicular Edge Federated Learning with Highly Mobile Connected Vehicles. (arXiv:2210.15496v1 [eess.SY])
    This paper proposes a vehicular edge federated learning (VEFL) solution, where an edge server leverages highly mobile connected vehicles' (CVs') onboard central processing units (CPUs) and local datasets to train a global model. Convergence analysis reveals that the VEFL training loss depends on the successful receptions of the CVs' trained models over the intermittent vehicle-to-infrastructure (V2I) wireless links. Owing to high mobility, in the full device participation case (FDPC), the edge server aggregates client model parameters based on a weighted combination according to the CVs' dataset sizes and sojourn periods, while it selects a subset of CVs in the partial device participation case (PDPC). We then devise joint VEFL and radio access technology (RAT) parameters optimization problems under delay, energy and cost constraints to maximize the probability of successful reception of the locally trained models. Considering that the optimization problem is NP-hard, we decompose it into a VEFL parameter optimization sub-problem, given the estimated worst-case sojourn period, delay and energy expense, and an online RAT parameter optimization sub-problem. Finally, extensive simulations are conducted to validate the effectiveness of the proposed solutions with a practical 5G new radio (5G-NR) RAT under a realistic microscopic mobility model.
    Robust Domain Adaptation for Pre-trained Multilingual Neural Machine Translation Models. (arXiv:2210.14979v1 [cs.CL])
    Recent literature has demonstrated the potential of multilingual Neural Machine Translation (mNMT) models. However, the most efficient models are not well suited to specialized industries. In these cases, internal data is scarce and expensive to find in all language pairs. Therefore, fine-tuning a mNMT model on a specialized domain is hard. In this context, we decided to focus on a new task: Domain Adaptation of a pre-trained mNMT model on a single pair of language while trying to maintain model quality on generic domain data for all language pairs. The risk of loss on generic domain and on other pairs is high. This task is key for mNMT model adoption in the industry and is at the border of many others. We propose a fine-tuning procedure for the generic mNMT that combines embeddings freezing and adversarial loss. Our experiments demonstrated that the procedure improves performances on specialized data with a minimal loss in initial performances on generic domain for all languages pairs, compared to a naive standard approach (+10.0 BLEU score on specialized data, -0.01 to -0.5 BLEU on WMT and Tatoeba datasets on the other pairs with M2M100).
    Masked Autoencoders Are Articulatory Learners. (arXiv:2210.15195v1 [eess.AS])
    Articulatory recordings track the positions and motion of different articulators along the vocal tract and are widely used to study speech production and to develop speech technologies such as articulatory based speech synthesizers and speech inversion systems. The University of Wisconsin X-Ray microbeam (XRMB) dataset is one of various datasets that provide articulatory recordings synced with audio recordings. The XRMB articulatory recordings employ pellets placed on a number of articulators which can be tracked by the microbeam. However, a significant portion of the articulatory recordings are mistracked, and have been so far unsuable. In this work, we present a deep learning based approach using Masked Autoencoders to accurately reconstruct the mistracked articulatory recordings for 41 out of 47 speakers of the XRMB dataset. Our model is able to reconstruct articulatory trajectories that closely match ground truth, even when three out of eight articulators are mistracked, and retrieve 3.28 out of 3.4 hours of previously unusable recordings.
    A Teacher-student Framework for Unsupervised Speech Enhancement Using Noise Remixing Training and Two-stage Inference. (arXiv:2210.15368v1 [cs.SD])
    The lack of clean speech is a practical challenge to the development of speech enhancement systems, which means that the training of neural network models must be done in an unsupervised manner, and there is an inevitable mismatch between their training criterion and evaluation metric. In response to this unfavorable situation, we propose a teacher-student training strategy that does not require any subjective/objective speech quality metrics as learning reference by improving the previously proposed noisy-target training (NyTT). Because homogeneity between in-domain noise and extraneous noise is the key to the effectiveness of NyTT, we train various student models by remixing the teacher model's estimated speech and noise for clean-target training or raw noisy speech and the teacher model's estimated noise for noisy-target training. We use the NyTT model as the initial teacher model. Experimental results show that our proposed method outperforms several baselines, especially with two-stage inference, where clean speech is derived successively through the bootstrap model and the final student model.  ( 2 min )
    MABEL: Attenuating Gender Bias using Textual Entailment Data. (arXiv:2210.14975v1 [cs.CL])
    Pre-trained language models encode undesirable social biases, which are further exacerbated in downstream use. To this end, we propose MABEL (a Method for Attenuating Gender Bias using Entailment Labels), an intermediate pre-training approach for mitigating gender bias in contextualized representations. Key to our approach is the use of a contrastive learning objective on counterfactually augmented, gender-balanced entailment pairs from natural language inference (NLI) datasets. We also introduce an alignment regularizer that pulls identical entailment pairs along opposite gender directions closer. We extensively evaluate our approach on intrinsic and extrinsic metrics, and show that MABEL outperforms previous task-agnostic debiasing approaches in terms of fairness. It also preserves task performance after fine-tuning on downstream tasks. Together, these findings demonstrate the suitability of NLI data as an effective means of bias mitigation, as opposed to only using unlabeled sentences in the literature. Finally, we identify that existing approaches often use evaluation settings that are insufficient or inconsistent. We make an effort to reproduce and compare previous methods, and call for unifying the evaluation settings across gender debiasing methods for better future comparison.
    FAS-UNet: A Novel FAS-driven Unet to Learn Variational Image Segmentation. (arXiv:2210.15164v1 [cs.CV])
    Solving variational image segmentation problems with hidden physics is often expensive and requires different algorithms and manually tunes model parameter. The deep learning methods based on the U-Net structure have obtained outstanding performances in many different medical image segmentation tasks, but designing such networks requires a lot of parameters and training data, not always available for practical problems. In this paper, inspired by traditional multi-phase convexity Mumford-Shah variational model and full approximation scheme (FAS) solving the nonlinear systems, we propose a novel variational-model-informed network (denoted as FAS-Unet) that exploits the model and algorithm priors to extract the multi-scale features. The proposed model-informed network integrates image data and mathematical models, and implements them through learning a few convolution kernels. Based on the variational theory and FAS algorithm, we first design a feature extraction sub-network (FAS-Solution module) to solve the model-driven nonlinear systems, where a skip-connection is employed to fuse the multi-scale features. Secondly, we further design a convolution block to fuse the extracted features from the previous stage, resulting in the final segmentation possibility. Experimental results on three different medical image segmentation tasks show that the proposed FAS-Unet is very competitive with other state-of-the-art methods in qualitative, quantitative and model complexity evaluations. Moreover, it may also be possible to train specialized network architectures that automatically satisfy some of the mathematical and physical laws in other image problems for better accuracy, faster training and improved generalization.
    COCO-DR: Combating Distribution Shifts in Zero-Shot Dense Retrieval with Contrastive and Distributionally Robust Learning. (arXiv:2210.15212v1 [cs.CL])
    We present a new zero-shot dense retrieval (ZeroDR) method, COCO-DR, to improve the generalization ability of dense retrieval by combating the distribution shifts between source training tasks and target scenarios. To mitigate the impact of document differences, COCO-DR continues pretraining the language model on the target corpora to adapt the model to target distributions via COtinuous COtrastive learning. To prepare for unseen target queries, COCO-DR leverages implicit Distributionally Robust Optimization (iDRO) to reweight samples from different source query clusters for improving model robustness over rare queries during fine-tuning. COCO-DR achieves superior average performance on BEIR, the zero-shot retrieval benchmark. At BERT Base scale, COCO-DR Base outperforms other ZeroDR models with 60x larger size. At BERT Large scale, COCO-DR Large outperforms the giant GPT-3 embedding model which has 500x more parameters. Our analysis show the correlation between COCO-DR's effectiveness in combating distribution shifts and improving zero-shot accuracy. Our code and model can be found at \url{https://github.com/OpenMatch/COCO-DR}.
    Learning Failure-Inducing Models for Testing Software-Defined Networks. (arXiv:2210.15469v1 [cs.SE])
    Software-defined networks (SDN) enable flexible and effective communication systems, e.g., data centers, that are managed by centralized software controllers. However, such a controller can undermine the underlying communication network of an SDN-based system and thus must be carefully tested. When an SDN-based system fails, in order to address such a failure, engineers need to precisely understand the conditions under which it occurs. In this paper, we introduce a machine learning-guided fuzzing method, named FuzzSDN, aiming at both (1) generating effective test data leading to failures in SDN-based systems and (2) learning accurate failure-inducing models that characterize conditions under which such system fails. This is done in a synergistic manner where models guide test generation and the latter also aims at improving the models. To our knowledge, FuzzSDN is the first attempt to simultaneously address these two objectives for SDNs. We evaluate FuzzSDN by applying it to systems controlled by two open-source SDN controllers. Further, we compare FuzzSDN with two state-of-the-art methods for fuzzing SDNs and two baselines (i.e., simple extensions of these two existing methods) for learning failure-inducing models. Our results show that (1) compared to the state-of-the-art methods, FuzzSDN generates at least 12 times more failures, within the same time budget, with a controller that is fairly robust to fuzzing and (2) our failure-inducing models have, on average, a precision of 98% and a recall of 86%, significantly outperforming the baselines.
    Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder. (arXiv:2210.15533v1 [cs.SD])
    Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Furthermore, unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.
    CasNet: Investigating Channel Robustness for Speech Separation. (arXiv:2210.15370v1 [cs.SD])
    Recording channel mismatch between training and testing conditions has been shown to be a serious problem for speech separation. This situation greatly reduces the separation performance, and cannot meet the requirement of daily use. In this study, inheriting the use of our previously constructed TAT-2mix corpus, we address the channel mismatch problem by proposing a channel-aware audio separation network (CasNet), a deep learning framework for end-to-end time-domain speech separation. CasNet is implemented on top of TasNet. Channel embedding (characterizing channel information in a mixture of multiple utterances) generated by Channel Encoder is introduced into the separation module by the FiLM technique. Through two training strategies, we explore two roles that channel embedding may play: 1) a real-life noise disturbance, making the model more robust, or 2) a guide, instructing the separation model to retain the desired channel information. Experimental results on TAT-2mix show that CasNet trained with both training strategies outperforms the TasNet baseline, which does not use channel embeddings.
    Hypergraph Artificial Benchmark for Community Detection (h-ABCD). (arXiv:2210.15009v1 [cs.SI])
    The Artificial Benchmark for Community Detection (ABCD) graph is a recently introduced random graph model with community structure and power-law distribution for both degrees and community sizes. The model generates graphs with similar properties as the well-known LFR one, and its main parameter can be tuned to mimic its counterpart in the LFR model, the mixing parameter. In this paper, we introduce hypergraph counterpart of the ABCD model, h-ABCD, which produces random hypergraph with distributions of ground-truth community sizes and degrees following power-law. As in the original ABCD, the new model h-ABCD can produce hypergraphs with various levels of noise. More importantly, the model is flexible and can mimic any desired level of homogeneity of hyperedges that fall into one community. As a result, it can be used as a suitable, synthetic playground for analyzing and tuning hypergraph community detection algorithms.
    Learning Discrete Directed Acyclic Graphs via Backpropagation. (arXiv:2210.15353v1 [cs.LG])
    Recently continuous relaxations have been proposed in order to learn Directed Acyclic Graphs (DAGs) from data by backpropagation, instead of using combinatorial optimization. However, a number of techniques for fully discrete backpropagation could instead be applied. In this paper, we explore that direction and propose DAG-DB, a framework for learning DAGs by Discrete Backpropagation. Based on the architecture of Implicit Maximum Likelihood Estimation [I-MLE, arXiv:2106.01798], DAG-DB adopts a probabilistic approach to the problem, sampling binary adjacency matrices from an implicit probability distribution. DAG-DB learns a parameter for the distribution from the loss incurred by each sample, performing competitively using either of two fully discrete backpropagation techniques, namely I-MLE and Straight-Through Estimation.
    What Language Model to Train if You Have One Million GPU Hours?. (arXiv:2210.15424v1 [cs.CL])
    The crystallization of modeling methods around the Transformer architecture has been a boon for practitioners. Simple, well-motivated architectural variations can transfer across tasks and scale, increasing the impact of modeling research. However, with the emergence of state-of-the-art 100B+ parameters models, large language models are increasingly expensive to accurately design and train. Notably, it can be difficult to evaluate how modeling decisions may impact emergent capabilities, given that these capabilities arise mainly from sheer scale alone. In the process of building BLOOM--the Big Science Large Open-science Open-access Multilingual language model--our goal is to identify an architecture and training setup that makes the best use of our 1,000,000 A100-GPU-hours budget. Specifically, we perform an ablation study at the billion-parameter scale comparing different modeling practices and their impact on zero-shot generalization. In addition, we study the impact of various popular pre-training corpora on zero-shot generalization. We also study the performance of a multilingual model and how it compares to the English-only one. Finally, we consider the scaling behaviour of Transformers to choose the target model size, shape, and training setup. All our models and code are open-sourced at https://huggingface.co/bigscience .
    CoViT: Real-time phylogenetics for the SARS-CoV-2 pandemic using Vision Transformers. (arXiv:2208.05004v2 [cs.LG] UPDATED)
    Real-time viral genome detection, taxonomic classification and phylogenetic analysis are critical for efficient tracking and control of viral pandemics such as Covid-19. However, the unprecedented and still growing amounts of viral genome data create a computational bottleneck, which effectively prevents the real-time pandemic tracking. For genomic tracing to work effectively, each new viral genome sequence must be placed in its pangenomic context. Re-inferring the full phylogeny of SARS-CoV-2, with datasets containing millions of samples, is prohibitively slow even using powerful computational resources. We are attempting to alleviate the computational bottleneck by modifying and applying Vision Transformer, a recently developed neural network model for image recognition, to taxonomic classification and placement of viral genomes, such as SARS-CoV-2. Our solution, CoViT, places SARS-CoV-2 genome accessions onto SARS-CoV-2 phylogenetic tree with the accuracy of 94.2%. Since CoViT is a classification neural network, it provides more than one likely placement. Specifically, one of the two most likely placements suggested by CoViT is correct with the probability of 97.9%. The probability of the correct placement to be found among the five most likely placements generated by CoViT is 99.8%. The placement time is 0.055s per individual genome running on NVIDIAs GeForce RTX 2080 Ti GPU. We make CoViT available to research community through GitHub: https://github.com/zuherJahshan/covit.
    Bayesian Hyperbolic Multidimensional Scaling. (arXiv:2210.15081v1 [stat.ME])
    Multidimensional scaling (MDS) is a widely used approach to representing high-dimensional, dependent data. MDS works by assigning each observation a location on a low-dimensional geometric manifold, with distance on the manifold representing similarity. We propose a Bayesian approach to multidimensional scaling when the low-dimensional manifold is hyperbolic. Using hyperbolic space facilitates representing tree-like structure common in many settings (e.g. text or genetic data with hierarchical structure). A Bayesian approach provides regularization that minimizes the impact of uncertainty or measurement error in the observed data. We also propose a case-control likelihood approximation that allows for efficient sampling from the posterior in larger data settings, reducing computational complexity from approximately $O(n^2)$ to $O(n)$. We evaluate the proposed method against state-of-the-art alternatives using simulations, canonical reference datasets, and human gene expression data.
    Many-Objective Reinforcement Learning for Online Testing of DNN-Enabled Systems. (arXiv:2210.15432v1 [cs.LG])
    Deep Neural Networks (DNNs) have been widely used to perform real-world tasks in cyber-physical systems such as Autonomous Diving Systems (ADS). Ensuring the correct behavior of such DNN-Enabled Systems (DES) is a crucial topic. Online testing is one of the promising modes for testing such systems with their application environments (simulated or real) in a closed loop taking into account the continuous interaction between the systems and their environments. However, the environmental variables (e.g., lighting conditions) that might change during the systems' operation in the real world, causing the DES to violate requirements (safety, functional), are often kept constant during the execution of an online test scenario due to the two major challenges: (1) the space of all possible scenarios to explore would become even larger if they changed and (2) there are typically many requirements to test simultaneously. In this paper, we present MORLOT (Many-Objective Reinforcement Learning for Online Testing), a novel online testing approach to address these challenges by combining Reinforcement Learning (RL) and many-objective search. MORLOT leverages RL to incrementally generate sequences of environmental changes while relying on many-objective search to determine the changes so that they are more likely to achieve any of the uncovered objectives. We empirically evaluate MORLOT using CARLA, a high-fidelity simulator widely used for autonomous driving research, integrated with Transfuser, a DNN-enabled ADS for end-to-end driving. The evaluation results show that MORLOT is significantly more effective and efficient than alternatives with a large effect size. In other words, MORLOT is a good option to test DES with dynamically changing environments while accounting for multiple safety requirements.
    Exploiting Features and Logits in Heterogeneous Federated Learning. (arXiv:2210.15527v1 [cs.LG])
    Due to the rapid growth of IoT and artificial intelligence, deploying neural networks on IoT devices is becoming increasingly crucial for edge intelligence. Federated learning (FL) facilitates the management of edge devices to collaboratively train a shared model while maintaining training data local and private. However, a general assumption in FL is that all edge devices are trained on the same machine learning model, which may be impractical considering diverse device capabilities. For instance, less capable devices may slow down the updating process because they struggle to handle large models appropriate for ordinary devices. In this paper, we propose a novel data-free FL method that supports heterogeneous client models by managing features and logits, called Felo; and its extension with a conditional VAE deployed in the server, called Velo. Felo averages the mid-level features and logits from the clients at the server based on their class labels to provide the average features and logits, which are utilized for further training the client models. Unlike Felo, the server has a conditional VAE in Velo, which is used for training mid-level features and generating synthetic features according to the labels. The clients optimize their models based on the synthetic features and the average logits. We conduct experiments on two datasets and show satisfactory performances of our methods compared with the state-of-the-art methods.
    Anonymized Histograms in Intermediate Privacy Models. (arXiv:2210.15178v1 [cs.DS])
    We study the problem of privately computing the anonymized histogram (a.k.a. unattributed histogram), which is defined as the histogram without item labels. Previous works have provided algorithms with $\ell_1$- and $\ell_2^2$-errors of $O_\varepsilon(\sqrt{n})$ in the central model of differential privacy (DP). In this work, we provide an algorithm with a nearly matching error guarantee of $\tilde{O}_\varepsilon(\sqrt{n})$ in the shuffle DP and pan-private models. Our algorithm is very simple: it just post-processes the discrete Laplace-noised histogram! Using this algorithm as a subroutine, we show applications in privately estimating symmetric properties of distributions such as entropy, support coverage, and support size.
    Isometric 3D Adversarial Examples in the Physical World. (arXiv:2210.15291v1 [cs.CV])
    3D deep learning models are shown to be as vulnerable to adversarial examples as 2D models. However, existing attack methods are still far from stealthy and suffer from severe performance degradation in the physical world. Although 3D data is highly structured, it is difficult to bound the perturbations with simple metrics in the Euclidean space. In this paper, we propose a novel $\epsilon$-isometric ($\epsilon$-ISO) attack to generate natural and robust 3D adversarial examples in the physical world by considering the geometric properties of 3D objects and the invariance to physical transformations. For naturalness, we constrain the adversarial example to be $\epsilon$-isometric to the original one by adopting the Gaussian curvature as a surrogate metric guaranteed by a theoretical analysis. For invariance to physical transformations, we propose a maxima over transformation (MaxOT) method that actively searches for the most harmful transformations rather than random ones to make the generated adversarial example more robust in the physical world. Experiments on typical point cloud recognition models validate that our approach can significantly improve the attack success rate and naturalness of the generated 3D adversarial examples than the state-of-the-art attack methods.  ( 2 min )
    Watermarking for Out-of-distribution Detection. (arXiv:2210.15198v1 [cs.LG])
    Out-of-distribution (OOD) detection aims to identify OOD data based on representations extracted from well-trained deep models. However, existing methods largely ignore the reprogramming property of deep models and thus may not fully unleash their intrinsic strength: without modifying parameters of a well-trained deep model, we can reprogram this model for a new purpose via data-level manipulation (e.g., adding a specific feature perturbation to the data). This property motivates us to reprogram a classification model to excel at OOD detection (a new task), and thus we propose a general methodology named watermarking in this paper. Specifically, we learn a unified pattern that is superimposed onto features of original data, and the model's detection capability is largely boosted after watermarking. Extensive experiments verify the effectiveness of watermarking, demonstrating the significance of the reprogramming property of deep models in OOD detection.  ( 2 min )
    Text2Model: Model Induction for Zero-shot Generalization Using Task Descriptions. (arXiv:2210.15182v1 [cs.CV])
    We study the problem of generating a training-free task-dependent visual classifier from text descriptions without visual samples. This \textit{Text-to-Model} (T2M) problem is closely related to zero-shot learning, but unlike previous work, a T2M model infers a model tailored to a task, taking into account all classes in the task. We analyze the symmetries of T2M, and characterize the equivariance and invariance properties of corresponding models. In light of these properties, we design an architecture based on hypernetworks that given a set of new class descriptions predicts the weights for an object recognition model which classifies images from those zero-shot classes. We demonstrate the benefits of our approach compared to zero-shot learning from text descriptions in image and point-cloud classification using various types of text descriptions: From single words to rich text descriptions.  ( 2 min )
    On Out-of-Distribution Detection for Audio with Deep Nearest Neighbors. (arXiv:2210.15283v1 [cs.SD])
    Out-of-distribution (OOD) detection is concerned with identifying data points that do not belong to the same distribution as the model's training data. For the safe deployment of predictive models in a real-world environment, it is critical to avoid making confident predictions on OOD inputs as it can lead to potentially dangerous consequences. However, OOD detection largely remains an under-explored area in the audio (and speech) domain. This is despite the fact that audio is a central modality for many tasks, such as speaker diarization, automatic speech recognition, and sound event detection. To address this, we propose to leverage feature-space of the model with deep k-nearest neighbors to detect OOD samples. We show that this simple and flexible method effectively detects OOD inputs across a broad category of audio (and speech) datasets. Specifically, it improves the false positive rate (FPR@TPR95) by 17% and the AUROC score by 7% than other prior techniques.  ( 2 min )
    Trust and Believe -- Should We? Evaluating the Trustworthiness of Twitter Users. (arXiv:2210.15214v1 [cs.SI])
    Social networking and micro-blogging services, such as Twitter, play an important role in sharing digital information. Despite the popularity and usefulness of social media, they are regularly abused by corrupt users. One of these nefarious activities is so-called fake news -- a "virus" that has been spreading rapidly thanks to the hospitable environment provided by social media platforms. The extensive spread of fake news is now becoming a major problem with far-reaching negative repercussions on both individuals and society. Hence, the identification of fake news on social media is a problem of utmost importance that has attracted the interest not only of the research community but most of the big players on both sides - such as Facebook, on the industry side, and political parties on the societal one. In this work, we create a model through which we hope to be able to offer a solution that will instill trust in social network communities. Our model analyses the behaviour of 50,000 politicians on Twitter and assigns an influence score for each evaluated user based on several collected and analysed features and attributes. Next, we classify political Twitter users as either trustworthy or untrustworthy using random forest and support vector machine classifiers. An active learning model has been used to classify any unlabeled ambiguous records from our dataset. Finally, to measure the performance of the proposed model, we used accuracy as the main evaluation metric.  ( 3 min )
    Multi-Viewpoint and Multi-Evaluation with Felicitous Inductive Bias Boost Machine Abstract Reasoning Ability. (arXiv:2210.14914v1 [cs.LG])
    Great endeavors have been made to study AI's ability in abstract reasoning, along with which different versions of RAVEN's progressive matrices (RPM) are proposed as benchmarks. Previous works give inkling that without sophisticated design or extra meta-data containing semantic information, neural networks may still be indecisive in making decisions regarding RPM problems, after relentless training. Evidenced by thorough experiments and ablation studies, we showcase that end-to-end neural networks embodied with felicitous inductive bias, intentionally design or serendipitously match, can solve RPM problems elegantly, without the augment of any extra meta-data or preferences of any specific backbone. Our work also reveals that multi-viewpoint with multi-evaluation is a key learning strategy for successful reasoning. Finally, potential explanations for the failure of connectionist models in generalization are provided. We hope that these results will serve as inspections of AI's ability beyond perception and toward abstract reasoning. Source code can be found in https://github.com/QinglaiWeiCASIA/RavenSolver.  ( 2 min )
    Federated Continual Learning to Detect Accounting Anomalies in Financial Auditing. (arXiv:2210.15051v1 [cs.LG])
    The International Standards on Auditing require auditors to collect reasonable assurance that financial statements are free of material misstatement. At the same time, a central objective of Continuous Assurance is the real-time assessment of digital accounting journal entries. Recently, driven by the advances in artificial intelligence, Deep Learning techniques have emerged in financial auditing to examine vast quantities of accounting data. However, learning highly adaptive audit models in decentralised and dynamic settings remains challenging. It requires the study of data distribution shifts over multiple clients and time periods. In this work, we propose a Federated Continual Learning framework enabling auditors to learn audit models from decentral clients continuously. We evaluate the framework's ability to detect accounting anomalies in common scenarios of organizational activity. Our empirical results, using real-world datasets and combined federated continual learning strategies, demonstrate the learned model's ability to detect anomalies in audit settings of data distribution shifts.  ( 2 min )
    The Inconvenient Truths of Ground Truth for Binary Analysis. (arXiv:2210.15079v1 [cs.CR])
    The effectiveness of binary analysis tools and techniques is often measured with respect to how well they map to a ground truth. We have found that not all ground truths are created equal. This paper challenges the binary analysis community to take a long look at the concept of ground truth, to ensure that we are in agreement with definition(s) of ground truth, so that we can be confident in the evaluation of tools and techniques. This becomes even more important as we move to trained machine learning models, which are only as useful as the validity of the ground truth in the training.  ( 2 min )
    A new Stack Autoencoder: Neighbouring Sample Envelope Embedded Stack Autoencoder Ensemble Model. (arXiv:2210.14956v1 [cs.LG])
    Stack autoencoder (SAE), as a representative deep network, has unique and excellent performance in feature learning, and has received extensive attention from researchers. However, existing deep SAEs focus on original samples without considering the hierarchical structural information between samples. To address this limitation, this paper proposes a new SAE model-neighbouring envelope embedded stack autoencoder ensemble (NE_ESAE). Firstly, the neighbouring sample envelope learning mechanism (NSELM) is proposed for preprocessing of input of SAE. NSELM constructs sample pairs by combining neighbouring samples. Besides, the NSELM constructs a multilayer sample spaces by multilayer iterative mean clustering, which considers the similar samples and generates layers of envelope samples with hierarchical structural information. Second, an embedded stack autoencoder (ESAE) is proposed and trained in each layer of sample space to consider the original samples during training and in the network structure, thereby better finding the relationship between original feature samples and deep feature samples. Third, feature reduction and base classifiers are conducted on the layers of envelope samples respectively, and output classification results of every layer of samples. Finally, the classification results of the layers of envelope sample space are fused through the ensemble mechanism. In the experimental section, the proposed algorithm is validated with over ten representative public datasets. The results show that our method significantly has better performance than existing traditional feature learning methods and the representative deep autoencoders.  ( 3 min )
    Rigid-Body Sound Synthesis with Differentiable Modal Resonators. (arXiv:2210.15306v1 [cs.SD])
    Physical models of rigid bodies are used for sound synthesis in applications from virtual environments to music production. Traditional methods such as modal synthesis often rely on computationally expensive numerical solvers, while recent deep learning approaches are limited by post-processing of their results. In this work we present a novel end-to-end framework for training a deep neural network to generate modal resonators for a given 2D shape and material, using a bank of differentiable IIR filters. We demonstrate our method on a dataset of synthetic objects, but train our model using an audio-domain objective, paving the way for physically-informed synthesisers to be learned directly from recordings of real-world objects.  ( 2 min )
    One Arrow, Two Kills: An Unified Framework for Achieving Optimal Regret Guarantees in Sleeping Bandits. (arXiv:2210.14998v1 [cs.LG])
    We address the problem of \emph{`Internal Regret'} in \emph{Sleeping Bandits} in the fully adversarial setup, as well as draw connections between different existing notions of sleeping regrets in the multiarmed bandits (MAB) literature and consequently analyze the implications: Our first contribution is to propose the new notion of \emph{Internal Regret} for sleeping MAB. We then proposed an algorithm that yields sublinear regret in that measure, even for a completely adversarial sequence of losses and availabilities. We further show that a low sleeping internal regret always implies a low external regret, and as well as a low policy regret for iid sequence of losses. The main contribution of this work precisely lies in unifying different notions of existing regret in sleeping bandits and understand the implication of one to another. Finally, we also extend our results to the setting of \emph{Dueling Bandits} (DB)--a preference feedback variant of MAB, and proposed a reduction to MAB idea to design a low regret algorithm for sleeping dueling bandits with stochastic preferences and adversarial availabilities. The efficacy of our algorithms is justified through empirical evaluations.  ( 2 min )
    Environment Design for Inverse Reinforcement Learning. (arXiv:2210.14972v1 [cs.LG])
    The task of learning a reward function from expert demonstrations suffers from high sample complexity as well as inherent limitations to what can be learned from demonstrations in a given environment. As the samples used for reward learning require human input, which is generally expensive, much effort has been dedicated towards designing more sample-efficient algorithms. Moreover, even with abundant data, current methods can still fail to learn insightful reward functions that are robust to minor changes in the environment dynamics. We approach these challenges differently than prior work by improving the sample-efficiency as well as the robustness of learned rewards through adaptively designing a sequence of demonstration environments for the expert to act in. We formalise a framework for this environment design process in which learner and expert repeatedly interact, and construct algorithms that actively seek information about the rewards by carefully curating environments for the human to demonstrate the task in.  ( 2 min )
    Generalized Laplacian Regularized Framelet GCNs. (arXiv:2210.15092v1 [cs.LG])
    This paper introduces a novel Framelet Graph approach based on p-Laplacian GNN. The proposed two models, named p-Laplacian undecimated framelet graph convolution (pL-UFG) and generalized p-Laplacian undecimated framelet graph convolution (pL-fUFG) inherit the nature of p-Laplacian with the expressive power of multi-resolution decomposition of graph signals. The empirical study highlights the excellent performance of the pL-UFG and pL-fUFG in different graph learning tasks including node classification and signal denoising.  ( 2 min )
    Learning on the Job: Self-Rewarding Offline-to-Online Finetuning for Industrial Insertion of Novel Connectors from Vision. (arXiv:2210.15206v1 [cs.RO])
    Learning-based methods in robotics hold the promise of generalization, but what can be done if a learned policy does not generalize to a new situation? In principle, if an agent can at least evaluate its own success (i.e., with a reward classifier that generalizes well even when the policy does not), it could actively practice the task and finetune the policy in this situation. We study this problem in the setting of industrial insertion tasks, such as inserting connectors in sockets and setting screws. Existing algorithms rely on precise localization of the connector or socket and carefully managed physical setups, such as assembly lines, to succeed at the task. But in unstructured environments such as homes or even some industrial settings, robots cannot rely on precise localization and may be tasked with previously unseen connectors. Offline reinforcement learning on a variety of connector insertion tasks is a potential solution, but what if the robot is tasked with inserting previously unseen connector? In such a scenario, we will still need methods that can robustly solve such tasks with online practice. One of the main observations we make in this work is that, with a suitable representation learning and domain generalization approach, it can be significantly easier for the reward function to generalize to a new but structurally similar task (e.g., inserting a new type of connector) than for the policy. This means that a learned reward function can be used to facilitate the finetuning of the robot's policy in situations where the policy fails to generalize in zero shot, but the reward function generalizes successfully. We show that such an approach can be instantiated in the real world, pretrained on 50 different connectors, and successfully finetuned to new connectors via the learned reward function. Videos can be viewed at https://sites.google.com/view/learningonthejob  ( 3 min )
    Deep-MDS Framework for Recovering the 3D Shape of 2D Landmarks from a Single Image. (arXiv:2210.15200v1 [cs.CV])
    In this paper, a low parameter deep learning framework utilizing the Non-metric Multi-Dimensional scaling (NMDS) method, is proposed to recover the 3D shape of 2D landmarks on a human face, in a single input image. Hence, NMDS approach is used for the first time to establish a mapping from a 2D landmark space to the corresponding 3D shape space. A deep neural network learns the pairwise dissimilarity among 2D landmarks, used by NMDS approach, whose objective is to learn the pairwise 3D Euclidean distance of the corresponding 2D landmarks on the input image. This scheme results in a symmetric dissimilarity matrix, with the rank larger than 2, leading the NMDS approach toward appropriately recovering the 3D shape of corresponding 2D landmarks. In the case of posed images and complex image formation processes like perspective projection which causes occlusion in the input image, we consider an autoencoder component in the proposed framework, as an occlusion removal part, which turns different input views of the human face into a profile view. The results of a performance evaluation using different synthetic and real-world human face datasets, including Besel Face Model (BFM), CelebA, CoMA - FLAME, and CASIA-3D, indicates the comparable performance of the proposed framework, despite its small number of training parameters, with the related state-of-the-art and powerful 3D reconstruction methods from the literature, in terms of efficiency and accuracy.  ( 3 min )
    Deep Learning is Provably Robust to Symmetric Label Noise. (arXiv:2210.15083v1 [stat.ML])
    Deep neural networks (DNNs) are capable of perfectly fitting the training data, including memorizing noisy data. It is commonly believed that memorization hurts generalization. Therefore, many recent works propose mitigation strategies to avoid noisy data or correct memorization. In this work, we step back and ask the question: Can deep learning be robust against massive label noise without any mitigation? We provide an affirmative answer for the case of symmetric label noise: We find that certain DNNs, including under-parameterized and over-parameterized models, can tolerate massive symmetric label noise up to the information-theoretic threshold. By appealing to classical statistical theory and universal consistency of DNNs, we prove that for multiclass classification, $L_1$-consistent DNN classifiers trained under symmetric label noise can achieve Bayes optimality asymptotically if the label noise probability is less than $\frac{K-1}{K}$, where $K \ge 2$ is the number of classes. Our results show that for symmetric label noise, no mitigation is necessary for $L_1$-consistent estimators. We conjecture that for general label noise, mitigation strategies that make use of the noisy data will outperform those that ignore the noisy data.  ( 2 min )
    ViT-CAT: Parallel Vision Transformers with Cross Attention Fusion for Popularity Prediction in MEC Networks. (arXiv:2210.15125v1 [cs.LG])
    Mobile Edge Caching (MEC) is a revolutionary technology for the Sixth Generation (6G) of wireless networks with the promise to significantly reduce users' latency via offering storage capacities at the edge of the network. The efficiency of the MEC network, however, critically depends on its ability to dynamically predict/update the storage of caching nodes with the top-K popular contents. Conventional statistical caching schemes are not robust to the time-variant nature of the underlying pattern of content requests, resulting in a surge of interest in using Deep Neural Networks (DNNs) for time-series popularity prediction in MEC networks. However, existing DNN models within the context of MEC fail to simultaneously capture both temporal correlations of historical request patterns and the dependencies between multiple contents. This necessitates an urgent quest to develop and design a new and innovative popularity prediction architecture to tackle this critical challenge. The paper addresses this gap by proposing a novel hybrid caching framework based on the attention mechanism. Referred to as the parallel Vision Transformers with Cross Attention (ViT-CAT) Fusion, the proposed architecture consists of two parallel ViT networks, one for collecting temporal correlation, and the other for capturing dependencies between different contents. Followed by a Cross Attention (CA) module as the Fusion Center (FC), the proposed ViT-CAT is capable of learning the mutual information between temporal and spatial correlations, as well, resulting in improving the classification accuracy, and decreasing the model's complexity about 8 times. Based on the simulation results, the proposed ViT-CAT architecture outperforms its counterparts across the classification accuracy, complexity, and cache-hit ratio.  ( 3 min )
    Partially Oblivious Neural Network Inference. (arXiv:2210.15189v1 [cs.CR])
    Oblivious inference is the task of outsourcing a ML model, like neural-networks, without disclosing critical and sensitive information, like the model's parameters. One of the most prominent solutions for secure oblivious inference is based on a powerful cryptographic tools, like Homomorphic Encryption (HE) and/or multi-party computation (MPC). Even though the implementation of oblivious inference systems schemes has impressively improved the last decade, there are still significant limitations on the ML models that they can practically implement. Especially when both the ML model and the input data's confidentiality must be protected. In this paper, we introduce the notion of partially oblivious inference. We empirically show that for neural network models, like CNNs, some information leakage can be acceptable. We therefore propose a novel trade-off between security and efficiency. In our research, we investigate the impact on security and inference runtime performance from the CNN model's weights partial leakage. We experimentally demonstrate that in a CIFAR-10 network we can leak up to $80\%$ of the model's weights with practically no security impact, while the necessary HE-mutliplications are performed four times faster.  ( 2 min )
    Fast and Efficient Scene Categorization for Autonomous Driving using VAEs. (arXiv:2210.14981v1 [cs.CV])
    Scene categorization is a useful precursor task that provides prior knowledge for many advanced computer vision tasks with a broad range of applications in content-based image indexing and retrieval systems. Despite the success of data driven approaches in the field of computer vision such as object detection, semantic segmentation, etc., their application in learning high-level features for scene recognition has not achieved the same level of success. We propose to generate a fast and efficient intermediate interpretable generalized global descriptor that captures coarse features from the image and use a classification head to map the descriptors to 3 scene categories: Rural, Urban and Suburban. We train a Variational Autoencoder in an unsupervised manner and map images to a constrained multi-dimensional latent space and use the latent vectors as compact embeddings that serve as global descriptors for images. The experimental results evidence that the VAE latent vectors capture coarse information from the image, supporting their usage as global descriptors. The proposed global descriptor is very compact with an embedding length of 128, significantly faster to compute, and is robust to seasonal and illuminational changes, while capturing sufficient scene information required for scene categorization.  ( 2 min )
    MEET: Mobility-Enhanced Edge inTelligence for Smart and Green 6G Networks. (arXiv:2210.15111v1 [cs.NI])
    Edge intelligence is an emerging paradigm for real-time training and inference at the wireless edge, thus enabling mission-critical applications. Accordingly, base stations (BSs) and edge servers (ESs) need to be densely deployed, leading to huge deployment and operation costs, in particular the energy costs. In this article, we propose a new framework called Mobility-Enhanced Edge inTelligence (MEET), which exploits the sensing, communication, computing, and self-powering capabilities of intelligent connected vehicles for the smart and green 6G networks. Specifically, the operators can incorporate infrastructural vehicles as movable BSs or ESs, and schedule them in a more flexible way to align with the communication and computation traffic fluctuations. Meanwhile, the remaining compute resources of opportunistic vehicles are exploited for edge training and inference, where mobility can further enhance edge intelligence by bringing more compute resources, communication opportunities, and diverse data. In this way, the deployment and operation costs are spread over the vastly available vehicles, so that the edge intelligence is realized cost-effectively and sustainably. Furthermore, these vehicles can be either powered by renewable energy to reduce carbon emissions, or charged more flexibly during off-peak hours to cut electricity bills.  ( 2 min )
    Improved Projection Learning for Lower Dimensional Feature Maps. (arXiv:2210.15170v1 [cs.LG])
    The requirement to repeatedly move large feature maps off- and on-chip during inference with convolutional neural networks (CNNs) imposes high costs in terms of both energy and time. In this work we explore an improved method for compressing all feature maps of pre-trained CNNs to below a specified limit. This is done by means of learned projections trained via end-to-end finetuning, which can then be folded and fused into the pre-trained network. We also introduce a new `ceiling compression' framework in which evaluate such techniques in view of the future goal of performing inference fully on-chip.  ( 2 min )
    Dictionary-Assisted Supervised Contrastive Learning. (arXiv:2210.15172v1 [cs.CL])
    Text analysis in the social sciences often involves using specialized dictionaries to reason with abstract concepts, such as perceptions about the economy or abuse on social media. These dictionaries allow researchers to impart domain knowledge and note subtle usages of words relating to a concept(s) of interest. We introduce the dictionary-assisted supervised contrastive learning (DASCL) objective, allowing researchers to leverage specialized dictionaries when fine-tuning pretrained language models. The text is first keyword simplified: a common, fixed token replaces any word in the corpus that appears in the dictionary(ies) relevant to the concept of interest. During fine-tuning, a supervised contrastive objective draws closer the embeddings of the original and keyword-simplified texts of the same class while pushing further apart the embeddings of different classes. The keyword-simplified texts of the same class are more textually similar than their original text counterparts, which additionally draws the embeddings of the same class closer together. Combining DASCL and cross-entropy improves classification performance metrics in few-shot learning settings and social science applications compared to using cross-entropy alone and alternative contrastive and data augmentation methods.  ( 2 min )
    Addressing Heterogeneity in Federated Learning via Distributional Transformation. (arXiv:2210.15025v1 [cs.CV])
    Federated learning (FL) allows multiple clients to collaboratively train a deep learning model. One major challenge of FL is when data distribution is heterogeneous, i.e., differs from one client to another. Existing personalized FL algorithms are only applicable to narrow cases, e.g., one or two data classes per client, and therefore they do not satisfactorily address FL under varying levels of data heterogeneity. In this paper, we propose a novel framework, called DisTrans, to improve FL performance (i.e., model accuracy) via train and test-time distributional transformations along with a double-input-channel model structure. DisTrans works by optimizing distributional offsets and models for each FL client to shift their data distribution, and aggregates these offsets at the FL server to further improve performance in case of distributional heterogeneity. Our evaluation on multiple benchmark datasets shows that DisTrans outperforms state-of-the-art FL methods and data augmentation methods under various settings and different degrees of client distributional heterogeneity.  ( 2 min )
    TILDE-Q: A Transformation Invariant Loss Function for Time-Series Forecasting. (arXiv:2210.15050v1 [cs.LG])
    Time-series forecasting has caught increasing attention in the AI research field due to its importance in solving real-world problems across different domains, such as energy, weather, traffic, and economy. As shown in various types of data, it has been a must-see issue to deal with drastic changes, temporal patterns, and shapes in sequential data that previous models are weak in prediction. This is because most cases in time-series forecasting aim to minimize $L_p$ norm distances as loss functions, such as mean absolute error (MAE) or mean square error (MSE). These loss functions are vulnerable to not only considering temporal dynamics modeling but also capturing the shape of signals. In addition, these functions often make models misbehave and return uncorrelated results to the original time-series. To become an effective loss function, it has to be invariant to the set of distortions between two time-series data instead of just comparing exact values. In this paper, we propose a novel loss function, called TILDE-Q (Transformation Invariant Loss function with Distance EQuilibrium), that not only considers the distortions in amplitude and phase but also allows models to capture the shape of time-series sequences. In addition, TILDE-Q supports modeling periodic and non-periodic temporal dynamics at the same time. We evaluate the effectiveness of TILDE-Q by conducting extensive experiments with respect to periodic and non-periodic conditions of data, from naive models to state-of-the-art models. The experiment results indicate that the models trained with TILDE-Q outperform those trained with other training metrics (e.g., MSE, dynamic time warping (DTW), temporal distortion index (TDI), and longest common subsequence (LCSS)).  ( 3 min )
    A Hierarchical Approach to Conditional Random Fields for System Anomaly Detection. (arXiv:2210.15030v1 [cs.LG])
    Anomaly detection to recognize unusual events in large scale systems in a time sensitive manner is critical in many industries, eg. bank fraud, enterprise systems, medical alerts, etc. Large-scale systems often grow in size and complexity over time, and anomaly detection algorithms need to adapt to changing structures. A hierarchical approach takes advantage of the implicit relationships in complex systems and localized context. The features in complex systems may vary drastically in data distribution, capturing different aspects from multiple data sources, and when put together provide a more complete view of the system. In this paper, two datasets are considered, the 1st comprising of system metrics from machines running on a cloud service, and the 2nd of application metrics from a distributed software system with inherent hierarchies and interconnections amongst its system nodes. Comparing algorithms, across the changepoint based PELT algorithm, cognitive learning-based Hierarchical Temporal Memory algorithms, Support Vector Machines and Conditional Random Fields provides a basis for proposing a Hierarchical Global-Local Conditional Random Field approach to accurately capture anomalies in complex systems, and across various features. Hierarchical algorithms can learn both the intricacies of lower-level or specific features, and utilize these in the global abstracted representation to detect anomalous patterns robustly across multi-source feature data and distributed systems. A graphical network analysis on complex systems can further fine-tune datasets to mine relationships based on available features, which can benefit hierarchical models. Furthermore, hierarchical solutions can adapt well to changes at a localized level, learning on new data and changing environments when parts of a system are over-hauled, and translate these learnings to a global view of the system over time.  ( 3 min )
    The Art NFTs and Their Marketplaces. (arXiv:2210.14942v1 [q-fin.ST])
    Non-Fungible Tokens (NFTs) are crypto assets with a unique digital identifier for ownership, powered by blockchain technology. Technically speaking, anything digital could be minted and sold as an NFT, which provides proof of ownership and authenticity of a digital file. For this reason, it helps us distinguish between the originals and their copies, making it possible to trade them. This paper focuses on art NFTs that change how artists can sell their products. It also changes how the art trade market works since NFT technology cuts out the middleman. Recently, the utility of NFTs has become an essential issue in the NFT ecosystem, which refers to the owners' usefulness, profitability, and benefits. Using recent major art NFT marketplace datasets, we summarize and interpret the current market trends and patterns in a way that brings insight into the future art market. Numerical examples are presented.  ( 2 min )
    Interstellar Object Accessibility and Mission Design. (arXiv:2210.14980v1 [astro-ph.EP])
    Interstellar objects (ISOs) are fascinating and under-explored celestial objects, providing physical laboratories to understand the formation of our solar system and probe the composition and properties of material formed in exoplanetary systems. This paper will discuss the accessibility of and mission design to ISOs with varying characteristics, including a discussion of state covariance estimation over the course of a cruise, handoffs from traditional navigation approaches to novel autonomous navigation for fast flyby regimes, and overall recommendations about preparing for the future in situ exploration of these targets. The lessons learned also apply to the fast flyby of other small bodies including long-period comets and potentially hazardous asteroids, which also require a tactical response with similar characteristics  ( 2 min )
    Disentangled Text Representation Learning with Information-Theoretic Perspective for Adversarial Robustness. (arXiv:2210.14957v1 [cs.CL])
    Adversarial vulnerability remains a major obstacle to constructing reliable NLP systems. When imperceptible perturbations are added to raw input text, the performance of a deep learning model may drop dramatically under attacks. Recent work argues the adversarial vulnerability of the model is caused by the non-robust features in supervised training. Thus in this paper, we tackle the adversarial robustness challenge from the view of disentangled representation learning, which is able to explicitly disentangle robust and non-robust features in text. Specifically, inspired by the variation of information (VI) in information theory, we derive a disentangled learning objective composed of mutual information to represent both the semantic representativeness of latent embeddings and differentiation of robust and non-robust features. On the basis of this, we design a disentangled learning network to estimate these mutual information. Experiments on text classification and entailment tasks show that our method significantly outperforms the representative methods under adversarial attacks, indicating that discarding non-robust features is critical for improving adversarial robustness.  ( 2 min )
    An Intelligent Decision Support Ensemble Voting Model for Coronary Artery Disease Prediction in Smart Healthcare Monitoring Environments. (arXiv:2210.14906v1 [cs.LG])
    Coronary artery disease (CAD) is one of the most common cardiac diseases worldwide and causes disability and economic burden. It is the world's leading and most serious cause of mortality, with approximately 80% of deaths reported in low- and middle-income countries. The preferred and most precise diagnostic tool for CAD is angiography, but it is invasive, expensive, and technically demanding. However, the research community is increasingly interested in the computer-aided diagnosis of CAD via the utilization of machine learning (ML) methods. The purpose of this work is to present an e-diagnosis tool based on ML algorithms that can be used in a smart healthcare monitoring system. We applied the most accurate machine learning methods that have shown superior results in the literature to different medical datasets such as RandomForest, XGboost, MLP, J48, AdaBoost, NaiveBayes, LogitBoost, KNN. Every single classifier can be efficient on a different dataset. Thus, an ensemble model using majority voting was designed to take advantage of the well-performed single classifiers, Ensemble learning aims to combine the forecasts of multiple individual classifiers to achieve higher performance than individual classifiers in terms of precision, specificity, sensitivity, and accuracy; furthermore, we have benchmarked our proposed model with the most efficient and well-known ensemble models, such as Bagging, Stacking methods based on the cross-validation technique, The experimental results confirm that the ensemble majority voting approach based on the top 3 classifiers: MultilayerPerceptron, RandomForest, and AdaBoost, achieves the highest accuracy of 88,12% and outperforms all other classifiers. This study demonstrates that the majority voting ensemble approach proposed above is the most accurate machine learning classification approach for the prediction and detection of coronary artery disease.  ( 3 min )
    TPU-MLIR: A Compiler For TPU Using MLIR. (arXiv:2210.15016v1 [cs.PL])
    Multi-level intermediate representations (MLIR) show great promise for reducing the cost of building domain-specific compilers by providing a reusable and extensible compiler infrastructure. This work presents TPU-MLIR, an end-to-end compiler based on MLIR that deploys pre-trained neural network (NN) models to a custom ASIC called a Tensor Processing Unit (TPU). TPU-MLIR defines two new dialects to implement its functionality: 1. a Tensor operation (TOP) dialect that encodes the deep learning graph semantics and independent of the deep learning framework and 2. a TPU kernel dialect to provide a standard kernel computation on TPU. A NN model is translated to the TOP dialect and then lowered to the TPU dialect for different TPUs according to the chip's configuration. We demonstrate how to use the MLIR pass pipeline to organize and perform optimization on TPU to generate machine code. The paper also presents a verification procedure to ensure the correctness of each transform stage.  ( 2 min )
    AltUB: Alternating Training Method to Update Base Distribution of Normalizing Flow for Anomaly Detection. (arXiv:2210.14913v1 [cs.LG])
    Unsupervised anomaly detection is coming into the spotlight these days in various practical domains due to the limited amount of anomaly data. One of the major approaches for it is a normalizing flow which pursues the invertible transformation of a complex distribution as images into an easy distribution as N(0, I). In fact, algorithms based on normalizing flow like FastFlow and CFLOW-AD establish state-of-the-art performance on unsupervised anomaly detection tasks. Nevertheless, we investigate these algorithms convert normal images into not N(0, I) as their destination, but an arbitrary normal distribution. Moreover, their performances are often unstable, which is highly critical for unsupervised tasks because data for validation are not provided. To break through these observations, we propose a simple solution AltUB which introduces alternating training to update the base distribution of normalizing flow for anomaly detection. AltUB effectively improves the stability of performance of normalizing flow. Furthermore, our method achieves the new state-of-the-art performance of the anomaly segmentation task on the MVTec AD dataset with 98.8% AUROC.  ( 2 min )
    Trade-off between reconstruction loss and feature alignment for domain generalization. (arXiv:2210.15000v1 [cs.LG])
    Domain generalization (DG) is a branch of transfer learning that aims to train the learning models on several seen domains and subsequently apply these pre-trained models to other unseen (unknown but related) domains. To deal with challenging settings in DG where both data and label of the unseen domain are not available at training time, the most common approach is to design the classifiers based on the domain-invariant representation features, i.e., the latent representations that are unchanged and transferable between domains. Contrary to popular belief, we show that designing classifiers based on invariant representation features alone is necessary but insufficient in DG. Our analysis indicates the necessity of imposing a constraint on the reconstruction loss induced by representation functions to preserve most of the relevant information about the label in the latent space. More importantly, we point out the trade-off between minimizing the reconstruction loss and achieving domain alignment in DG. Our theoretical results motivate a new DG framework that jointly optimizes the reconstruction loss and the domain discrepancy. Both theoretical and numerical results are provided to justify our approach.  ( 2 min )
    Characterizing Datapoints via Second-Split Forgetting. (arXiv:2210.15031v1 [cs.LG])
    Researchers investigating example hardness have increasingly focused on the dynamics by which neural networks learn and forget examples throughout training. Popular metrics derived from these dynamics include (i) the epoch at which examples are first correctly classified; (ii) the number of times their predictions flip during training; and (iii) whether their prediction flips if they are held out. However, these metrics do not distinguish among examples that are hard for distinct reasons, such as membership in a rare subpopulation, being mislabeled, or belonging to a complex subpopulation. In this paper, we propose $second$-$split$ $forgetting$ $time$ (SSFT), a complementary metric that tracks the epoch (if any) after which an original training example is forgotten as the network is fine-tuned on a randomly held out partition of the data. Across multiple benchmark datasets and modalities, we demonstrate that $mislabeled$ examples are forgotten quickly, and seemingly $rare$ examples are forgotten comparatively slowly. By contrast, metrics only considering the first split learning dynamics struggle to differentiate the two. At large learning rates, SSFT tends to be robust across architectures, optimizers, and random seeds. From a practical standpoint, the SSFT can (i) help to identify mislabeled samples, the removal of which improves generalization; and (ii) provide insights about failure modes. Through theoretical analysis addressing overparameterized linear models, we provide insights into how the observed phenomena may arise. Code for reproducing our experiments can be found here: https://github.com/pratyushmaini/ssft  ( 2 min )
    Automated Diagnosis of Cardiovascular Diseases from Cardiac Magnetic Resonance Imaging Using Deep Learning Models: A Review. (arXiv:2210.14909v1 [eess.IV])
    In recent years, cardiovascular diseases (CVDs) have become one of the leading causes of mortality globally. CVDs appear with minor symptoms and progressively get worse. The majority of people experience symptoms such as exhaustion, shortness of breath, ankle swelling, fluid retention, and other symptoms when starting CVD. Coronary artery disease (CAD), arrhythmia, cardiomyopathy, congenital heart defect (CHD), mitral regurgitation, and angina are the most common CVDs. Clinical methods such as blood tests, electrocardiography (ECG) signals, and medical imaging are the most effective methods used for the detection of CVDs. Among the diagnostic methods, cardiac magnetic resonance imaging (CMR) is increasingly used to diagnose, monitor the disease, plan treatment and predict CVDs. Coupled with all the advantages of CMR data, CVDs diagnosis is challenging for physicians due to many slices of data, low contrast, etc. To address these issues, deep learning (DL) techniques have been employed to the diagnosis of CVDs using CMR data, and much research is currently being conducted in this field. This review provides an overview of the studies performed in CVDs detection using CMR images and DL techniques. The introduction section examined CVDs types, diagnostic methods, and the most important medical imaging techniques. In the following, investigations to detect CVDs using CMR images and the most significant DL methods are presented. Another section discussed the challenges in diagnosing CVDs from CMR data. Next, the discussion section discusses the results of this review, and future work in CVDs diagnosis from CMR images and DL techniques are outlined. The most important findings of this study are presented in the conclusion section.  ( 3 min )
    RulE: Neural-Symbolic Knowledge Graph Reasoning with Rule Embedding. (arXiv:2210.14905v1 [cs.AI])
    Knowledge graph (KG) reasoning is an important problem for knowledge graphs. It predicts missing links by reasoning on existing facts. Knowledge graph embedding (KGE) is one of the most popular methods to address this problem. It embeds entities and relations into low-dimensional vectors and uses the learned entity/relation embeddings to predict missing facts. However, KGE only uses zeroth-order (propositional) logic to encode existing triplets (e.g., ``Alice is Bob's wife."); it is unable to leverage first-order (predicate) logic to represent generally applicable logical \textbf{rules} (e.g., ``$\forall x,y \colon x ~\text{is}~ y\text{'s wife} \rightarrow y ~\text{is}~ x\text{'s husband}$''). On the other hand, traditional rule-based KG reasoning methods usually rely on hard logical rule inference, making it brittle and hardly competitive with KGE. In this paper, we propose RulE, a novel and principled framework to represent and model logical rules and triplets. RulE jointly represents entities, relations and logical rules in a unified embedding space. By learning an embedding for each logical rule, RulE can perform logical rule inference in a soft way and give a confidence score to each grounded rule, similar to how KGE gives each triplet a confidence score. Compared to KGE alone, RulE allows injecting prior logical rule information into the embedding space, which improves the generalization of knowledge graph embedding. Besides, the learned confidence scores of rules improve the logical rule inference process by softly controlling the contribution of each rule, which alleviates the brittleness of logic. We evaluate our method with link prediction tasks. Experimental results on multiple benchmark KGs demonstrate the effectiveness of RulE.  ( 3 min )
    Private Isotonic Regression. (arXiv:2210.15175v1 [cs.LG])
    In this paper, we consider the problem of differentially private (DP) algorithms for isotonic regression. For the most general problem of isotonic regression over a partially ordered set (poset) $\mathcal{X}$ and for any Lipschitz loss function, we obtain a pure-DP algorithm that, given $n$ input points, has an expected excess empirical risk of roughly $\mathrm{width}(\mathcal{X}) \cdot \log|\mathcal{X}| / n$, where $\mathrm{width}(\mathcal{X})$ is the width of the poset. In contrast, we also obtain a near-matching lower bound of roughly $(\mathrm{width}(\mathcal{X}) + \log |\mathcal{X}|) / n$, that holds even for approximate-DP algorithms. Moreover, we show that the above bounds are essentially the best that can be obtained without utilizing any further structure of the poset. In the special case of a totally ordered set and for $\ell_1$ and $\ell_2^2$ losses, our algorithm can be implemented in near-linear running time; we also provide extensions of this algorithm to the problem of private isotonic regression with additional structural constraints on the output function.  ( 2 min )
    Predicting Visual Attention and Distraction During Visual Search Using Convolutional Neural Networks. (arXiv:2210.15093v1 [cs.CV])
    Most studies in computational modeling of visual attention encompass task-free observation of images. Free-viewing saliency considers limited scenarios of daily life. Most visual activities are goal-oriented and demand a great amount of top-down attention control. Visual search task demands more top-down control of attention, compared to free-viewing. In this paper, we present two approaches to model visual attention and distraction of observers during visual search. Our first approach adapts a light-weight free-viewing saliency model to predict eye fixation density maps of human observers over pixels of search images, using a two-stream convolutional encoder-decoder network, trained and evaluated on COCO-Search18 dataset. This method predicts which locations are more distracting when searching for a particular target. Our network achieves good results on standard saliency metrics (AUC-Judd=0.95, AUC-Borji=0.85, sAUC=0.84, NSS=4.64, KLD=0.93, CC=0.72, SIM=0.54, and IG=2.59). Our second approach is object-based and predicts the distractor and target objects during visual search. Distractors are all objects except the target that observers fixate on during search. This method uses a Mask-RCNN segmentation network pre-trained on MS-COCO and fine-tuned on COCO-Search18 dataset. We release our segmentation annotations of targets and distractors in COCO-Search18 for three target categories: bottle, bowl, and car. The average scores over the three categories are: F1-score=0.64, MAP(iou:0.5)=0.57, MAR(iou:0.5)=0.73. Our implementation code in Tensorflow is publicly available at https://github.com/ManooshSamiei/Distraction-Visual-Search .  ( 3 min )
    Federated Graph Representation Learning using Self-Supervision. (arXiv:2210.15120v1 [cs.LG])
    Federated graph representation learning (FedGRL) brings the benefits of distributed training to graph structured data while simultaneously addressing some privacy and compliance concerns related to data curation. However, several interesting real-world graph data characteristics viz. label deficiency and downstream task heterogeneity are not taken into consideration in current FedGRL setups. In this paper, we consider a realistic and novel problem setting, wherein cross-silo clients have access to vast amounts of unlabeled data with limited or no labeled data and additionally have diverse downstream class label domains. We then propose a novel FedGRL formulation based on model interpolation where we aim to learn a shared global model that is optimized collaboratively using a self-supervised objective and gets downstream task supervision through local client models. We provide a specific instantiation of our general formulation using BGRL a SoTA self-supervised graph representation learning method and we empirically verify its effectiveness through realistic cross-slio datasets: (1) we adapt the Twitch Gamer Network which naturally simulates a cross-geo scenario and show that our formulation can provide consistent and avg. 6.1% gains over traditional supervised federated learning objectives and on avg. 1.7% gains compared to individual client specific self-supervised training and (2) we construct and introduce a new cross-silo dataset called Amazon Co-purchase Networks that have both the characteristics of the motivated problem setting. And, we witness on avg. 11.5% gains over traditional supervised federated learning and on avg. 1.9% gains over individually trained self-supervised models. Both experimental results point to the effectiveness of our proposed formulation. Finally, both our novel problem setting and dataset contributions provide new avenues for the research in FedGRL.  ( 3 min )
    Light-weighted CNN-Attention based architecture for Hand Gesture Recognition via ElectroMyography. (arXiv:2210.15119v1 [cs.LG])
    Advancements in Biological Signal Processing (BSP) and Machine-Learning (ML) models have paved the path for development of novel immersive Human-Machine Interfaces (HMI). In this context, there has been a surge of significant interest in Hand Gesture Recognition (HGR) utilizing Surface-Electromyogram (sEMG) signals. This is due to its unique potential for decoding wearable data to interpret human intent for immersion in Mixed Reality (MR) environments. To achieve the highest possible accuracy, complicated and heavy-weighted Deep Neural Networks (DNNs) are typically developed, which restricts their practical application in low-power and resource-constrained wearable systems. In this work, we propose a light-weighted hybrid architecture (HDCAM) based on Convolutional Neural Network (CNN) and attention mechanism to effectively extract local and global representations of the input. The proposed HDCAM model with 58,441 parameters reached a new state-of-the-art (SOTA) performance with 82.91% and 81.28% accuracy on window sizes of 300 ms and 200 ms for classifying 17 hand gestures. The number of parameters to train the proposed HDCAM architecture is 18.87 times less than its previous SOTA counterpart.  ( 2 min )
    Contrastive Decoding: Open-ended Text Generation as Optimization. (arXiv:2210.15097v1 [cs.CL])
    Likelihood, although useful as a training loss, is a poor search objective for guiding open-ended generation from language models (LMs). Existing generation algorithms must avoid both unlikely strings, which are incoherent, and highly likely ones, which are short and repetitive. We propose contrastive decoding (CD), a more reliable search objective that returns the difference between likelihood under a large LM (called the expert, e.g. OPT-13b) and a small LM (called the amateur, e.g. OPT-125m). CD is inspired by the fact that the failures of larger LMs are even more prevalent in smaller LMs, and that this difference signals exactly which texts should be preferred. CD requires zero training, and produces higher quality text than decoding from the larger LM alone. It also generalizes across model types (OPT and GPT2) and significantly outperforms four strong decoding algorithms in automatic and human evaluations.  ( 2 min )

  • Open

    [P] Computer Vision made easy with a new Open Source Framework. Quick and simple.
    Hi all, Ikomia API is an open source tool for building and deploying Computer Vision solutions without much effort. It's a Python library built upon a C++ core. You can mix your preferred frameworks such as OpenCV, Detectron2, OpenMMLab or YOLO with the best state-of-the-art algorithms from individual repos. At Ikomia, we deeply believe that sharing scientific knowledge is the key to success, that's why we try to make research-based algorithms ready-to-use for developers. Ikomia is developed by developers for developers :) More info on our GitHub. You can try our Google Colab on YOLOv7 here if you are interested. Support and feedback are welcome ;) Enjoy ! submitted by /u/gdemarcq [link] [comments]  ( 58 min )
    [D] Self-supervised/collaborative embedding?
    My search foo has failed me, so I'm wondering if anyone has heard of using a pair of NNs with identical architectures, but different seeds, to train a model for producing embedding vectors without defining a supervised learning task. The outputs of the NNs would be the embedding vector directly. The two NNs would be trained synchronously and use the output of the other as the label with the training continuing until the outputs of the NNs become sufficiently close. Has anyone heard of such a thing? submitted by /u/AmalgamDragon [link] [comments]  ( 60 min )
    [R] GPT for robotics: Perception-Action Causal Transformer (PACT)
    Blog post: https://www.microsoft.com/en-us/research/group/autonomous-systems-group-robotics/articles/perception-action-causal-transformer-for-autoregressive-robotics-pretraining/ Highlights: Transformer model for autoregressive prediction of states and actions over time to implicitly encode dynamics and behaviors for a particular robot The representation learned can be fine-tuned to distinct tasks (navigation, mapping, localization) with minimal data The base transformer model serves as a generative robotics model, similarly to GPT-3, and can be prompted to result in different robot behaviors Paper: https://arxiv.org/abs/2209.11133 Video: https://youtu.be/mNQvQu_atuw Code and data: https://github.com/microsoft/PACT ​ https://preview.redd.it/9ezv2homdfw91.png?width=857&format=png&auto=webp&s=3997d1cba4d3afca62fdf5e63207d6701a5eb762 https://preview.redd.it/vh06khtqdfw91.png?width=670&format=png&auto=webp&s=4c82347774582938851922d7b3aaa5b18f2b93e3 https://preview.redd.it/uz13qf4pefw91.png?width=1696&format=png&auto=webp&s=356df352f7eb305c4dea8bebb0ab611717fa38e5 submitted by /u/CheapBreakfast9 [link] [comments]  ( 58 min )
    [D] Do companies actually care about their model's training/inference speed?
    I serve the AI industry, primarily building, configuring and selling GPU-accelerated workstations/servers and cloud instances. Most people and companies buy and rent these things based on necessity. *You can't really dig holes effectively if you don't have a shovel kind of thing.* I'm obviously not the only provider in the market. And I'm not one of the largest. Some choose me because I save them a lot of money and some choose me because I'm really really good at what I do(configuring and optimizing). (Yes, I'm confident enough to put that out there.) When I'm taking care of an upgrade situation, it's usually because of one of two things. The hardware is outdated and needs a refresh to be able to support modern processing tools. The client's project is scaling and they need more compute power or VRAM (generally). My question is there anyone (or companies) out there who actually cares to upgrade based on speed? Like is anyone going through the upgrading process simply because they want to train their models faster(save time)? Or bring more value to their clients by having their models inference faster? I'd like anyone's opinion on this but if you fit the description of this type of client, I'd like to know the thought process of upgrading. Whether you've been through it in the past or something you're going through now. submitted by /u/GPUaccelerated [link] [comments]  ( 61 min )
    [R] "Re3: Generating Longer Stories With Recursive Reprompting and Revision" - Generating stories of 2000+ words (or even much longer)
    Archive: https://arxiv.org/abs/2210.06774 PDF: https://arxiv.org/pdf/2210.06774 (long samples around the middle of the PDF) GitHub: https://github.com/yangkevin2/emnlp22-re3-story-generation I'm not affiliated with the project; Just working with ML. submitted by /u/0xWTC [link] [comments]  ( 58 min )
    [P] OpenAI Whisper - 3x CPU Inference Speedup
    Applying a simple post-training, Dynamic Quantization process included with PyTorch to OpenAI Whisper provides great speedups for CPU based deployment. This is of particular interest for people running OpenAI Whisper models on laptops which lack hardware acceleration. Anecdotal results show that accuracy for the smaller models is the same, if not slightly higher after quantization but is very slightly reduced for the larger (medium+) models. Below results are for transcribing 30 seconds of audio: Whisper Model Pre-Quant (secs) Post-Quant (secs) Speedup tiny 2.3 3.1 0.74x slowdown base 5.2 3.2 1.62x speedup small 19.1 6.9 2.76x speedup medium 60.7 23.1 2.62x speedup Others have found even greater speedups for the large model, around roughly x3.25. openai-whisper-cpu (GitHub) submitted by /u/Ok-Alps-7918 [link] [comments]  ( 57 min )
    [D] [R] Large-scale clustering
    Is there any computationally efficient implementation of clustering methods, beyond vanilla kmeans, for lot of data? Let's say 1M data points of 100s dimensions. I know about sci-kit learn and a couple others but I wonder if the will suit that data sizes. submitted by /u/jesusfbes [link] [comments]  ( 57 min )
    [P] Any open source LLMs with comparable ability to Gpt-3 Davinci-2 for natural language to JSON parsing? Bloom and GPT-NEO have been underwhelming.
    See title, I’ve tried both Bloom and GptNEO but either I’m promoting wrong or they can’t compete with gpt3 for converting text into json. My specific use case is for parsing stories into relevant objects like characters, themes, settings, and events. ​ I’ll post an example of davinci’s parsing of Poe’s “a telltale heart” in the comments, it’s the ideal level of quality I’d like to replicate on a self hosted model. submitted by /u/laul_pogan [link] [comments]  ( 59 min )
    [P] How to combine multiple low-quality data into one giant high-quality data?
    What is this process called? I remember it being discussed on this sub a long time ago, but I can’t find anything on it. Let's say, for example, that I have multiple low-resolution cameras that gave me 10000 images of some area from different angles. How would I combine all those pictures to produce a single high-quality picture of this area? Any input is appreciated. submitted by /u/Ali00100 [link] [comments]  ( 57 min )
    [R] RTFormer : Real-Time Semantic Segmentation with Transformer
    Hi, I read a semantic segmentation paper and share it to you. Hope you enjoy it. This paper propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation, which achieves better trade-off between performance and efficiency than CNN-based models. To achieve high inference efficiency on GPU-like devices, RTFormer leverages GPU-Friendly Attention with linear complexity and discards the multi-head mechanism. Besides, cross-resolution attention is more efficient to gather global context information for high-resolution branch by spreading the high level knowledge learned from low-resolution branch. Extensive experiments on mainstream benchmarks demonstrate the effectiveness of the proposed RTFormer, it achieves state-of-the-art on Cityscapes, CamVid and COCOStuff, and shows promising results on ADE20K. Official code is available at: https://github.com/PaddlePaddle/PaddleSeg Arxiv: https://arxiv.org/abs/2210.07124 ​ https://preview.redd.it/bq8gvktkgcw91.png?width=999&format=png&auto=webp&s=d6cc97b4d55ef2135b7c742394e93ab9927426f8 submitted by /u/Effective_Tax_2096 [link] [comments]  ( 60 min )
    [R] WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models
    Our paper "WinoGAViL: Gamified Association Benchmark to Challenge Vision-and-Language Models" was accepted to NeurIPS 2022, Datasets and Benchmark. Paper: http://arxiv.org/abs/2207.12576 Website: http://winogavil.github.io Huggingface: https://huggingface.co/datasets/nlphuji/winogavil Colab: https://colab.research.google.com/drive/19qcPovniLj2PiLlP75oFgsK-uhTr6SSi Which images best fit the cue werewolf? Did you know V&L AI models only get ~50% on our challenging WinoGAViL association task but humans get 90%? https://preview.redd.it/upvig7xbrcw91.png?width=1658&format=png&auto=webp&s=8f733a993bb7a6bf2a1377d39ae830e46e9c8cab Introducing WinoGAViL, an online game you can play now against AI! WinoGAViL is a dynamic benchmark for evaluating V&L models. Inspired by the popular card game C…  ( 59 min )
    [D] Any models that will animate images like DeepNostalgia?
    Have been trying to find a github repo that offers similar functionality. For those that aren't familiar, it takes an image with a face and will nicely animate that photo (head movements, blinking). Unfortunately after 2 videos they want you to pay 200 dollars for a year of their service. While I wouldn't mind paying for this as a standalone feature for a cheaper price, most of the other features in the bundled package are of no interest or value to me. I have tried to achieve this effect manually with Adobe After Effects, learning about displacement maps and making one in photoshop, but I wasn't able to get great results unfortunately. Thank you so much! submitted by /u/DavidUpInHere [link] [comments]  ( 63 min )
    [D] Suggestions for state-of-the-art genetic algorithm implementations or alternatives for black box optimization?
    I build various hardware components that take in bias voltages to tune them to various states. No closed form relationship exists to find these values, and output is very finicky with respect to the inputs. Furthermore, they need to be calibrated over a wide range of desired outputs and different fabrications of the hardware need different tuning voltages. I have anywhere from 4 to 8 tuning variables. Because manually tuning isn't feasible, I want to look into genetic algorithms or other machine learning methods for this. It's essentially a black-box problem, as I plug in voltage values and read output metrics with no data transformation in between so nothing is differentiable. I've tried implementing a GAN to generate training data, and have had some success but it's very slow and random. I would be grateful to hear of any suggestions for state-of-the-art methods for a black-box calibration problem such as this. I should add that I can take about 5 measurements per second, so it's not a particularly costly process. submitted by /u/Matternous [link] [comments]  ( 58 min )
    [R] Looking for a dataset similar to: Breast Cancer Wisconsin (Diagnostic) Data Set
    I am looking for something similar to this dataset but with about 1000 participants. If anyone could help with some leads I would really appreciate it. Here is the dataset that I am referring to: https://www.kaggle.com/datasets/uciml/breast-cancer-wisconsin-data submitted by /u/challenging-luck [link] [comments]  ( 58 min )
    [P] Model to Reference Image components into a larger image
    Hey there! I hope you're doing great! I'm writing here to get a piece of advice on a project I want to start on image mapping. The idea behind my project is as follows. Assume I have a photo of some object of interest (e.g. a person) and I have another image containing some part of it (e.g. another photo of his hand); I would like to create a model that given these two images (the full body and the hand) is able to find the most likely coordinates from which the second image comes from. Essentialy, I want to create a model that tells me where in the first image is located the object in the second image. I have no idea if this exists (it probably does) and how it is called. Do you have any idea or suggestion on how to do this or how to look for this in the literature? submitted by /u/saikjuan [link] [comments]  ( 126 min )
  • Open

    NVIDIA Jetson AGX Orin Dev Kit
    I ordered an Orin dev kit a few months ago for a project and ended up never using it. I opened the box to make sure it came unharmed and that is it. Never plugged it in. I am working on selling it now and thought people in this community might be interested. I am in the bay area on the off chance anyone interested is here. submitted by /u/here_to_create [link] [comments]  ( 48 min )
    I trained a deep RL agent to play a spaceship shooter game with PPO.
    submitted by /u/BiologyNerd100 [link] [comments]  ( 45 min )
    Why does evaluation reward plateau much higher than training reward?
    Is there a dominant reason that explains the deviation in scores? submitted by /u/XecutionStyle [link] [comments]  ( 52 min )
    Creating Simulations for Reinforcement Learning
    submitted by /u/No_Virus3178 [link] [comments]  ( 49 min )
  • Open

    PhD Defense Slides and Lessons Learned
    In July this year I finally defended my PhD which mainly focused on (adversarial) robustness and uncertainty estimation in deep learning. In my case, the defense consisted of a (public) 30 minute talk about my work, followed by questions from the thesis committee and audience. In this article, I want to share the slides and some lessons learned in preparing for my defense. The post PhD Defense Slides and Lessons Learned appeared first on David Stutz.  ( 5 min )
  • Open

    Anyone completed the elements of ai course?
    If you have please message me submitted by /u/Old_Selection5784 [link] [comments]  ( 42 min )
    We're using GPT3 to live to generate a stream with AI actors and create a hilariously random and absurd show! Check out this recap of one of our recent streams 🤖 We will be going live again on Twitch today at 4 pm EST check out @TransformsTV for more!!
    submitted by /u/transformsai [link] [comments]  ( 41 min )
    Searching for a specific AI learning resource, hoping the community can help!
    Hey all, I am in need of help from the Reddit hive mind! A while back (probably over a year ago) I found a resource put together by an AI researcher/academic, that outlined a self-study curriculum for learning the foundations of AI. It was a fairly barebones/old school web page, that outlined a sequential curriculum you could use on your own to get up to speed. If I remember correctly it was quite in depth, in that it was laid out to match a typical 4-year undergraduate degree. I'm super interested in finding it again, but unfortunately my Googling skills seem to be lacking. If anyone knows what I'm babbling about and has any ideas, that would be of massive help! Thanks in advance. submitted by /u/Vegetable-Land-4589 [link] [comments]  ( 43 min )
    I trained a deep reinforcement learning agent to play a space shooter game. Here it is battling itself. It was trained purely with self-play using Proximal Policy Optimization.
    submitted by /u/BiologyNerd100 [link] [comments]  ( 44 min )
    Building a HydraNet for Self-driving car simulation
    Ever wondered how Tesla's autopilot is able to make so many predictions in real time? It's because instead of designing multiple neural networks for different tasks, they design neural networks with a common backbone doing multiple tasks. These neural networks are called Hydranets. Having known about them I revived my old project on a self-driving car and designed a hydranet for predicting both the steering angle and throttle in a single pass. To know more you can visit this blog link: ​ https://medium.com/geekculture/building-a-hydranet-for-self-driving-car-simulation-cd08543feffe ​ There is also a youtube link in the blog which shows the working of the system in real time. submitted by /u/VikasOjha666 [link] [comments]  ( 50 min )
    A gentle introduction to generative AI
    submitted by /u/magosaurus [link] [comments]  ( 54 min )
    Is there a mega-thread with a list of cool Ai tools?
    So far i've found some great websites for: - ai generated tweets based off someone's account: https://tweethunter.io/generate-tweets - ai generated logos: https://brandmark.io/ - ai image generation of course: https://playgroundai.com/ - written documents: lex.page is there anything for Ai generated deepfake influencers? 😝 What are some other cool ai tools oyu guys have found? (besides dalle image generators) submitted by /u/skittleteeth [link] [comments]  ( 43 min )
    Mira Murati interview - DALL·E 2 and the Power of AI | The Daily Show
    submitted by /u/giantyetifeet [link] [comments]  ( 40 min )
    AI Dream 93 - QUANTUMANIA (Unofficial Fan TEASER)
    submitted by /u/LordPewPew777 [link] [comments]  ( 40 min )
    AI story with binaural sound – Simultaneous project
    submitted by /u/ktoktokto [link] [comments]  ( 40 min )
    MetaHuman Creator: High-Fidelity Digital Humans Made Easy
    submitted by /u/Marketing_introverts [link] [comments]  ( 50 min )
    How to combine multiple low quality data into one giant high quality data?
    What is this process called? I remember it being discussed on this sub a long time ago but I can’t find anything on it. Lets say for example that I have multiple low resolution cameras that gave me 10000 images of some area from different angles. How would I go about combining all those pictures to produce a single high quality picture of this area? Any input is appreciated submitted by /u/Ali00100 [link] [comments]  ( 43 min )
    DeepMind, AI and Spatial Memory: an interview with scientist Talfan Evans
    submitted by /u/Ok-Craft-9908 [link] [comments]  ( 41 min )
    Is Reinforcement Learning Still Relevant?
    While there are various practical applications of reinforcement learning, the concept as a whole poses some limitations when used in developing autonomous machine intelligence https://analyticsindiamag.com/is-reinforcement-learning-still-relevant/ submitted by /u/analyticsindiam [link] [comments]  ( 45 min )
    Any AI that will animate images like DeepNostalgia?
    Have been trying to find a github repo that offers similar functionality. For those that aren't familiar, it takes an image with a face and will nicely animate that photo (head movements, blinking). Unfortunately after 2 videos they want you to pay 200 dollars for a year of their service. While I wouldn't mind paying for this feature alone for a cheaper price, unfortunately most of the other features in the package are of no interest or value to me. I have tried to achieve this effect manually when Adobe After Effects, learning about displacement maps and making one in photoshop, but I wasn't able to get great results, and it was kind of hard to do for many images :( Thank you so much! submitted by /u/DavidUpInHere [link] [comments]  ( 42 min )
    This sweater developed by the University of Maryland is an invisibility cloak against AI. It uses "adversarial patterns" to stop AI from recognizing the person wearing it.
    submitted by /u/Good_Show_9 [link] [comments]  ( 45 min )
    Google’s New AI: Kind of Like Tesla's Robot! 🤖
    submitted by /u/Black_RL [link] [comments]  ( 44 min )
    Newb questions about creating rap music with AI
    I am a rapper (not one of any importance) and I am interested in someday being able to make AI-generated songs out of my existing pool of lyrics/recordings, and I’m hoping that someone here might be able to point me in the right direction where to start. Admittedly, I have zero understanding or experience with AI or code or anything, but I want to start organizing the data that I do have, like lyrics, acapellas, and instrumentals. The way I envision it, I would run all of my lyrics through a word-counter of some kind to come up with my full vocabulary and track that on a doc or spreadsheet, and I would have folders with wav files of all of my acapellas, which I would re-record all on the same microphone to have a uniform sound. Would it be in my best interest to have separate wav files for each individual word if I could snip them from the acapellas? Or would that be a waste of my time? I really am just trying to get my foot in the door here so I would love to hear any suggestions that anyone has for reading material, video tutorials or existing tools that I should research to get started. submitted by /u/JackSquattJr [link] [comments]  ( 45 min )
  • Open

    Data Engineering for ML: Optimize for Cost Efficiency
    Sponsored Post     Over the past few years, a lot has changed in the world of stream processing systems. This is especially true as companies manage larger amounts of data than ever before.  In fact, roughly 2.5 quintiliion bytes worth of data are generated every day. Manually processing the sheer amount of data that […] The post Data Engineering for ML: Optimize for Cost Efficiency appeared first on Machine Learning Mastery.  ( 10 min )
  • Open

    It’s NeRF or Nothin’!
    No content preview
  • Open

    Neural NETA: Automaker Selects NVIDIA DRIVE Orin for AI-Powered Vehicles
    One of China’s popular battery-electric startups now has the brains to boot. NETA Auto, a Zheijiang-based electric automaker, this week announced it will build its future electric vehicles on the NVIDIA DRIVE Orin platform. These EVs will be software defined, with automated driving and intelligent features that will be continuously upgraded via over-the-air updates. This Read article > The post Neural NETA: Automaker Selects NVIDIA DRIVE Orin for AI-Powered Vehicles appeared first on NVIDIA Blog.  ( 4 min )
    Microsoft Experience Centers Display Scalable, Real-Time Graphics With NVIDIA RTX and Mosaic Technology
    When customers walk into a Microsoft Experience Center in New York City, Sydney or London, they’re instantly met with stunning graphics displayed on multiple screens and high-definition video walls inside a multi-story building. Built to showcase the latest technologies, Microsoft Experience Centers surround customers with vibrant, immersive graphics as they explore new products, watch technical Read article > The post Microsoft Experience Centers Display Scalable, Real-Time Graphics With NVIDIA RTX and Mosaic Technology appeared first on NVIDIA Blog.  ( 5 min )
    Make Gaming a Priority: Special Membership Discount Hits GeForce NOW for Limited Time
    This spook-tacular Halloween edition of GFN Thursday features a special treat: 40% off a six-month GeForce NOW Priority Membership — get it for just $29.99 for a limited time. Several sweet new games are also joining the GeForce NOW library. Creatures of the night can now stream vampire survival game V Rising from the cloud. Read article > The post Make Gaming a Priority: Special Membership Discount Hits GeForce NOW for Limited Time appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Improve price performance of your model training using Amazon SageMaker heterogeneous clusters
    This post is co-written with Chaim Rand from Mobileye. Certain machine learning (ML) workloads, such as training computer vision models or reinforcement learning, often involve combining the GPU- or accelerator-intensive task of neural network model training with the CPU-intensive task of data preprocessing, like image augmentation. When both types of tasks run on the same […]  ( 12 min )
    Reduce food waste to improve sustainability and financial results in retail with Amazon Forecast
    With environmental, social, and governance (ESG) initiatives becoming more important for companies, our customer, one of Greater China region’s top convenience store chains, has been seeking a solution to reduce food waste (currently over $3.5 million USD per year). Doing so will allow them to not only realize substantial operating savings, but also support corporate […]  ( 8 min )
  • Open

    Research Focus: Week of October 24, 2022
    Welcome to Research Focus, a new series of blog posts that highlights notable publications, events, code/datasets, new hires and other milestones from across the research community at Microsoft. Meet the 2022 recipients of the Microsoft Research Global PhD Fellowship Microsoft is thrilled to announce the 2022 Microsoft Research Global PhD Fellows from around the world. […] The post Research Focus: Week of October 24, 2022 appeared first on Microsoft Research.  ( 9 min )
  • Open

    Can AI assist in predicting chaos theory?
    So, in chaos theory the interaction between 2 lorenz attractors causes chaos, right? And over time the ripples from that chaos become more and more divergent and difficult to predict. I know that this is why its impossible to accurately predict the weather past 2 weeks of time because you basically get into the butterfly effect where the measurements needed to predict macro changes require an increasingly level precision at the micro scale... So like, even though it doesn't really seem like chaos, which is something that is inherently unpredictable--or, you know, CHAOTIC-- should have a pattern a neural network can learn from, has anyone tried making a latent space from chaos theory? What was the outcome and was it at all successful? And if so does it help us to increase the accuracy of our deterministic predictions past what would otherwise be possible with the data we have? submitted by /u/LasciviousApemantus [link] [comments]  ( 43 min )
  • Open

    A more convenient squircle equation
    A few years ago I wrote several posts about “squircles”, starting with this post. These are shapes satisfying where typically p = 4 or 5. The advantage of a squircle over a square with rounded edges is that the curvature varies continuously around the figure rather than jumping from a constant positive value to zero. […] A more convenient squircle equation first appeared on John D. Cook.  ( 6 min )
  • Open

    3 Questions: How AI image generators could help robots
    Yilun Du, a PhD student and MIT CSAIL affiliate, discusses the potential applications of generative art beyond the explosion of images that put the web into creative hysterics.  ( 9 min )
  • Open

    Learning to predict arbitrary quantum processes. (arXiv:2210.14894v1 [quant-ph])
    We present an efficient machine learning (ML) algorithm for predicting any unknown quantum process $\mathcal{E}$ over $n$ qubits. For a wide range of distributions $\mathcal{D}$ on arbitrary $n$-qubit states, we show that this ML algorithm can learn to predict any local property of the output from the unknown process $\mathcal{E}$, with a small average error over input states drawn from $\mathcal{D}$. The ML algorithm is computationally efficient even when the unknown process is a quantum circuit with exponentially many gates. Our algorithm combines efficient procedures for learning properties of an unknown state and for learning a low-degree approximation to an unknown observable. The analysis hinges on proving new norm inequalities, including a quantum analogue of the classical Bohnenblust-Hille inequality, which we derive by giving an improved algorithm for optimizing local Hamiltonians. Overall, our results highlight the potential for ML models to predict the output of complex quantum dynamics much faster than the time needed to run the process itself.  ( 2 min )
    Monotonic segmental attention for automatic speech recognition. (arXiv:2210.14742v1 [cs.CL])
    We introduce a novel segmental-attention model for automatic speech recognition. We restrict the decoder attention to segments to avoid quadratic runtime of global attention, better generalize to long sequences, and eventually enable streaming. We directly compare global-attention and different segmental-attention modeling variants. We develop and compare two separate time-synchronous decoders, one specifically taking the segmental nature into account, yielding further improvements. Using time-synchronous decoding for segmental models is novel and a step towards streaming applications. Our experiments show the importance of a length model to predict the segment boundaries. The final best segmental-attention model using segmental decoding performs better than global-attention, in contrast to other monotonic attention approaches in the literature. Further, we observe that the segmental model generalizes much better to long sequences of up to several minutes.  ( 2 min )
    Federated Learning with Nesterov Accelerated Gradient. (arXiv:2009.08716v2 [cs.LG] UPDATED)
    Federated learning (FL) is a fast-developing technique that allows multiple workers to train a global model based on a distributed dataset. Conventional FL (FedAvg) employs gradient descent algorithm, which may not be efficient enough. Momentum is able to improve the situation by adding an additional momentum step to accelerate the convergence and has demonstrated its benefits in both centralized and FL environments. It is well-known that Nesterov Accelerated Gradient (NAG) is a more advantageous form of momentum, but it is not clear how to quantify the benefits of NAG in FL so far. This motives us to propose FedNAG, which employs NAG in each worker as well as NAG momentum and model aggregation in the aggregator. We provide a detailed convergence analysis of FedNAG and compare it with FedAvg. Extensive experiments based on real-world datasets and trace-driven simulation are conducted, demonstrating that FedNAG increases the learning accuracy by 3-24% and decreases the total training time by 11-70% compared with the benchmarks under a wide range of settings.  ( 2 min )
    Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference. (arXiv:2210.14756v1 [cs.LG])
    We introduce two synthetic likelihood methods for Simulation-Based Inference (SBI), to conduct either amortized or targeted inference from experimental observations when a high-fidelity simulator is available. Both methods learn a conditional energy-based model (EBM) of the likelihood using synthetic data generated by the simulator, conditioned on parameters drawn from a proposal distribution. The learned likelihood can then be combined with any prior to obtain a posterior estimate, from which samples can be drawn using MCMC. Our methods uniquely combine a flexible Energy-Based Model and the minimization of a KL loss: this is in contrast to other synthetic likelihood methods, which either rely on normalizing flows, or minimize score-based objectives; choices that come with known pitfalls. Our first method, Amortized Unnormalized Neural Likelihood Estimation (AUNLE), introduces a tilting trick during training that allows to significantly lower the computational cost of inference by enabling the use of efficient MCMC techniques. Our second method, Sequential UNLE (SUNLE), employs a robust doubly intractable approach in order to re-use simulation data and improve posterior accuracy on a specific dataset. We demonstrate the properties of both methods on a range of synthetic datasets, and apply them to a neuroscience model of the pyloric network in the crab Cancer Borealis, matching the performance of other synthetic likelihood methods at a fraction of the simulation budget.  ( 2 min )
    Submodularity and pairwise independence. (arXiv:2209.08563v1 [math.OC] CROSS LISTED)
    In this paper, we provide a characterization of the expected value of submodular set functions with pairwise independent random input. The set of pairwise independent (uncorrelated) probability distributions contains the mutually independent distribution and is contained within the set of arbitrarily dependent (correlated) distributions. We study the ratio of the maximum expected value of a function with arbitrary dependence among the random input with given marginal probabilities to the maximum expected value of the function with pairwise independent random input with the same marginal probabilities. The definition of the ratio is inspired from the correlation gap ratio of Agrawal et al. (2012) and Calinescu et al. (2007). Our results show that for any monotone submodular set function defined on n variables, the ratio is bounded from above by 4/3 in the following cases: (a) for small n (specifically n = 2, 3) with general marginal probabilities, and (b) for general n with small marginal probabilities. The bound is tight in cases (a) and (b). This contrasts with the e/(e-1) bound on the correlation gap ratio for monotone submodular set functions with mutually independent random input which is known to be tight in case (b). Our results illustrate a fundamental difference in the behavior of submodular functions with weaker notions of independence. We discuss an application in distributionally robust optimization and end the paper with a conjecture.  ( 2 min )
    A case for disaggregation of ML data processing. (arXiv:2210.14826v1 [cs.LG])
    Machine Learning (ML) computation requires feeding input data for the models to ingest. Traditionally, input data processing happens on the same host as the ML computation. The input data processing can however become a bottleneck of the ML computation if there are insufficient resources to process data quickly enough. This slows down the ML computation and wastes valuable and scarce ML hardware (e.g. GPUs and TPUs) used by the ML computation. In this paper, we present tf.data service, a disaggregated input data processing service built on top of tf.data. Our work goes beyond describing the design and implementation of a new system which disaggregates preprocessing from ML computation and presents: (1) empirical evidence based on production workloads for the need of disaggregation, as well as quantitative evaluation of the impact disaggregation has on the performance and cost of production workloads, (2) benefits of disaggregation beyond horizontal scaling, (3) analysis of tf.data service's adoption at Google, the lessons learned during building and deploying the system and potential future lines of research opened up by our work. We demonstrate that horizontally scaling data processing using tf.data service helps remove input bottlenecks, achieving speedups of up to 110x and job cost reductions of up to 89x. We further show that tf.data service can support computation reuse through data sharing across ML jobs with identical data processing pipelines (e.g. hyperparameter tuning jobs), incurring no performance penalty and reducing overall resource cost. Finally, we show that tf.data service advanced features can benefit performance of non-input bound jobs; in particular, coordinated data reads through tf.data service can yield up to 2x speedups and job cost savings for NLP jobs.  ( 3 min )
    Optimal detection of the feature matching map in presence of noise and outliers. (arXiv:2106.07044v2 [math.ST] UPDATED)
    We consider the problem of finding the matching map between two sets of $d$ dimensional vectors from noisy observations, where the second set contains outliers. The matching map is then an injection, which can be consistently estimated only if the vectors of the second set are well separated. The main result shows that, in the high-dimensional setting, a detection region of unknown injection can be characterized by the sets of vectors for which the inlier-inlier distance is of order at least $d^{1/4}$ and the inlier-outlier distance is of order at least $d^{1/2}$. These rates are achieved using the estimated matching minimizing the sum of logarithms of distances between matched pairs of points. We also prove lower bounds establishing optimality of these rates. Finally, we report results of numerical experiments on both synthetic and real world data that illustrate our theoretical results and provide further insight into the properties of the estimators studied in this work.  ( 2 min )
    Nash Equilibria and Pitfalls of Adversarial Training in Adversarial Robustness Games. (arXiv:2210.12606v2 [cs.LG] UPDATED)
    Adversarial training is a standard technique for training adversarially robust models. In this paper, we study adversarial training as an alternating best-response strategy in a 2-player zero-sum game. We prove that even in a simple scenario of a linear classifier and a statistical model that abstracts robust vs. non-robust features, the alternating best response strategy of such game may not converge. On the other hand, a unique pure Nash equilibrium of the game exists and is provably robust. We support our theoretical results with experiments, showing the non-convergence of adversarial training and the robustness of Nash equilibrium.  ( 2 min )
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v3 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered one-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to three estimation problems: sparse covariance matrix estimation, sparse linear regression, and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, with the underlying distribution of heavy-tailed data assumed to possess bounded second or fourth moment. For each model we propose new estimators based on one-bit quantized data. In sub-Gaussian regime, our estimators achieve optimal minimax rates up to logarithmic factors, which indicates that our quantization scheme nearly introduces no additional cost. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in such one-bit quantized and heavy-tailed setting, or exhibit some advantages over existing comparable results. More specifically, our results for one-bit compressed sensing feature generality of sensing vector (sub-Gaussian or even heavy-tailed) and tractable convex programming. A novel setting where both measurement and covariate are quantized is also first proposed and studied. For one-bit matrix completion, our method is essentially different from the standard likelihood approach and can handle pre-quantization random noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.  ( 3 min )
    Broken Neural Scaling Laws. (arXiv:2210.14891v1 [cs.LG])
    We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, diffusion generative modeling of images, arithmetic, and reinforcement learning. When compared to other functional forms for neural scaling behavior, this functional form yields extrapolations of scaling behavior that often are considerably more accurate (root mean squared log error of its extrapolations are 0.86 times that of previous state-of-the-art on average) on this set. Moreover, this functional form accurately models and extrapolates scaling behavior that other functional forms are incapable of expressing such as the non-monotonic transitions present in the scaling behavior of phenomena such as double descent and the delayed, sharp inflection points present in the scaling behavior of tasks such as arithmetic. Code is available at https://github.com/ethancaballero/broken_neural_scaling_laws
    data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language. (arXiv:2202.03555v3 [cs.LG] UPDATED)
    While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup using a standard Transformer architecture. Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input. Experiments on the major benchmarks of speech recognition, image classification, and natural language understanding demonstrate a new state of the art or competitive performance to predominant approaches.
    Exact Manifold Gaussian Variational Bayes. (arXiv:2210.14598v1 [stat.ML])
    We propose an optimization algorithm for Variational Inference (VI) in complex models. Our approach relies on natural gradient updates where the variational space is a Riemann manifold. We develop an efficient algorithm for Gaussian Variational Inference that implicitly satisfies the positive definite constraint on the variational covariance matrix. Our Exact manifold Gaussian Variational Bayes (EMGVB) provides exact but simple update rules and is straightforward to implement. Due to its black-box nature, EMGVB stands as a ready-to-use solution for VI in complex models. Over five datasets, we empirically validate our feasible approach on different statistical, econometric, and deep learning models, discussing its performance with respect to baseline methods.
    NeuralSearchX: Serving a Multi-billion-parameter Reranker for Multilingual Metasearch at a Low Cost. (arXiv:2210.14837v1 [cs.IR])
    The widespread availability of search API's (both free and commercial) brings the promise of increased coverage and quality of search results for metasearch engines, while decreasing the maintenance costs of the crawling and indexing infrastructures. However, merging strategies frequently comprise complex pipelines that require careful tuning, which is often overlooked in the literature. In this work, we describe NeuralSearchX, a metasearch engine based on a multi-purpose large reranking model to merge results and highlight sentences. Due to the homogeneity of our architecture, we could focus our optimization efforts on a single component. We compare our system with Microsoft's Biomedical Search and show that our design choices led to a much cost-effective system with competitive QPS while having close to state-of-the-art results on a wide range of public benchmarks. Human evaluation on two domain-specific tasks shows that our retrieval system outperformed Google API by a large margin in terms of nDCG@10 scores. By describing our architecture and implementation in detail, we hope that the community will build on our design choices. The system is available at https://neuralsearchx.nsx.ai.
    Convergence of Finite Memory Q-Learning for POMDPs and Near Optimality of Learned Policies under Filter Stability. (arXiv:2103.12158v4 [cs.LG] UPDATED)
    In this paper, for POMDPs, we provide the convergence of a Q learning algorithm for control policies using a finite history of past observations and control actions, and, consequentially, we establish near optimality of such limit Q functions under explicit filter stability conditions. We present explicit error bounds relating the approximation error to the length of the finite history window. We establish the convergence of such Q-learning iterations under mild ergodicity assumptions on the state process during the exploration phase. We further show that the limit fixed point equation gives an optimal solution for an approximate belief-MDP. We then provide bounds on the performance of the policy obtained using the limit Q values compared to the performance of the optimal policy for the POMDP, where we also present explicit conditions using recent results on filter stability in controlled POMDPs. While there exist many experimental results, (i) the rigorous asymptotic convergence (to an approximate MDP value function) for such finite-memory Q-learning algorithms, and (ii) the near optimality with an explicit rate of convergence (in the memory size) are results that are new to the literature, to our knowledge.
    Multi-level Data Representation For Training Deep Helmholtz Machines. (arXiv:2210.14855v1 [cs.LG])
    A vast majority of the current research in the field of Machine Learning is done using algorithms with strong arguments pointing to their biological implausibility such as Backpropagation, deviating the field's focus from understanding its original organic inspiration to a compulsive search for optimal performance. Yet, there have been a few proposed models that respect most of the biological constraints present in the human brain and are valid candidates for mimicking some of its properties and mechanisms. In this paper, we will focus on guiding the learning of a biologically plausible generative model called the Helmholtz Machine in complex search spaces using a heuristic based on the Human Image Perception mechanism. We hypothesize that this model's learning algorithm is not fit for Deep Networks due to its Hebbian-like local update rule, rendering it incapable of taking full advantage of the compositional properties that multi-layer networks provide. We propose to overcome this problem, by providing the network's hidden layers with visual queues at different resolutions using a Multi-level Data representation. The results on several image datasets showed the model was able to not only obtain better overall quality but also a wider diversity in the generated images, corroborating our intuition that using our proposed heuristic allows the model to take more advantage of the network's depth growth. More importantly, they show the unexplored possibilities underlying brain-inspired models and techniques.
    Balancing Value Underestimation and Overestimation with Realistic Actor-Critic. (arXiv:2110.09712v6 [cs.LG] UPDATED)
    Model-free deep reinforcement learning (RL) has been successfully applied to challenging continuous control domains. However, poor sample efficiency prevents these methods from being widely used in real-world domains. This paper introduces a novel model-free algorithm, Realistic Actor-Critic(RAC), which can be incorporated with any off-policy RL algorithms to improve sample efficiency. RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn a policy family with the same neural network, each with different trade-offs between underestimation and overestimation. To learn such policies, we introduce uncertainty punished Q-learning, which uses uncertainty from the ensembling of multiple critics to build various confidence-bounds of Q-function. We evaluate RAC on the MuJoCo benchmark, achieving 10x sample efficiency and 25\% performance improvement on the most challenging Humanoid environment compared to SAC.
    Hierarchical Message-Passing Graph Neural Networks. (arXiv:2009.03717v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become a prominent approach to machine learning with graphs and have been increasingly applied in a multitude of domains. Nevertheless, since most existing GNN models are based on flat message-passing mechanisms, two limitations need to be tackled: (i) they are costly in encoding long-range information spanning the graph structure; (ii) they are failing to encode features in the high-order neighbourhood in the graphs as they only perform information aggregation across the observed edges in the original graph. To deal with these two issues, we propose a novel Hierarchical Message-passing Graph Neural Networks framework. The key idea is generating a hierarchical structure that re-organises all nodes in a flat graph into multi-level super graphs, along with innovative intra- and inter-level propagation manners. The derived hierarchy creates shortcuts connecting far-away nodes so that informative long-range interactions can be efficiently accessed via message passing and incorporates meso- and macro-level semantics into the learned node representations. We present the first model to implement this framework, termed Hierarchical Community-aware Graph Neural Network (HC-GNN), with the assistance of a hierarchical community detection algorithm. The theoretical analysis illustrates HC-GNN's remarkable capacity in capturing long-range information without introducing heavy additional computation complexity. Empirical experiments conducted on 9 datasets under transductive, inductive, and few-shot settings exhibit that HC-GNN can outperform state-of-the-art GNN models in network analysis tasks, including node classification, link prediction, and community detection. Moreover, the model analysis further demonstrates HC-GNN's robustness facing graph sparsity and the flexibility in incorporating different GNN encoders.
    BioNLI: Generating a Biomedical NLI Dataset Using Lexico-semantic Constraints for Adversarial Examples. (arXiv:2210.14814v1 [cs.CL])
    Natural language inference (NLI) is critical for complex decision-making in biomedical domain. One key question, for example, is whether a given biomedical mechanism is supported by experimental evidence. This can be seen as an NLI problem but there are no directly usable datasets to address this. The main challenge is that manually creating informative negative examples for this task is difficult and expensive. We introduce a novel semi-supervised procedure that bootstraps an NLI dataset from existing biomedical dataset that pairs mechanisms with experimental evidence in abstracts. We generate a range of negative examples using nine strategies that manipulate the structure of the underlying mechanisms both with rules, e.g., flip the roles of the entities in the interaction, and, more importantly, as perturbations via logical constraints in a neuro-logical decoding system. We use this procedure to create a novel dataset for NLI in the biomedical domain, called BioNLI and benchmark two state-of-the-art biomedical classifiers. The best result we obtain is around mid 70s in F1, suggesting the difficulty of the task. Critically, the performance on the different classes of negative examples varies widely, from 97% F1 on the simple role change negative examples, to barely better than chance on the negative examples generated using neuro-logic decoding.
    TuneUp: A Training Strategy for Improving Generalization of Graph Neural Networks. (arXiv:2210.14843v1 [stat.ML])
    Despite many advances in Graph Neural Networks (GNNs), their training strategies simply focus on minimizing a loss over nodes in a graph. However, such simplistic training strategies may be sub-optimal as they neglect that certain nodes are much harder to make accurate predictions on than others. Here we present TuneUp, a curriculum learning strategy for better training GNNs. Crucially, TuneUp trains a GNN in two stages. The first stage aims to produce a strong base GNN. Such base GNNs tend to perform well on head nodes (nodes with large degrees) but less so on tail nodes (nodes with small degrees). So, the second stage of TuneUp specifically focuses on improving prediction on tail nodes. Concretely, TuneUp synthesizes many additional supervised tail node data by dropping edges from head nodes and reusing the supervision on the original head nodes. TuneUp then minimizes the loss over the synthetic tail nodes to finetune the base GNN. TuneUp is a general training strategy that can be used with any GNN architecture and any loss, making TuneUp applicable to a wide range of prediction tasks. Extensive evaluation of TuneUp on two GNN architectures, three types of prediction tasks, and both inductive and transductive settings shows that TuneUp significantly improves the performance of the base GNN on tail nodes, while often even improving the performance on head nodes, which together leads up to 58.5% relative improvement in GNN predictive performance. Moreover, TuneUp significantly outperforms its variants without the two-stage curriculum learning, existing graph data augmentation techniques, as well as other specialized methods for tail nodes.
    Categorical SDEs with Simplex Diffusion. (arXiv:2210.14784v1 [cs.LG])
    Diffusion models typically operate in the standard framework of generative modelling by producing continuously-valued datapoints. To this end, they rely on a progressive Gaussian smoothing of the original data distribution, which admits an SDE interpretation involving increments of a standard Brownian motion. However, some applications such as text generation or reinforcement learning might naturally be better served by diffusing categorical-valued data, i.e., lifting the diffusion to a space of probability distributions. To this end, this short theoretical note proposes Simplex Diffusion, a means to directly diffuse datapoints located on an n-dimensional probability simplex. We show how this relates to the Dirichlet distribution on the simplex and how the analogous SDE is realized thanks to a multi-dimensional Cox-Ingersoll-Ross process (abbreviated as CIR), previously used in economics and mathematical finance. Finally, we make remarks as to the numerical implementation of trajectories of the CIR process, and discuss some limitations of our approach.
    PyCIL: A Python Toolbox for Class-Incremental Learning. (arXiv:2112.12533v2 [cs.LG] UPDATED)
    Traditional machine learning systems are deployed under the closed-world setting, which requires the entire training data before the offline training process. However, real-world applications often face the incoming new classes, and a model should incorporate them continually. The learning paradigm is called Class-Incremental Learning (CIL). We propose a Python toolbox that implements several key algorithms for class-incremental learning to ease the burden of researchers in the machine learning community. The toolbox contains implementations of a number of founding works of CIL such as EWC and iCaRL, but also provides current state-of-the-art algorithms that can be used for conducting novel fundamental research. This toolbox, named PyCIL for Python Class-Incremental Learning, is available at https://github.com/G-U-N/PyCIL
    Distributionally Robust Batch Contextual Bandits. (arXiv:2006.05630v6 [cs.LG] UPDATED)
    Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.
    Enhanced Bilevel Optimization via Bregman Distance. (arXiv:2107.12301v3 [math.OC] UPDATED)
    Bilevel optimization has been recently used in many machine learning problems such as hyperparameter optimization, policy optimization, and meta learning. Although many bilevel optimization methods have been proposed, they still suffer from the high computational complexities and do not consider the more general bilevel problems with nonsmooth regularization. In the paper, thus, we propose a class of enhanced bilevel optimization methods with using Bregman distance to solve bilevel optimization problems, where the outer subproblem is nonconvex and possibly nonsmooth, and the inner subproblem is strongly convex. Specifically, we propose a bilevel optimization method based on Bregman distance (BiO-BreD) to solve deterministic bilevel problems, which achieves a lower computational complexity than the best known results. Meanwhile, we also propose a stochastic bilevel optimization method (SBiO-BreD) to solve stochastic bilevel problems based on stochastic approximated gradients and Bregman distance. Moreover, we further propose an accelerated version of SBiO-BreD method (ASBiO-BreD) using the variance-reduced technique, which can achieve a lower computational complexity than the best known computational complexities with respect to condition number $\kappa$ and target accuracy $\epsilon$ for finding an $\epsilon$-stationary point. We conduct data hyper-cleaning task and hyper-representation learning task to demonstrate that our new algorithms outperform related bilevel optimization approaches.
    Coresets for Vertical Federated Learning: Regularized Linear Regression and $K$-Means Clustering. (arXiv:2210.14664v1 [cs.LG])
    Vertical federated learning (VFL), where data features are stored in multiple parties distributively, is an important area in machine learning. However, the communication complexity for VFL is typically very high. In this paper, we propose a unified framework by constructing coresets in a distributed fashion for communication-efficient VFL. We study two important learning tasks in the VFL setting: regularized linear regression and $k$-means clustering, and apply our coreset framework to both problems. We theoretically show that using coresets can drastically alleviate the communication complexity, while nearly maintain the solution quality. Numerical experiments are conducted to corroborate our theoretical findings.
    Does Corpus Quality Really Matter for Low-Resource Languages?. (arXiv:2203.08111v2 [cs.CL] UPDATED)
    The vast majority of non-English corpora are derived from automatically filtered versions of CommonCrawl. While prior work has identified major issues on the quality of these datasets (Kreutzer et al., 2021), it is not clear how this impacts downstream performance. Taking representation learning in Basque as a case study, we explore tailored crawling (manually identifying and scraping websites with high-quality content) as an alternative to filtering CommonCrawl. Our new corpus, called EusCrawl, is similar in size to the Basque portion of popular multilingual corpora like CC100 and mC4, yet it has a much higher quality according to native annotators. For instance, 66% of documents are rated as high-quality for EusCrawl, in contrast with <33% for both mC4 and CC100. Nevertheless, we obtain similar results on downstream NLU tasks regardless of the corpus used for pre-training. Our work suggests that NLU performance in low-resource languages is not primarily constrained by the quality of the data, and other factors like corpus size and domain coverage can play a more important role.
    Image Super-Resolution With Deep Variational Autoencoders. (arXiv:2203.09445v2 [cs.CV] UPDATED)
    Image super-resolution (SR) techniques are used to generate a high-resolution image from a low-resolution image. Until now, deep generative models such as autoregressive models and Generative Adversarial Networks (GANs) have proven to be effective at modelling high-resolution images. VAE-based models have often been criticised for their feeble generative performance, but with new advancements such as VDVAE, there is now strong evidence that deep VAEs have the potential to outperform current state-of-the-art models for high-resolution image generation. In this paper, we introduce VDVAE-SR, a new model that aims to exploit the most recent deep VAE methodologies to improve upon the results of similar models. VDVAE-SR tackles image super-resolution using transfer learning on pretrained VDVAEs. The presented model is competitive with other state-of-the-art models, having comparable results on image quality metrics.
    Distribution-Free Finite-Sample Guarantees and Split Conformal Prediction. (arXiv:2210.14735v1 [stat.ML])
    Modern black-box predictive models are often accompanied by weak performance guarantees that only hold asymptotically in the size of the dataset or require strong parametric assumptions. In response to this, split conformal prediction represents a promising avenue to obtain finite-sample guarantees under minimal distribution-free assumptions. Although prediction set validity most often concerns marginal coverage, we explore the related but different guarantee of tolerance regions, reformulating known results in the language of nested prediction sets and extending on the duality between marginal coverage and tolerance regions. Furthermore, we highlight the connection between split conformal prediction and classical tolerance predictors developed in the 1940s, as well as recent developments in distribution-free risk control. One result that transfers from classical tolerance predictors is that the coverage of a prediction set based on order statistics, conditional on the calibration set, is a random variable stochastically dominating the Beta distribution. We demonstrate the empirical effectiveness of our findings on synthetic and real datasets using a popular split conformal prediction procedure called conformalized quantile regression (CQR).
    Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems -- Survey. (arXiv:2208.13592v2 [math.OC] UPDATED)
    This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.
    Graph Filter Transfer via Probability Density Ratio Weighting. (arXiv:2210.14633v1 [eess.SP])
    The problem of recovering graph signals is one of the main topics in graph signal processing. A representative approach to this problem is the graph Wiener filter, which utilizes the statistical information of the target signal computed from historical data to construct an effective estimator. However, we often encounter situations where the current graph differs from that of historical data due to topology changes, leading to performance degradation of the estimator. This paper proposes a graph filter transfer method, which learns an effective estimator from historical data under topology changes. The proposed method leverages the probability density ratio of the current and historical observations and constructs an estimator that minimizes the reconstruction error in the current graph domain. The experiment on synthetic data demonstrates that the proposed method outperforms other methods.
    Quantum deep recurrent reinforcement learning. (arXiv:2210.14876v1 [quant-ph])
    Recent advances in quantum computing (QC) and machine learning (ML) have drawn significant attention to the development of quantum machine learning (QML). Reinforcement learning (RL) is one of the ML paradigms which can be used to solve complex sequential decision making problems. Classical RL has been shown to be capable to solve various challenging tasks. However, RL algorithms in the quantum world are still in their infancy. One of the challenges yet to solve is how to train quantum RL in the partially observable environments. In this paper, we approach this challenge through building QRL agents with quantum recurrent neural networks (QRNN). Specifically, we choose the quantum long short-term memory (QLSTM) to be the core of the QRL agent and train the whole model with deep $Q$-learning. We demonstrate the results via numerical simulations that the QLSTM-DRQN can solve standard benchmark such as Cart-Pole with more stable and higher average scores than classical DRQN with similar architecture and number of model parameters.
    Exploring the Linear Subspace Hypothesis in Gender Bias Mitigation. (arXiv:2009.09435v3 [cs.LG] UPDATED)
    Bolukbasi et al. (2016) presents one of the first gender bias mitigation techniques for word embeddings. Their method takes pre-trained word embeddings as input and attempts to isolate a linear subspace that captures most of the gender bias in the embeddings. As judged by an analogical evaluation task, their method virtually eliminates gender bias in the embeddings. However, an implicit and untested assumption of their method is that the bias sub-space is actually linear. In this work, we generalize their method to a kernelized, non-linear version. We take inspiration from kernel principal component analysis and derive a non-linear bias isolation technique. We discuss and overcome some of the practical drawbacks of our method for non-linear gender bias mitigation in word embeddings and analyze empirically whether the bias subspace is actually linear. Our analysis shows that gender bias is in fact well captured by a linear subspace, justifying the assumption of Bolukbasi et al. (2016).
    VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives. (arXiv:2206.11212v2 [cs.CV] UPDATED)
    Many past works aim to improve visual reasoning in models by supervising feature importance (estimated by model explanation techniques) with human annotations such as highlights of important image regions. However, recent work has shown that performance gains from feature importance (FI) supervision for Visual Question Answering (VQA) tasks persist even with random supervision, suggesting that these methods do not meaningfully align model FI with human FI. In this paper, we show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: (1) accurate predictions given limited but sufficient information (Sufficiency); (2) max-entropy predictions given no important information (Uncertainty); (3) invariance of predictions to changes in unimportant features (Invariance); and (4) alignment between model FI explanations and human FI explanations (Plausibility). Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets in terms of both in-distribution and out-of-distribution accuracy. While past work suggests that the mechanism for improved accuracy is through improved explanation plausibility, we show that this relationship depends crucially on explanation faithfulness (whether explanations truly represent the model's internal reasoning). Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful. Lastly, we show that, surprisingly, RRR metrics are not predictive of out-of-distribution model accuracy when controlling for a model's in-distribution accuracy, which calls into question the value of these metrics for evaluating model reasoning. All supporting code is available at https://github.com/zfying/visfis
    Automatic Diagnosis of Myocarditis Disease in Cardiac MRI Modality using Deep Transformers and Explainable Artificial Intelligence. (arXiv:2210.14611v1 [cs.CV])
    Myocarditis is among the most important cardiovascular diseases (CVDs), endangering the health of many individuals by damaging the myocardium. Microbes and viruses, such as HIV, play a vital role in myocarditis disease (MCD) incidence. Lack of MCD diagnosis in the early stages is associated with irreversible complications. Cardiac magnetic resonance imaging (CMRI) is highly popular among cardiologists to diagnose CVDs. In this paper, a deep learning (DL) based computer-aided diagnosis system (CADS) is presented for the diagnosis of MCD using CMRI images. The proposed CADS includes dataset, preprocessing, feature extraction, classification, and post-processing steps. First, the Z-Alizadeh dataset was selected for the experiments. The preprocessing step included noise removal, image resizing, and data augmentation (DA). In this step, CutMix, and MixUp techniques were used for the DA. Then, the most recent pre-trained and transformers models were used for feature extraction and classification using CMRI images. Our results show high performance for the detection of MCD using transformer models compared with the pre-trained architectures. Among the DL architectures, Turbulence Neural Transformer (TNT) architecture achieved an accuracy of 99.73% with 10-fold cross-validation strategy. Explainable-based Grad Cam method is used to visualize the MCD suspected areas in CMRI images.
    Algorithmically-Consistent Deep Learning Frameworks for Structural Topology Optimization. (arXiv:2012.05359v2 [cs.LG] UPDATED)
    Topology optimization has emerged as a popular approach to refine a component's design and increase its performance. However, current state-of-the-art topology optimization frameworks are compute-intensive, mainly due to multiple finite element analysis iterations required to evaluate the component's performance during the optimization process. Recently, machine learning (ML)-based topology optimization methods have been explored by researchers to alleviate this issue. However, previous ML approaches have mainly been demonstrated on simple two-dimensional applications with low-resolution geometry. Further, current methods are based on a single ML model for end-to-end prediction, which requires a large dataset for training. These challenges make it non-trivial to extend current approaches to higher resolutions. In this paper, we develop deep learning-based frameworks consistent with traditional topology optimization algorithms for 3D topology optimization with a reasonably fine (high) resolution. We achieve this by training multiple networks, each learning a different step of the overall topology optimization methodology, making the framework more consistent with the topology optimization algorithm. We demonstrate the application of our framework on both 2D and 3D geometries. The results show that our approach predicts the final optimized design better (5.76x reduction in total compliance MSE in 2D; 2.03x reduction in total compliance MSE in 3D) than current ML-based topology optimization methods.
    4-bit Conformer with Native Quantization Aware Training for Speech Recognition. (arXiv:2203.15952v3 [eess.AS] UPDATED)
    Reducing the latency and model size has always been a significant research problem for live Automatic Speech Recognition (ASR) application scenarios. Along this direction, model quantization has become an increasingly popular approach to compress neural networks and reduce computation cost. Most of the existing practical ASR systems apply post-training 8-bit quantization. To achieve a higher compression rate without introducing additional performance regression, in this study, we propose to develop 4-bit ASR models with native quantization aware training, which leverages native integer operations to effectively optimize both training and inference. We conducted two experiments on state-of-the-art Conformer-based ASR models to evaluate our proposed quantization technique. First, we explored the impact of different precisions for both weight and activation quantization on the LibriSpeech dataset, and obtained a lossless 4-bit Conformer model with 7.7x size reduction compared to the float32 model. Following this, we for the first time investigated and revealed the viability of 4-bit quantization on a practical ASR system that is trained with large-scale datasets, and produced a lossless Conformer ASR model with mixed 4-bit and 8-bit weights that has 5x size reduction compared to the float32 model.
    On the Role of Bidirectionality in Language Model Pre-Training. (arXiv:2205.11726v2 [cs.CL] UPDATED)
    Prior work on language model pre-training has explored different architectures and learning objectives, but differences in data, hyperparameters and evaluation make a principled comparison difficult. In this work, we focus on bidirectionality as a key factor that differentiates existing approaches, and present a comprehensive study of its role in next token prediction, text infilling, zero-shot priming and fine-tuning. We propose a new framework that generalizes prior approaches, including fully unidirectional models like GPT, fully bidirectional models like BERT, and hybrid models like CM3 and prefix LM. Our framework distinguishes between two notions of bidirectionality (bidirectional context and bidirectional attention) and allows us to control each of them separately. We find that the optimal configuration is largely application-dependent (e.g., bidirectional attention is beneficial for fine-tuning and infilling, but harmful for next token prediction and zero-shot priming). We train models with up to 6.7B parameters, and find differences to remain consistent at scale. While prior work on scaling has focused on left-to-right autoregressive models, our results suggest that this approach comes with some trade-offs, and it might be worthwhile to develop very large bidirectional models.
    HyperEF: Spectral Hypergraph Coarsening by Effective-Resistance Clustering. (arXiv:2210.14813v1 [cs.LG])
    This paper introduces a scalable algorithmic framework (HyperEF) for spectral coarsening (decomposition) of large-scale hypergraphs by exploiting hyperedge effective resistances. Motivated by the latest theoretical framework for low-resistance-diameter decomposition of simple graphs, HyperEF aims at decomposing large hypergraphs into multiple node clusters with only a few inter-cluster hyperedges. The key component in HyperEF is a nearly-linear time algorithm for estimating hyperedge effective resistances, which allows incorporating the latest diffusion-based non-linear quadratic operators defined on hypergraphs. To achieve good runtime scalability, HyperEF searches within the Krylov subspace (or approximate eigensubspace) for identifying the nearly-optimal vectors for approximating the hyperedge effective resistances. In addition, a node weight propagation scheme for multilevel spectral hypergraph decomposition has been introduced for achieving even greater node coarsening ratios. When compared with state-of-the-art hypergraph partitioning (clustering) methods, extensive experiment results on real-world VLSI designs show that HyperEF can more effectively coarsen (decompose) hypergraphs without losing key structural (spectral) properties of the original hypergraphs, while achieving over $70\times$ runtime speedups over hMetis and $20\times$ speedups over HyperSF.
    Empowering parameter-efficient transfer learning by recognizing the kernel structure in self-attention. (arXiv:2205.03720v2 [cs.CL] UPDATED)
    The massive amount of trainable parameters in the pre-trained language models (PLMs) makes them hard to be deployed to multiple downstream tasks. To address this issue, parameter-efficient transfer learning methods have been proposed to tune only a few parameters during fine-tuning while freezing the rest. This paper looks at existing methods along this line through the \textit{kernel lens}. Motivated by the connection between self-attention in transformer-based PLMs and kernel learning, we propose \textit{kernel-wise adapters}, namely \textit{Kernel-mix}, that utilize the kernel structure in self-attention to guide the assignment of the tunable parameters. These adapters use guidelines found in classical kernel learning and enable separate parameter tuning for each attention head. Our empirical results, over a diverse set of natural language generation and understanding tasks, show that our proposed adapters can attain or improve the strong performance of existing baselines.
    Detecting Anomalous Cryptocurrency Transactions: an AML/CFT Application of Machine Learning-based Forensics. (arXiv:2206.04803v2 [cs.CR] UPDATED)
    In shaping the Internet of Money (IoM), the application of blockchain and distributed ledger technologies (DLTs) to the financial sector triggered regulatory concerns. Notably, while the user anonymity enabled in this field may safeguard privacy, data protection and other civil liberties, lack of identifiability hinders accountability, investigation and enforcement. This challenges the framework to combat money laundering and the financing of terrorism and proliferation (AML/CFT). As both law enforcement agencies and the private sector apply forensics to track currency across blockchain ecosystems, this paper focuses on the growing relevance of these techniques. In particular, this work offers IoM-specific insights into the application of methods of machine learning and network and transaction graph analysis. Namely, it analyzes a real-world dataset of Bitcoin transactions represented as a directed graph network through various machine learning techniques. The modeling of blockchain transactions as a complex network suggests that the use of data analysis methods in machine learning, specifically thought for graphs, can aid transaction classification and help identify illicit fund transfers. Indeed, this work shows that the neural network types known as Graph Convolutional Networks (GCN) and Graph Attention Networks (GAT) are a promising solution for AML/CFT compliance. Notably, in this AML/CFT application scenario GCN outperform other classic machine learning approaches.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v2 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Desiderata for next generation of ML model serving. (arXiv:2210.14665v1 [cs.LG])
    Inference is a significant part of ML software infrastructure. Despite the variety of inference frameworks available, the field as a whole can be considered in its early days. This paper puts forth a range of important qualities that next generation of inference platforms should be aiming for. We present our rationale for the importance of each quality, and discuss ways to achieve it in practice. An overarching design pattern is data-centricity, which enables smarter monitoring in ML system operation.
    Personalized Federated Learning via Heterogeneous Modular Networks. (arXiv:2210.14830v1 [cs.LG])
    Personalized Federated Learning (PFL) which collaboratively trains a federated model while considering local clients under privacy constraints has attracted much attention. Despite its popularity, it has been observed that existing PFL approaches result in sub-optimal solutions when the joint distribution among local clients diverges. To address this issue, we present Federated Modular Network (FedMN), a novel PFL approach that adaptively selects sub-modules from a module pool to assemble heterogeneous neural architectures for different clients. FedMN adopts a light-weighted routing hypernetwork to model the joint distribution on each client and produce the personalized selection of the module blocks for each client. To reduce the communication burden in existing FL, we develop an efficient way to interact between the clients and the server. We conduct extensive experiments on the real-world test beds and the results show both the effectiveness and efficiency of the proposed FedMN over the baselines.
    "Calibeating": Beating Forecasters at Their Own Game. (arXiv:2209.04892v2 [econ.TH] UPDATED)
    In order to identify expertise, forecasters should not be tested by their calibration score, which can always be made arbitrarily small, but rather by their Brier score. The Brier score is the sum of the calibration score and the refinement score; the latter measures how good the sorting into bins with the same forecast is, and thus attests to "expertise." This raises the question of whether one can gain calibration without losing expertise, which we refer to as "calibeating." We provide an easy way to calibeat any forecast, by a deterministic online procedure. We moreover show that calibeating can be achieved by a stochastic procedure that is itself calibrated, and then extend the results to simultaneously calibeating multiple procedures, and to deterministic procedures that are continuously calibrated.
    Principal Component Classification. (arXiv:2210.12746v2 [cs.LG] UPDATED)
    We propose to directly compute classification estimates by learning features encoded with their class scores using PCA. Our resulting model has a encoder-decoder structure suitable for supervised learning, it is computationally efficient and performs well for classification on several datasets.
    Tomayto, Tomahto. Beyond Token-level Answer Equivalence for Question Answering Evaluation. (arXiv:2202.07654v2 [cs.CL] UPDATED)
    The predictions of question answering (QA)systems are typically evaluated against manually annotated finite sets of one or more answers. This leads to a coverage limitation that results in underestimating the true performance of systems, and is typically addressed by extending over exact match (EM) with pre-defined rules or with the token-level F1 measure. In this paper, we present the first systematic conceptual and data-driven analysis to examine the shortcomings of token-level equivalence measures. To this end, we define the asymmetric notion of answer equivalence (AE), accepting answers that are equivalent to or improve over the reference, and publish over 23k human judgments for candidates produced by multiple QA systems on SQuAD. Through a careful analysis of this data, we reveal and quantify several concrete limitations of the F1 measure, such as a false impression of graduality, or missing dependence on the question. Since collecting AE annotations for each evaluated model is expensive, we learn a BERT matching (BEM) measure to approximate this task. Being a simpler task than QA, we find BEM to provide significantly better AE approximations than F1, and to more accurately reflect the performance of systems. Finally, we demonstrate the practical utility of AE and BEM on the concrete application of minimal accurate prediction sets, reducing the number of required answers by up to x2.6.
    Is Multi-Task Learning an Upper Bound for Continual Learning?. (arXiv:2210.14797v1 [cs.LG])
    Continual and multi-task learning are common machine learning approaches to learning from multiple tasks. The existing works in the literature often assume multi-task learning as a sensible performance upper bound for various continual learning algorithms. While this assumption is empirically verified for different continual learning benchmarks, it is not rigorously justified. Moreover, it is imaginable that when learning from multiple tasks, a small subset of these tasks could behave as adversarial tasks reducing the overall learning performance in a multi-task setting. In contrast, continual learning approaches can avoid the performance drop caused by such adversarial tasks to preserve their performance on the rest of the tasks, leading to better performance than a multi-task learner. This paper proposes a novel continual self-supervised learning setting, where each task corresponds to learning an invariant representation for a specific class of data augmentations. In this setting, we show that continual learning often beats multi-task learning on various benchmark datasets, including MNIST, CIFAR-10, and CIFAR-100.
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v2 [stat.ML] UPDATED)
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.
    On the infinite-depth limit of finite-width neural networks. (arXiv:2210.00688v2 [stat.ML] UPDATED)
    In this paper, we study the infinite-depth limit of finite-width residual neural networks with random Gaussian weights. With proper scaling, we show that by fixing the width and taking the depth to infinity, the pre-activations converge in distribution to a zero-drift diffusion process. Unlike the infinite-width limit where the pre-activation converge weakly to a Gaussian random variable, we show that the infinite-depth limit yields different distributions depending on the choice of the activation function. We document two cases where these distributions have closed-form (different) expressions. We further show an intriguing change of regime phenomenon of the post-activation norms when the width increases from 3 to 4. Lastly, we study the sequential limit infinite-depth-then-infinite-width and compare it with the more commonly studied infinite-width-then-infinite-depth limit.
    Multi-Environment based Meta-Learning with CSI Fingerprints for Radio Based Positioning. (arXiv:2210.14510v1 [eess.SP])
    Radio based positioning of a user equipment (UE) based on deep learning (DL) methods using channel state information (CSI) fingerprints have shown promising results. DL models are able to capture complex properties embedded in the CSI about a particular environment and map UE's CSI to the UE's position. However, the CSI fingerprints and the DL models trained on such fingerprints are highly dependent on a particular propagation environment, which generally limits the transfer of knowledge of the DL models from one environment to another. In this paper, we propose a DL model consisting of two parts: the first part aims to learn environment independent features while the second part combines those features depending on the particular environment. To improve transfer learning, we propose a meta learning scheme for training the first part over multiple environments. We show that for positioning in a new environment, initializing a DL model with the meta learned environment independent function achieves higher UE positioning accuracy compared to regular transfer learning from one environment to the new environment, or compared to training the DL model from scratch with only fingerprints from the new environment. Our proposed scheme is able to create an environment independent function which can embed knowledge from multiple environments and more effectively learn from a new environment.
    Sim-to-Real via Sim-to-Seg: End-to-end Off-road Autonomous Driving Without Real Data. (arXiv:2210.14721v1 [cs.LG])
    Autonomous driving is complex, requiring sophisticated 3D scene understanding, localization, mapping, and control. Rather than explicitly modelling and fusing each of these components, we instead consider an end-to-end approach via reinforcement learning (RL). However, collecting exploration driving data in the real world is impractical and dangerous. While training in simulation and deploying visual sim-to-real techniques has worked well for robot manipulation, deploying beyond controlled workspace viewpoints remains a challenge. In this paper, we address this challenge by presenting Sim2Seg, a re-imagining of RCAN that crosses the visual reality gap for off-road autonomous driving, without using any real-world data. This is done by learning to translate randomized simulation images into simulated segmentation and depth maps, subsequently enabling real-world images to also be translated. This allows us to train an end-to-end RL policy in simulation, and directly deploy in the real-world. Our approach, which can be trained in 48 hours on 1 GPU, can perform equally as well as a classical perception and control stack that took thousands of engineering hours over several months to build. We hope this work motivates future end-to-end autonomous driving research.
    Fairness Transferability Subject to Bounded Distribution Shift. (arXiv:2206.00129v2 [cs.LG] UPDATED)
    Given an algorithmic predictor that is "fair" on some source distribution, will it still be fair on an unknown target distribution that differs from the source within some bound? In this paper, we study the transferability of statistical group fairness for machine learning predictors (i.e., classifiers or regressors) subject to bounded distribution shifts. Such shifts may be introduced by initial training data uncertainties, user adaptation to a deployed predictor, dynamic environments, or the use of pre-trained models in new settings. Herein, we develop a bound that characterizes such transferability, flagging potentially inappropriate deployments of machine learning for socially consequential tasks. We first develop a framework for bounding violations of statistical fairness subject to distribution shift, formulating a generic upper bound for transferred fairness violations as our primary result. We then develop bounds for specific worked examples, focusing on two commonly used fairness definitions (i.e., demographic parity and equalized odds) and two classes of distribution shift (i.e., covariate shift and label shift). Finally, we compare our theoretical bounds to deterministic models of distribution shift and against real-world data, finding that we are able to estimate fairness violation bounds in practice, even when simplifying assumptions are only approximately satisfied.
    A Conditional Gradient-based Method for Simple Bilevel Optimization with Convex Lower-level Problem. (arXiv:2206.08868v2 [math.OC] UPDATED)
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane, and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.
    Streaming PAC-Bayes Gaussian process regression with a performance guarantee for online decision making. (arXiv:2210.08486v2 [cs.LG] UPDATED)
    As a powerful Bayesian non-parameterized algorithm, the Gaussian process (GP) has performed a significant role in Bayesian optimization and signal processing. GPs have also advanced online decision-making systems because their posterior distribution has a closed-form solution. However, its training and inference process requires all historic data to be stored and the GP model to be trained from scratch. For those reasons, several online GP algorithms, such as O-SGPR and O-SVGP, have been specifically designed for streaming settings. In this paper, we present a new theoretical framework for online GPs based on the online probably approximately correct (PAC) Bayes theory. The framework offers both a guarantee of generalized performance and good accuracy. Instead of minimizing the marginal likelihood, our algorithm optimizes both the empirical risk function and a regularization item, which is in proportion to the divergence between the prior distribution and posterior distribution of parameters. In addition to its theoretical appeal, the algorithm performs well empirically on several regression datasets. Compared to other online GP algorithms, ours yields a generalization guarantee and very competitive accuracy.
    Robust Contextual Linear Bandits. (arXiv:2210.14483v1 [cs.LG])
    Model misspecification is a major consideration in applications of statistical methods and machine learning. However, it is often neglected in contextual bandits. This paper studies a common form of misspecification, an inter-arm heterogeneity that is not captured by context. To address this issue, we assume that the heterogeneity arises due to arm-specific random variables, which can be learned. We call this setting a robust contextual bandit. The arm-specific variables explain the unknown inter-arm heterogeneity, and we incorporate them in the robust contextual estimator of the mean reward and its uncertainty. We develop two efficient bandit algorithms for our setting: a UCB algorithm called RoLinUCB and a posterior-sampling algorithm called RoLinTS. We analyze both algorithms and bound their $n$-round Bayes regret. Our experiments show that RoLinTS is comparably statistically efficient to the classic methods when the misspecification is low, more robust when the misspecification is high, and significantly more computationally efficient than its naive implementation.
    History-Based, Bayesian, Closure for Stochastic Parameterization: Application to Lorenz '96. (arXiv:2210.14488v1 [stat.ML])
    Physical parameterizations are used as representations of unresolved subgrid processes within weather and global climate models or coarse-scale turbulent models, whose resolutions are too coarse to resolve small-scale processes. These parameterizations are typically grounded on physically-based, yet empirical, representations of the underlying small-scale processes. Machine learning-based parameterizations have recently been proposed as an alternative and have shown great promises to reduce uncertainties associated with small-scale processes. Yet, those approaches still show some important mismatches that are often attributed to stochasticity in the considered process. This stochasticity can be due to noisy data, unresolved variables or simply to the inherent chaotic nature of the process. To address these issues, we develop a new type of parameterization (closure) which is based on a Bayesian formalism for neural networks, to account for uncertainty quantification, and includes memory, to account for the non-instantaneous response of the closure. To overcome the curse of dimensionality of Bayesian techniques in high-dimensional spaces, the Bayesian strategy is based on a Hamiltonian Monte Carlo Markov Chain sampling strategy that takes advantage of the likelihood function and kinetic energy's gradients with respect to the parameters to accelerate the sampling process. We apply the proposed Bayesian history-based parameterization to the Lorenz '96 model in the presence of noisy and sparse data, similar to satellite observations, and show its capacity to predict skillful forecasts of the resolved variables while returning trustworthy uncertainty quantifications for different sources of error. This approach paves the way for the use of Bayesian approaches for closure problems.
    Analyzing Deep Learning Representations of Point Clouds for Real-Time In-Vehicle LiDAR Perception. (arXiv:2210.14612v1 [cs.CV])
    LiDAR sensors are an integral part of modern autonomous vehicles as they provide an accurate, high-resolution 3D representation of the vehicle's surroundings. However, it is computationally difficult to make use of the ever-increasing amounts of data from multiple high-resolution LiDAR sensors. As frame-rates, point cloud sizes and sensor resolutions increase, real-time processing of these point clouds must still extract semantics from this increasingly precise picture of the vehicle's environment. One deciding factor of the run-time performance and accuracy of deep neural networks operating on these point clouds is the underlying data representation and the way it is computed. In this work, we examine the relationship between the computational representations used in neural networks and their performance characteristics. To this end, we propose a novel computational taxonomy of LiDAR point cloud representations used in modern deep neural networks for 3D point cloud processing. Using this taxonomy, we perform a structured analysis of different families of approaches. Thereby, we uncover common advantages and limitations in terms of computational efficiency, memory requirements, and representational capacity as measured by semantic segmentation performance. Finally, we provide some insights and guidance for future developments in neural point cloud processing methods.
    Rhino: Deep Causal Temporal Relationship Learning With History-dependent Noise. (arXiv:2210.14706v1 [cs.LG])
    Discovering causal relationships between different variables from time series data has been a long-standing challenge for many domains such as climate science, finance, and healthcare. Given the complexity of real-world relationships and the nature of observations in discrete time, causal discovery methods need to consider non-linear relations between variables, instantaneous effects and history-dependent noise (the change of noise distribution due to past actions). However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a novel causal relationship learning framework for time-series data, called Rhino, which combines vector auto-regression, deep learning and variational inference to model non-linear relationships with instantaneous effects while allowing the noise distribution to be modulated by historical observations. Theoretically, we prove the structural identifiability of Rhino. Our empirical results from extensive synthetic experiments and two real-world benchmarks demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness under model misspecification.
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v4 [cs.LG] UPDATED)
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}.
    Uncertainty-based Meta-Reinforcement Learning for Robust Radar Tracking. (arXiv:2210.14532v1 [cs.LG])
    Nowadays, Deep Learning (DL) methods often overcome the limitations of traditional signal processing approaches. Nevertheless, DL methods are barely applied in real-life applications. This is mainly due to limited robustness and distributional shift between training and test data. To this end, recent work has proposed uncertainty mechanisms to increase their reliability. Besides, meta-learning aims at improving the generalization capability of DL models. By taking advantage of that, this paper proposes an uncertainty-based Meta-Reinforcement Learning (Meta-RL) approach with Out-of-Distribution (OOD) detection. The presented method performs a given task in unseen environments and provides information about its complexity. This is done by determining first and second-order statistics on the estimated reward. Using information about its complexity, the proposed algorithm is able to point out when tracking is reliable. To evaluate the proposed method, we benchmark it on a radar-tracking dataset. There, we show that our method outperforms related Meta-RL approaches on unseen tracking scenarios in peak performance by 16% and the baseline by 35% while detecting OOD data with an F1-Score of 72%. This shows that our method is robust to environmental changes and reliably detects OOD scenarios.
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v2 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of overparameterized neural networks. With a suitable random initialization, an extremely large network is well approximated by a Gaussian process, both before and during training. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimizes the generalization bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.
    Efficient Large Scale Language Modeling with Mixtures of Experts. (arXiv:2112.10684v2 [cs.CL] UPDATED)
    Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation. This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings: in- and out-of-domain language modeling, zero- and few-shot priming, and full-shot fine-tuning. With the exception of fine-tuning, we find MoEs to be substantially more compute efficient. At more modest training budgets, MoEs can match the performance of dense models using $\sim$4 times less compute. This gap narrows at scale, but our largest MoE model (1.1T parameters) consistently outperforms a compute-equivalent dense model (6.7B parameters). Overall, this performance gap varies greatly across tasks and domains, suggesting that MoE and dense models generalize differently in ways that are worthy of future study. We make our code and models publicly available for research use.
    Multimodal sensor data fusion for in-situ classification of animal behavior using accelerometry and GNSS data. (arXiv:2206.12078v2 [cs.LG] UPDATED)
    In this paper, we examine the use of data from multiple sensing modes, i.e., accelerometry and global navigation satellite system (GNSS), for classifying animal behavior. We extract three new features from the GNSS data, namely, distance from water point, median speed, and median estimated horizontal position error. We combine the information available from the accelerometry and GNSS data via two approaches. The first approach is based on concatenating the features extracted from both sensor data and feeding the concatenated feature vector into a multi-layer perceptron (MLP) classifier. The second approach is based on fusing the posterior probabilities predicted by two MLP classifiers. The input to each classifier is the features extracted from the data of one sensing mode. We evaluate the performance of the developed multimodal animal behavior classification algorithms using two real-world datasets collected via smart cattle collar tags and ear tags. The leave-one-animal-out cross-validation results show that both approaches improve the classification performance appreciably compared with using data of only one sensing mode. This is more notable for the infrequent but important behaviors of walking and drinking. The algorithms developed based on both approaches require little computational and memory resources hence are suitable for implementation on embedded systems of our collar tags and ear tags. However, the multimodal animal behavior classification algorithm based on posterior probability fusion is preferable to the one based on feature concatenation as it delivers better classification accuracy, has less computational and memory complexity, is more robust to sensor data failure, and enjoys better modularity.
    Parallel Gated Neural Network With Attention Mechanisim For Speech Enhancement. (arXiv:2210.14509v1 [cs.SD])
    Deep learning algorithm are increasingly used for speech enhancement (SE). In supervised methods, global and local information is required for accurate spectral mapping. A key restriction is often poor capture of key contextual information. To leverage long-term for target speakers and compensate distortions of cleaned speech, this paper adopts a sequence-to-sequence (S2S) mapping structure and proposes a novel monaural speech enhancement system, consisting of a Feature Extraction Block (FEB), a Compensation Enhancement Block (ComEB) and a Mask Block (MB). In the FEB a U-net block is used to extract abstract features using complex-valued spectra with one path to suppress the background noise in the magnitude domain using masking methods and the MB takes magnitude features from the FEBand compensates the lost complex-domain features produced from ComEB to restore the final cleaned speech. Experiments are conducted on the Librispeech dataset and results show that the proposed model obtains better performance than recent models in terms of ESTOI and PESQ scores.
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v2 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.
    Towards Practical Few-Shot Query Sets: Transductive Minimum Description Length Inference. (arXiv:2210.14545v1 [cs.LG])
    Standard few-shot benchmarks are often built upon simplifying assumptions on the query sets, which may not always hold in practice. In particular, for each task at testing time, the classes effectively present in the unlabeled query set are known a priori, and correspond exactly to the set of classes represented in the labeled support set. We relax these assumptions and extend current benchmarks, so that the query-set classes of a given task are unknown, but just belong to a much larger set of possible classes. Our setting could be viewed as an instance of the challenging yet practical problem of extremely imbalanced K-way classification, K being much larger than the values typically used in standard benchmarks, and with potentially irrelevant supervision from the support set. Expectedly, our setting incurs drops in the performances of state-of-the-art methods. Motivated by these observations, we introduce a PrimAl Dual Minimum Description LEngth (PADDLE) formulation, which balances data-fitting accuracy and model complexity for a given few-shot task, under supervision constraints from the support set. Our constrained MDL-like objective promotes competition among a large set of possible classes, preserving only effective classes that befit better the data of a few-shot task. It is hyperparameter free, and could be applied on top of any base-class training. Furthermore, we derive a fast block coordinate descent algorithm for optimizing our objective, with convergence guarantee, and a linear computational complexity at each iteration. Comprehensive experiments over the standard few-shot datasets and the more realistic and challenging i-Nat dataset show highly competitive performances of our method, more so when the numbers of possible classes in the tasks increase. Our code is publicly available at https://github.com/SegoleneMartin/PADDLE.
    Multi-lingual Evaluation of Code Generation Models. (arXiv:2210.14868v1 [cs.LG])
    We present MBXP, an execution-based code completion benchmark in 10+ programming languages. This collection of datasets is generated by our conversion framework that translates prompts and test cases from the original MBPP dataset to the corresponding data in a target language. Based on this benchmark, we are able to evaluate code generation models in a multi-lingual fashion, and in particular discover generalization ability of language models on out-of-domain languages, advantages of large multi-lingual models over mono-lingual, benefits of few-shot prompting, and zero-shot translation abilities. In addition, we use our code generation model to perform large-scale bootstrapping to obtain synthetic canonical solutions in several languages. These solutions can be used for other code-related evaluations such as insertion-based, summarization, or code translation tasks where we demonstrate results and release as part of our benchmark.
    A survey of Bayesian Network structure learning. (arXiv:2109.11415v2 [cs.LG] UPDATED)
    Bayesian Networks (BNs) have become increasingly popular over the last few decades as a tool for reasoning under uncertainty in fields as diverse as medicine, biology, epidemiology, economics and the social sciences. This is especially true in real-world areas where we seek to answer complex questions based on hypothetical evidence to determine actions for intervention. However, determining the graphical structure of a BN remains a major challenge, especially when modelling a problem under causal assumptions. Solutions to this problem include the automated discovery of BN graphs from data, constructing them based on expert knowledge, or a combination of the two. This paper provides a comprehensive review of combinatoric algorithms proposed for learning BN structure from data, describing 74 algorithms including prototypical, well-established and state-of-the-art approaches. The basic approach of each algorithm is described in consistent terms, and the similarities and differences between them highlighted. Methods of evaluating algorithms and their comparative performance are discussed including the consistency of claims made in the literature. Approaches for dealing with data noise in real-world datasets and incorporating expert knowledge into the learning process are also covered.
    Leveraging Demonstrations with Latent Space Priors. (arXiv:2210.14685v1 [cs.LG])
    Demonstrations provide insight into relevant state or action space regions, bearing great potential to boost the efficiency and practicality of reinforcement learning agents. In this work, we propose to leverage demonstration datasets by combining skill learning and sequence modeling. Starting with a learned joint latent space, we separately train a generative model of demonstration sequences and an accompanying low-level policy. The sequence model forms a latent space prior over plausible demonstration behaviors to accelerate learning of high-level policies. We show how to acquire such priors from state-only motion capture demonstrations and explore several methods for integrating them into policy learning on transfer tasks. Our experimental results confirm that latent space priors provide significant gains in learning speed and final performance in a set of challenging sparse-reward environments with a complex, simulated humanoid. Videos, source code and pre-trained models are available at the corresponding project website at https://facebookresearch.github.io/latent-space-priors .
    Pretrained audio neural networks for Speech emotion recognition in Portuguese. (arXiv:2210.14716v1 [cs.SD])
    The goal of speech emotion recognition (SER) is to identify the emotional aspects of speech. The SER challenge for Brazilian Portuguese speech was proposed with short snippets of Portuguese which are classified as neutral, non-neutral female and non-neutral male according to paralinguistic elements (laughing, crying, etc). This dataset contains about $50$ minutes of Brazilian Portuguese speech. As the dataset leans on the small side, we investigate whether a combination of transfer learning and data augmentation techniques can produce positive results. Thus, by combining a data augmentation technique called SpecAugment, with the use of Pretrained Audio Neural Networks (PANNs) for transfer learning we are able to obtain interesting results. The PANNs (CNN6, CNN10 and CNN14) are pretrained on a large dataset called AudioSet containing more than $5000$ hours of audio. They were finetuned on the SER dataset and the best performing model (CNN10) on the validation set was submitted to the challenge, achieving an $F1$ score of $0.73$ up from $0.54$ from the baselines provided by the challenge. Moreover, we also tested the use of Transformer neural architecture, pretrained on about $600$ hours of Brazilian Portuguese audio data. Transformers, as well as more complex models of PANNs (CNN14), fail to generalize to the test set in the SER dataset and do not beat the baseline. Considering the limitation of the dataset sizes, currently the best approach for SER is using PANNs (specifically, CNN6 and CNN10).
    RankGen: Improving Text Generation with Large Ranking Models. (arXiv:2205.09726v2 [cs.CL] UPDATED)
    Given an input sequence (or prefix), modern language models often assign high probabilities to output sequences that are repetitive, incoherent, or irrelevant to the prefix; as such, model-generated text also contains such artifacts. To address these issues we present RankGen, a 1.2B parameter encoder model for English that scores model generations given a prefix. RankGen can be flexibly incorporated as a scoring function in beam search and used to decode from any pretrained language model. We train RankGen using large-scale contrastive learning to map a prefix close to the ground-truth sequence that follows it and far away from two types of negatives: (1) random sequences from the same document as the prefix, and (2) sequences generated from a large language model conditioned on the prefix. Experiments across four different language models (345M-11B parameters) and two domains show that RankGen significantly outperforms decoding algorithms like nucleus, top-k, and typical sampling on both automatic metrics (85.0 vs 77.3 MAUVE) as well as human evaluations with English writers (74.5% human preference over nucleus sampling). Analysis reveals that RankGen outputs are more relevant to the prefix and improve continuity and coherence compared to baselines. We release our model checkpoints, code, and human preference data with explanations to facilitate future research.
    Adversarially Robust Medical Classification via Attentive Convolutional Neural Networks. (arXiv:2210.14405v1 [cs.CV])
    Convolutional neural network-based medical image classifiers have been shown to be especially susceptible to adversarial examples. Such instabilities are likely to be unacceptable in the future of automated diagnoses. Though statistical adversarial example detection methods have proven to be effective defense mechanisms, additional research is necessary that investigates the fundamental vulnerabilities of deep-learning-based systems and how best to build models that jointly maximize traditional and robust accuracy. This paper presents the inclusion of attention mechanisms in CNN-based medical image classifiers as a reliable and effective strategy for increasing robust accuracy without sacrifice. This method is able to increase robust accuracy by up to 16% in typical adversarial scenarios and up to 2700% in extreme cases.
    Limitations of Deep Learning for Inverse Problems on Digital Hardware. (arXiv:2202.13490v3 [cs.LG] UPDATED)
    Deep neural networks have seen tremendous success over the last years. Since the training is performed on digital hardware, in this paper, we analyze what actually can be computed on current hardware platforms modeled as Turing machines, which would lead to inherent restrictions of deep learning. For this, we focus on the class of inverse problems, which, in particular, encompasses any task to reconstruct data from measurements. We prove that finite-dimensional inverse problems are not Banach-Mazur computable for small relaxation parameters. In fact, our result even holds for Borel-Turing computability., i.e., there does not exist an algorithm which performs the training of a neural network on digital hardware for any given accuracy. Even more, our results introduce a lower bound on the accuracy that can be obtained algorithmically. This establishes a conceptual barrier on the capabilities of neural networks for finite-dimensional inverse problems given that the computations are performed on digital hardware.
    Continual Learning, Fast and Slow. (arXiv:2209.02370v2 [cs.AI] UPDATED)
    According to the Complementary Learning Systems (CLS) theory \cite{mcclelland1995there} in neuroscience, humans do effective \emph{continual learning} through two complementary systems: a fast learning system centered on the hippocampus for rapid learning of the specifics, individual experiences; and a slow learning system located in the neocortex for the gradual acquisition of structured knowledge about the environment. Motivated by this theory, we propose \emph{DualNets} (for Dual Networks), a general continual learning framework comprising a fast learning system for supervised learning of pattern-separated representation from specific tasks and a slow learning system for representation learning of task-agnostic general representation via Self-Supervised Learning (SSL). DualNets can seamlessly incorporate both representation types into a holistic framework to facilitate better continual learning in deep neural networks. Via extensive experiments, we demonstrate the promising results of DualNets on a wide range of continual learning protocols, ranging from the standard offline, task-aware setting to the challenging online, task-free scenario. Notably, on the CTrL \cite{veniat2020efficient} benchmark that has unrelated tasks with vastly different visual images, DualNets can achieve competitive performance with existing state-of-the-art dynamic architecture strategies \cite{ostapenko2021continual}. Furthermore, we conduct comprehensive ablation studies to validate DualNets efficacy, robustness, and scalability. Code is publicly available at \url{https://github.com/phquang/DualNet}.
    Learning on Large-scale Text-attributed Graphs via Variational Inference. (arXiv:2210.14709v1 [cs.LG])
    This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for such a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by large language models and training GNNs on big graphs. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows to separately train the two modules but at the same time allows the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach.
    Animal behavior classification via deep learning on embedded systems. (arXiv:2111.12295v3 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.
    Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning. (arXiv:2210.14867v1 [cs.CL])
    In this paper, we elaborate upon recipes for building multilingual representation models that are not only competitive with existing state-of-the-art models but are also more parameter efficient, thereby promoting better adoption in resource-constrained scenarios and practical applications. We show that going beyond English-centric bitexts, coupled with a novel sampling strategy aimed at reducing under-utilization of training data, substantially boosts performance across model sizes for both Electra and MLM pre-training objectives. We introduce XY-LENT: X-Y bitext enhanced Language ENcodings using Transformers which not only achieves state-of-the-art performance over 5 cross-lingual tasks within all model size bands, is also competitive across bands. Our XY-LENT XL variant outperforms XLM-RXXL and exhibits competitive performance with mT5 XXL while being 5x and 6x smaller respectively. We then show that our proposed method helps ameliorate the curse of multilinguality, with the XY-LENT XL achieving 99.3% GLUE performance and 98.5% SQuAD 2.0 performance compared to a SoTA English only model in the same size band. We then analyze our models performance on extremely low resource languages and posit that scaling alone may not be sufficient for improving the performance in this scenario
    Promises and Challenges of Causality for Ethical Machine Learning. (arXiv:2201.10683v2 [cs.LG] UPDATED)
    In recent years, there has been increasing interest in causal reasoning for designing fair decision-making systems due to its compatibility with legal frameworks, interpretability for human stakeholders, and robustness to spurious correlations inherent in observational data, among other factors. The recent attention to causal fairness, however, has been accompanied with great skepticism due to practical and epistemological challenges with applying current causal fairness approaches in the literature. Motivated by the long-standing empirical work on causality in econometrics, social sciences, and biomedical sciences, in this paper we lay out the conditions for appropriate application of causal fairness under the "potential outcomes framework." We highlight key aspects of causal inference that are often ignored in the causal fairness literature. In particular, we discuss the importance of specifying the nature and timing of interventions on social categories such as race or gender. Precisely, instead of postulating an intervention on immutable attributes, we propose a shift in focus to their perceptions and discuss the implications for fairness evaluation. We argue that such conceptualization of the intervention is key in evaluating the validity of causal assumptions and conducting sound causal analysis including avoiding post-treatment bias. Subsequently, we illustrate how causality can address the limitations of existing fairness metrics, including those that depend upon statistical correlations. Specifically, we introduce causal variants of common statistical notions of fairness, and we make a novel observation that under the causal framework there is no fundamental disagreement between different notions of fairness. Finally, we conduct extensive experiments where we demonstrate our approach for evaluating and mitigating unfairness, specially when post-treatment variables are present.
    Discovering Design Concepts for CAD Sketches. (arXiv:2210.14451v1 [cs.LG])
    Sketch design concepts are recurring patterns found in parametric CAD sketches. Though rarely explicitly formalized by the CAD designers, these concepts are implicitly used in design for modularity and regularity. In this paper, we propose a learning based approach that discovers the modular concepts by induction over raw sketches. We propose the dual implicit-explicit representation of concept structures that allows implicit detection and explicit generation, and the separation of structure generation and parameter instantiation for parameterized concept generation, to learn modular concepts by end-to-end training. We demonstrate the design concept learning on a large scale CAD sketch dataset and show its applications for design intent interpretation and auto-completion.
    Hierarchical Federated Learning with Momentum Acceleration in Multi-Tier Networks. (arXiv:2210.14560v1 [cs.LG])
    In this paper, we propose Hierarchical Federated Learning with Momentum Acceleration (HierMo), a three-tier worker-edge-cloud federated learning algorithm that applies momentum for training acceleration. Momentum is calculated and aggregated in the three tiers. We provide convergence analysis for HierMo, showing a convergence rate of O(1/T). In the analysis, we develop a new approach to characterize model aggregation, momentum aggregation, and their interactions. Based on this result, {we prove that HierMo achieves a tighter convergence upper bound compared with HierFAVG without momentum}. We also propose HierOPT, which optimizes the aggregation periods (worker-edge and edge-cloud aggregation periods) to minimize the loss given a limited training time.
    Deep Metric Learning with Adaptive Margin and Adaptive Scale for Acoustic Word Discrimination. (arXiv:2210.14564v1 [eess.AS])
    Many recent loss functions in deep metric learning are expressed with logarithmic and exponential forms, and they involve margin and scale as essential hyper-parameters. Since each data class has an intrinsic characteristic, several previous works have tried to learn embedding space close to the real distribution by introducing adaptive margins. However, there was no work on adaptive scales at all. We argue that both margin and scale should be adaptively adjustable during the training. In this paper, we propose a method called Adaptive Margin and Scale (AdaMS), where hyper-parameters of margin and scale are replaced with learnable parameters of adaptive margins and adaptive scales for each class. Our method is evaluated on Wall Street Journal dataset, and we achieve outperforming results for word discrimination tasks.
    Position tracking of a varying number of sound sources with sliding permutation invariant training. (arXiv:2210.14536v1 [eess.AS])
    Recent data- and learning-based sound source localization (SSL) methods have shown strong performance in challenging acoustic scenarios. However, little work has been done on adapting such methods to track consistently multiple sources appearing and disappearing, as would occur in reality. In this paper, we present a new training strategy for deep learning SSL models with a straightforward implementation based on the mean squared error of the optimal association between estimated and reference positions in the preceding time frames. It optimizes the desired properties of a tracking system: handling a time-varying number of sources and ordering localization estimates according to their trajectories, minimizing identity switches (IDSs). Evaluation on simulated data of multiple reverberant moving sources and on two model architectures proves its effectiveness on reducing identity switches without compromising frame-wise localization accuracy.
    Classification and Self-Supervised Regression of Arrhythmic ECG Signals Using Convolutional Neural Networks. (arXiv:2210.14253v1 [cs.LG])
    Interpretation of electrocardiography (ECG) signals is required for diagnosing cardiac arrhythmia. Recently, machine learning techniques have been applied for automated computer-aided diagnosis. Machine learning tasks can be divided into regression and classification. Regression can be used for noise and artifacts removal as well as resolve issues of missing data from low sampling frequency. Classification task concerns the prediction of output diagnostic classes according to expert-labeled input classes. In this work, we propose a deep neural network model capable of solving regression and classification tasks. Moreover, we combined the two approaches, using unlabeled and labeled data, to train the model. We tested the model on the MIT-BIH Arrhythmia database. Our method showed high effectiveness in detecting cardiac arrhythmia based on modified Lead II ECG records, as well as achieved high quality of ECG signal approximation. For the former, our method attained overall accuracy of 87:33% and balanced accuracy of 80:54%, on par with reference approaches. For the latter, application of self-supervised learning allowed for training without the need for expert labels. The regression model yielded satisfactory performance with fairly accurate prediction of QRS complexes. Transferring knowledge from regression to the classification task, our method attained higher overall accuracy of 87:78%.
    ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits. (arXiv:2210.14322v1 [cs.LG])
    We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem. The only two existing attempts in this line of work fall short across multiple dimensions, including pessimistic measures of non-stationary complexity and non-adaptive parameter tuning that requires knowledge of the number of preference changes. We develop an elimination-based rescheduling algorithm to overcome these shortcomings and show a near-optimal $\tilde{O}(\sqrt{S^{\texttt{CW}} T})$ dynamic regret bound, where $S^{\texttt{CW}}$ is the number of times the Condorcet winner changes in $T$ rounds. This yields the first near-optimal dynamic regret algorithm for unknown $S^{\texttt{CW}}$. We further study other related notions of non-stationarity for which we also prove near-optimal dynamic regret guarantees under additional assumptions on the underlying preference model.
    On the Surprising Effectiveness of Transformers in Low-Labeled Video Recognition. (arXiv:2209.07474v3 [cs.CV] UPDATED)
    Recently vision transformers have been shown to be competitive with convolution-based methods (CNNs) broadly across multiple vision tasks. The less restrictive inductive bias of transformers endows greater representational capacity in comparison with CNNs. However, in the image classification setting this flexibility comes with a trade-off with respect to sample efficiency, where transformers require ImageNet-scale training. This notion has carried over to video where transformers have not yet been explored for video classification in the low-labeled or semi-supervised settings. Our work empirically explores the low data regime for video classification and discovers that, surprisingly, transformers perform extremely well in the low-labeled video setting compared to CNNs. We specifically evaluate video vision transformers across two contrasting video datasets (Kinetics-400 and SomethingSomething-V2) and perform thorough analysis and ablation studies to explain this observation using the predominant features of video transformer architectures. We even show that using just the labeled data, transformers significantly outperform complex semi-supervised CNN methods that leverage large-scale unlabeled data as well. Our experiments inform our recommendation that semi-supervised learning video work should consider the use of video transformers in the future.
    WaveBound: Dynamic Error Bounds for Stable Time Series Forecasting. (arXiv:2210.14303v1 [cs.LG])
    Time series forecasting has become a critical task due to its high practicality in real-world applications such as traffic, energy consumption, economics and finance, and disease analysis. Recent deep-learning-based approaches have shown remarkable success in time series forecasting. Nonetheless, due to the dynamics of time series data, deep networks still suffer from unstable training and overfitting. Inconsistent patterns appearing in real-world data lead the model to be biased to a particular pattern, thus limiting the generalization. In this work, we introduce the dynamic error bounds on training loss to address the overfitting issue in time series forecasting. Consequently, we propose a regularization method called WaveBound which estimates the adequate error bounds of training loss for each time step and feature at each iteration. By allowing the model to focus less on unpredictable data, WaveBound stabilizes the training process, thus significantly improving generalization. With the extensive experiments, we show that WaveBound consistently improves upon the existing models in large margins, including the state-of-the-art model.
    On Robust Incremental Learning over Many Multilingual Steps. (arXiv:2210.14307v1 [cs.CL])
    Recent work in incremental learning has introduced diverse approaches to tackle catastrophic forgetting from data augmentation to optimized training regimes. However, most of them focus on very few training steps. We propose a method for robust incremental learning over dozens of fine-tuning steps using data from a variety of languages. We show that a combination of data-augmentation and an optimized training regime allows us to continue improving the model even for as many as fifty training steps. Crucially, our augmentation strategy does not require retaining access to previous training data and is suitable in scenarios with privacy constraints.
    Adaptive deep density approximation for fractional Fokker-Planck equations. (arXiv:2210.14402v1 [cs.LG])
    In this work, we propose adaptive deep learning approaches based on normalizing flows for solving fractional Fokker-Planck equations (FPEs). The solution of a FPE is a probability density function (PDF). Traditional mesh-based methods are ineffective because of the unbounded computation domain, a large number of dimensions and the nonlocal fractional operator. To this end, we represent the solution with an explicit PDF model induced by a flow-based deep generative model, simplified KRnet, which constructs a transport map from a simple distribution to the target distribution. We consider two methods to approximate the fractional Laplacian. One method is the Monte Carlo approximation. The other method is to construct an auxiliary model with Gaussian radial basis functions (GRBFs) to approximate the solution such that we may take advantage of the fact that the fractional Laplacian of a Gaussian is known analytically. Based on these two different ways for the approximation of the fractional Laplacian, we propose two models, MCNF and GRBFNF, to approximate stationary FPEs and MCTNF to approximate time-dependent FPEs. To further improve the accuracy, we refine the training set and the approximate solution alternately. A variety of numerical examples is presented to demonstrate the effectiveness of our adaptive deep density approaches.
    RBP-DIP: High-Quality CT Reconstruction Using an Untrained Neural Network with Residual Back Projection and Deep Image Prior. (arXiv:2210.14416v1 [eess.IV])
    Neural network related methods, due to their unprecedented success in image processing, have emerged as a new set of tools in CT reconstruction with the potential to change the field. However, the lack of high-quality training data and theoretical guarantees, together with increasingly complicated network structures, make its implementation impractical. In this paper, we present a new framework (RBP-DIP) based on Deep Image Prior (DIP) and a special residual back projection (RBP) connection to tackle these challenges. Comparing to other pre-trained neural network related algorithms, the proposed framework is closer to an iterative reconstruction (IR) algorithm as it requires no training data or training process. In that case, the proposed framework can be altered (e.g, different hyperparameters and constraints) on demand, adapting to different conditions (e.g, different imaged objects, imaging instruments, and noise levels) without retraining. Experiments show that the proposed framework has significant improvements over other state-of-the-art conventional methods, as well as pre-trained and untrained models with similar network structures, especially under sparse-view, limited-angle, and low-dose conditions.
    Arc travel time and path choice model estimation subsumed. (arXiv:2210.14351v1 [stat.ME])
    We propose a method for maximum likelihood estimation of path choice model parameters and arc travel time using data of different levels of granularity. Hitherto these two tasks have been tackled separately under strong assumptions. Using a small example, we illustrate that this can lead to biased results. Results on both real (New York yellow cab) and simulated data show strong performance of our method compared to existing baselines.
    Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach. (arXiv:2210.14420v1 [stat.ML])
    In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus does not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.
    D-Shape: Demonstration-Shaped Reinforcement Learning via Goal Conditioning. (arXiv:2210.14428v1 [cs.LG])
    While combining imitation learning (IL) and reinforcement learning (RL) is a promising way to address poor sample efficiency in autonomous behavior acquisition, methods that do so typically assume that the requisite behavior demonstrations are provided by an expert that behaves optimally with respect to a task reward. If, however, suboptimal demonstrations are provided, a fundamental challenge appears in that the demonstration-matching objective of IL conflicts with the return-maximization objective of RL. This paper introduces D-Shape, a new method for combining IL and RL that uses ideas from reward shaping and goal-conditioned RL to resolve the above conflict. D-Shape allows learning from suboptimal demonstrations while retaining the ability to find the optimal policy with respect to the task reward. We experimentally validate D-Shape in sparse-reward gridworld domains, showing that it both improves over RL in terms of sample efficiency and converges consistently to the optimal policy in the presence of suboptimal demonstrations.
    Which is the best model for my data?. (arXiv:2210.14687v1 [cs.LG])
    In this paper, we tackle the problem of selecting the optimal model for a given structured pattern classification dataset. In this context, a model can be understood as a classifier and a hyperparameter configuration. The proposed meta-learning approach purely relies on machine learning and involves four major steps. Firstly, we present a concise collection of 62 meta-features that address the problem of information cancellation when aggregation measure values involving positive and negative measurements. Secondly, we describe two different approaches for synthetic data generation intending to enlarge the training data. Thirdly, we fit a set of pre-defined classification models for each classification problem while optimizing their hyperparameters using grid search. The goal is to create a meta-dataset such that each row denotes a multilabel instance describing a specific problem. The features of these meta-instances denote the statistical properties of the generated datasets, while the labels encode the grid search results as binary vectors such that best-performing models are positively labeled. Finally, we tackle the model selection problem with several multilabel classifiers, including a Convolutional Neural Network designed to handle tabular data. The simulation results show that our meta-learning approach can correctly predict an optimal model for 91% of the synthetic datasets and for 87% of the real-world datasets. Furthermore, we noticed that most meta-classifiers produced better results when using our meta-features. Overall, our proposal differs from other meta-learning approaches since it tackles the algorithm selection and hyperparameter tuning problems in a single step. Toward the end, we perform a feature importance analysis to determine which statistical features drive the model selection mechanism.
    AD-DMKDE: Anomaly Detection through Density Matrices and Fourier Features. (arXiv:2210.14796v1 [cs.LG])
    This paper presents a novel density estimation method for anomaly detection using density matrices (a powerful mathematical formalism from quantum mechanics) and Fourier features. The method can be seen as an efficient approximation of Kernel Density Estimation (KDE). A systematic comparison of the proposed method with eleven state-of-the-art anomaly detection methods on various data sets is presented, showing competitive performance on different benchmark data sets. The method is trained efficiently and it uses optimization to find the parameters of data embedding. The prediction phase complexity of the proposed algorithm is constant relative to the training data size, and it performs well in data sets with different anomaly rates. Its architecture allows vectorization and can be implemented on GPU/TPU hardware.
    IMU2CLIP: Multimodal Contrastive Learning for IMU Motion Sensors from Egocentric Videos and Text. (arXiv:2210.14395v1 [cs.CV])
    We present IMU2CLIP, a novel pre-training approach to align Inertial Measurement Unit (IMU) motion sensor recordings with video and text, by projecting them into the joint representation space of Contrastive Language-Image Pre-training (CLIP). The proposed approach allows IMU2CLIP to translate human motions (as measured by IMU sensors) into their corresponding textual descriptions and videos -- while preserving the transitivity across these modalities. We explore several new IMU-based applications that IMU2CLIP enables, such as motion-based media retrieval and natural language reasoning tasks with motion data. In addition, we show that IMU2CLIP can significantly improve the downstream performance when fine-tuned for each application (e.g. activity recognition), demonstrating the universal usage of IMU2CLIP as a new pre-trained resource. Our code will be made publicly available.
    DiffuSeq: Sequence to Sequence Text Generation with Diffusion Models. (arXiv:2210.08933v2 [cs.CL] UPDATED)
    Recently, diffusion models have emerged as a new paradigm for generative models. Despite the success in domains using continuous signals such as vision and audio, adapting diffusion models to natural language is difficult due to the discrete nature of text. We tackle this challenge by proposing DiffuSeq: a diffusion model designed for sequence-to-sequence (Seq2Seq) text generation tasks. Upon extensive evaluation over a wide range of Seq2Seq tasks, we find DiffuSeq achieving comparable or even better performance than six established baselines, including a state-of-the-art model that is based on pre-trained language models. Apart from quality, an intriguing property of DiffuSeq is its high diversity during generation, which is desired in many Seq2Seq tasks. We further include a theoretical analysis revealing the connection between DiffuSeq and autoregressive/non-autoregressive models. Bringing together theoretical analysis and empirical evidence, we demonstrate the great potential of diffusion models in complex conditional language generation tasks.
    Wasserstein Archetypal Analysis. (arXiv:2210.14298v1 [stat.ML])
    Archetypal analysis is an unsupervised machine learning method that summarizes data using a convex polytope. In its original formulation, for fixed k, the method finds a convex polytope with k vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared Euclidean distance between the data and the polytope is minimal. In the present work, we consider an alternative formulation of archetypal analysis based on the Wasserstein metric, which we call Wasserstein archetypal analysis (WAA). In one dimension, there exists a unique solution of WAA and, in two dimensions, we prove existence of a solution, as long as the data distribution is absolutely continuous with respect to Lebesgue measure. We discuss obstacles to extending our result to higher dimensions and general data distributions. We then introduce an appropriate regularization of the problem, via a Renyi entropy, which allows us to obtain existence of solutions of the regularized problem for general data distributions, in arbitrary dimensions. We prove a consistency result for the regularized problem, ensuring that if the data are iid samples from a probability measure, then as the number of samples is increased, a subsequence of the archetype points converges to the archetype points for the limiting data distribution, almost surely. Finally, we develop and implement a gradient-based computational approach for the two-dimensional problem, based on the semi-discrete formulation of the Wasserstein metric. Our analysis is supported by detailed computational experiments.
    CaloFlow for CaloChallenge Dataset 1. (arXiv:2210.14245v1 [physics.ins-det])
    CaloFlow is a new and promising approach to fast calorimeter simulation based on normalizing flows. Applying CaloFlow to the photon and charged pion Geant4 showers of Dataset 1 of the Fast Calorimeter Simulation Challenge 2022, we show how it can produce high-fidelity samples with a sampling time that is several orders of magnitude faster than Geant4. We demonstrate the fidelity of the samples using calorimeter shower images, histograms of high level features, and aggregate metrics such as a classifier trained to distinguish CaloFlow from Geant4 samples.
    Learning Causal Graphs in Manufacturing Domains using Structural Equation Models. (arXiv:2210.14573v1 [stat.ML])
    Many production processes are characterized by numerous and complex cause-and-effect relationships. Since they are only partially known they pose a challenge to effective process control. In this work we present how Structural Equation Models can be used for deriving cause-and-effect relationships from the combination of prior knowledge and process data in the manufacturing domain. Compared to existing applications, we do not assume linear relationships leading to more informative results.
    On the pragmatism of using binary classifiers over data intensive neural network classifiers for detection of COVID-19 from voice. (arXiv:2204.04802v2 [cs.SD] UPDATED)
    Lately, there has been a global effort by multiple research groups to detect COVID-19 from voice. Different researchers use different kinds of information from the voice signal to achieve this. Various types of phonated sounds and the sound of cough and breath have all been used with varying degree of success in automated voice-based COVID-19 detection apps. In this paper, we show that detecting COVID-19 from voice does not require custom-made non-standard features or complicated neural network classifiers rather it can be successfully done with just standard features and simple binary classifiers. In fact, we show that the latter is not only more accurate and interpretable but also more computationally efficient in that they can be run locally on small devices. We demonstrate this on a human-curated dataset of over 1000 subjects, collected and calibrated in clinical settings.
    Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input. (arXiv:2210.14648v1 [eess.AS])
    Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2.
    Sparsity in Continuous-Depth Neural Networks. (arXiv:2210.14672v1 [cs.LG])
    Neural Ordinary Differential Equations (NODEs) have proven successful in learning dynamical systems in terms of accurately recovering the observed trajectories. While different types of sparsity have been proposed to improve robustness, the generalization properties of NODEs for dynamical systems beyond the observed data are underexplored. We systematically study the influence of weight and feature sparsity on forecasting as well as on identifying the underlying dynamical laws. Besides assessing existing methods, we propose a regularization technique to sparsify "input-output connections" and extract relevant features during training. Moreover, we curate real-world datasets consisting of human motion capture and human hematopoiesis single-cell RNA-seq data to realistically analyze different levels of out-of-distribution (OOD) generalization in forecasting and dynamics identification respectively. Our extensive empirical evaluation on these challenging benchmarks suggests that weight sparsity improves generalization in the presence of noise or irregular sampling. However, it does not prevent learning spurious feature dependencies in the inferred dynamics, rendering them impractical for predictions under interventions, or for inferring the true underlying dynamics. Instead, feature sparsity can indeed help with recovering sparse ground-truth dynamics compared to unregularized NODEs.
    Full-band General Audio Synthesis with Score-based Diffusion. (arXiv:2210.14661v1 [cs.SD])
    Recent works have shown the capability of deep generative models to tackle general audio synthesis from a single label, producing a variety of impulsive, tonal, and environmental sounds. Such models operate on band-limited signals and, as a result of an autoregressive approach, they are typically conformed by pre-trained latent encoders and/or several cascaded modules. In this work, we propose a diffusion-based generative model for general audio synthesis, named DAG, which deals with full-band signals end-to-end in the waveform domain. Results show the superiority of DAG over existing label-conditioned generators in terms of both quality and diversity. More specifically, when compared to the state of the art, the band-limited and full-band versions of DAG achieve relative improvements that go up to 40 and 65%, respectively. We believe DAG is flexible enough to accommodate different conditioning schemas while providing good quality synthesis.
    Don't Prompt, Search! Mining-based Zero-Shot Learning with Language Models. (arXiv:2210.14803v1 [cs.CL])
    Masked language models like BERT can perform text classification in a zero-shot fashion by reformulating downstream tasks as text infilling. However, this approach is highly sensitive to the template used to prompt the model, yet practitioners are blind when designing them in strict zero-shot settings. In this paper, we propose an alternative mining-based approach for zero-shot learning. Instead of prompting language models, we use regular expressions to mine labeled examples from unlabeled corpora, which can optionally be filtered through prompting, and used to finetune a pretrained model. Our method is more flexible and interpretable than prompting, and outperforms it on a wide range of tasks when using comparable templates. Our results suggest that the success of prompting can partly be explained by the model being exposed to similar examples during pretraining, which can be directly retrieved through regular expressions.
    On the uncertainty principle of neural networks. (arXiv:2205.01493v2 [cs.LG] UPDATED)
    Despite the successes in many fields, it is found that neural networks are difficult to be both accurate and robust, i.e., high accuracy networks are often vulnerable. Various empirical and analytic studies have substantiated that there is more or less a trade-off between the accuracy and robustness of neural networks. If the property is inherent, applications based on the neural networks are vulnerable with untrustworthy predictions. To more deeply explore and understand this issue, in this study we show that the accuracy-robustness trade-off is an intrinsic property whose underlying mechanism is closely related to the uncertainty principle in quantum mechanics. By relating the loss function in neural networks to the wave function in quantum mechanics, we show that the inputs and their conjugates cannot be resolved by a neural network simultaneously. This work thus provides an insightful explanation for the inevitability of the accuracy-robustness dilemma for general deep networks from an entirely new perspective, and furthermore, reveals a potential possibility to study various properties of neural networks with the mature mathematical tools in quantum physics.
    'A net for everyone': fully personalized and unsupervised neural networks trained with longitudinal data from a single patient. (arXiv:2210.14228v1 [cs.LG])
    With the rise in importance of personalized medicine, we trained personalized neural networks to detect tumor progression in longitudinal datasets. The model was evaluated on two datasets with a total of 64 scans from 32 patients diagnosed with glioblastoma multiforme (GBM). Contrast-enhanced T1w sequences of brain magnetic resonance imaging (MRI) images were used in this study. For each patient, we trained their own neural network using just two images from different timepoints. Our approach uses a Wasserstein-GAN (generative adversarial network), an unsupervised network architecture, to map the differences between the two images. Using this map, the change in tumor volume can be evaluated. Due to the combination of data augmentation and the network architecture, co-registration of the two images is not needed. Furthermore, we do not rely on any additional training data, (manual) annotations or pre-training neural networks. The model received an AUC-score of 0.87 for tumor change. We also introduced a modified RANO criteria, for which an accuracy of 66% can be achieved. We show that using data from just one patient can be used to train deep neural networks to monitor tumor change.
    ViNL: Visual Navigation and Locomotion Over Obstacles. (arXiv:2210.14791v1 [cs.RO])
    We present Visual Navigation and Locomotion over obstacles (ViNL), which enables a quadrupedal robot to navigate unseen apartments while stepping over small obstacles that lie in its path (e.g., shoes, toys, cables), similar to how humans and pets lift their feet over objects as they walk. ViNL consists of: (1) a visual navigation policy that outputs linear and angular velocity commands that guides the robot to a goal coordinate in unfamiliar indoor environments; and (2) a visual locomotion policy that controls the robot's joints to avoid stepping on obstacles while following provided velocity commands. Both the policies are entirely "model-free", i.e. sensors-to-actions neural networks trained end-to-end. The two are trained independently in two entirely different simulators and then seamlessly co-deployed by feeding the velocity commands from the navigator to the locomotor, entirely "zero-shot" (without any co-training). While prior works have developed learning methods for visual navigation or visual locomotion, to the best of our knowledge, this is the first fully learned approach that leverages vision to accomplish both (1) intelligent navigation in new environments, and (2) intelligent visual locomotion that aims to traverse cluttered environments without disrupting obstacles. On the task of navigation to distant goals in unknown environments, ViNL using just egocentric vision significantly outperforms prior work on robust locomotion using privileged terrain maps (+32.8% success and -4.42 collisions per meter). Additionally, we ablate our locomotion policy to show that each aspect of our approach helps reduce obstacle collisions. Videos and code at this http URL
    Data reconstruction of turbulent flows with Gappy POD, Extended POD and Generative Adversarial Networks. (arXiv:2210.11921v1 [physics.flu-dyn] CROSS LISTED)
    Three methods are used to reconstruct two-dimensional instantaneous velocity fields in a turbulent flow under rotation. The first two methods both use the linear proper orthogonal decomposition (POD), which are Gappy POD (GPOD) and Extended POD (EPOD), while the third one reconstructs the flow using a fully non-linear Convolutional Neural Network embedded in a Generative Adversarial Network (GAN). First, we show that there is always an optimal number of modes regarding a specific gap for the GPOD with dimension reduction. Moreover, adopting a Lasso regularizer for GPOD provides comparable reconstruction results. In order to systematically compare the applicability of the three tools, we consider a square gap at changing the size. Results show that compared with POD-based methods, GAN reconstruction not only has a smaller $L_2$ error, but also better turbulent statistics of both the velocity module and the velocity module gradient. This can be attributed to the ability of nonlinearity expression of the network and the presence of adversarial loss during the GAN training. We also investigate effects of the adversarial ratio, which controls the compromising between the $L_2$ error and the statistical properties. Finally, we assess the reconstruction on random gappiness. All methods perform well for small- and medium-size gaps, while GAN works better when the gappiness is large.
    FO-PINNs: A First-Order formulation for Physics Informed Neural Networks. (arXiv:2210.14320v1 [cs.LG])
    We present FO-PINNs, physics-informed neural networks that are trained using the first-order formulation of the Partial Differential Equation (PDE) losses. We show that FO-PINNs offer significantly higher accuracy in solving parameterized systems compared to traditional PINNs, and reduce time-per-iteration by removing the extra backpropagations needed to compute the second or higher-order derivatives. Additionally, unlike standard PINNs, FO-PINNs can be used with exact imposition of boundary conditions using approximate distance functions, and can be trained using Automatic Mixed Precision (AMP) to further speed up the training. Through two Helmholtz and Navier-Stokes examples, we demonstrate the advantages of FO-PINNs over traditional PINNs in terms of accuracy and training speedup.
    zPROBE: Zero Peek Robustness Checks for Federated Learning. (arXiv:2206.12100v2 [cs.LG] UPDATED)
    Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thus the users' (private) training data is not leaked from the individual model updates. However, keeping the individual updates private allows malicious users to perform Byzantine attacks and degrade the accuracy without being detected. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., median, to find malicious updates. However, implementing privacy-preserving rank-based statistics is nontrivial and not scalable in the secure domain, as it requires sorting all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage our statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy.
    Hybrid HMM Decoder For Convolutional Codes By Joint Trellis-Like Structure and Channel Prior. (arXiv:2210.14749v1 [cs.IT])
    The anti-interference capability of wireless links is a physical layer problem for edge computing. Although convolutional codes have inherent error correction potential due to the redundancy introduced in the data, the performance of the convolutional code is drastically degraded due to multipath effects on the channel. In this paper, we propose the use of a Hidden Markov Model (HMM) for the reconstruction of convolutional codes and decoding by the Viterbi algorithm. Furthermore, to implement soft-decision decoding, the observation of HMM is replaced by Gaussian mixture models (GMM). Our method provides superior error correction potential than the standard method because the model parameters contain channel state information (CSI). We evaluated the performance of the method compared to standard Viterbi decoding by numerical simulation. In the multipath channel, the hybrid HMM decoder can achieve a performance gain of 4.7 dB and 2 dB when using hard-decision and soft-decision decoding, respectively. The HMM decoder also achieves significant performance gains for the RSC code, suggesting that the method could be extended to turbo codes.
    Can Transformer Attention Spread Give Insights Into Uncertainty of Detected and Tracked Objects?. (arXiv:2210.14391v1 [cs.CV])
    Transformers have recently been utilized to perform object detection and tracking in the context of autonomous driving. One unique characteristic of these models is that attention weights are computed in each forward pass, giving insights into the model's interior, in particular, which part of the input data it deemed interesting for the given task. Such an attention matrix with the input grid is available for each detected (or tracked) object in every transformer decoder layer. In this work, we investigate the distribution of these attention weights: How do they change through the decoder layers and through the lifetime of a track? Can they be used to infer additional information about an object, such as a detection uncertainty? Especially in unstructured environments, or environments that were not common during training, a reliable measure of detection uncertainty is crucial to decide whether the system can still be trusted or not.
    Modeling the Graphotactics of Low-Resource Languages Using Sequential GANs. (arXiv:2210.14409v1 [cs.CL])
    Generative Adversarial Networks (GANs) have been shown to aid in the creation of artificial data in situations where large amounts of real data are difficult to come by. This issue is especially salient in the computational linguistics space, where researchers are often tasked with modeling the complex morphologic and grammatical processes of low-resource languages. This paper will discuss the implementation and testing of a GAN that attempts to model and reproduce the graphotactics of a language using only 100 example strings. These artificial, yet graphotactically compliant, strings are meant to aid in modeling the morphological inflection of low-resource languages.
    Is Out-of-Distribution Detection Learnable?. (arXiv:2210.14707v1 [cs.LG])
    Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.
    Visual Semantic Parsing: From Images to Abstract Meaning Representation. (arXiv:2210.14862v1 [cs.CV])
    The success of scene graphs for visual scene understanding has brought attention to the benefits of abstracting a visual input (e.g., image) into a structured representation, where entities (people and objects) are nodes connected by edges specifying their relations. Building these representations, however, requires expensive manual annotation in the form of images paired with their scene graphs or frames. These formalisms remain limited in the nature of entities and relations they can capture. In this paper, we propose to leverage a widely-used meaning representation in the field of natural language processing, the Abstract Meaning Representation (AMR), to address these shortcomings. Compared to scene graphs, which largely emphasize spatial relationships, our visual AMR graphs are more linguistically informed, with a focus on higher-level semantic concepts extrapolated from visual input. Moreover, they allow us to generate meta-AMR graphs to unify information contained in multiple image descriptions under one representation. Through extensive experimentation and analysis, we demonstrate that we can re-purpose an existing text-to-AMR parser to parse images into AMRs. Our findings point to important future research directions for improved scene understanding.
    Calibrated Predictive Distributions via Diagnostics for Conditional Coverage. (arXiv:2205.14568v2 [stat.ML] UPDATED)
    Uncertainty quantification is crucial for assessing the predictive ability of AI algorithms. A large body of work (including normalizing flows and Bayesian neural networks) has been devoted to describing the entire predictive distribution (PD) of a target variable Y given input features $\mathbf{X}$. However, off-the-shelf PDs are usually far from being conditionally calibrated; i.e., the probability of occurrence of an event given input $\mathbf{X}$ can be significantly different from the predicted probability. Most current research on predictive inference (such as conformal prediction) concerns constructing calibrated prediction sets only. It is often believed that the problem of obtaining and assessing entire conditionally calibrated PDs is too challenging. In this work, we show that recalibration, as well as diagnostics of entire PDs, are indeed attainable goals in practice. Our proposed method relies on the idea of regressing probability integral transform (PIT) scores against $\mathbf{X}$. This regression gives full diagnostics of conditional coverage across the entire feature space and can be used to recalibrate misspecified PDs. We benchmark our corrected prediction bands against oracle bands and state-of-the-art predictive inference algorithms for synthetic data, including settings with a distributional shift. Finally, we produce calibrated PDs for two applications: (i) probabilistic nowcasting based on sequences of satellite images, and (ii) estimation of galaxy distances based on imaging data (photometric redshifts).
    Adaptive Test-Time Defense with the Manifold Hypothesis. (arXiv:2210.14404v1 [cs.LG])
    In this work, we formulate a novel framework of adversarial robustness using the manifold hypothesis. Our framework provides sufficient conditions for defending against adversarial examples. We develop a test-time defense method with our formulation and variational inference. The developed approach combines manifold learning with the Bayesian framework to provide adversarial robustness without the need for adversarial training. We show that our proposed approach can provide adversarial robustness even if attackers are aware of existence of test-time defense. In additions, our approach can also serve as a test-time defense mechanism for variational autoencoders.
    Comparison of neural closure models for discretised PDEs. (arXiv:2210.14675v1 [cs.LG])
    Neural closure models have recently been proposed as a method for efficiently approximating small scales in multiscale systems with neural networks. The choice of loss function and associated training procedure has a large effect on the accuracy and stability of the resulting neural closure model. In this work, we systematically compare three distinct procedures: "derivative fitting", "trajectory fitting" with discretise-then-optimise, and "trajectory fitting" with optimise-then-discretise. Derivative fitting is conceptually the simplest and computationally the most efficient approach and is found to perform reasonably well on one of the test problems (Kuramoto-Sivashinsky) but poorly on the other (Burgers). Trajectory fitting is computationally more expensive but is more robust and is therefore the preferred approach. Of the two trajectory fitting procedures, the discretise-then-optimise approach produces more accurate models than the optimise-then-discretise approach. While the optimise-then-discretise approach can still produce accurate models, care must be taken in choosing the length of the trajectories used for training, in order to train the models on long-term behaviour while still producing reasonably accurate gradients during training. Two existing theorems are interpreted in a novel way that gives insight into the long-term accuracy of a neural closure model based on how accurate it is in the short term.
    Fusing Modalities by Multiplexed Graph Neural Networks for Outcome Prediction in Tuberculosis. (arXiv:2210.14377v1 [cs.LG])
    In a complex disease such as tuberculosis, the evidence for the disease and its evolution may be present in multiple modalities such as clinical, genomic, or imaging data. Effective patient-tailored outcome prediction and therapeutic guidance will require fusing evidence from these modalities. Such multimodal fusion is difficult since the evidence for the disease may not be uniform across all modalities, not all modality features may be relevant, or not all modalities may be present for all patients. All these nuances make simple methods of early, late, or intermediate fusion of features inadequate for outcome prediction. In this paper, we present a novel fusion framework using multiplexed graphs and derive a new graph neural network for learning from such graphs. Specifically, the framework allows modalities to be represented through their targeted encodings, and models their relationship explicitly via multiplexed graphs derived from salient features in a combined latent space. We present results that show that our proposed method outperforms state-of-the-art methods of fusing modalities for multi-outcome prediction on a large Tuberculosis (TB) dataset.
    Federated Fuzzy Neural Network with Evolutionary Rule Learning. (arXiv:2210.14393v1 [cs.LG])
    Distributed fuzzy neural networks (DFNNs) have attracted increasing attention recently due to their learning abilities in handling data uncertainties in distributed scenarios. However, it is challenging for DFNNs to handle cases in which the local data are non-independent and identically distributed (non-IID). In this paper, we propose a federated fuzzy neural network (FedFNN) with evolutionary rule learning (ERL) to cope with non-IID issues as well as data uncertainties. The FedFNN maintains a global set of rules in a server and a personalized subset of these rules for each local client. ERL is inspired by the theory of biological evolution; it encourages rule variations while activating superior rules and deactivating inferior rules for local clients with non-IID data. Specifically, ERL consists of two stages in an iterative procedure: a rule cooperation stage that updates global rules by aggregating local rules based on their activation statuses and a rule evolution stage that evolves the global rules and updates the activation statuses of the local rules. This procedure improves both the generalization and personalization of the FedFNN for dealing with non-IID issues and data uncertainties. Extensive experiments conducted on a range of datasets demonstrate the superiority of the FedFNN over state-of-the-art methods.
    NAS-PRNet: Neural Architecture Search generated Phase Retrieval Net for Off-axis Quantitative Phase Imaging. (arXiv:2210.14231v1 [eess.IV])
    Single neural networks have achieved simultaneous phase retrieval with aberration compensation and phase unwrapping in off-axis Quantitative Phase Imaging (QPI). However, when designing the phase retrieval neural network architecture, the trade-off between computation latency and accuracy has been largely neglected. Here, we propose Neural Architecture Search (NAS) generated Phase Retrieval Net (NAS-PRNet), which is an encoder-decoder style neural network, automatically found from a large neural network architecture search space. The NAS scheme in NAS-PRNet is modified from SparseMask, in which the learning of skip connections between the encoder and the decoder is formulated as a differentiable NAS problem, and the gradient decent is applied to efficiently search the optimal skip connections. Using MobileNet-v2 as the encoder and a synthesized loss that incorporates phase reconstruction and network sparsity losses, NAS-PRNet has realized fast and accurate phase retrieval of biological cells. When tested on a cell dataset, NAS-PRNet has achieved a Peak Signal-to-Noise Ratio (PSNR) of 36.1 dB, outperforming the widely used U-Net and original SparseMask-generated neural network. Notably, the computation latency of NAS-PRNet is only 31 ms which is 12 times less than U-Net. Moreover, the connectivity scheme in NAS-PRNet, identified from one off-axis QPI system, can be well fitted to another with different fringe patterns.
    Bilingual Lexicon Induction for Low-Resource Languages using Graph Matching via Optimal Transport. (arXiv:2210.14378v1 [cs.CL])
    Bilingual lexicons form a critical component of various natural language processing applications, including unsupervised and semisupervised machine translation and crosslingual information retrieval. We improve bilingual lexicon induction performance across 40 language pairs with a graph-matching method based on optimal transport. The method is especially strong with low amounts of supervision.
    Accelerating Certified Robustness Training via Knowledge Transfer. (arXiv:2210.14283v1 [cs.LG])
    Training deep neural network classifiers that are certifiably robust against adversarial attacks is critical to ensuring the security and reliability of AI-controlled systems. Although numerous state-of-the-art certified training methods have been developed, they are computationally expensive and scale poorly with respect to both dataset and network complexity. Widespread usage of certified training is further hindered by the fact that periodic retraining is necessary to incorporate new data and network improvements. In this paper, we propose Certified Robustness Transfer (CRT), a general-purpose framework for reducing the computational overhead of any certifiably robust training method through knowledge transfer. Given a robust teacher, our framework uses a novel training loss to transfer the teacher's robustness to the student. We provide theoretical and empirical validation of CRT. Our experiments on CIFAR-10 show that CRT speeds up certified robustness training by $8 \times$ on average across three different architecture generations while achieving comparable robustness to state-of-the-art methods. We also show that CRT can scale to large-scale datasets like ImageNet.
    Combined Data and Deep Learning Model Uncertainties: An Application to the Measurement of Solid Fuel Regression Rate. (arXiv:2210.14287v1 [cs.LG])
    In complex physical process characterization, such as the measurement of the regression rate for solid hybrid rocket fuels, where both the observation data and the model used have uncertainties originating from multiple sources, combining these in a systematic way for quantities of interest(QoI) remains a challenge. In this paper, we present a forward propagation uncertainty quantification (UQ) process to produce a probabilistic distribution for the observed regression rate $\dot{r}$. We characterized two input data uncertainty sources from the experiment (the distortion from the camera $U_c$ and the non-zero angle fuel placement $U_\gamma$), the prediction and model form uncertainty from the deep neural network ($U_m$), as well as the variability from the manually segmented images used for training it ($U_s$). We conducted seven case studies on combinations of these uncertainty sources with the model form uncertainty. The main contribution of this paper is the investigation and inclusion of the experimental image data uncertainties involved, and how to include them in a workflow when the QoI is the result of multiple sequential processes.
    Certified Robustness in Federated Learning. (arXiv:2206.02535v2 [cs.LG] UPDATED)
    Federated learning has recently gained significant attention and popularity due to its effectiveness in training machine learning models on distributed data privately. However, as in the single-node supervised learning setup, models trained in federated learning suffer from vulnerability to imperceptible input transformations known as adversarial attacks, questioning their deployment in security-related applications. In this work, we study the interplay between federated training, personalization, and certified robustness. In particular, we deploy randomized smoothing, a widely-used and scalable certification method, to certify deep networks trained on a federated setup against input perturbations and transformations. We find that the simple federated averaging technique is effective in building not only more accurate, but also more certifiably-robust models, compared to training solely on local data. We further analyze personalization, a popular technique in federated training that increases the model's bias towards local data, on robustness. We show several advantages of personalization over both~(that is, only training on local data and federated training) in building more robust models with faster training. Finally, we explore the robustness of mixtures of global and local~(i.e. personalized) models, and find that the robustness of local models degrades as they diverge from the global model
    Estimating and Controlling for Fairness via Sensitive Attribute Predictors. (arXiv:2207.12497v2 [cs.LG] UPDATED)
    The responsible use of machine learning tools in real world high-stakes decision making demands that we audit and control for potential biases against underrepresented groups. This process naturally requires access to the sensitive attribute one desires to control, such as demographics, gender, or other potentially sensitive features. Unfortunately, this information is often unavailable. In this work we demonstrate that one can still reliably estimate, and ultimately control, for fairness by using proxy sensitive attributes derived from a sensitive attribute predictor. Specifically, we first show that with just a little knowledge of the complete data distribution, one may use a sensitive attribute predictor to obtain bounds of the classifier's true fairness metric. Second, we demonstrate how one can provably control a classifier's worst-case fairness violation with respect to the true sensitive attribute by controlling for fairness with respect to the proxy sensitive attribute. Our results hold under assumptions that are significantly milder than previous works, and we illustrate these results with experiments on synthetic and real datasets.
    Flexible Android Malware Detection Model based on Generative Adversarial Networks with Code Tensor. (arXiv:2210.14225v1 [cs.CR])
    The behavior of malware threats is gradually increasing, heightened the need for malware detection. However, existing malware detection methods only target at the existing malicious samples, the detection of fresh malicious code and variants of malicious code is limited. In this paper, we propose a novel scheme that detects malware and its variants efficiently. Based on the idea of the generative adversarial networks (GANs), we obtain the `true' sample distribution that satisfies the characteristics of the real malware, use them to deceive the discriminator, thus achieve the defense against malicious code attacks and improve malware detection. Firstly, a new Android malware APK to image texture feature extraction segmentation method is proposed, which is called segment self-growing texture segmentation algorithm. Secondly, tensor singular value decomposition (tSVD) based on the low-tubal rank transforms malicious features with different sizes into a fixed third-order tensor uniformly, which is entered into the neural network for training and learning. Finally, a flexible Android malware detection model based on GANs with code tensor (MTFD-GANs) is proposed. Experiments show that the proposed model can generally surpass the traditional malware detection model, with a maximum improvement efficiency of 41.6\%. At the same time, the newly generated samples of the GANs generator greatly enrich the sample diversity. And retraining malware detector can effectively improve the detection efficiency and robustness of traditional models.
    Multi-Domain Long-Tailed Learning by Augmenting Disentangled Representations. (arXiv:2210.14358v1 [cs.LG])
    There is an inescapable long-tailed class-imbalance issue in many real-world classification problems. Existing long-tailed classification methods focus on the single-domain setting, where all examples are drawn from the same distribution. However, real-world scenarios often involve multiple domains with distinct imbalanced class distributions. We study this multi-domain long-tailed learning problem and aim to produce a model that generalizes well across all classes and domains. Towards that goal, we introduce TALLY, which produces invariant predictors by balanced augmenting hidden representations over domains and classes. Built upon a proposed selective balanced sampling strategy, TALLY achieves this by mixing the semantic representation of one example with the domain-associated nuisances of another, producing a new representation for use as data augmentation. To improve the disentanglement of semantic representations, TALLY further utilizes a domain-invariant class prototype that averages out domain-specific effects. We evaluate TALLY on four long-tailed variants of classical domain generalization benchmarks and two real-world imbalanced multi-domain datasets. The results indicate that TALLY consistently outperforms other state-of-the-art methods in both subpopulation shift and domain shift.
    Zero-Shot Learning of a Conditional Generative Adversarial Network for Data-Free Network Quantization. (arXiv:2210.14392v1 [cs.CV])
    We propose a novel method for training a conditional generative adversarial network (CGAN) without the use of training data, called zero-shot learning of a CGAN (ZS-CGAN). Zero-shot learning of a conditional generator only needs a pre-trained discriminative (classification) model and does not need any training data. In particular, the conditional generator is trained to produce labeled synthetic samples whose characteristics mimic the original training data by using the statistics stored in the batch normalization layers of the pre-trained model. We show the usefulness of ZS-CGAN in data-free quantization of deep neural networks. We achieved the state-of-the-art data-free network quantization of the ResNet and MobileNet classification models trained on the ImageNet dataset. Data-free quantization using ZS-CGAN showed a minimal loss in accuracy compared to that obtained by conventional data-dependent quantization.
    SCP-GAN: Self-Correcting Discriminator Optimization for Training Consistency Preserving Metric GAN on Speech Enhancement Tasks. (arXiv:2210.14474v1 [cs.SD])
    In recent years, Generative Adversarial Networks (GANs) have produced significantly improved results in speech enhancement (SE) tasks. They are difficult to train, however. In this work, we introduce several improvements to the GAN training schemes, which can be applied to most GAN-based SE models. We propose using consistency loss functions, which target the inconsistency in time and time-frequency domains caused by Fourier and Inverse Fourier Transforms. We also present self-correcting optimization for training a GAN discriminator on SE tasks, which helps avoid "harmful" training directions for parts of the discriminator loss function. We have tested our proposed methods on several state-of-the-art GAN-based SE models and obtained consistent improvements, including new state-of-the-art results for the Voice Bank+DEMAND dataset.
    Auxiliary task discovery through generate-and-test. (arXiv:2210.14361v1 [cs.LG])
    In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks, hand-designed tasks, and learning without auxiliary tasks across a suite of environments.
    Short Paper: Static and Microarchitectural ML-Based Approaches For Detecting Spectre Vulnerabilities and Attacks. (arXiv:2210.14452v1 [cs.CR])
    Spectre intrusions exploit speculative execution design vulnerabilities in modern processors. The attacks violate the principles of isolation in programs to gain unauthorized private user information. Current state-of-the-art detection techniques utilize micro-architectural features or vulnerable speculative code to detect these threats. However, these techniques are insufficient as Spectre attacks have proven to be more stealthy with recently discovered variants that bypass current mitigation mechanisms. Side-channels generate distinct patterns in processor cache, and sensitive information leakage is dependent on source code vulnerable to Spectre attacks, where an adversary uses these vulnerabilities, such as branch prediction, which causes a data breach. Previous studies predominantly approach the detection of Spectre attacks using the microarchitectural analysis, a reactive approach. Hence, in this paper, we present the first comprehensive evaluation of static and microarchitectural analysis-assisted machine learning approaches to detect Spectre vulnerable code snippets (preventive) and Spectre attacks (reactive). We evaluate the performance trade-offs in employing classifiers for detecting Spectre vulnerabilities and attacks.
    Frank-Wolfe-based Algorithms for Approximating Tyler's M-estimator. (arXiv:2206.09370v2 [math.OC] UPDATED)
    Tyler's M-estimator is a well known procedure for robust and heavy-tailed covariance estimation. Tyler himself suggested an iterative fixed-point algorithm for computing his estimator however, it requires super-linear (in the size of the data) runtime per iteration, which maybe prohibitive in large scale. In this work we propose, to the best of our knowledge, the first Frank-Wolfe-based algorithms for computing Tyler's estimator. One variant uses standard Frank-Wolfe steps, the second also considers \textit{away-steps} (AFW), and the third is a \textit{geodesic} version of AFW (GAFW). AFW provably requires, up to a log factor, only linear time per iteration, while GAFW runs in linear time (up to a log factor) in a large $n$ (number of data-points) regime. All three variants are shown to provably converge to the optimal solution with sublinear rate, under standard assumptions, despite the fact that the underlying optimization problem is not convex nor smooth. Under an additional fairly mild assumption, that holds with probability 1 when the (normalized) data-points are i.i.d. samples from a continuous distribution supported on the entire unit sphere, AFW and GAFW are proved to converge with linear rates. Importantly, all three variants are parameter-free and use adaptive step-sizes.
    Planning with Occluded Traffic Agents using Bi-Level Variational Occlusion Models. (arXiv:2210.14584v1 [cs.LG])
    Reasoning with occluded traffic agents is a significant open challenge for planning for autonomous vehicles. Recent deep learning models have shown impressive results for predicting occluded agents based on the behaviour of nearby visible agents; however, as we show in experiments, these models are difficult to integrate into downstream planning. To this end, we propose Bi-level Variational Occlusion Models (BiVO), a two-step generative model that first predicts likely locations of occluded agents, and then generates likely trajectories for the occluded agents. In contrast to existing methods, BiVO outputs a trajectory distribution which can then be sampled from and integrated into standard downstream planning. We evaluate the method in closed-loop replay simulation using the real-world nuScenes dataset. Our results suggest that BiVO can successfully learn to predict occluded agent trajectories, and these predictions lead to better subsequent motion plans in critical scenarios.
    Streaming Submodular Maximization with Differential Privacy. (arXiv:2210.14315v1 [cs.LG])
    In this work, we study the problem of privately maximizing a submodular function in the streaming setting. Extensive work has been done on privately maximizing submodular functions in the general case when the function depends upon the private data of individuals. However, when the size of the data stream drawn from the domain of the objective function is large or arrives very fast, one must privately optimize the objective within the constraints of the streaming setting. We establish fundamental differentially private baselines for this problem and then derive better trade-offs between privacy and utility for the special case of decomposable submodular functions. A submodular function is decomposable when it can be written as a sum of submodular functions; this structure arises naturally when each summand function models the utility of an individual and the goal is to study the total utility of the whole population as in the well-known Combinatorial Public Projects Problem. Finally, we complement our theoretical analysis with experimental corroboration.
    LaundroGraph: Self-Supervised Graph Representation Learning for Anti-Money Laundering. (arXiv:2210.14360v1 [cs.LG])
    Anti-money laundering (AML) regulations mandate financial institutions to deploy AML systems based on a set of rules that, when triggered, form the basis of a suspicious alert to be assessed by human analysts. Reviewing these cases is a cumbersome and complex task that requires analysts to navigate a large network of financial interactions to validate suspicious movements. Furthermore, these systems have very high false positive rates (estimated to be over 95\%). The scarcity of labels hinders the use of alternative systems based on supervised learning, reducing their applicability in real-world applications. In this work we present LaundroGraph, a novel self-supervised graph representation learning approach to encode banking customers and financial transactions into meaningful representations. These representations are used to provide insights to assist the AML reviewing process, such as identifying anomalous movements for a given customer. LaundroGraph represents the underlying network of financial interactions as a customer-transaction bipartite graph and trains a graph neural network on a fully self-supervised link prediction task. We empirically demonstrate that our approach outperforms other strong baselines on self-supervised link prediction using a real-world dataset, improving the best non-graph baseline by $12$ p.p. of AUC. The goal is to increase the efficiency of the reviewing process by supplying these AI-powered insights to the analysts upon review. To the best of our knowledge, this is the first fully self-supervised system within the context of AML detection.
    Provable Safe Reinforcement Learning with Binary Feedback. (arXiv:2210.14492v1 [cs.LG])
    Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.
    FedClassAvg: Local Representation Learning for Personalized Federated Learning on Heterogeneous Neural Networks. (arXiv:2210.14226v1 [cs.LG])
    Personalized federated learning is aimed at allowing numerous clients to train personalized models while participating in collaborative training in a communication-efficient manner without exchanging private data. However, many personalized federated learning algorithms assume that clients have the same neural network architecture, and those for heterogeneous models remain understudied. In this study, we propose a novel personalized federated learning method called federated classifier averaging (FedClassAvg). Deep neural networks for supervised learning tasks consist of feature extractor and classifier layers. FedClassAvg aggregates classifier weights as an agreement on decision boundaries on feature spaces so that clients with not independently and identically distributed (non-iid) data can learn about scarce labels. In addition, local feature representation learning is applied to stabilize the decision boundaries and improve the local feature extraction capabilities for clients. While the existing methods require the collection of auxiliary data or model weights to generate a counterpart, FedClassAvg only requires clients to communicate with a couple of fully connected layers, which is highly communication-efficient. Moreover, FedClassAvg does not require extra optimization problems such as knowledge transfer, which requires intensive computation overhead. We evaluated FedClassAvg through extensive experiments and demonstrated it outperforms the current state-of-the-art algorithms on heterogeneous personalized federated learning tasks.
    OTSeq2Set: An Optimal Transport Enhanced Sequence-to-Set Model for Extreme Multi-label Text Classification. (arXiv:2210.14523v1 [cs.CL])
    Extreme multi-label text classification (XMTC) is the task of finding the most relevant subset labels from an extremely large-scale label collection. Recently, some deep learning models have achieved state-of-the-art results in XMTC tasks. These models commonly predict scores for all labels by a fully connected layer as the last layer of the model. However, such models can't predict a relatively complete and variable-length label subset for each document, because they select positive labels relevant to the document by a fixed threshold or take top k labels in descending order of scores. A less popular type of deep learning models called sequence-to-sequence (Seq2Seq) focus on predicting variable-length positive labels in sequence style. However, the labels in XMTC tasks are essentially an unordered set rather than an ordered sequence, the default order of labels restrains Seq2Seq models in training. To address this limitation in Seq2Seq, we propose an autoregressive sequence-to-set model for XMTC tasks named OTSeq2Set. Our model generates predictions in student-forcing scheme and is trained by a loss function based on bipartite matching which enables permutation-invariance. Meanwhile, we use the optimal transport distance as a measurement to force the model to focus on the closest labels in semantic label space. Experiments show that OTSeq2Set outperforms other competitive baselines on 4 benchmark datasets. Especially, on the Wikipedia dataset with 31k labels, it outperforms the state-of-the-art Seq2Seq method by 16.34% in micro-F1 score. The code is available at https://github.com/caojie54/OTSeq2Set.
    Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. (arXiv:2210.14306v1 [cs.SE])
    AI code-recommendation systems (CodeRec), such as Copilot, can assist programmers inside an IDE by suggesting and autocompleting arbitrary code; potentially improving their productivity. To understand how these AI improve programmers in a coding session, we need to understand how they affect programmers' behavior. To make progress, we studied GitHub Copilot, and developed CUPS -- a taxonomy of 12 programmer activities common to AI code completion systems. We then conducted a study with 21 programmers who completed coding tasks and used our labeling tool to retrospectively label their sessions with CUPS. We analyze over 3000 label instances, and visualize the results with timelines and state machines to profile programmer-CodeRec interaction. This reveals novel insights into the distribution and patterns of programmer behavior, as well as inefficiencies and time costs. Finally, we use these insights to inform future interventions to improve AI-assisted programming and human-AI interaction.
    Adaptive Experimental Design and Counterfactual Inference. (arXiv:2210.14369v1 [cs.LG])
    Adaptive experimental design methods are increasingly being used in industry as a tool to boost testing throughput or reduce experimentation cost relative to traditional A/B/N testing methods. This paper shares lessons learned regarding the challenges and pitfalls of naively using adaptive experimentation systems in industrial settings where non-stationarity is prevalent, while also providing perspectives on the proper objectives and system specifications in these settings. We developed an adaptive experimental design framework for counterfactual inference based on these experiences, and tested it in a commercial environment.
    SVD Perspectives for Augmenting DeepONet Flexibility and Interpretability. (arXiv:2204.12670v2 [cs.LG] UPDATED)
    Deep operator networks (DeepONets) are powerful architectures for fast and accurate emulation of complex dynamics. As their remarkable generalization capabilities are primarily enabled by their projection-based attribute, we investigate connections with low-rank techniques derived from the singular value decomposition (SVD). We demonstrate that some of the concepts behind proper orthogonal decomposition (POD)-neural networks can improve DeepONet's design and training phases. These ideas lead us to a methodology extension that we name SVD-DeepONet. Moreover, through multiple SVD analyses, we find that DeepONet inherits from its projection-based attribute strong inefficiencies in representing dynamics characterized by symmetries. Inspired by the work on shifted-POD, we develop flexDeepONet, an architecture enhancement that relies on a pre-transformation network for generating a moving reference frame and isolating the rigid components of the dynamics. In this way, the physics can be represented on a latent space free from rotations, translations, and stretches, and an accurate projection can be performed to a low-dimensional basis. In addition to flexibility and interpretability, the proposed perspectives increase DeepONet's generalization capabilities and computational efficiencies. For instance, we show flexDeepONet can accurately surrogate the dynamics of 19 variables in a combustion chemistry application by relying on 95% less trainable parameters than the ones of the vanilla architecture. We argue that DeepONet and SVD-based methods can reciprocally benefit from each other. In particular, the flexibility of the former in leveraging multiple data sources and multifidelity knowledge in the form of both unstructured data and physics-informed constraints has the potential to greatly extend the applicability of methodologies such as POD and PCA.
    Exploring the Whole Rashomon Set of Sparse Decision Trees. (arXiv:2209.08040v2 [cs.LG] UPDATED)
    In any given machine learning problem, there may be many models that could explain the data almost equally well. However, most learning algorithms return only one of these models, leaving practitioners with no practical way to explore alternative models that might have desirable properties beyond what could be expressed within a loss function. The Rashomon set is the set of these all almost-optimal models. Rashomon sets can be extremely complicated, particularly for highly nonlinear function classes that allow complex interaction terms, such as decision trees. We provide the first technique for completely enumerating the Rashomon set for sparse decision trees; in fact, our work provides the first complete enumeration of any Rashomon set for a non-trivial problem with a highly nonlinear discrete function class. This allows the user an unprecedented level of control over model choice among all models that are approximately equally good. We represent the Rashomon set in a specialized data structure that supports efficient querying and sampling. We show three applications of the Rashomon set: 1) it can be used to study variable importance for the set of almost-optimal trees (as opposed to a single tree), 2) the Rashomon set for accuracy enables enumeration of the Rashomon sets for balanced accuracy and F1-score, and 3) the Rashomon set for a full dataset can be used to produce Rashomon sets constructed with only subsets of the data set. Thus, we are able to examine Rashomon sets across problems with a new lens, enabling users to choose models rather than be at the mercy of an algorithm that produces only a single model.
    Synthetic Tumors Make AI Segment Tumors Better. (arXiv:2210.14845v1 [eess.IV])
    We develop a novel strategy to generate synthetic tumors. Unlike existing works, the tumors generated by our strategy have two intriguing advantages: (1) realistic in shape and texture, which even medical professionals can confuse with real tumors; (2) effective for AI model training, which can perform liver tumor segmentation similarly to a model trained on real tumors - this result is unprecedented because no existing work, using synthetic tumors only, has thus far reached a similar or even close performance to the model trained on real tumors. This result also implies that manual efforts for developing per-voxel annotation of tumors (which took years to create) can be considerably reduced for training AI models in the future. Moreover, our synthetic tumors have the potential to improve the success rate of small tumor detection by automatically generating enormous examples of small (or tiny) synthetic tumors.
    A PAC-Bayes bound for deterministic classifiers. (arXiv:2209.02525v2 [stat.ML] UPDATED)
    We establish a disintegrated PAC-Bayesian bound, for classifiers that are trained via continuous-time (non-stochastic) gradient descent. Contrarily to what is standard in the PAC-Bayesian setting, our result applies to a training algorithm that is deterministic, conditioned on a random initialisation, without requiring any $\textit{de-randomisation}$ step. We provide a broad discussion of the main features of the bound that we propose, and we study analytically and empirically its behaviour on linear models, finding promising results.
    DiffusionDB: A Large-scale Prompt Gallery Dataset for Text-to-Image Generative Models. (arXiv:2210.14896v1 [cs.CV])
    With recent advancements in diffusion models, users can generate high-quality images by writing text prompts in natural language. However, generating images with desired details requires proper prompts, and it is often unclear how a model reacts to different prompts and what the best prompts are. To help researchers tackle these critical challenges, we introduce DiffusionDB, the first large-scale text-to-image prompt dataset. DiffusionDB contains 2 million images generated by Stable Diffusion using prompts and hyperparameters specified by real users. We analyze prompts in the dataset and discuss key properties of these prompts. The unprecedented scale and diversity of this human-actuated dataset provide exciting research opportunities in understanding the interplay between prompts and generative models, detecting deepfakes, and designing human-AI interaction tools to help users more easily use these models. DiffusionDB is publicly available at: https://poloclub.github.io/diffusiondb.
    Sharp threshold for alignment of graph databases with Gaussian weights. (arXiv:2010.16295v3 [stat.ML] UPDATED)
    We study the fundamental limits for reconstruction in weighted graph (or matrix) database alignment. We consider a model of two graphs where $\pi^*$ is a planted uniform permutation and all pairs of edge weights $(A_{i,j}, B_{\pi^*(i),\pi^*(j)})_{1 \leq i0$, there is an estimator $\hat{\pi}$ -- namely the MAP estimator -- based on the observation of databases $A,B$ that achieves exact reconstruction with high probability. Conversely, if $n \rho^2 \leq 4 \log n - \log \log n - \omega(1)$, then any estimator $\hat{\pi}$ verifies $\hat{\pi}=\pi$ with probability $o(1)$. This result shows that the information-theoretic threshold for exact recovery is the same as the one obtained for detection in a recent work by Wu et al. (2020): in other words, for Gaussian weighted graph alignment, the problem of reconstruction is not more difficult than that of detection. Though the reconstruction task was already well understood for vector-shaped database alignment (that is taking signal of the form $(u_i, v_{\pi^*(i)})_{1 \leq i\leq n}$ where $(u_i, v_{\pi^*(i)})$ are i.i.d. pairs in $\mathbb{R}^{d_u} \times \mathbb{R}^{d_v}$), its formulation for graph (or matrix) databases brings a drastically different problem for which the hard phase is conjectured to be wide. The proofs build upon the analysis of the MAP estimator and the second moment method, together with the study of the correlation structure of energies of permutations.
    Deep Anomaly Detection and Search via Reinforcement Learning. (arXiv:2208.14834v2 [cs.LG] UPDATED)
    Semi-supervised Anomaly Detection (AD) is a kind of data mining task which aims at learning features from partially-labeled datasets to help detect outliers. In this paper, we classify existing semi-supervised AD methods into two categories: unsupervised-based and supervised-based, and point out that most of them suffer from insufficient exploitation of labeled data and under-exploration of unlabeled data. To tackle these problems, we propose Deep Anomaly Detection and Search (DADS), which applies Reinforcement Learning (RL) to balance exploitation and exploration. During the training process, the agent searches for possible anomalies with hierarchically-structured datasets and uses the searched anomalies to enhance performance, which in essence draws lessons from the idea of ensemble learning. Experimentally, we compare DADS with several state-of-the-art methods in the settings of leveraging labeled known anomalies to detect both other known anomalies and unknown anomalies. Results show that DADS can efficiently and precisely search anomalies from unlabeled data and learn from them, thus achieving good performance.
    A Note On k-Means Probabilistic Poverty. (arXiv:1910.00413v2 [cs.LG] UPDATED)
    It is proven, by example, that the version of $k$-means with random initialization does not have the property probabilistic k-richness.
    Local Metric Learning for Off-Policy Evaluation in Contextual Bandits with Continuous Actions. (arXiv:2210.13373v2 [cs.LG] UPDATED)
    We consider local kernel metric learning for off-policy evaluation (OPE) of deterministic policies in contextual bandits with continuous action spaces. Our work is motivated by practical scenarios where the target policy needs to be deterministic due to domain requirements, such as prescription of treatment dosage and duration in medicine. Although importance sampling (IS) provides a basic principle for OPE, it is ill-posed for the deterministic target policy with continuous actions. Our main idea is to relax the target policy and pose the problem as kernel-based estimation, where we learn the kernel metric in order to minimize the overall mean squared error (MSE). We present an analytic solution for the optimal metric, based on the analysis of bias and variance. Whereas prior work has been limited to scalar action spaces or kernel bandwidth selection, our work takes a step further being capable of vector action spaces and metric optimization. We show that our estimator is consistent, and significantly reduces the MSE compared to baseline OPE methods through experiments on various domains.
    FedX: Federated Learning for Compositional Pairwise Risk Optimization. (arXiv:2210.14396v1 [cs.LG])
    In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of $\mathbb E_{\mathbf z\sim \mathcal S_1} f(\mathbb E_{\mathbf z'\sim\mathcal S_2} \ell(\mathbf w, \mathbf z, \mathbf z'))$, where two sets of data $\mathcal S_1, \mathcal S_2$ are distributed over multiple machines, $\ell(\cdot; \cdot,\cdot)$ is a pairwise loss that only depends on the prediction outputs of the input data pairs $(\mathbf z, \mathbf z')$, and $f(\cdot)$ is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nonlinear $f$, respectively. To address the challenges, we decouple the gradient's components with two types, namely active parts and lazy parts, where the active parts depend on local data that are computed with the local model and the lazy parts depend on other machines that are communicated/computed based on historical models and samples. We develop a novel theoretical analysis to combat the latency of the lazy parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the lazy parts do not degrade the complexities. We conduct empirical studies of FedX for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.
    Parameter-free Regret in High Probability with Heavy Tails. (arXiv:2210.14355v1 [stat.ML])
    We present new algorithms for online convex optimization over unbounded domains that obtain parameter-free regret in high-probability given access only to potentially heavy-tailed subgradient estimates. Previous work in unbounded domains considers only in-expectation results for sub-exponential subgradients. Unlike in the bounded domain case, we cannot rely on straight-forward martingale concentration due to exponentially large iterates produced by the algorithm. We develop new regularization techniques to overcome these problems. Overall, with probability at most $\delta$, for all comparators $\mathbf{u}$ our algorithm achieves regret $\tilde{O}(\| \mathbf{u} \| T^{1/\mathfrak{p}} \log (1/\delta))$ for subgradients with bounded $\mathfrak{p}^{th}$ moments for some $\mathfrak{p} \in (1, 2]$.
    Towards a machine learning pipeline in reduced order modelling for inverse problems: neural networks for boundary parametrization, dimensionality reduction and solution manifold approximation. (arXiv:2210.14764v1 [math.NA])
    In this work, we propose a model order reduction framework to deal with inverse problems in a non-intrusive setting. Inverse problems, especially in a partial differential equation context, require a huge computational load due to the iterative optimization process. To accelerate such a procedure, we apply a numerical pipeline that involves artificial neural networks to parametrize the boundary conditions of the problem in hand, compress the dimensionality of the (full-order) snapshots, and approximate the parametric solution manifold. It derives a general framework capable to provide an ad-hoc parametrization of the inlet boundary and quickly converges to the optimal solution thanks to model order reduction. We present in this contribution the results obtained by applying such methods to two different CFD test cases.
    Meta-node: A Concise Approach to Effectively Learn Complex Relationships in Heterogeneous Graphs. (arXiv:2210.14480v1 [cs.LG])
    Existing message passing neural networks for heterogeneous graphs rely on the concepts of meta-paths or meta-graphs due to the intrinsic nature of heterogeneous graphs. However, the meta-paths and meta-graphs need to be pre-configured before learning and are highly dependent on expert knowledge to construct them. To tackle this challenge, we propose a novel concept of meta-node for message passing that can learn enriched relational knowledge from complex heterogeneous graphs without any meta-paths and meta-graphs by explicitly modeling the relations among the same type of nodes. Unlike meta-paths and meta-graphs, meta-nodes do not require any pre-processing steps that require expert knowledge. Going one step further, we propose a meta-node message passing scheme and apply our method to a contrastive learning model. In the experiments on node clustering and classification tasks, the proposed meta-node message passing method outperforms state-of-the-arts that depend on meta-paths. Our results demonstrate that effective heterogeneous graph learning is possible without the need for meta-paths that are frequently used in this field.
    An Attention-based Long Short-Term Memory Framework for Detection of Bitcoin Scams. (arXiv:2210.14408v1 [cs.CR])
    Bitcoin is the most common cryptocurrency involved in cyber scams. Cybercriminals often utilize pseudonymity and privacy protection mechanism associated with Bitcoin transactions to make their scams virtually untraceable. The Ponzi scheme has attracted particularly significant attention among Bitcoin fraudulent activities. This paper considers a multi-class classification problem to determine whether a transaction is involved in Ponzi schemes or other cyber scams, or is a non-scam transaction. We design a specifically designed crawler to collect data and propose a novel Attention-based Long Short-Term Memory (A-LSTM) method for the classification problem. The experimental results show that the proposed model has better efficiency and accuracy than existing approaches, including Random Forest, Extra Trees, Gradient Boosting, and classical LSTM. With correctly identified scam features, our proposed A-LSTM achieves an F1-score over 82% for the original data and outperforms the existing approaches.
    Improving Adversarial Robustness via Joint Classification and Multiple Explicit Detection Classes. (arXiv:2210.14410v1 [cs.CV])
    This work concerns the development of deep networks that are certifiably robust to adversarial attacks. Joint robust classification-detection was recently introduced as a certified defense mechanism, where adversarial examples are either correctly classified or assigned to the "abstain" class. In this work, we show that such a provable framework can benefit by extension to networks with multiple explicit abstain classes, where the adversarial examples are adaptively assigned to those. We show that naively adding multiple abstain classes can lead to "model degeneracy", then we propose a regularization approach and a training method to counter this degeneracy by promoting full use of the multiple abstain classes. Our experiments demonstrate that the proposed approach consistently achieves favorable standard vs. robust verified accuracy tradeoffs, outperforming state-of-the-art algorithms for various choices of number of abstain classes.
    Reachable Polyhedral Marching (RPM): An Exact Analysis Tool for Deep-Learned Control Systems. (arXiv:2210.08339v2 [cs.LG] UPDATED)
    We present a tool for computing exact forward and backward reachable sets of deep neural networks with rectified linear unit (ReLU) activation. We then develop algorithms using this tool to compute invariant sets and regions of attraction (ROAs) for control systems with neural networks in the feedback loop. Our algorithm is unique in that it builds the reachable sets by incrementally enumerating polyhedral regions in the input space, rather than iterating layer-by-layer through the network as in other methods. When performing safety verification, if an unsafe region is found, our algorithm can return this result without completing the full reachability computation, thus giving an anytime property that accelerates safety verification. Furthermore, we introduce a method to accelerate the computation of ROAs in the case that deep learned components are homeomorphisms, which we find is surprisingly common in practice. We demonstrate our tool in several test cases. We compute a ROA for a learned van der Pol oscillator model. We find a control invariant set for a learned torque-controlled pendulum model. We also verify specific safety properties for multiple deep networks related to the ACAS Xu aircraft collision advisory system. Finally, we apply our algorithm to find ROAs for an image-based aircraft runway taxi problem. Algorithm source code: https://github.com/StanfordMSL/Neural-Network-Reach .
    Personalized incentives as feedback design in generalized Nash equilibrium problems. (arXiv:2203.12948v2 [math.OC] UPDATED)
    We investigate both stationary and time-varying, nonmonotone generalized Nash equilibrium problems that exhibit symmetric interactions among the agents, which are known to be potential. As may happen in practical cases, however, we envision a scenario in which the formal expression of the underlying potential function is not available, and we design a semi-decentralized Nash equilibrium seeking algorithm. In the proposed two-layer scheme, a coordinator iteratively integrates the (possibly noisy and sporadic) agents' feedback to learn the pseudo-gradients of the agents, and then design personalized incentives for them. On their side, the agents receive those personalized incentives, compute a solution to an extended game, and then return feedback measurements to the coordinator. In the stationary setting, our algorithm returns a Nash equilibrium in case the coordinator is endowed with standard learning policies, while it returns a Nash equilibrium up to a constant, yet adjustable, error in the time-varying case. As a motivating application, we consider the ridehailing service provided by several companies with mobility as a service orchestration, necessary to both handle competition among firms and avoid traffic congestion, which is also adopted to run numerical experiments verifying our results.
    A Bibliometric Analysis and Review on Reinforcement Learning for Transportation Applications. (arXiv:2210.14524v1 [cs.LG])
    Transportation is the backbone of the economy and urban development. Improving the efficiency, sustainability, resilience, and intelligence of transportation systems is critical and also challenging. The constantly changing traffic conditions, the uncertain influence of external factors (e.g., weather, accidents), and the interactions among multiple travel modes and multi-type flows result in the dynamic and stochastic natures of transportation systems. The planning, operation, and control of transportation systems require flexible and adaptable strategies in order to deal with uncertainty, non-linearity, variability, and high complexity. In this context, Reinforcement Learning (RL) that enables autonomous decision-makers to interact with the complex environment, learn from the experiences, and select optimal actions has been rapidly emerging as one of the most useful approaches for smart transportation. This paper conducts a bibliometric analysis to identify the development of RL-based methods for transportation applications, typical journals/conferences, and leading topics in the field of intelligent transportation in recent ten years. Then, this paper presents a comprehensive literature review on applications of RL in transportation by categorizing different methods with respect to the specific application domains. The potential future research directions of RL applications and developments are also discussed.
    Inducer-tuning: Connecting Prefix-tuning and Adapter-tuning. (arXiv:2210.14469v1 [cs.CL])
    Prefix-tuning, or more generally continuous prompt tuning, has become an essential paradigm of parameter-efficient transfer learning. Using a large pre-trained language model (PLM), prefix-tuning can obtain strong performance by training only a small portion of parameters. In this paper, we propose to understand and further develop prefix-tuning through the kernel lens. Specifically, we make an analogy between \textit{prefixes} and \textit{inducing variables} in kernel methods and hypothesize that \textit{prefixes} serving as \textit{inducing variables} would improve their overall mechanism. From the kernel estimator perspective, we suggest a new variant of prefix-tuning -- \textit{inducer-tuning}, which shares the exact mechanism as prefix-tuning while leveraging the residual form found in adapter-tuning. This mitigates the initialization issue in prefix-tuning. Through comprehensive empirical experiments on natural language understanding and generation tasks, we demonstrate that inducer-tuning can close the performance gap between prefix-tuning and fine-tuning.
    Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures. (arXiv:2210.14347v1 [stat.ML])
    This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and non-vanishing noise. A generalized least squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, that requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
    Deep Subspace Encoders for Nonlinear System Identification. (arXiv:2210.14816v1 [eess.SY])
    Using Artificial Neural Networks (ANN) for nonlinear system identification has proven to be a promising approach, but despite of all recent research efforts, many practical and theoretical problems still remain open. Specifically, noise handling and models, issues of consistency and reliable estimation under minimisation of the prediction error are the most severe problems. The latter comes with numerous practical challenges such as explosion of the computational cost in terms of the number of data samples and the occurrence of instabilities during optimization. In this paper, we aim to overcome these issues by proposing a method which uses a truncated prediction loss and a subspace encoder for state estimation. The truncated prediction loss is computed by selecting multiple truncated subsections from the time series and computing the average prediction loss. To obtain a computationally efficient estimation method that minimizes the truncated prediction loss, a subspace encoder represented by an artificial neural network is introduced. This encoder aims to approximate the state reconstructability map of the estimated model to provide an initial state for each truncated subsection given past inputs and outputs. By theoretical analysis, we show that, under mild conditions, the proposed method is locally consistent, increases optimization stability, and achieves increased data efficiency by allowing for overlap between the subsections. Lastly, we provide practical insights and user guidelines employing a numerical example and state-of-the-art benchmark results.
    Imputation of missing values in multi-view data. (arXiv:2210.14484v1 [stat.ML])
    When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new meta-learning imputation method based on stacked penalized logistic regression (StaPLR), which performs imputation in a dimension-reduced space. We evaluate the new imputation method with several imputation algorithms using simulations. The results show that meta-level imputation of missing values leads to good results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
    Scaling Laws Beyond Backpropagation. (arXiv:2210.14593v1 [cs.LG])
    Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.
    Federated Learning Using Variance Reduced Stochastic Gradient for Probabilistically Activated Agents. (arXiv:2210.14362v1 [cs.LG])
    This paper proposes an algorithm for Federated Learning (FL) with a two-layer structure that achieves both variance reduction and a faster convergence rate to an optimal solution in the setting where each agent has an arbitrary probability of selection in each iteration. In distributed machine learning, when privacy matters, FL is a functional tool. Placing FL in an environment where it has some irregular connections of agents (devices), reaching a trained model in both an economical and quick way can be a demanding job. The first layer of our algorithm corresponds to the model parameter propagation across agents done by the server. In the second layer, each agent does its local update with a stochastic and variance-reduced technique called Stochastic Variance Reduced Gradient (SVRG). We leverage the concept of variance reduction from stochastic optimization when the agents want to do their local update step to reduce the variance caused by stochastic gradient descent (SGD). We provide a convergence bound for our algorithm which improves the rate from $O(\frac{1}{\sqrt{K}})$ to $O(\frac{1}{K})$ by using a constant step-size. We demonstrate the performance of our algorithm using numerical examples.
    Learning in Multi-Player Stochastic Games. (arXiv:2210.14280v1 [cs.GT])
    We consider the problem of simultaneous learning in stochastic games with many players in the finite-horizon setting. While the typical target solution for a stochastic game is a Nash equilibrium, this is intractable with many players. We instead focus on variants of {\it correlated equilibria}, such as those studied for extensive-form games. We begin with a hardness result for the adversarial MDP problem: even for a horizon of 3, obtaining sublinear regret against the best non-stationary policy is \textsf{NP}-hard when both rewards and transitions are adversarial. This implies that convergence to even the weakest natural solution concept -- normal-form coarse correlated equilbrium -- is not possible via black-box reduction to a no-regret algorithm even in stochastic games with constant horizon (unless $\textsf{NP}\subseteq\textsf{BPP}$). Instead, we turn to a different target: algorithms which {\it generate} an equilibrium when they are used by all players. Our main result is algorithm which generates an {\it extensive-form} correlated equilibrium, whose runtime is exponential in the horizon but polynomial in all other parameters. We give a similar algorithm which is polynomial in all parameters for "fast-mixing" stochastic games. We also show a method for efficiently reaching normal-form coarse correlated equilibria in "single-controller" stochastic games which follows the traditional no-regret approach. When shared randomness is available, the two generative algorithms can be extended to give simultaneous regret bounds and converge in the traditional sense.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v2 [math.ST] UPDATED)
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    Bayesian Learning via Neural Schr\"odinger-F\"ollmer Flows. (arXiv:2111.10510v9 [stat.ML] UPDATED)
    In this work we explore a new framework for approximate Bayesian inference in large datasets based on stochastic control (i.e. Schr\"odinger bridges). We advocate stochastic control as a finite time and low variance alternative to popular steady-state methods such as stochastic gradient Langevin dynamics (SGLD). Furthermore, we discuss and adapt the existing theoretical guarantees of this framework and establish connections to already existing VI routines in SDE-based models.  ( 2 min )
    Local Linear Convergence of Gradient Methods for Subspace Optimization via Strict Complementarity. (arXiv:2202.04020v2 [math.OC] UPDATED)
    We consider optimization problems in which the goal is find a $k$-dimensional subspace of $\mathbb{R}^n$, $k<<n$, which minimizes a convex and smooth loss. Such problems generalize the fundamental task of principal component analysis (PCA) to include robust and sparse counterparts, and logistic PCA for binary data, among others. This problem could be approached either via nonconvex gradient methods with highly-efficient iterations, but for which arguing about fast convergence to a global minimizer is difficult or, via a convex relaxation for which arguing about convergence to a global minimizer is straightforward, but the corresponding methods are often inefficient in high dimensions. In this work we bridge these two approaches under a strict complementarity assumption, which in particular implies that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem. Our main result is a proof that a natural nonconvex gradient method which is \textit{SVD-free} and requires only a single QR-factorization of an $n\times k$ matrix per iteration, converges locally with a linear rate. We also establish linear convergence results for the nonconvex projected gradient method, and the Frank-Wolfe method when applied to the convex relaxation.
    Causal Information Bottleneck Boosts Adversarial Robustness of Deep Neural Network. (arXiv:2210.14229v1 [cs.LG])
    The information bottleneck (IB) method is a feasible defense solution against adversarial attacks in deep learning. However, this method suffers from the spurious correlation, which leads to the limitation of its further improvement of adversarial robustness. In this paper, we incorporate the causal inference into the IB framework to alleviate such a problem. Specifically, we divide the features obtained by the IB method into robust features (content information) and non-robust features (style information) via the instrumental variables to estimate the causal effects. With the utilization of such a framework, the influence of non-robust features could be mitigated to strengthen the adversarial robustness. We make an analysis of the effectiveness of our proposed method. The extensive experiments in MNIST, FashionMNIST, and CIFAR-10 show that our method exhibits the considerable robustness against multiple adversarial attacks. Our code would be released.
  • Open

    Hierarchical Message-Passing Graph Neural Networks. (arXiv:2009.03717v3 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have become a prominent approach to machine learning with graphs and have been increasingly applied in a multitude of domains. Nevertheless, since most existing GNN models are based on flat message-passing mechanisms, two limitations need to be tackled: (i) they are costly in encoding long-range information spanning the graph structure; (ii) they are failing to encode features in the high-order neighbourhood in the graphs as they only perform information aggregation across the observed edges in the original graph. To deal with these two issues, we propose a novel Hierarchical Message-passing Graph Neural Networks framework. The key idea is generating a hierarchical structure that re-organises all nodes in a flat graph into multi-level super graphs, along with innovative intra- and inter-level propagation manners. The derived hierarchy creates shortcuts connecting far-away nodes so that informative long-range interactions can be efficiently accessed via message passing and incorporates meso- and macro-level semantics into the learned node representations. We present the first model to implement this framework, termed Hierarchical Community-aware Graph Neural Network (HC-GNN), with the assistance of a hierarchical community detection algorithm. The theoretical analysis illustrates HC-GNN's remarkable capacity in capturing long-range information without introducing heavy additional computation complexity. Empirical experiments conducted on 9 datasets under transductive, inductive, and few-shot settings exhibit that HC-GNN can outperform state-of-the-art GNN models in network analysis tasks, including node classification, link prediction, and community detection. Moreover, the model analysis further demonstrates HC-GNN's robustness facing graph sparsity and the flexibility in incorporating different GNN encoders.  ( 3 min )
    Robust Contextual Linear Bandits. (arXiv:2210.14483v1 [cs.LG])
    Model misspecification is a major consideration in applications of statistical methods and machine learning. However, it is often neglected in contextual bandits. This paper studies a common form of misspecification, an inter-arm heterogeneity that is not captured by context. To address this issue, we assume that the heterogeneity arises due to arm-specific random variables, which can be learned. We call this setting a robust contextual bandit. The arm-specific variables explain the unknown inter-arm heterogeneity, and we incorporate them in the robust contextual estimator of the mean reward and its uncertainty. We develop two efficient bandit algorithms for our setting: a UCB algorithm called RoLinUCB and a posterior-sampling algorithm called RoLinTS. We analyze both algorithms and bound their $n$-round Bayes regret. Our experiments show that RoLinTS is comparably statistically efficient to the classic methods when the misspecification is low, more robust when the misspecification is high, and significantly more computationally efficient than its naive implementation.  ( 2 min )
    Parameter-free Regret in High Probability with Heavy Tails. (arXiv:2210.14355v1 [stat.ML])
    We present new algorithms for online convex optimization over unbounded domains that obtain parameter-free regret in high-probability given access only to potentially heavy-tailed subgradient estimates. Previous work in unbounded domains considers only in-expectation results for sub-exponential subgradients. Unlike in the bounded domain case, we cannot rely on straight-forward martingale concentration due to exponentially large iterates produced by the algorithm. We develop new regularization techniques to overcome these problems. Overall, with probability at most $\delta$, for all comparators $\mathbf{u}$ our algorithm achieves regret $\tilde{O}(\| \mathbf{u} \| T^{1/\mathfrak{p}} \log (1/\delta))$ for subgradients with bounded $\mathfrak{p}^{th}$ moments for some $\mathfrak{p} \in (1, 2]$.  ( 2 min )
    Provable Safe Reinforcement Learning with Binary Feedback. (arXiv:2210.14492v1 [cs.LG])
    Safety is a crucial necessity in many applications of reinforcement learning (RL), whether robotic, automotive, or medical. Many existing approaches to safe RL rely on receiving numeric safety feedback, but in many cases this feedback can only take binary values; that is, whether an action in a given state is safe or unsafe. This is particularly true when feedback comes from human experts. We therefore consider the problem of provable safe RL when given access to an offline oracle providing binary feedback on the safety of state, action pairs. We provide a novel meta algorithm, SABRE, which can be applied to any MDP setting given access to a blackbox PAC RL algorithm for that setting. SABRE applies concepts from active learning to reinforcement learning to provably control the number of queries to the safety oracle. SABRE works by iteratively exploring the state space to find regions where the agent is currently uncertain about safety. Our main theoretical results shows that, under appropriate technical assumptions, SABRE never takes unsafe actions during training, and is guaranteed to return a near-optimal safe policy with high probability. We provide a discussion of how our meta-algorithm may be applied to various settings studied in both theoretical and empirical frameworks.
    A Conditional Gradient-based Method for Simple Bilevel Optimization with Convex Lower-level Problem. (arXiv:2206.08868v2 [math.OC] UPDATED)
    In this paper, we study a class of bilevel optimization problems, also known as simple bilevel optimization, where we minimize a smooth objective function over the optimal solution set of another convex constrained optimization problem. Several iterative methods have been developed for tackling this class of problems. Alas, their convergence guarantees are either asymptotic for the upper-level objective, or the convergence rates are slow and sub-optimal. To address this issue, in this paper, we introduce a novel bilevel optimization method that locally approximates the solution set of the lower-level problem via a cutting plane, and then runs a conditional gradient update to decrease the upper-level objective. When the upper-level objective is convex, we show that our method requires ${\mathcal{O}}(\max\{1/\epsilon_f,1/\epsilon_g\})$ iterations to find a solution that is $\epsilon_f$-optimal for the upper-level objective and $\epsilon_g$-optimal for the lower-level objective. Moreover, when the upper-level objective is non-convex, our method requires ${\mathcal{O}}(\max\{1/\epsilon_f^2,1/(\epsilon_f\epsilon_g)\})$ iterations to find an $(\epsilon_f,\epsilon_g)$-optimal solution. We also prove stronger convergence guarantees under the H\"olderian error bound assumption on the lower-level problem. To the best of our knowledge, our method achieves the best-known iteration complexity for the considered class of bilevel problems.
    Bayesian Learning via Neural Schr\"odinger-F\"ollmer Flows. (arXiv:2111.10510v9 [stat.ML] UPDATED)
    In this work we explore a new framework for approximate Bayesian inference in large datasets based on stochastic control (i.e. Schr\"odinger bridges). We advocate stochastic control as a finite time and low variance alternative to popular steady-state methods such as stochastic gradient Langevin dynamics (SGLD). Furthermore, we discuss and adapt the existing theoretical guarantees of this framework and establish connections to already existing VI routines in SDE-based models.
    Tensor-on-Tensor Regression: Riemannian Optimization, Over-parameterization, Statistical-computational Gap, and Their Interplay. (arXiv:2206.08756v2 [math.ST] UPDATED)
    We study the tensor-on-tensor regression, where the goal is to connect tensor responses to tensor covariates with a low Tucker rank parameter tensor/matrix without the prior knowledge of its intrinsic rank. We propose the Riemannian gradient descent (RGD) and Riemannian Gauss-Newton (RGN) methods and cope with the challenge of unknown rank by studying the effect of rank over-parameterization. We provide the first convergence guarantee for the general tensor-on-tensor regression by showing that RGD and RGN respectively converge linearly and quadratically to a statistically optimal estimate in both rank correctly-parameterized and over-parameterized settings. Our theory reveals an intriguing phenomenon: Riemannian optimization methods naturally adapt to over-parameterization without modifications to their implementation. We also prove the statistical-computational gap in scalar-on-tensor regression by a direct low-degree polynomial argument. Our theory demonstrates a "blessing of statistical-computational gap" phenomenon: in a wide range of scenarios in tensor-on-tensor regression for tensors of order three or higher, the computationally required sample size matches what is needed by moderate rank over-parameterization when considering computationally feasible estimators, while there are no such benefits in the matrix settings. This shows moderate rank over-parameterization is essentially "cost-free" in terms of sample size in tensor-on-tensor regression of order three or higher. Finally, we conduct simulation studies to show the advantages of our proposed methods and to corroborate our theoretical findings.
    Conformal Off-Policy Prediction in Contextual Bandits. (arXiv:2206.04405v2 [stat.ML] UPDATED)
    Most off-policy evaluation methods for contextual bandits have focused on the expected outcome of a policy, which is estimated via methods that at best provide only asymptotic guarantees. However, in many applications, the expectation may not be the best measure of performance as it does not capture the variability of the outcome. In addition, particularly in safety-critical settings, stronger guarantees than asymptotic correctness may be required. To address these limitations, we consider a novel application of conformal prediction to contextual bandits. Given data collected under a behavioral policy, we propose \emph{conformal off-policy prediction} (COPP), which can output reliable predictive intervals for the outcome under a new target policy. We provide theoretical finite-sample guarantees without making any additional assumptions beyond the standard contextual bandit setup, and empirically demonstrate the utility of COPP compared with existing methods on synthetic and real-world data.
    Federated Learning with Nesterov Accelerated Gradient. (arXiv:2009.08716v2 [cs.LG] UPDATED)
    Federated learning (FL) is a fast-developing technique that allows multiple workers to train a global model based on a distributed dataset. Conventional FL (FedAvg) employs gradient descent algorithm, which may not be efficient enough. Momentum is able to improve the situation by adding an additional momentum step to accelerate the convergence and has demonstrated its benefits in both centralized and FL environments. It is well-known that Nesterov Accelerated Gradient (NAG) is a more advantageous form of momentum, but it is not clear how to quantify the benefits of NAG in FL so far. This motives us to propose FedNAG, which employs NAG in each worker as well as NAG momentum and model aggregation in the aggregator. We provide a detailed convergence analysis of FedNAG and compare it with FedAvg. Extensive experiments based on real-world datasets and trace-driven simulation are conducted, demonstrating that FedNAG increases the learning accuracy by 3-24% and decreases the total training time by 11-70% compared with the benchmarks under a wide range of settings.
    Imputation of missing values in multi-view data. (arXiv:2210.14484v1 [stat.ML])
    When missing values occur in multi-view data, all features in a view are likely to be missing simultaneously. This leads to very large quantities of missing data which, especially when combined with high-dimensionality, makes the application of conditional imputation methods computationally infeasible. We introduce a new meta-learning imputation method based on stacked penalized logistic regression (StaPLR), which performs imputation in a dimension-reduced space. We evaluate the new imputation method with several imputation algorithms using simulations. The results show that meta-level imputation of missing values leads to good results at a much lower computational cost, and makes the use of advanced imputation algorithms such as missForest and predictive mean matching possible in settings where they would otherwise be computationally infeasible.
    Scaling Laws Beyond Backpropagation. (arXiv:2210.14593v1 [cs.LG])
    Alternatives to backpropagation have long been studied to better understand how biological brains may learn. Recently, they have also garnered interest as a way to train neural networks more efficiently. By relaxing constraints inherent to backpropagation (e.g., symmetric feedforward and feedback weights, sequential updates), these methods enable promising prospects, such as local learning. However, the tradeoffs between different methods in terms of final task performance, convergence speed, and ultimately compute and data requirements are rarely outlined. In this work, we use scaling laws to study the ability of Direct Feedback Alignment~(DFA) to train causal decoder-only Transformers efficiently. Scaling laws provide an overview of the tradeoffs implied by a modeling decision, up to extrapolating how it might transfer to increasingly large models. We find that DFA fails to offer more efficient scaling than backpropagation: there is never a regime for which the degradation in loss incurred by using DFA is worth the potential reduction in compute budget. Our finding comes at variance with previous beliefs in the alternative training methods community, and highlights the need for holistic empirical approaches to better understand modeling decisions.
    Interpolating Discriminant Functions in High-Dimensional Gaussian Latent Mixtures. (arXiv:2210.14347v1 [stat.ML])
    This paper considers binary classification of high-dimensional features under a postulated model with a low-dimensional latent Gaussian mixture structure and non-vanishing noise. A generalized least squares estimator is used to estimate the direction of the optimal separating hyperplane. The estimated hyperplane is shown to interpolate on the training data. While the direction vector can be consistently estimated as could be expected from recent results in linear regression, a naive plug-in estimate fails to consistently estimate the intercept. A simple correction, that requires an independent hold-out sample, renders the procedure minimax optimal in many scenarios. The interpolation property of the latter procedure can be retained, but surprisingly depends on the way the labels are encoded.
    Optimizing Pessimism in Dynamic Treatment Regimes: A Bayesian Learning Approach. (arXiv:2210.14420v1 [stat.ML])
    In this article, we propose a novel pessimism-based Bayesian learning method for optimal dynamic treatment regimes in the offline setting. When the coverage condition does not hold, which is common for offline data, the existing solutions would produce sub-optimal policies. The pessimism principle addresses this issue by discouraging recommendation of actions that are less explored conditioning on the state. However, nearly all pessimism-based methods rely on a key hyper-parameter that quantifies the degree of pessimism, and the performance of the methods can be highly sensitive to the choice of this parameter. We propose to integrate the pessimism principle with Thompson sampling and Bayesian machine learning for optimizing the degree of pessimism. We derive a credible set whose boundary uniformly lower bounds the optimal Q-function, and thus does not require additional tuning of the degree of pessimism. We develop a general Bayesian learning method that works with a range of models, from Bayesian linear basis model to Bayesian neural network model. We develop the computational algorithm based on variational inference, which is highly efficient and scalable. We establish the theoretical guarantees of the proposed method, and show empirically that it outperforms the existing state-of-the-art solutions through both simulations and a real data example.
    Compressed Sensing MRI Reconstruction Regularized by VAEs with Structured Image Covariance. (arXiv:2210.14586v1 [eess.IV])
    Learned regularization for MRI reconstruction can provide complex data-driven priors to inverse problems while still retaining the control and insight of a variational regularization method. Moreover, unsupervised learning, without paired training data, allows the learned regularizer to remain flexible to changes in the forward problem such as noise level, sampling pattern or coil sensitivities. One such approach uses generative models, trained on ground-truth images, as priors for inverse problems, penalizing reconstructions far from images the generator can produce. In this work, we utilize variational autoencoders (VAEs) that generate not only an image but also a covariance uncertainty matrix for each image. The covariance can model changing uncertainty dependencies caused by structure in the image, such as edges or objects, and provides a new distance metric from the manifold of learned images. We demonstrate these novel generative regularizers on radially sub-sampled MRI knee measurements from the fastMRI dataset and compare them to other unlearned, unsupervised and supervised methods. Our results show that the proposed method is competitive with other state-of-the-art methods and behaves consistently with changing sampling patterns and noise levels.
    Arc travel time and path choice model estimation subsumed. (arXiv:2210.14351v1 [stat.ME])
    We propose a method for maximum likelihood estimation of path choice model parameters and arc travel time using data of different levels of granularity. Hitherto these two tasks have been tackled separately under strong assumptions. Using a small example, we illustrate that this can lead to biased results. Results on both real (New York yellow cab) and simulated data show strong performance of our method compared to existing baselines.
    Provably Efficient Model-Free Constrained RL with Linear Function Approximation. (arXiv:2206.11889v2 [cs.LG] UPDATED)
    We study the constrained reinforcement learning problem, in which an agent aims to maximize the expected cumulative reward subject to a constraint on the expected total value of a utility function. In contrast to existing model-based approaches or model-free methods accompanied with a `simulator', we aim to develop the first model-free, simulator-free algorithm that achieves a sublinear regret and a sublinear constraint violation even in large-scale systems. To this end, we consider the episodic constrained Markov decision processes with linear function approximation, where the transition dynamics and the reward function can be represented as a linear function of some known feature mapping. We show that $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret and $\tilde{\mathcal{O}}(\sqrt{d^3H^3T})$ constraint violation bounds can be achieved, where $d$ is the dimension of the feature mapping, $H$ is the length of the episode, and $T$ is the total number of steps. Our bounds are attained without explicitly estimating the unknown transition model or requiring a simulator, and they depend on the state space only through the dimension of the feature mapping. Hence our bounds hold even when the number of states goes to infinity. Our main results are achieved via novel adaptations of the standard LSVI-UCB algorithms. In particular, we first introduce primal-dual optimization into the LSVI-UCB algorithm to balance between regret and constraint violation. More importantly, we replace the standard greedy selection with respect to the state-action function in LSVI-UCB with a soft-max policy. This turns out to be key in establishing uniform concentration for the constrained case via its approximation-smoothness trade-off. We also show that one can achieve an even zero constraint violation while still maintaining the same order with respect to $T$.
    Predicting the State of Synchronization of Financial Time Series using Cross Recurrence Plots. (arXiv:2210.14605v1 [q-fin.ST])
    Cross-correlation analysis is a powerful tool for understanding the mutual dynamics of time series. This study introduces a new method for predicting the future state of synchronization of the dynamics of two financial time series. To this end, we use the cross-recurrence plot analysis as a nonlinear method for quantifying the multidimensional coupling in the time domain of two time series and for determining their state of synchronization. We adopt a deep learning framework for methodologically addressing the prediction of the synchronization state based on features extracted from dynamically sub-sampled cross-recurrence plots. We provide extensive experiments on several stocks, major constituents of the S\&P100 index, to empirically validate our approach. We find that the task of predicting the state of synchronization of two time series is in general rather difficult, but for certain pairs of stocks attainable with very satisfactory performance.
    Moment Estimation for Nonparametric Mixture Models Through Implicit Tensor Decomposition. (arXiv:2210.14386v1 [math.NA])
    We present an alternating least squares type numerical optimization scheme to estimate conditionally-independent mixture models in $\mathbb{R}^n$, with minimal additional distributional assumptions. Following the method of moments, we tackle a coupled system of low-rank tensor decomposition problems. The steep costs associated with high-dimensional tensors are avoided, through the development of specialized tensor-free operations. Numerical experiments illustrate the performance of the algorithm and its applicability to various models and applications. In many cases the results exhibit improved reliability over the expectation-maximization algorithm, with similar time and storage costs. We also provide some supporting theory, establishing identifiability and local linear convergence.
    "Calibeating": Beating Forecasters at Their Own Game. (arXiv:2209.04892v2 [econ.TH] UPDATED)
    In order to identify expertise, forecasters should not be tested by their calibration score, which can always be made arbitrarily small, but rather by their Brier score. The Brier score is the sum of the calibration score and the refinement score; the latter measures how good the sorting into bins with the same forecast is, and thus attests to "expertise." This raises the question of whether one can gain calibration without losing expertise, which we refer to as "calibeating." We provide an easy way to calibeat any forecast, by a deterministic online procedure. We moreover show that calibeating can be achieved by a stochastic procedure that is itself calibrated, and then extend the results to simultaneously calibeating multiple procedures, and to deterministic procedures that are continuously calibrated.
    A PAC-Bayes bound for deterministic classifiers. (arXiv:2209.02525v2 [stat.ML] UPDATED)
    We establish a disintegrated PAC-Bayesian bound, for classifiers that are trained via continuous-time (non-stochastic) gradient descent. Contrarily to what is standard in the PAC-Bayesian setting, our result applies to a training algorithm that is deterministic, conditioned on a random initialisation, without requiring any $\textit{de-randomisation}$ step. We provide a broad discussion of the main features of the bound that we propose, and we study analytically and empirically its behaviour on linear models, finding promising results.
    Smooth Monotone Stochastic Variational Inequalities and Saddle Point Problems -- Survey. (arXiv:2208.13592v2 [math.OC] UPDATED)
    This paper is a survey of methods for solving smooth (strongly) monotone stochastic variational inequalities. To begin with, we give the deterministic foundation from which the stochastic methods eventually evolved. Then we review methods for the general stochastic formulation, and look at the finite sum setup. The last parts of the paper are devoted to various recent (not necessarily stochastic) advances in algorithms for variational inequalities.
    High Dimensional Statistical Estimation under Uniformly Dithered One-bit Quantization. (arXiv:2202.13157v3 [stat.ML] UPDATED)
    In this paper, we propose a uniformly dithered one-bit quantization scheme for high-dimensional statistical estimation. The scheme contains truncation, dithering, and quantization as typical steps. As canonical examples, the quantization scheme is applied to three estimation problems: sparse covariance matrix estimation, sparse linear regression, and matrix completion. We study both sub-Gaussian and heavy-tailed regimes, with the underlying distribution of heavy-tailed data assumed to possess bounded second or fourth moment. For each model we propose new estimators based on one-bit quantized data. In sub-Gaussian regime, our estimators achieve optimal minimax rates up to logarithmic factors, which indicates that our quantization scheme nearly introduces no additional cost. In heavy-tailed regime, while the rates of our estimators become essentially slower, these results are either the first ones in such one-bit quantized and heavy-tailed setting, or exhibit some advantages over existing comparable results. More specifically, our results for one-bit compressed sensing feature generality of sensing vector (sub-Gaussian or even heavy-tailed) and tractable convex programming. A novel setting where both measurement and covariate are quantized is also first proposed and studied. For one-bit matrix completion, our method is essentially different from the standard likelihood approach and can handle pre-quantization random noise with unknown distribution. Experimental results on synthetic data are presented to support our theoretical analysis.
    Distribution-Free Finite-Sample Guarantees and Split Conformal Prediction. (arXiv:2210.14735v1 [stat.ML])
    Modern black-box predictive models are often accompanied by weak performance guarantees that only hold asymptotically in the size of the dataset or require strong parametric assumptions. In response to this, split conformal prediction represents a promising avenue to obtain finite-sample guarantees under minimal distribution-free assumptions. Although prediction set validity most often concerns marginal coverage, we explore the related but different guarantee of tolerance regions, reformulating known results in the language of nested prediction sets and extending on the duality between marginal coverage and tolerance regions. Furthermore, we highlight the connection between split conformal prediction and classical tolerance predictors developed in the 1940s, as well as recent developments in distribution-free risk control. One result that transfers from classical tolerance predictors is that the coverage of a prediction set based on order statistics, conditional on the calibration set, is a random variable stochastically dominating the Beta distribution. We demonstrate the empirical effectiveness of our findings on synthetic and real datasets using a popular split conformal prediction procedure called conformalized quantile regression (CQR).
    Coordinate Descent for SLOPE. (arXiv:2210.14780v1 [math.OC])
    The lasso is the most famous sparse regression and feature selection method. One reason for its popularity is the speed at which the underlying optimization problem can be solved. Sorted L-One Penalized Estimation (SLOPE) is a generalization of the lasso with appealing statistical properties. In spite of this, the method has not yet reached widespread interest. A major reason for this is that current software packages that fit SLOPE rely on algorithms that perform poorly in high dimensions. To tackle this issue, we propose a new fast algorithm to solve the SLOPE optimization problem, which combines proximal gradient descent and proximal coordinate descent steps. We provide new results on the directional derivative of the SLOPE penalty and its related SLOPE thresholding operator, as well as provide convergence guarantees for our proposed solver. In extensive benchmarks on simulated and real data, we show that our method outperforms a long list of competing algorithms.
    Frank-Wolfe-based Algorithms for Approximating Tyler's M-estimator. (arXiv:2206.09370v2 [math.OC] UPDATED)
    Tyler's M-estimator is a well known procedure for robust and heavy-tailed covariance estimation. Tyler himself suggested an iterative fixed-point algorithm for computing his estimator however, it requires super-linear (in the size of the data) runtime per iteration, which maybe prohibitive in large scale. In this work we propose, to the best of our knowledge, the first Frank-Wolfe-based algorithms for computing Tyler's estimator. One variant uses standard Frank-Wolfe steps, the second also considers \textit{away-steps} (AFW), and the third is a \textit{geodesic} version of AFW (GAFW). AFW provably requires, up to a log factor, only linear time per iteration, while GAFW runs in linear time (up to a log factor) in a large $n$ (number of data-points) regime. All three variants are shown to provably converge to the optimal solution with sublinear rate, under standard assumptions, despite the fact that the underlying optimization problem is not convex nor smooth. Under an additional fairly mild assumption, that holds with probability 1 when the (normalized) data-points are i.i.d. samples from a continuous distribution supported on the entire unit sphere, AFW and GAFW are proved to converge with linear rates. Importantly, all three variants are parameter-free and use adaptive step-sizes.
    TuneUp: A Training Strategy for Improving Generalization of Graph Neural Networks. (arXiv:2210.14843v1 [stat.ML])
    Despite many advances in Graph Neural Networks (GNNs), their training strategies simply focus on minimizing a loss over nodes in a graph. However, such simplistic training strategies may be sub-optimal as they neglect that certain nodes are much harder to make accurate predictions on than others. Here we present TuneUp, a curriculum learning strategy for better training GNNs. Crucially, TuneUp trains a GNN in two stages. The first stage aims to produce a strong base GNN. Such base GNNs tend to perform well on head nodes (nodes with large degrees) but less so on tail nodes (nodes with small degrees). So, the second stage of TuneUp specifically focuses on improving prediction on tail nodes. Concretely, TuneUp synthesizes many additional supervised tail node data by dropping edges from head nodes and reusing the supervision on the original head nodes. TuneUp then minimizes the loss over the synthetic tail nodes to finetune the base GNN. TuneUp is a general training strategy that can be used with any GNN architecture and any loss, making TuneUp applicable to a wide range of prediction tasks. Extensive evaluation of TuneUp on two GNN architectures, three types of prediction tasks, and both inductive and transductive settings shows that TuneUp significantly improves the performance of the base GNN on tail nodes, while often even improving the performance on head nodes, which together leads up to 58.5% relative improvement in GNN predictive performance. Moreover, TuneUp significantly outperforms its variants without the two-stage curriculum learning, existing graph data augmentation techniques, as well as other specialized methods for tail nodes.  ( 3 min )
    Rhino: Deep Causal Temporal Relationship Learning With History-dependent Noise. (arXiv:2210.14706v1 [cs.LG])
    Discovering causal relationships between different variables from time series data has been a long-standing challenge for many domains such as climate science, finance, and healthcare. Given the complexity of real-world relationships and the nature of observations in discrete time, causal discovery methods need to consider non-linear relations between variables, instantaneous effects and history-dependent noise (the change of noise distribution due to past actions). However, previous works do not offer a solution addressing all these problems together. In this paper, we propose a novel causal relationship learning framework for time-series data, called Rhino, which combines vector auto-regression, deep learning and variational inference to model non-linear relationships with instantaneous effects while allowing the noise distribution to be modulated by historical observations. Theoretically, we prove the structural identifiability of Rhino. Our empirical results from extensive synthetic experiments and two real-world benchmarks demonstrate better discovery performance compared to relevant baselines, with ablation studies revealing its robustness under model misspecification.  ( 2 min )
    Streaming Submodular Maximization with Differential Privacy. (arXiv:2210.14315v1 [cs.LG])
    In this work, we study the problem of privately maximizing a submodular function in the streaming setting. Extensive work has been done on privately maximizing submodular functions in the general case when the function depends upon the private data of individuals. However, when the size of the data stream drawn from the domain of the objective function is large or arrives very fast, one must privately optimize the objective within the constraints of the streaming setting. We establish fundamental differentially private baselines for this problem and then derive better trade-offs between privacy and utility for the special case of decomposable submodular functions. A submodular function is decomposable when it can be written as a sum of submodular functions; this structure arises naturally when each summand function models the utility of an individual and the goal is to study the total utility of the whole population as in the well-known Combinatorial Public Projects Problem. Finally, we complement our theoretical analysis with experimental corroboration.  ( 2 min )
    Weisfeiler and Leman Go Walking: Random Walk Kernels Revisited. (arXiv:2205.10914v2 [cs.LG] UPDATED)
    Random walk kernels have been introduced in seminal work on graph learning and were later largely superseded by kernels based on the Weisfeiler-Leman test for graph isomorphism. We give a unified view on both classes of graph kernels. We study walk-based node refinement methods and formally relate them to several widely-used techniques, including Morgan's algorithm for molecule canonization and the Weisfeiler-Leman test. We define corresponding walk-based kernels on nodes that allow fine-grained parameterized neighborhood comparison, reach Weisfeiler-Leman expressiveness, and are computed using the kernel trick. From this we show that classical random walk kernels with only minor modifications regarding definition and computation are as expressive as the widely-used Weisfeiler-Leman subtree kernel but support non-strict neighborhood comparison. We verify experimentally that walk-based kernels reach or even surpass the accuracy of Weisfeiler-Leman kernels in real-world classification tasks.  ( 2 min )
    Is Out-of-Distribution Detection Learnable?. (arXiv:2210.14707v1 [cs.LG])
    Supervised learning aims to train a classifier under the assumption that training and test data are from the same distribution. To ease the above assumption, researchers have studied a more realistic setting: out-of-distribution (OOD) detection, where test data may come from classes that are unknown during training (i.e., OOD data). Due to the unavailability and diversity of OOD data, good generalization ability is crucial for effective OOD detection algorithms. To study the generalization of OOD detection, in this paper, we investigate the probably approximately correct (PAC) learning theory of OOD detection, which is proposed by researchers as an open problem. First, we find a necessary condition for the learnability of OOD detection. Then, using this condition, we prove several impossibility theorems for the learnability of OOD detection under some scenarios. Although the impossibility theorems are frustrating, we find that some conditions of these impossibility theorems may not hold in some practical scenarios. Based on this observation, we next give several necessary and sufficient conditions to characterize the learnability of OOD detection in some practical scenarios. Lastly, we also offer theoretical supports for several representative OOD detection works based on our OOD theory.  ( 2 min )
    When Does Differentially Private Learning Not Suffer in High Dimensions?. (arXiv:2207.00160v4 [cs.LG] UPDATED)
    Large pretrained models can be privately fine-tuned to achieve performance approaching that of non-private models. A common theme in these results is the surprising observation that high-dimensional models can achieve favorable privacy-utility trade-offs. This seemingly contradicts known results on the model-size dependence of differentially private convex learning and raises the following research question: When does the performance of differentially private learning not degrade with increasing model size? We identify that the magnitudes of gradients projected onto subspaces is a key factor that determines performance. To precisely characterize this for private convex learning, we introduce a condition on the objective that we term \emph{restricted Lipschitz continuity} and derive improved bounds for the excess empirical and population risks that are dimension-independent under additional conditions. We empirically show that in private fine-tuning of large language models, gradients obtained during fine-tuning are mostly controlled by a few principal components. This behavior is similar to conditions under which we obtain dimension-independent bounds in convex settings. Our theoretical and empirical results together provide a possible explanation for recent successes in large-scale private fine-tuning. Code to reproduce our results can be found at \url{https://github.com/lxuechen/private-transformers/tree/main/examples/classification/spectral_analysis}.  ( 3 min )
    Local Linear Convergence of Gradient Methods for Subspace Optimization via Strict Complementarity. (arXiv:2202.04020v2 [math.OC] UPDATED)
    We consider optimization problems in which the goal is find a $k$-dimensional subspace of $\mathbb{R}^n$, $k<<n$, which minimizes a convex and smooth loss. Such problems generalize the fundamental task of principal component analysis (PCA) to include robust and sparse counterparts, and logistic PCA for binary data, among others. This problem could be approached either via nonconvex gradient methods with highly-efficient iterations, but for which arguing about fast convergence to a global minimizer is difficult or, via a convex relaxation for which arguing about convergence to a global minimizer is straightforward, but the corresponding methods are often inefficient in high dimensions. In this work we bridge these two approaches under a strict complementarity assumption, which in particular implies that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem. Our main result is a proof that a natural nonconvex gradient method which is \textit{SVD-free} and requires only a single QR-factorization of an $n\times k$ matrix per iteration, converges locally with a linear rate. We also establish linear convergence results for the nonconvex projected gradient method, and the Frank-Wolfe method when applied to the convex relaxation.  ( 2 min )
    ANACONDA: An Improved Dynamic Regret Algorithm for Adaptive Non-Stationary Dueling Bandits. (arXiv:2210.14322v1 [cs.LG])
    We study the problem of non-stationary dueling bandits and provide the first adaptive dynamic regret algorithm for this problem. The only two existing attempts in this line of work fall short across multiple dimensions, including pessimistic measures of non-stationary complexity and non-adaptive parameter tuning that requires knowledge of the number of preference changes. We develop an elimination-based rescheduling algorithm to overcome these shortcomings and show a near-optimal $\tilde{O}(\sqrt{S^{\texttt{CW}} T})$ dynamic regret bound, where $S^{\texttt{CW}}$ is the number of times the Condorcet winner changes in $T$ rounds. This yields the first near-optimal dynamic regret algorithm for unknown $S^{\texttt{CW}}$. We further study other related notions of non-stationarity for which we also prove near-optimal dynamic regret guarantees under additional assumptions on the underlying preference model.  ( 2 min )
    Calibrated Predictive Distributions via Diagnostics for Conditional Coverage. (arXiv:2205.14568v2 [stat.ML] UPDATED)
    Uncertainty quantification is crucial for assessing the predictive ability of AI algorithms. A large body of work (including normalizing flows and Bayesian neural networks) has been devoted to describing the entire predictive distribution (PD) of a target variable Y given input features $\mathbf{X}$. However, off-the-shelf PDs are usually far from being conditionally calibrated; i.e., the probability of occurrence of an event given input $\mathbf{X}$ can be significantly different from the predicted probability. Most current research on predictive inference (such as conformal prediction) concerns constructing calibrated prediction sets only. It is often believed that the problem of obtaining and assessing entire conditionally calibrated PDs is too challenging. In this work, we show that recalibration, as well as diagnostics of entire PDs, are indeed attainable goals in practice. Our proposed method relies on the idea of regressing probability integral transform (PIT) scores against $\mathbf{X}$. This regression gives full diagnostics of conditional coverage across the entire feature space and can be used to recalibrate misspecified PDs. We benchmark our corrected prediction bands against oracle bands and state-of-the-art predictive inference algorithms for synthetic data, including settings with a distributional shift. Finally, we produce calibrated PDs for two applications: (i) probabilistic nowcasting based on sequences of satellite images, and (ii) estimation of galaxy distances based on imaging data (photometric redshifts).  ( 3 min )
    Distributionally Robust Batch Contextual Bandits. (arXiv:2006.05630v6 [cs.LG] UPDATED)
    Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data -- an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data. We first present a policy evaluation procedure that allows us to assess how well the policy does under the worst-case environment shift. We then establish a central limit theorem type guarantee for this proposed policy evaluation scheme. Leveraging this evaluation scheme, we further propose a novel learning algorithm that is able to learn a policy that is robust to adversarial perturbations and unknown covariate shifts with a performance guarantee based on the theory of uniform convergence. Finally, we empirically test the effectiveness of our proposed algorithm in synthetic datasets and demonstrate that it provides the robustness that is missing using standard policy learning algorithms. We conclude the paper by providing a comprehensive application of our methods in the context of a real-world voting dataset.  ( 3 min )
    Learning Causal Graphs in Manufacturing Domains using Structural Equation Models. (arXiv:2210.14573v1 [stat.ML])
    Many production processes are characterized by numerous and complex cause-and-effect relationships. Since they are only partially known they pose a challenge to effective process control. In this work we present how Structural Equation Models can be used for deriving cause-and-effect relationships from the combination of prior knowledge and process data in the manufacturing domain. Compared to existing applications, we do not assume linear relationships leading to more informative results.  ( 2 min )
    History-Based, Bayesian, Closure for Stochastic Parameterization: Application to Lorenz '96. (arXiv:2210.14488v1 [stat.ML])
    Physical parameterizations are used as representations of unresolved subgrid processes within weather and global climate models or coarse-scale turbulent models, whose resolutions are too coarse to resolve small-scale processes. These parameterizations are typically grounded on physically-based, yet empirical, representations of the underlying small-scale processes. Machine learning-based parameterizations have recently been proposed as an alternative and have shown great promises to reduce uncertainties associated with small-scale processes. Yet, those approaches still show some important mismatches that are often attributed to stochasticity in the considered process. This stochasticity can be due to noisy data, unresolved variables or simply to the inherent chaotic nature of the process. To address these issues, we develop a new type of parameterization (closure) which is based on a Bayesian formalism for neural networks, to account for uncertainty quantification, and includes memory, to account for the non-instantaneous response of the closure. To overcome the curse of dimensionality of Bayesian techniques in high-dimensional spaces, the Bayesian strategy is based on a Hamiltonian Monte Carlo Markov Chain sampling strategy that takes advantage of the likelihood function and kinetic energy's gradients with respect to the parameters to accelerate the sampling process. We apply the proposed Bayesian history-based parameterization to the Lorenz '96 model in the presence of noisy and sparse data, similar to satellite observations, and show its capacity to predict skillful forecasts of the resolved variables while returning trustworthy uncertainty quantifications for different sources of error. This approach paves the way for the use of Bayesian approaches for closure problems.  ( 3 min )
    Wasserstein Archetypal Analysis. (arXiv:2210.14298v1 [stat.ML])
    Archetypal analysis is an unsupervised machine learning method that summarizes data using a convex polytope. In its original formulation, for fixed k, the method finds a convex polytope with k vertices, called archetype points, such that the polytope is contained in the convex hull of the data and the mean squared Euclidean distance between the data and the polytope is minimal. In the present work, we consider an alternative formulation of archetypal analysis based on the Wasserstein metric, which we call Wasserstein archetypal analysis (WAA). In one dimension, there exists a unique solution of WAA and, in two dimensions, we prove existence of a solution, as long as the data distribution is absolutely continuous with respect to Lebesgue measure. We discuss obstacles to extending our result to higher dimensions and general data distributions. We then introduce an appropriate regularization of the problem, via a Renyi entropy, which allows us to obtain existence of solutions of the regularized problem for general data distributions, in arbitrary dimensions. We prove a consistency result for the regularized problem, ensuring that if the data are iid samples from a probability measure, then as the number of samples is increased, a subsequence of the archetype points converges to the archetype points for the limiting data distribution, almost surely. Finally, we develop and implement a gradient-based computational approach for the two-dimensional problem, based on the semi-discrete formulation of the Wasserstein metric. Our analysis is supported by detailed computational experiments.  ( 3 min )
    Maximum Likelihood Learning of Energy-Based Models for Simulation-Based Inference. (arXiv:2210.14756v1 [cs.LG])
    We introduce two synthetic likelihood methods for Simulation-Based Inference (SBI), to conduct either amortized or targeted inference from experimental observations when a high-fidelity simulator is available. Both methods learn a conditional energy-based model (EBM) of the likelihood using synthetic data generated by the simulator, conditioned on parameters drawn from a proposal distribution. The learned likelihood can then be combined with any prior to obtain a posterior estimate, from which samples can be drawn using MCMC. Our methods uniquely combine a flexible Energy-Based Model and the minimization of a KL loss: this is in contrast to other synthetic likelihood methods, which either rely on normalizing flows, or minimize score-based objectives; choices that come with known pitfalls. Our first method, Amortized Unnormalized Neural Likelihood Estimation (AUNLE), introduces a tilting trick during training that allows to significantly lower the computational cost of inference by enabling the use of efficient MCMC techniques. Our second method, Sequential UNLE (SUNLE), employs a robust doubly intractable approach in order to re-use simulation data and improve posterior accuracy on a specific dataset. We demonstrate the properties of both methods on a range of synthetic datasets, and apply them to a neuroscience model of the pyloric network in the crab Cancer Borealis, matching the performance of other synthetic likelihood methods at a fraction of the simulation budget.  ( 2 min )
    Exact Manifold Gaussian Variational Bayes. (arXiv:2210.14598v1 [stat.ML])
    We propose an optimization algorithm for Variational Inference (VI) in complex models. Our approach relies on natural gradient updates where the variational space is a Riemann manifold. We develop an efficient algorithm for Gaussian Variational Inference that implicitly satisfies the positive definite constraint on the variational covariance matrix. Our Exact manifold Gaussian Variational Bayes (EMGVB) provides exact but simple update rules and is straightforward to implement. Due to its black-box nature, EMGVB stands as a ready-to-use solution for VI in complex models. Over five datasets, we empirically validate our feasible approach on different statistical, econometric, and deep learning models, discussing its performance with respect to baseline methods.  ( 2 min )
    Sharp threshold for alignment of graph databases with Gaussian weights. (arXiv:2010.16295v3 [stat.ML] UPDATED)
    We study the fundamental limits for reconstruction in weighted graph (or matrix) database alignment. We consider a model of two graphs where $\pi^*$ is a planted uniform permutation and all pairs of edge weights $(A_{i,j}, B_{\pi^*(i),\pi^*(j)})_{1 \leq i0$, there is an estimator $\hat{\pi}$ -- namely the MAP estimator -- based on the observation of databases $A,B$ that achieves exact reconstruction with high probability. Conversely, if $n \rho^2 \leq 4 \log n - \log \log n - \omega(1)$, then any estimator $\hat{\pi}$ verifies $\hat{\pi}=\pi$ with probability $o(1)$. This result shows that the information-theoretic threshold for exact recovery is the same as the one obtained for detection in a recent work by Wu et al. (2020): in other words, for Gaussian weighted graph alignment, the problem of reconstruction is not more difficult than that of detection. Though the reconstruction task was already well understood for vector-shaped database alignment (that is taking signal of the form $(u_i, v_{\pi^*(i)})_{1 \leq i\leq n}$ where $(u_i, v_{\pi^*(i)})$ are i.i.d. pairs in $\mathbb{R}^{d_u} \times \mathbb{R}^{d_v}$), its formulation for graph (or matrix) databases brings a drastically different problem for which the hard phase is conjectured to be wide. The proofs build upon the analysis of the MAP estimator and the second moment method, together with the study of the correlation structure of energies of permutations.  ( 3 min )
    FedX: Federated Learning for Compositional Pairwise Risk Optimization. (arXiv:2210.14396v1 [cs.LG])
    In this paper, we tackle a novel federated learning (FL) problem for optimizing a family of compositional pairwise risks, to which no existing FL algorithms are applicable. In particular, the objective has the form of $\mathbb E_{\mathbf z\sim \mathcal S_1} f(\mathbb E_{\mathbf z'\sim\mathcal S_2} \ell(\mathbf w, \mathbf z, \mathbf z'))$, where two sets of data $\mathcal S_1, \mathcal S_2$ are distributed over multiple machines, $\ell(\cdot; \cdot,\cdot)$ is a pairwise loss that only depends on the prediction outputs of the input data pairs $(\mathbf z, \mathbf z')$, and $f(\cdot)$ is possibly a non-linear non-convex function. This problem has important applications in machine learning, e.g., AUROC maximization with a pairwise loss, and partial AUROC maximization with a compositional loss. The challenges for designing an FL algorithm lie in the non-decomposability of the objective over multiple machines and the interdependency between different machines. We propose two provable FL algorithms (FedX) for handling linear and nonlinear $f$, respectively. To address the challenges, we decouple the gradient's components with two types, namely active parts and lazy parts, where the active parts depend on local data that are computed with the local model and the lazy parts depend on other machines that are communicated/computed based on historical models and samples. We develop a novel theoretical analysis to combat the latency of the lazy parts and the interdependency between the local model parameters and the involved data for computing local gradient estimators. We establish both iteration and communication complexities and show that using the historical samples and models for computing the lazy parts do not degrade the complexities. We conduct empirical studies of FedX for deep AUROC and partial AUROC maximization, and demonstrate their performance compared with several baselines.  ( 3 min )
    SPQR: An R Package for Semi-Parametric Density and Quantile Regression. (arXiv:2210.14482v1 [stat.ME])
    We develop an R package SPQR that implements the semi-parametric quantile regression (SPQR) method in Xu and Reich (2021). The method begins by fitting a flexible density regression model using monotonic splines whose weights are modeled as data-dependent functions using artificial neural networks. Subsequently, estimates of conditional density and quantile process can all be obtained. Unlike many approaches to quantile regression that assume a linear model, SPQR allows for virtually any relationship between the covariates and the response distribution including non-linear effects and different effects on different quantile levels. To increase the interpretability and transparency of SPQR, model-agnostic statistics developed by Apley and Zhu (2020) are used to estimate and visualize the covariate effects and their relative importance on the quantile function. In this article, we detail how this framework is implemented in SPQR and illustrate how this package should be used in practice through simulated and real data examples.  ( 2 min )
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v2 [stat.ML] UPDATED)
    The limit of infinite width allows for substantial simplifications in the analytical study of overparameterized neural networks. With a suitable random initialization, an extremely large network is well approximated by a Gaussian process, both before and during training. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimizes the generalization bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.  ( 2 min )
    Animal behavior classification via deep learning on embedded systems. (arXiv:2111.12295v3 [cs.LG] UPDATED)
    We develop an end-to-end deep-neural-network-based algorithm for classifying animal behavior using accelerometry data on the embedded system of an artificial intelligence of things (AIoT) device installed in a wearable collar tag. The proposed algorithm jointly performs feature extraction and classification utilizing a set of infinite-impulse-response (IIR) and finite-impulse-response (FIR) filters together with a multilayer perceptron. The utilized IIR and FIR filters can be viewed as specific types of recurrent and convolutional neural network layers, respectively. We evaluate the performance of the proposed algorithm via two real-world datasets collected from total eighteen grazing beef cattle using collar tags. The results show that the proposed algorithm offers good intra- and inter-dataset classification accuracy and outperforms its closest contenders including two state-of-the-art convolutional-neural-network-based time-series classification algorithms, which are significantly more complex. We implement the proposed algorithm on the embedded system of the utilized collar tags' AIoT device to perform in-situ classification of animal behavior. We achieve real-time in-situ behavior inference from accelerometry data without imposing any strain on the available computational, memory, or energy resources of the embedded system.  ( 2 min )

  • Open

    [D] What platform environment would you use for young Python ML learners?
    I'm considering helping out a nonprofit with a summer camp that introduces some early Python ML ideas to high school students. Given that Google Colab now essentially doesn't have a "free" version, what would you fine folks use? In educational settings like this, one of the key things is reduced complexity. Colab was great for that in that many libraries were already installed and you're pulling up a notebook immediately. Whereas one could theoretically set up a number of computers for this, that's not always viable. And like many things in education, cost is typically prohibitive. Even harder in the educational setting is students having their own accounts, which I'll leave off as a worry for now (but I mention it on the off-chance). I should mention, in my case at least, I don't need any GPU hours at all. Just small 100Kish or so datasets and standard models. Maybe Colab somehow still works for that if not using GPU, but their model is unclear to me there. submitted by /u/jshkk [link] [comments]  ( 125 min )
    [R]Cool book I came across on Algebra, Topology, Differential Calculus and ML
    submitted by /u/obsoletelearner [link] [comments]  ( 125 min )
    [D] Generic Python code refinement corpus?
    Hey guys, I’m doing a project involving some code refinement stuff and I was trying to find a convenient, relatively quick to train on Python code corpus for code-to-code refinement to pretrain my models on before fine tuning them on other tasks. Anyone have any suggestions? submitted by /u/Excellent-Hope-1514 [link] [comments]  ( 123 min )
    [R] In-context Reinforcement Learning with Algorithm Distillation (AD) - DeepMind 2022
    Transformer Collects it´s own data and maximizes reward on new tasks. No prompting or finetuning necessary . The transformer explores, exploits and maximizes return in context while the weights are frozen! Expert Distillation (most similar to Gato), on the other hand, cannot explore and fails to maximize return! Paper: https://arxiv.org/abs/2210.14215 Abstract: We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data. https://preview.redd.it/pylf9gtk07w91.jpg?width=1188&format=pjpg&auto=webp&s=f88892aae0ac543738218fec7c7382d174a47f58 https://preview.redd.it/kkh9sg6q17w91.jpg?width=1522&format=pjpg&auto=webp&s=9baf2e7bfdd7ee09b03cc2cc9462049c978153b8 https://preview.redd.it/1sk3ng6q17w91.jpg?width=1504&format=pjpg&auto=webp&s=76444916eb907f3ace91597df71ec4abffb4c665 https://preview.redd.it/c5u4wg6q17w91.jpg?width=614&format=pjpg&auto=webp&s=321ef0b199de4ede7924757569e47a831c1456f9 submitted by /u/Singularian2501 [link] [comments]  ( 127 min )
    [D] NeurIPS 2022 Poster Design Tips
    I am trying to design a poster for my accepted paper at NeurIPS. I found that the pre-print service seems to love Adobe products like Photoshop and Illustrator that I have barely use. Moreover the guideline says that the poster format should be in CMYK... I became wondered how the others are preparing their poster for the preprint service and waiting for any tip for doing it fluidly. Can we simply make a ppt slide and paste it to Photoshop?? It's very annoying for me to use unfarmiliar Adobe stuffs... Thanks for any advice in advance. submitted by /u/RepresentativeDue559 [link] [comments]  ( 124 min )
    [D] Poisson Flow Generative Models - new physics inspired generative model
    Hey everyone! Over the past couple years, we've all seen the incredible performance of Diffusion Models in Stable Diffusion, DALL-E 2, etc. A team at MIT recently published a new model that is inspired by physics much like diffusion models. I wrote an introduction to Poisson Flow Generative Models as an explainer for the models, which use concepts from electrodynamics to generate data. ​ Some highlights: 10-20x faster than diffusion models on image generation with comparable performance Invertible mapping akin to normalizing flows allows for explicit likelihood estimate and a well-structured latent space Operates on a deterministic ODE and has no stochastic element unlike diffusion models and score-based SDE approaches Here are some example CelebA images being generated with PFGMs: https://i.redd.it/9f826fz726w91.gif Note that the highest resolution images explored in the paper are LSUN bedroom 256 x 256 images Looking forward to answering any questions! submitted by /u/SleekEagle [link] [comments]  ( 126 min )
    [R] In-context Reinforcement Learning with Algorithm Distillation
    submitted by /u/hardmaru [link] [comments]  ( 123 min )
    [D] Semantic segmentation (weak labels)
    Hello, I work with a dataset of microscope images where the goal is to segment whether there is "cell" or not - classical semantic segmentation using supervised learning. However, the hand-labels are not coherent in itself. The following problems occur: Varying borders. Sometimes, it is hard to tell where the cell border begins (blurry microscope effects). Therefore, the hand-labels are sometimes more narrow, sometimes more loose/wide. This ambiguity is bad for convergence: The model does not know whether to include or exclude the blurry border. Unfortunately, one cannot simply say everything that is crisp = cell (i.e., by thresholding), because that would omit cells that are "out of focus", but clearly identifiable as a cell. Lazy labeling. Also, the annotators took "shortcuts" by collectively labeling a group of cells instead of each cell separately. This results in the intermediate zone (in between cells) to be wrongly annotated as belonging to the cell class. (Please see attached image in case this is unclear) Due to these issues, the model often produces bad annotations. How would one tackle this kind of problems (ideally, without pseudolabelling)? Do you have any literature, ideas, ... Thank you very much! Kind regards submitted by /u/laggdown [link] [comments]  ( 127 min )
    [D]Cheating in AAAI 2023 rebuttal
    I am based in US and I am a PC member (reviewer) of AAAI 2023. This is what happened: an author of a paper I reviewed approached me during the rebuttal period, hoping that I could raise the score. I notice that this year AAAI is different regarding the anonymous policy: During the rebuttal period, reviewers can see the names of other reviewers; reviewers can see the name of the meta-reviewer; the meta-reviewer can see the names of reviewers. This surely increases the chance for an author to identify the target reviewers. I am very sorry for such a design. The chairs should take responsibility. submitted by /u/EnvironmentalBar338 [link] [comments]  ( 107 min )
    [D] What's the best open source model for GPT3-like text-to-text generation on local hardware?
    Playing around with GPT3 was quite fun, but now my credits are expired. I used it primarily just for fun to see what the model knows and understands, and occasionally for brainstorming and writing. I've tried T5-FLAN XL which runs very fast on my 3060ti and in theory should be quite good, but the output was quite lacklustre compared to GPT3. Even for examples from the paper (e.g. the scarcasm detection) the answer was typically correct but the chain of thought leading up to it was only convincing 10% of the time. I experimented with a few different sampling strategies (temp, top_p, top_k, beam search etc.) but still quite disappointing results. Any advice? Would the XXL model or one of meta's LLMs be much better? Are there magic settings for sampling? Are there any existing notebooks/GUI tools that I can conveniently run locally? Cheers! submitted by /u/AuspiciousApple [link] [comments]  ( 125 min )
    [D] Python function that changed your life in regard to Machine Learning
    Hello, Do you have any python function that was game-changer for your ML work? If so, share it with us :D submitted by /u/popcornn1 [link] [comments]  ( 125 min )
    [N] Proceedings of the First International Symposium on the Tsetlin Machine available in IEEE Xplore
    ​ Proceedings - First International Symposium on the Tsetlin Machine Finally it is here! The proceedings of The First International Symposium on the Tsetlin Machine. Fourteen exciting papers advancing the Tsetlin machine research. Many thanks to the authors, my co-organizers, and all the supporters of the Tsetlin machine - we build this together. #democraticAI #greenAI #logicbasedAI https://ieeexplore.ieee.org/xpl/conhome/9923753/proceeding submitted by /u/olegranmo [link] [comments]  ( 126 min )
    [P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels
    We are releasing Kernl under Apache 2 license, a library to make PyTorch models inference significantly faster. With 1 line of code we applied the optimizations and made Bert up to 12X faster than Hugging Face baseline. T5 is also covered in this first release (> 6X speed up generation and we are still halfway in the optimizations!). This has been possible because we wrote custom GPU kernels with the new OpenAI programming language Triton and leveraged TorchDynamo. Project link: https://github.com/ELS-RD/kernl/ E2E demo notebooks: XNLI classification, T5 generation Benchmarks ran on a 3090 RTX GPU, 12 cores Intel CPU, more info below On long sequence length inputs, Kernl is most of the time the fastest inference engine, and close to Nvidia TensorRT on shortest ones. Keep in mind that B…  ( 129 min )
    [P] Awesome Machine Unlearning
    https://github.com/tamlhp/awesome-machine-unlearning We compile and sort the machine unlearning frameworks into different categories. If you are interested, we have a survey paper to introduce this field. The repo is updated frequently. submitted by /u/adasken [link] [comments]  ( 131 min )
    Learn From Industry & Research Experts at Speech AI Summit ( [R], [N])
    Hi all, Thought it would be helpful for the people in this group on whoever is doing research and developments in speech AI. There is free digital speech AI summit coming up where industry and research experts coming together from Google, Meta and NVIDIA to talk about latest work in research, production and open-source. Register here (https://www.nvidia.com/en-us/events/speech-ai-summit/), if you are interested. If you have any questions, please leave a comment and I will do my best to respond as soon as possible. submitted by /u/Designer-Comb-7144 [link] [comments]  ( 118 min )
  • Open

    [R] [2210.13435] Dichotomy of Control: Separating What You Can Control from What You Cannot
    submitted by /u/chimp73 [link] [comments]  ( 115 min )
    Why is my REINFORCE algorithm not learning?
    I posted the question on StackOverflow submitted by /u/Academic-Rent7800 [link] [comments]  ( 115 min )
    Need help understanding RL literature!
    I'm fairly new to the field and currently reading Sutton's RL book, where i come across equations below in Dynamic Programming chapter. I do get the derivation of (4.4) from (3.9) but the appearance of (4.3) got me confused. Does it imply [;G_{t+1} = v_{\pi}(s');] or there are some mathematical manipulations in between. Please explain redditors, thank you in advance! https://preview.redd.it/77jtj3njr6w91.png?width=1118&format=png&auto=webp&s=e4952d6edc3a006a4085874569e8089cd4a1a854 submitted by /u/Confident_Pumpkin_99 [link] [comments]  ( 116 min )
    Multi Agent Reinforcement Learning
    I was reading a paper which states “since a centralized critic with access to the global state and the global action is required for the MARL.” Is this even true? Don’t some MARL applications use a completely decentralized architecture? So in that case what would the centralized critic even refer to? submitted by /u/tmt22459 [link] [comments]  ( 116 min )
    I integrated Open AI gym into my game
    submitted by /u/aeonax [link] [comments]  ( 116 min )
    Can current algorithms with enough compute produce AGI?
    From Transformers to Diffusion, from Latent to Sampling methods, we're witnessing evolution. Is this trend sufficient to produce AGI? What if paired with quantum computing? The real question I feel is 'what is inevitable'. A matter of forecasting where possibilities explode and time moves faster than we can predict. If given $5 Million dollars for an AI start-up, there's no method to employ. I'm not suggesting we need a method to implement a business to produce AGI, but it's the same line of thought of the future. submitted by /u/XecutionStyle [link] [comments]  ( 118 min )
    [reference request] methods of agent team evaluation in multi-agent RL
    Hello community! I've been working on research in multi-agent RL for collaborative-competition scenarios. I've came across naive but important question how to benchmark and evaluate the pool of checkpoints trained using different algorithms . The most straight-forward way of comparing the team reward (win-rate against the opponent) may be too 'crude' and not perfect for asymmetric games and stochastic effects produces a variance among tests especially for similarly skilled teams. Do you maybe know of any other methods of multi-agent teams evaluation, and can provide some references ? Best, Jacek submitted by /u/dzako1 [link] [comments]  ( 114 min )
    Problem with feeding a batch through network
    Hello, I'm trying to train a Deep Q-network that outputs N binary random action variables. This is because each action is a set of binary variables, e.g., [1, 0, 0, 1, 0]. Now I collect states, action and rewards together in a batch that I try to feed through the network as seen in the code below. However, I get an error on line 119 because the batch_indec and action_batch shapes are not the same. This is because the batch_index array is just a 1d array of the indexes in the batch but the action_batch has the shape (Batch_size, N_outputs) since each action is not just a random action variable. I'm kind of stranded here since my coding background is not very strong (I'm a management engineering grad student) So any advices on how to feed the batch successfully through the network will be hu…  ( 121 min )
    Name of paper: showing that offline RL algorithm can improve very fast if it trains online briefly
    Hey, I once stumbled on a paper that showed that an algorithm trained in an offline setting and then afterwards briefly trained in an online setting improved dramatically compared to just the offline trained version. What is the name of this paper? Thanks submitted by /u/Dragonrooster [link] [comments]  ( 114 min )
    Deep Learning with TensorFlow and Keras, 3rd Edition
    submitted by /u/Futureisnotsecure [link] [comments]  ( 115 min )
    For RL newbies (RL labs for the DeepMind's RL course)
    Hello RL learners, You might find this repository helpful to apply what you have been learning in the DeepMind's RL course. ​ https://github.com/mhd-medfa/IU-Reinforcement-Learning-22-lab submitted by /u/MhdMedfa1 [link] [comments]  ( 115 min )
  • Open

    Should I take my Uni’s course or study on my own?
    Hello everyone, first time posting with some beginner questions. I am in my 3rd year of my CS Major, and was investigating topics I might like to do for my thesis. My main interest so far has been software dev, but I am liking my machine learning class a lot so I’ve started to look into AI. I was planning on taking the intro to AI course my uni offers, but sadly it is taught by a professor I seriously dislike. I’ve had this prof before, and he is infamous in the faculty for being the kind of prof that ruins otherwise interesting courses. So my question was this: should I take the gamble with this prof for the university legitimacy or should I just study and learn on my own? submitted by /u/PatchFact [link] [comments]  ( 106 min )
    pandorabox.ai
    Hello guys, In the past months, I got fascinated by the many programs that use AI for developing images, texts, and videos. I ended up testing most of them, not only for fun but also to help me in some business. I'm very excited about this AI trend, and I truly think it will disrupt many sectors. Recently, a friend and I got the idea of developing a hub to concentrate all these phenomenal tools: pandorasbox.ai We would appreciate your feedback on this first draft. Any suggestion of another amazing tools to include? What do you think of other kind of contents (like stocks companies/crypto) related to AI? What kind of problem did you get when trying to work/test/find AI software? Thank you all for your attention! submitted by /u/_-_agenda_-_ [link] [comments]  ( 106 min )
    Has anyone attempted a cost/benefit analysis for advanced AI? By cost I mean risk to malevolent AI and existential threats
    It is very concerning to me that it’s all systems go without much thought of stopping and asking if this is all worth it. submitted by /u/TheLastSamurai [link] [comments]  ( 106 min )
    Shutterstock will start selling AI-generated stock imagery with help from OpenAI
    submitted by /u/TallAssociation0 [link] [comments]  ( 105 min )
    ML interpretability
    Hi guys, how do you approach the interpretability of black-box models? submitted by /u/MariiaKozlova [link] [comments]  ( 105 min )
    Looking for a paper about methods to detect a ball in a soccer game
    Hi, I'm doing a school work about soccer analytics and I want to talk about how an algorithm can detect a ball in a soccer game video to later on do on analysis on that. Any paper or research you can recommend me? Thanks in advance! submitted by /u/iamtdb [link] [comments]  ( 106 min )
    How AI Transformers Mimic Parts of the Brain | Quanta Magazine
    submitted by /u/estasfuera [link] [comments]  ( 106 min )
    An AI App Can Predict and Prevent Money Laundering
    submitted by /u/SamuelSmith1416 [link] [comments]  ( 106 min )
    How is the art market and the world of international collectors experiencing the changes of our technological age? Olga Simonova founder of ArtLever and expert in luxury objects and artworks throughout Europe and the United States tells us about it
    submitted by /u/Artlever [link] [comments]  ( 107 min )
    Using AI to compress audio files for quick and easy sharing
    submitted by /u/magenta_placenta [link] [comments]  ( 111 min )
    I need someone to hold my Hand...
    Trying to make this short; went to university (physics) because I just didnt know what else to do (had no one I could ask) Graduated with a master's degree Worked for 2 years in semi-conductor industry Quit because of annoying boss Thinking about changing my direction into AI/Data Science Working in the industry showed me I enjoy coding, data analysis etc., quite a bit more than I expected That got me thinking that I might use this pause between jobs to transition into this field. I have enough money saved up (frugal lifestyle) to finance myself for a couple of years. (I'm 29 y/o) I looked around and I could apply into a master's program "Artificial Intelligence" thanks to my Physics degree. But this scares me, as I have minimal coding skills which I only half-assed on the job, just to fulfill my task. What I'm asking here is that if someone here can relate in any way to have a look at my rough timeline and give me feedback on it. So my rough idea currently is something like this: First year Since I'm starting directly into a masters with minimal coding knowledge I will probably take less courses to make sure I can handle the workload and if necessary go back and re-study other topics I might lack Second year Depending on how well the first year went, I might take extra courses or keep the workload intentionally low so I can deepen my knowledge in other fields I might deem important (statistics, linear algebra, etc.) Third year Look for a part-time coding job to get hands on experience during my studies Fourth year Look towards ending the studies and starting the thesis Fifth year Finish the thesis, start to work somewhere submitted by /u/sta6 [link] [comments]  ( 110 min )
    Trillion: Virtual Try-on for jewelry 💍
    Hey everyone! We are excited to announce that Trillion is live on Product Hunt 😻 Trillion is the first app unlocking a virtual try-on feature for jewelry brands and jewelry lovers. It took a year and a half to create our technology for realistic try-on of jewelry in AR and prepare the app for public use. Here we go 🚀 ⚠️ The Trillion app is: - 360 degrees view of jewelry pieces - Virtual try-on of jewelry in AR - Anatomically accurate jewelry placement and realistic display even while moving - Purchasing feature - Available on App Store and Google Play Please support our launch and share your feedback on the Product Hunt page 👉 https://www.producthunt.com/posts/trillion submitted by /u/ApatheticSparrow [link] [comments]  ( 106 min )
    Artificial intelligence is used for predictive policing in the US and UK—South Africa should embrace it, too
    In the 2002 movie Minority Report (based on a short story by Philip K Dick), director Steven Spielberg imagined a future in which three psychics can "see" murders before they happen. Their clairvoyance allows Tom Cruise and his "Precrime" police force to avert nearly all potential homicides. Twenty years on, in the real world, scientists and law enforcement agencies are using data mining and machine learning to mimic those psychics. Such "predictive policing", as it is called, is based on the fact that many crimes—and criminals—have detectable patterns. Predictive policing has enjoyed some successes. In a case study in the US, one police department was able to reduce gun incidents by 47% over the typically gun-happy New Year's Eve. Manchester police in the UK were similarly able to predi…  ( 116 min )
    Join Us For Transformers AI Online Hackathon!
    If you're looking for an exciting and challenging event that will push your skills to the limit, look no further than the Transformers AI Online Hackathon🧑‍💻 In this 3-day event, you will use transformers architecture to build innovative solutions🔥 With a friendly and supportive atmosphere, this is the perfect event for anyone looking to take their AI skills to the next level. So what are you waiting for? Register now and let's get started!🚀 (it's free to attend!) Register here ​ Transformers AI Hackathon submitted by /u/lablabai [link] [comments]  ( 106 min )
    All About Alexa’s New Language Understanding Model
    About AlexaTM 20B The new language model from Amazon is a multilingual large-scale model pre-trained on a set of denoising and Causal Language Modeling (CLM) tasks. As per the company, this strategy helps AlexaTM model to be more efficient for few-shot learning than the decoder-only language models. ​ AlexaTM 20B model achieves state-of-the-art performance om 1-shot summarisation tasks and outperforms larger PaLM decoder model with 540 billion parameters. Amazon’s model works particularly well for low-resource language pairs that it supports – Arabic, French, English, German, Hindi, Italian, Japanese, Portuguese, Spanish, Marathi, Tamil, and Telugu on Flores-101 dataset. Read More: https://analyticsindiamag.com/all-about-alexas-new-language-understanding-model/ submitted by /u/analyticsindiam [link] [comments]  ( 106 min )
    HaikNews.com - Uses AI to write a haiku out of daily news headlines and generates a corresponding image w/ the Stable Diffusion AI model
    This is a rather simple project written in PHP, JavaScript, and some backend database code. Essentially, it uses the fantastic NewsAPI to automatically gather news headlines on a daily basis from about 50k different publishers, which include basically all the sources used by Google News. These headlines are then analyzed for syllable count and the site attempts to write haikus out of them. Only the most popular 5-7-5 syllable form of haikus are used, and unfortunately because the syllable counter is not perfect, the haikus generated do not always conform exactly to the syllable rules (but most of the time they do). The haikus themselves are generated on demand, so each time you refresh the page, you should see a new haiku. However, the data sources (news articles) only update once per day. The haiku is then fed into DeepAI's Stable Diffusion API to generate an image corresponding to the haiku. To keep things interesting, this image is either returned with no filter, a comic book filter, impressionist painting filter, or caricature filter. Note that the image takes around 20 seconds to load so it may take a moment to display. The image can essentially be thought of as a visual display of the generated haiku. http://www.haiknews.com submitted by /u/southside915 [link] [comments]  ( 108 min )
    What’s the best ai for creating a southern woman’s voice?
    submitted by /u/Freakonaleash-mp3 [link] [comments]  ( 106 min )
    So I used InferKit to generate a story, and this is what happened.
    "You're going down, Spongebob..." says Pickle Rick. "Fuck you" says Spongebob. Suddenly the two charge at eachother with their guns raised. Spongebob swings his fist, but Pickle Rick catches his fist and throws him across the room. Spongebob falls to the floor, and Pickle Rick stands over him. Spongebob picks himself up off the floor and looks at Pickle Rick. "You're a real douche, Pickle Rick" he says, as he fires a blast of water at him. Pickle Rick ducks and the water splashes against the wall. "I don't want to fight you, Spongebob" says Pickle Rick. "Yeah, you don't want to fight me" says Spongebob. "No, I don't want to fight you" says Pickle Rick. "No, I don't want to fight you" says Spongebob. "No, I don't want to fight you" says Pickle Rick. "No, I don't want to fight you" says Spon…  ( 115 min )
  • Open

    Determining a conic by points and tangents
    The first post this series said that a conic section has five degrees of freedom, and that any theorem that claims to determine a conic by less than five numbers is using some additional implicit information. The second post looked at Gibbs’ method which uses three observations, and a variation on the method uses just […] Determining a conic by points and tangents first appeared on John D. Cook.  ( 4 min )
    Four views of multisets
    This post will define multisets and basic operations on multisets. We’ll view union, intersection, inclusion and sum each from four perspectives: Examples with words Example with prime factorization Using Python’s  multiset module Multisets as functions Definition and examples A multiset is like a set, except each element may appear more than once. We say each […] Four views of multisets first appeared on John D. Cook.  ( 8 min )
    Lambert’s theorem
    At the start of this series we pointed out that a conic section has five degrees of freedom, and so claims to be able to determine an orbit from less than five numbers must be hiding information somewhere. That is the case with Lambert’s theorem which reportedly determines an orbit from two numbers. Lambert’s theorem […] Lambert’s theorem first appeared on John D. Cook.  ( 5 min )
  • Open

    The Number of AI Research Publications is Accelerating Faster in China than the U.S.
    Note: this post was completed as part of a written work trial for Epoch.  ( 8 min )
  • Open

    Amazon SageMaker Automatic Model Tuning now supports grid search
    Today Amazon SageMaker announced the support of Grid search for automatic model tuning, providing users with an additional strategy to find the best hyperparameter configuration for your model. Amazon SageMaker automatic model tuning finds the best version of a model by running many training jobs on your dataset using a range of hyperparameters that you […]  ( 5 min )
    Introducing the Amazon SageMaker Serverless Inference Benchmarking Toolkit
    Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale machine learning (ML) models. It provides a pay-per-use model, which is ideal for services where endpoint invocations are infrequent and unpredictable. Unlike a real-time hosting endpoint, which is backed by a long-running instance, compute resources for […]  ( 9 min )
    Deploy a machine learning inference data capture solution on AWS Lambda
    Monitoring machine learning (ML) predictions can help improve the quality of deployed models. Capturing the data from inferences made in production can enable you to monitor your deployed models and detect deviations in model quality. Early and proactive detection of these deviations enables you to take corrective actions, such as retraining models, auditing upstream systems, […]  ( 8 min )
    AWS Celebrates 5 Years of Innovation with Amazon SageMaker
    In just 5 years, tens of thousands of customers have tapped Amazon SageMaker to create millions of models, train models with billions of parameters, and generate hundreds of billions of monthly predictions. The seeds of a machine learning (ML) paradigm shift were there for decades, but with the ready availability of virtually infinite compute capacity, […]  ( 13 min )
  • Open

    Natural Language Assessment: A New Framework to Promote Education
    Posted by Kedem Snir, Software Engineer, and Gal Elidan, Senior Staff Research Scientist, Google Research Whether it's a professional honing their skills or a child learning to read, coaches and educators play a key role in assessing the learner's answer to a question in a given context and guiding them towards a goal. These interactions have unique characteristics that set them apart from other forms of dialogue, yet are not available when learners practice alone at home. In the field of natural language processing, this type of capability has not received much attention and is technologically challenging. We set out to explore how we can use machine learning to assess answers in a way that facilitates learning. In this blog, we introduce an important natural language understanding (NL…  ( 93 min )
  • Open

    Jetson-Driven Grub Getter: Cartken Rolls Out Robots-as-a-Service for Deliveries
    There’s a new sidewalk-savvy robot, and it’s delivering coffee, grub and a taste of fun. The bot is garnering interest for Oakland, Calif., startup Cartken. The company, founded in 2019, has rapidly deployed robots for a handful of customer applications, including for Starbucks and Grubhub deliveries. Cartken CEO Chris Bersch said that he and co-founders Read article > The post Jetson-Driven Grub Getter: Cartken Rolls Out Robots-as-a-Service for Deliveries appeared first on NVIDIA Blog.  ( 5 min )
  • Open

    Deep Learning with TensorFlow and Keras, 3rd Edition
    submitted by /u/Futureisnotsecure [link] [comments]  ( 105 min )
    Does anyone know of any spiking neural network implementations?
    I’ve been trying to wrap my head around spiking networks. While I understand how to implement them, I’m totally lost of STDP training for them. If anyone knows of any existing implementations that I could look at, it’d be greatly helpful! submitted by /u/statsing [link] [comments]  ( 104 min )
  • Open

    One Student Knows All Experts Know: From Sparse to Dense. (arXiv:2201.10890v4 [cs.LG] UPDATED)
    Human education system trains one student by multiple experts. Mixture-of-experts (MoE) is a powerful sparse architecture including multiple experts. However, sparse MoE model is easy to overfit, hard to deploy, and not hardware-friendly for practitioners. In this work, inspired by the human education model, we propose a novel task, knowledge integration, to obtain a dense student model (OneS) as knowledgeable as one sparse MoE. We investigate this task by proposing a general training framework including knowledge gathering and knowledge distillation. Specifically, to gather key knowledge from different pre-trained experts, we first investigate four different possible knowledge gathering methods, \ie summation, averaging, Top-K Knowledge Gathering (Top-KG), and Singular Value Decomposition Knowledge Gathering (SVD-KG) proposed in this paper. We then refine the dense student model by knowledge distillation to offset the noise from gathering. On ImageNet, our OneS preserves $61.7\%$ benefits from MoE and achieves $78.4\%$ top-1 accuracy ImageNet with only $15$M parameters. On four natural language processing datasets, OneS obtains $88.2\%$ MoE benefits and outperforms the best baseline by $51.7\%$ using the same architecture and training data. In addition, compared with the MoE counterpart, OneS can achieve $3.7 \times$ inference speedup due to less computation and hardware-friendly architecture.  ( 3 min )
    Graph Neural Networks for Nomination and Representation Learning of Web Elements. (arXiv:2111.02168v3 [cs.LG] UPDATED)
    This paper tackles the under-explored problem of DOM element nomination and representation learning with three important contributions. First, we present a large-scale and realistic dataset of webpages, far richer and more diverse than other datasets proposed for element representation learning, classification and nomination on the web. The dataset contains $51,701$ manually labeled product pages from $8,175$ real e-commerce websites. Second, we adapt several Graph Neural Network (GNN) architectures to website DOM trees and benchmark their performance on a diverse set of element nomination tasks using our proposed dataset. In element nomination, a single element on a page is selected for a given class. We show that on our challenging dataset a simple Convolutional GNN outperforms state-of-the-art methods on web element nomination. Finally, we propose a new training method that further boosts the element nomination accuracy. In nomination for the web, classification (assigning a class to a given element) is usually used as a surrogate objective for nomination during training. Our novel training methodology steers the classification objective towards the more complex and useful nomination objective.  ( 3 min )
    LQGNet: Hybrid Model-Based and Data-Driven Linear Quadratic Stochastic Control. (arXiv:2210.12803v2 [eess.SY] UPDATED)
    Stochastic control deals with finding an optimal control signal for a dynamical system in a setting with uncertainty, playing a key role in numerous applications. The linear quadratic Gaussian (LQG) is a widely-used setting, where the system dynamics is represented as a linear Gaussian statespace (SS) model, and the objective function is quadratic. For this setting, the optimal controller is obtained in closed form by the separation principle. However, in practice, the underlying system dynamics often cannot be faithfully captured by a fully known linear Gaussian SS model, limiting its performance. Here, we present LQGNet, a stochastic controller that leverages data to operate under partially known dynamics. LQGNet augments the state tracking module of separation-based control with a dedicated trainable algorithm. The resulting system preserves the operation of classic LQG control while learning to cope with partially known SS models without having to fully identify the dynamics. We empirically show that LQGNet outperforms classic stochastic control by overcoming mismatched SS models.  ( 2 min )
    On Dual-Based PI Controllers for Online Allocation Problems. (arXiv:2202.06152v2 [math.OC] UPDATED)
    Dual-based proportional-integral (PI) controllers are often employed in practice to solve online allocation problems with global constraints, such as budget pacing in online advertising. However, controllers are used in a heuristic fashion and come with no provable guarantees on their performance. This paper provides the first regret bounds on the performance of dual-based PI controllers for online allocation problems. We do so by first establishing a fundamental connection between dual-based PI controllers and a new first-order algorithm for online convex optimization, which, in a special case, recovers online mirror descent with momentum. We prove the proposed first-order algorithm attains low regret for general online convex optimization problems with adversarial inputs. We leverage this new result to give the first regret bound for dual-based PI controllers for online allocation problems. As a byproduct of our proofs, we provide the first regret bound for online mirror descent for non-smooth convex optimization, which might be of independent interest.  ( 2 min )
    A Continuous Convolutional Trainable Filter for Modelling Unstructured Data. (arXiv:2210.13416v2 [cs.LG] UPDATED)
    Convolutional Neural Network (CNN) is one of the most important architectures in deep learning. The fundamental building block of a CNN is a trainable filter, represented as a discrete grid, used to perform convolution on discrete input data. In this work, we propose a continuous version of a trainable convolutional filter able to work also with unstructured data. This new framework allows exploring CNNs beyond discrete domains, enlarging the usage of this important learning technique for many more complex problems. Our experiments show that the continuous filter can achieve a level of accuracy comparable to the state-of-the-art discrete filter, and that it can be used in current deep learning architectures as a building block to solve problems with unstructured domains as well.  ( 2 min )
    Learned Lifted Linearization Applied to Unstable Dynamic Systems Enabled by Koopman Direct Encoding. (arXiv:2210.13602v1 [cs.LG])
    This paper presents a Koopman lifting linearization method that is applicable to nonlinear dynamical systems having both stable and unstable regions. It is known that DMD and other standard data-driven methods face a fundamental difficulty in constructing a Koopman model when applied to unstable systems. Here we solve the problem by incorporating knowledge about a nonlinear state equation with a learning method for finding an effective set of observables. In a lifted space, stable and unstable regions are separated into independent subspaces. Based on this property, we propose to find effective observables through neural net training where training data are separated into stable and unstable trajectories. The resultant learned observables are used for constructing a linear state transition matrix using method known as Direct Encoding, which transforms the nonlinear state equation to a state transition matrix through inner product computations with the observables. The proposed method shows a dramatic improvement over existing DMD and data-driven methods.  ( 2 min )
    SpacePhish: The Evasion-space of Adversarial Attacks against Phishing Website Detectors using Machine Learning. (arXiv:2210.13660v1 [cs.CR])
    Existing literature on adversarial Machine Learning (ML) focuses either on showing attacks that break every ML model, or defenses that withstand most attacks. Unfortunately, little consideration is given to the actual \textit{cost} of the attack or the defense. Moreover, adversarial samples are often crafted in the "feature-space", making the corresponding evaluations of questionable value. Simply put, the current situation does not allow to estimate the actual threat posed by adversarial attacks, leading to a lack of secure ML systems. We aim to clarify such confusion in this paper. By considering the application of ML for Phishing Website Detection (PWD), we formalize the "evasion-space" in which an adversarial perturbation can be introduced to fool a ML-PWD -- demonstrating that even perturbations in the "feature-space" are useful. Then, we propose a realistic threat model describing evasion attacks against ML-PWD that are cheap to stage, and hence intrinsically more attractive for real phishers. Finally, we perform the first statistically validated assessment of state-of-the-art ML-PWD against 12 evasion attacks. Our evaluation shows (i) the true efficacy of evasion attempts that are more likely to occur; and (ii) the impact of perturbations crafted in different evasion-spaces. Our realistic evasion attempts induce a statistically significant degradation (3-10% at $p\!<$0.05), and their cheap cost makes them a subtle threat. Notably, however, some ML-PWD are immune to our most realistic attacks ($p$=0.22). Our contribution paves the way for a much needed re-assessment of adversarial attacks against ML systems for cybersecurity.  ( 3 min )
    Training Language Models with Memory Augmentation. (arXiv:2205.12674v2 [cs.CL] UPDATED)
    Recent work has improved language models (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the language model. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple language modeling and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.  ( 2 min )
    Adversarial Pretraining of Self-Supervised Deep Networks: Past, Present and Future. (arXiv:2210.13463v1 [cs.LG])
    In this paper, we review adversarial pretraining of self-supervised deep networks including both convolutional neural networks and vision transformers. Unlike the adversarial training with access to labeled examples, adversarial pretraining is complicated as it only has access to unlabeled examples. To incorporate adversaries into pretraining models on either input or feature level, we find that existing approaches are largely categorized into two groups: memory-free instance-wise attacks imposing worst-case perturbations on individual examples, and memory-based adversaries shared across examples over iterations. In particular, we review several representative adversarial pretraining models based on Contrastive Learning (CL) and Masked Image Modeling (MIM), respectively, two popular self-supervised pretraining methods in literature. We also review miscellaneous issues about computing overheads, input-/feature-level adversaries, as well as other adversarial pretraining approaches beyond the above two groups. Finally, we discuss emerging trends and future directions about the relations between adversarial and cooperative pretraining, unifying adversarial CL and MIM pretraining, and the trade-off between accuracy and robustness in adversarial pretraining.  ( 2 min )
    Audio MFCC-gram Transformers for respiratory insufficiency detection in COVID-19. (arXiv:2210.14085v1 [cs.SD])
    This work explores speech as a biomarker and investigates the detection of respiratory insufficiency (RI) by analyzing speech samples. Previous work \cite{spira2021} constructed a dataset of respiratory insufficiency COVID-19 patient utterances and analyzed it by means of a convolutional neural network achieving an accuracy of $87.04\%$, validating the hypothesis that one can detect RI through speech. Here, we study how Transformer neural network architectures can improve the performance on RI detection. This approach enables construction of an acoustic model. By choosing the correct pretraining technique, we generate a self-supervised acoustic model, leading to improved performance ($96.53\%$) of Transformers for RI detection.  ( 2 min )
    Confident Adaptive Language Modeling. (arXiv:2207.07061v2 [cs.CL] UPDATED)
    Recent advances in Transformer-based large language models (LLMs) have led to significant performance improvements across many tasks. These gains come with a drastic increase in the models' size, potentially leading to slow and costly use at inference time. In practice, however, the series of generations made by LLMs is composed of varying levels of difficulty. While certain predictions truly benefit from the models' full capacity, other continuations are more trivial and can be solved with reduced compute. In this work, we introduce Confident Adaptive Language Modeling (CALM), a framework for dynamically allocating different amounts of compute per input and generation timestep. Early exit decoding involves several challenges that we address here, such as: (1) what confidence measure to use; (2) connecting sequence-level constraints to local per-token exit decisions; and (3) attending back to missing hidden representations due to early exits in previous tokens. Through theoretical analysis and empirical experiments on three diverse text generation tasks, we demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $\times 3$ -- while provably maintaining high performance.  ( 2 min )
    Self-supervised Co-learning of Uncurated Images and Reports Enables Oversight AI in Radiology. (arXiv:2208.05140v3 [eess.IV] UPDATED)
    Oversight AI is an emerging concept in radiology where the AI forms a symbiosis with radiologists by continuously supporting radiologists in their decision-making. Recent advances in vision-language pre-training sheds a light on the long-standing problems of the oversight AI by the understanding of both visual and textual concepts and their semantic correspondences. However, there have been limited successes in the application of vision-language pre-training in the medical domain, as the current vision-language models and learning strategies for photographic images and captions are not optimal to process the medical data that are usually insufficient in the amount and the diversity. To address this, here we present medical X-VL, a self-supervised model tailored for efficient vision-language pre-training that exploits cross attention in the radiological images and reports' common feature space in a symmetric manner. We experimentally demonstrate that the pre-trained medical X-VL model outperforms the current state-of-the-art models in various vision-language tasks in medical domains. We finally demonstrate practical clinical usages of our oversight AI for monitoring human errors and in the diagnosis of newly emerging diseases, which suggests the potential of an oversight AI model for widespread applicability in different medical applications.  ( 3 min )
    Learning Causal Discovery. (arXiv:2209.05598v2 [cs.LG] UPDATED)
    Causal discovery (CD) from time-varying data is important in neuroscience, medicine, and machine learning. Techniques for CD include randomized experiments which are generally unbiased but expensive. It also includes algorithms like regression, matching, and Granger causality, which are only correct under strong assumptions made by human designers. However, as we found in other areas of machine learning, humans are usually not quite right and human expertise is usually outperformed by data-driven approaches. Here we test if we can improve causal discovery in a data-driven way. We take a perturbable system with a large number of causal components (transistors), the MOS 6502 processor, acquire the causal ground truth, and learn the causal discovery procedure represented as a neural network. We find that this procedure far outperforms human-designed causal discovery procedures, such as Mutual Information, LiNGAM, and Granger Causality both on MOS 6502 processor and the NetSim dataset which simulates functional magnetic resonance imaging (fMRI) results. We argue that the causality field should consider, where possible, a supervised approach, where CD procedures are learned from large datasets with known causal relations instead of being designed by a human specialist. Our findings promise a new approach toward improving CD in neural and medical data and for the broader machine learning community.  ( 2 min )
    Monitored Distillation for Positive Congruent Depth Completion. (arXiv:2203.16034v2 [cs.CV] UPDATED)
    We propose a method to infer a dense depth map from a single image, its calibration, and the associated sparse point cloud. In order to leverage existing models (teachers) that produce putative depth maps, we propose an adaptive knowledge distillation approach that yields a positive congruent training process, wherein a student model avoids learning the error modes of the teachers. In the absence of ground truth for model selection and training, our method, termed Monitored Distillation, allows a student to exploit a blind ensemble of teachers by selectively learning from predictions that best minimize the reconstruction error for a given image. Monitored Distillation yields a distilled depth map and a confidence map, or ``monitor'', for how well a prediction from a particular teacher fits the observed image. The monitor adaptively weights the distilled depth where if all of the teachers exhibit high residuals, the standard unsupervised image reconstruction loss takes over as the supervisory signal. On indoor scenes (VOID), we outperform blind ensembling baselines by 17.53% and unsupervised methods by 24.25%; we boast a 79% model size reduction while maintaining comparable performance to the best supervised method. For outdoors (KITTI), we tie for 5th overall on the benchmark despite not using ground truth. Code available at: https://github.com/alexklwong/mondi-python.  ( 2 min )
    Drastically Reducing the Number of Trainable Parameters in Deep CNNs by Inter-layer Kernel-sharing. (arXiv:2210.14151v1 [cs.CV])
    Deep convolutional neural networks (DCNNs) have become the state-of-the-art (SOTA) approach for many computer vision tasks: image classification, object detection, semantic segmentation, etc. However, most SOTA networks are too large for edge computing. Here, we suggest a simple way to reduce the number of trainable parameters and thus the memory footprint: sharing kernels between multiple convolutional layers. Kernel-sharing is only possible between ``isomorphic" layers, i.e.layers having the same kernel size, input and output channels. This is typically the case inside each stage of a DCNN. Our experiments on CIFAR-10 and CIFAR-100, using the ConvMixer and SE-ResNet architectures show that the number of parameters of these models can drastically be reduced with minimal cost on accuracy. The resulting networks are appealing for certain edge computing applications that are subject to severe memory constraints, and even more interesting if leveraging "frozen weights" hardware accelerators. Kernel-sharing is also an efficient regularization method, which can reduce overfitting. The codes are publicly available at https://github.com/AlirezaAzadbakht/kernel-sharing.  ( 2 min )
    Multilingual Relation Classification via Efficient and Effective Prompting. (arXiv:2210.13838v1 [cs.CL])
    Prompting pre-trained language models has achieved impressive performance on various NLP tasks, especially in low data regimes. Despite the success of prompting in monolingual settings, applying prompt-based methods in multilingual scenarios has been limited to a narrow set of tasks, due to the high cost of handcrafting multilingual prompts. In this paper, we present the first work on prompt-based multilingual relation classification (RC), by introducing an efficient and effective method that constructs prompts from relation triples and involves only minimal translation for the class labels. We evaluate its performance in fully supervised, few-shot and zero-shot scenarios, and analyze its effectiveness across 14 languages, prompt variants, and English-task training in cross-lingual settings. We find that in both fully supervised and few-shot scenarios, our prompt method beats competitive baselines: fine-tuning XLM-R_EM and null prompts. It also outperforms the random baseline by a large margin in zero-shot experiments. Our method requires little in-language knowledge and can be used as a strong baseline for similar multilingual classification tasks.  ( 2 min )
    Characterizing information loss in a chaotic double pendulum with the Information Bottleneck. (arXiv:2210.14220v1 [cs.LG])
    A hallmark of chaotic dynamics is the loss of information with time. Although information loss is often expressed through a connection to Lyapunov exponents -- valid in the limit of high information about the system state -- this picture misses the rich spectrum of information decay across different levels of granularity. Here we show how machine learning presents new opportunities for the study of information loss in chaotic dynamics, with a double pendulum serving as a model system. We use the Information Bottleneck as a training objective for a neural network to extract information from the state of the system that is optimally predictive of the future state after a prescribed time horizon. We then decompose the optimally predictive information by distributing a bottleneck to each state variable, recovering the relative importance of the variables in determining future evolution. The framework we develop is broadly applicable to chaotic systems and pragmatic to apply, leveraging data and machine learning to monitor the limits of predictability and map out the loss of information.  ( 2 min )
    Sequential Decision Making on Unmatched Data using Bayesian Kernel Embeddings. (arXiv:2210.13692v1 [stat.ML])
    The problem of sequentially maximizing the expectation of a function seeks to maximize the expected value of a function of interest without having direct control on its features. Instead, the distribution of such features depends on a given context and an action taken by an agent. In contrast to Bayesian optimization, the arguments of the function are not under agent's control, but are indirectly determined by the agent's action based on a given context. If the information of the features is to be included in the maximization problem, the full conditional distribution of such features, rather than its expectation only, needs to be accounted for. Furthermore, the function is itself unknown, only counting with noisy observations of such function, and potentially requiring the use of unmatched data sets. We propose a novel algorithm for the aforementioned problem which takes into consideration the uncertainty derived from the estimation of both the conditional distribution of the features and the unknown function, by modeling the former as a Bayesian conditional mean embedding and the latter as a Gaussian process. Our algorithm empirically outperforms the current state-of-the-art algorithm in the experiments conducted.  ( 2 min )
    Line Graph Contrastive Learning for Link Prediction. (arXiv:2210.13795v1 [cs.LG])
    Link prediction task aims to predict the connection of two nodes in the network. Existing works mainly predict links by node pairs similarity measurements. However, if the local structure doesn't meet such measurement assumption, the algorithms' performance will deteriorate rapidly. To overcome these limitations, we propose a Line Graph Contrastive Learning (LGCL) method to obtain multiview information. Our framework obtains a subgraph view by h-hop subgraph sampling with target node pairs as the center. After transforming the sampled subgraph into a line graph, the edge embedding information is directly accessible, and the link prediction task is converted into a node classification task. Then, different graph convolution operators learn representations from double perspectives. Finally, contrastive learning is adopted to balance the subgraph representations of these perspectives via maximizing mutual information. With experiments on six public datasets, LGCL outperforms current benchmarks on link prediction tasks and shows better generalization performance and robustness.
    SWIFT: Rapid Decentralized Federated Learning via Wait-Free Model Communication. (arXiv:2210.14026v1 [cs.DC])
    The decentralized Federated Learning (FL) setting avoids the role of a potentially unreliable or untrustworthy central host by utilizing groups of clients to collaboratively train a model via localized training and model/gradient sharing. Most existing decentralized FL algorithms require synchronization of client models where the speed of synchronization depends upon the slowest client. In this work, we propose SWIFT: a novel wait-free decentralized FL algorithm that allows clients to conduct training at their own speed. Theoretically, we prove that SWIFT matches the gold-standard iteration convergence rate $\mathcal{O}(1/\sqrt{T})$ of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations $T$). Furthermore, we provide theoretical results for IID and non-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms. Although SWIFT achieves the same iteration convergence rate with respect to $T$ as other state-of-the-art (SOTA) parallel stochastic algorithms, it converges faster with respect to run-time due to its wait-free structure. Our experimental results demonstrate that SWIFT's run-time is reduced due to a large reduction in communication time per epoch, which falls by an order of magnitude compared to synchronous counterparts. Furthermore, SWIFT produces loss levels for image classification, over IID and non-IID data settings, upwards of 50% faster than existing SOTA algorithms.
    $\texttt{Mangrove}$: Learning Galaxy Properties from Merger Trees. (arXiv:2210.13473v1 [astro-ph.GA])
    Efficiently mapping baryonic properties onto dark matter is a major challenge in astrophysics. Although semi-analytic models (SAMs) and hydrodynamical simulations have made impressive advances in reproducing galaxy observables across cosmologically significant volumes, these methods still require significant computation times, representing a barrier to many applications. Graph Neural Networks (GNNs) have recently proven to be the natural choice for learning physical relations. Among the most inherently graph-like structures found in astrophysics are the dark matter merger trees that encode the evolution of dark matter halos. In this paper we introduce a new, graph-based emulator framework, $\texttt{Mangrove}$, and show that it emulates the galactic stellar mass, cold gas mass and metallicity, instantaneous and time-averaged star formation rate, and black hole mass -- as predicted by a SAM -- with root mean squared error up to two times lower than other methods across a $(75 Mpc/h)^3$ simulation box in 40 seconds, 4 orders of magnitude faster than the SAM. We show that $\texttt{Mangrove}$ allows for quantification of the dependence of galaxy properties on merger history. We compare our results to the current state of the art in the field and show significant improvements for all target properties. $\texttt{Mangrove}$ is publicly available.  ( 2 min )
    Noisy Low-rank Matrix Optimization: Geometry of Local Minima and Convergence Rate. (arXiv:2203.03899v2 [math.OC] UPDATED)
    This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sensing has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results can handle the problem in the case with a general objective function subject to noisy data only when the RIP constant is close to 0. In this paper, we develop a new mathematical framework to solve the above-mentioned problem with a far less restrictive RIP constant. We prove that as long as the RIP constant of the noiseless objective is less than $1/3$, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than $1/3$. Compared to the existing results in the literature, this paper offers the strongest RIP bound and provides a complete theoretical analysis on the global and local optimization landscapes of general low-rank optimization problems under random corruptions from any finite-variance family.
    Structure-based Drug Design with Equivariant Diffusion Models. (arXiv:2210.13695v1 [q-bio.BM])
    Structure-based drug design (SBDD) aims to design small-molecule ligands that bind with high affinity and specificity to pre-determined protein targets. Traditional SBDD pipelines start with large-scale docking of compound libraries from public databases, thus limiting the exploration of chemical space to existent previously studied regions. Recent machine learning methods approached this problem using an atom-by-atom generation approach, which is computationally expensive. In this paper, we formulate SBDD as a 3D-conditional generation problem and present DiffSBDD, an E(3)-equivariant 3D-conditional diffusion model that generates novel ligands conditioned on protein pockets. Furthermore, we curate a new dataset of experimentally determined binding complex data from Binding MOAD to provide a realistic binding scenario that complements the synthetic CrossDocked dataset. Comprehensive in silico experiments demonstrate the efficiency of DiffSBDD in generating novel and diverse drug-like ligands that engage protein pockets with high binding energies as predicted by in silico docking.  ( 2 min )
    Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup. (arXiv:2210.13512v1 [cs.LG])
    Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features.  ( 2 min )
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v2 [cs.LG] UPDATED)
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.  ( 2 min )
    Deformation Theory of Boltzmann Distributions. (arXiv:2210.13772v1 [hep-lat])
    Consider a one-parameter family of Boltzmann distributions $p_t(x) = \tfrac{1}{Z_t}e^{-S_t(x)}$. In this paper we study the problem of sampling from $p_{t_0}$ by first sampling from $p_{t_1}$ and then applying a transformation $\Psi_{t_1}^{t_0}$ to the samples so that to they follow $p_{t_0}$. We derive an equation relating $\Psi$ and the corresponding family of unnormalized log-likelihoods $S_t$. We demonstrate the utility of this idea on the $\phi^4$ lattice field theory by extending its defining action $S_0$ to a family of actions $S_t$ and finding a $\tau$ such that normalizing flows perform better at learning the Boltzmann distribution $p_\tau$ than at learning $p_0$.  ( 2 min )
    Speeding Up Question Answering Task of Language Models via Inverted Index. (arXiv:2210.13578v1 [cs.CL])
    Natural language processing applications, such as conversational agents and their question-answering capabilities, are widely used in the real world. Despite the wide popularity of large language models (LLMs), few real-world conversational agents take advantage of LLMs. Extensive resources consumed by LLMs disable developers from integrating them into end-user applications. In this study, we leverage an inverted indexing mechanism combined with LLMs to improve the efficiency of question-answering models for closed-domain questions. Our experiments show that using the index improves the average response time by 97.44%. In addition, due to the reduced search scope, the average BLEU score improved by 0.23 while using the inverted index.  ( 2 min )
    ConnectedUNets++: Mass Segmentation from Whole Mammographic Images. (arXiv:2210.13668v1 [eess.IV])
    Deep learning has made a breakthrough in medical image segmentation in recent years due to its ability to extract high-level features without the need for prior knowledge. In this context, U-Net is one of the most advanced medical image segmentation models, with promising results in mammography. Despite its excellent overall performance in segmenting multimodal medical images, the traditional U-Net structure appears to be inadequate in various ways. There are certain U-Net design modifications, such as MultiResUNet, Connected-UNets, and AU-Net, that have improved overall performance in areas where the conventional U-Net architecture appears to be deficient. Following the success of UNet and its variants, we have presented two enhanced versions of the Connected-UNets architecture: ConnectedUNets+ and ConnectedUNets++. In ConnectedUNets+, we have replaced the simple skip connections of Connected-UNets architecture with residual skip connections, while in ConnectedUNets++, we have modified the encoder-decoder structure along with employing residual skip connections. We have evaluated our proposed architectures on two publicly available datasets, the Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBIS-DDSM) and INbreast.  ( 2 min )
    CoLoC: Conditioned Localizer and Classifier for Sound Event Localization and Detection. (arXiv:2210.13932v1 [cs.SD])
    In this article, we describe Conditioned Localizer and Classifier (CoLoC) which is a novel solution for Sound Event Localization and Detection (SELD). The solution constitutes of two stages: the localization is done first and is followed by classification conditioned by the output of the localizer. In order to resolve the problem of the unknown number of sources we incorporate the idea borrowed from Sequential Set Generation (SSG). Models from both stages are SELDnet-like CRNNs, but with single outputs. Conducted reasoning shows that such two single-output models are fit for SELD task. We show that our solution improves on the baseline system in most metrics on the STARSS22 Dataset.  ( 2 min )
    Embodied, Situated, and Grounded Intelligence: Implications for AI. (arXiv:2210.13589v1 [cs.AI])
    In April of 2022, the Santa Fe Institute hosted a workshop on embodied, situated, and grounded intelligence as part of the Institute's Foundations of Intelligence project. The workshop brought together computer scientists, psychologists, philosophers, social scientists, and others to discuss the science of embodiment and related issues in human intelligence, and its implications for building robust, human-level AI. In this report, we summarize each of the talks and the subsequent discussions. We also draw out a number of key themes and identify important frontiers for future research.  ( 2 min )
    A Temporally Consistent Image-based Sun Tracking Algorithm for Solar Energy Forecasting Applications. (arXiv:2012.01059v3 [cs.CV] UPDATED)
    Improving irradiance forecasting is critical to further increase the share of solar in the energy mix. On a short time scale, fish-eye cameras on the ground are used to capture cloud displacements causing the local variability of the electricity production. As most of the solar radiation comes directly from the Sun, current forecasting approaches use its position in the image as a reference to interpret the cloud cover dynamics. However, existing Sun tracking methods rely on external data and a calibration of the camera, which requires access to the device. To address these limitations, this study introduces an image-based Sun tracking algorithm to localise the Sun in the image when it is visible and interpolate its daily trajectory from past observations. We validate the method on a set of sky images collected over a year at SIRTA's lab. Experimental results show that the proposed method provides robust smooth Sun trajectories with a mean absolute error below 1% of the image size.
    QTI Submission to DCASE 2021: residual normalization for device-imbalanced acoustic scene classification with efficient design. (arXiv:2206.13909v2 [cs.SD] UPDATED)
    This technical report describes the details of our TASK1A submission of the DCASE2021 challenge. The goal of the task is to design an audio scene classification system for device-imbalanced datasets under the constraints of model complexity. This report introduces four methods to achieve the goal. First, we propose Residual Normalization, a novel feature normalization method that uses instance normalization with a shortcut path to discard unnecessary device-specific information without losing useful information for classification. Second, we design an efficient architecture, BC-ResNet-Mod, a modified version of the baseline architecture with a limited receptive field. Third, we exploit spectrogram-to-spectrogram translation from one to multiple devices to augment training data. Finally, we utilize three model compression schemes: pruning, quantization, and knowledge distillation to reduce model complexity. The proposed system achieves an average test accuracy of 76.3% in TAU Urban Acoustic Scenes 2020 Mobile, development dataset with 315k parameters, and average test accuracy of 75.3% after compression to 61.0KB of non-zero parameters. We extend this work to [1].
    Conformal Inference for Online Prediction with Arbitrary Distribution Shifts. (arXiv:2208.08401v2 [stat.ME] UPDATED)
    Conformal inference is a powerful tool for quantifying the uncertainty around predictions made by black-box models (e.g. neural nets, random forests). Formally, this methodology guarantees that if the training and test data are exchangeable (e.g. i.i.d.) then we can construct a prediction set $C$ for the target $Y$ such that $P(Y \in C) = 1-\alpha$ for any target level $\alpha$. In this article, we extend this methodology to an online prediction setting where the distribution generating the data is allowed to vary over time. To account for the non-exchangeability, we develop a protective layer that lies on top of conformal inference and gradually re-calibrates its predictions to adapt to the observed changes in the environment. Our methods are highly flexible and can be used in combination with any predictive algorithm that produces estimates of the target or its conditional distribution and without any assumptions on the size or type of the distribution shift. We test our techniques on two real-world datasets aimed at predicting stock market volatility and COVID-19 case counts and find that they are robust and adaptive to real-world distribution shifts.
    The Curious Case of Benign Memorization. (arXiv:2210.14019v1 [cs.LG])
    Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete overfitting of the data even if the labels are randomized, i.e. networks can completely memorize all given patterns. While such a memorization capacity seems worrisome, in this work we show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers. As a result, only the very last layers are used for memorization, while preceding layers encode performant features which remain largely unaffected by the label noise. We explore the intricate role of the augmentations used for training and identify a memorization-generalization trade-off in terms of their diversity, marking a clear distinction to all previous works. Finally, we give a first explanation for the emergence of benign memorization by showing that malign memorization under data augmentation is infeasible due to the insufficient capacity of the model for the increased sample size. As a consequence, the network is forced to leverage the correlated nature of the augmentations and as a result learns meaningful features. To complete the picture, a better theory of feature learning in deep neural networks is required to fully understand the origins of this phenomenon.
    DIMES: A Differentiable Meta Solver for Combinatorial Optimization Problems. (arXiv:2210.04123v2 [cs.LG] UPDATED)
    Recently, deep reinforcement learning (DRL) models have shown promising results in solving NP-hard Combinatorial Optimization (CO) problems. However, most DRL solvers can only scale to a few hundreds of nodes for combinatorial optimization problems on graphs, such as the Traveling Salesman Problem (TSP). This paper addresses the scalability challenge in large-scale combinatorial optimization by proposing a novel approach, namely, DIMES. Unlike previous DRL methods which suffer from costly autoregressive decoding or iterative refinements of discrete solutions, DIMES introduces a compact continuous space for parameterizing the underlying distribution of candidate solutions. Such a continuous space allows stable REINFORCE-based training and fine-tuning via massively parallel sampling. We further propose a meta-learning framework to enable the effective initialization of model parameters in the fine-tuning stage. Extensive experiments show that DIMES outperforms recent DRL-based methods on large benchmark datasets for Traveling Salesman Problems and Maximal Independent Set problems.
    Towards an Error-free Deep Occupancy Detector for Smart Camera Parking System. (arXiv:2208.08220v2 [cs.CV] UPDATED)
    Although the smart camera parking system concept has existed for decades, a few approaches have fully addressed the system's scalability and reliability. As the cornerstone of a smart parking system is the ability to detect occupancy, traditional methods use the classification backbone to predict spots from a manual labeled grid. This is time-consuming and loses the system's scalability. Additionally, most of the approaches use deep learning models, making them not error-free and not reliable at scale. Thus, we propose an end-to-end smart camera parking system where we provide an autonomous detecting occupancy by an object detector called OcpDet. Our detector also provides meaningful information from contrastive modules: training and spatial knowledge, which avert false detections during inference. We benchmark OcpDet on the existing PKLot dataset and reach competitive results compared to traditional classification solutions. We also introduce an additional SNU-SPS dataset, in which we estimate the system performance from various views and conduct system evaluation in parking assignment tasks. The result from our dataset shows that our system is promising for real-world applications.
    Kurdish Handwritten Character Recognition using Deep Learning Techniques. (arXiv:2210.13734v1 [cs.CV])
    Handwriting recognition is one of the active and challenging areas of research in the field of image processing and pattern recognition. It has many applications that include: a reading aid for visual impairment, automated reading and processing for bank checks, making any handwritten document searchable, and converting them into structural text form, etc. Moreover, high accuracy rates have been recorded by handwriting recognition systems for English, Chinese Arabic, Persian, and many other languages. Yet there is no such system available for offline Kurdish handwriting recognition. In this paper, an attempt is made to design and develop a model that can recognize handwritten characters for Kurdish alphabets using deep learning techniques. Kurdish (Sorani) contains 34 characters and mainly employs an Arabic\Persian based script with modified alphabets. In this work, a Deep Convolutional Neural Network model is employed that has shown exemplary performance in handwriting recognition systems. Then, a comprehensive dataset was created for handwritten Kurdish characters, which contains more than 40 thousand images. The created dataset has been used for training the Deep Convolutional Neural Network model for classification and recognition tasks. In the proposed system, the experimental results show an acceptable recognition level. The testing results reported a 96% accuracy rate, and training accuracy reported a 97% accuracy rate. From the experimental results, it is clear that the proposed deep learning model is performing well and is comparable to the similar model of other languages' handwriting recognition systems.
    Learning Explicit Object-Centric Representations with Vision Transformers. (arXiv:2210.14139v1 [cs.CV])
    With the recent successful adaptation of transformers to the vision domain, particularly when trained in a self-supervised fashion, it has been shown that vision transformers can learn impressive object-reasoning-like behaviour and features expressive for the task of object segmentation in images. In this paper, we build on the self-supervision task of masked autoencoding and explore its effectiveness for explicitly learning object-centric representations with transformers. To this end, we design an object-centric autoencoder using transformers only and train it end-to-end to reconstruct full images from unmasked patches. We show that the model efficiently learns to decompose simple scenes as measured by segmentation metrics on several multi-object benchmarks.
    What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0. (arXiv:2206.14348v2 [cs.CL] UPDATED)
    Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model\'s most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.
    Assembly Planning from Observations under Physical Constraints. (arXiv:2204.09616v2 [cs.RO] UPDATED)
    This paper addresses the problem of copying an unknown assembly of primitives with known shape and appearance using information extracted from a single photograph by an off-the-shelf procedure for object detection and pose estimation. The proposed algorithm uses a simple combination of physical stability constraints, convex optimization and Monte Carlo tree search to plan assemblies as sequences of pick-and-place operations represented by STRIPS operators. It is efficient and, most importantly, robust to the errors in object detection and pose estimation unavoidable in any real robotic system. The proposed approach is demonstrated with thorough experiments on a UR5 manipulator.
    TabMixer: Excavating Label Distribution Learning with Small-scale Features. (arXiv:2210.13852v1 [cs.LG])
    Label distribution learning (LDL) differs from multi-label learning which aims at representing the polysemy of instances by transforming single-label values into descriptive degrees. Unfortunately, the feature space of the label distribution dataset is affected by human factors and the inductive bias of the feature extractor causing uncertainty in the feature space. Especially, for datasets with small-scale feature spaces (the feature space dimension $\approx$ the label space), the existing LDL algorithms do not perform well. To address this issue, we seek to model the uncertainty augmentation of the feature space to alleviate the problem in LDL tasks. Specifically, we start with augmenting each feature value in the feature vector of a sample into a vector (sampling on a Gaussian distribution function). Which, the variance parameter of the Gaussian distribution function is learned by using a sub-network, and the mean parameter is filled by this feature value. Then, each feature vector is augmented to a matrix which is fed into a mixer with local attention (\textit{TabMixer}) to extract the latent feature. Finally, the latent feature is squeezed to yield an accurate label distribution via a squeezed network. Extensive experiments verify that our proposed algorithm can be competitive compared to other LDL algorithms on several benchmarks.
    One-shot, Offline and Production-Scalable PID Optimisation with Deep Reinforcement Learning. (arXiv:2210.13906v1 [eess.SY])
    Proportional-integral-derivative (PID) control underlies more than $97\%$ of automated industrial processes. Controlling these processes effectively with respect to some specified set of performance goals requires finding an optimal set of PID parameters to moderate the PID loop. Tuning these parameters is a long and exhaustive process. A method (patent pending) based on deep reinforcement learning is presented that learns a relationship between generic system properties (e.g. resonance frequency), a multi-objective performance goal and optimal PID parameter values. Performance is demonstrated in the context of a real optical switching product of the foremost manufacturer of such devices globally. Switching is handled by piezoelectric actuators where switching time and optical loss are derived from the speed and stability of actuator-control processes respectively. The method achieves a $5\times$ improvement in the number of actuators that fall within the most challenging target switching speed, $\geq 20\%$ improvement in mean switching speed at the same optical loss and $\geq 75\%$ reduction in performance inconsistency when temperature varies between 5 and 73 degrees celcius. Furthermore, once trained (which takes $\mathcal{O}(hours)$), the model generates actuator-unique PID parameters in a one-shot inference process that takes $\mathcal{O}(ms)$ in comparison to up to $\mathcal{O}(week)$ required for conventional tuning methods, therefore accomplishing these performance improvements whilst achieving up to a $10^6\times$ speed-up. After training, the method can be applied entirely offline, incurring effectively zero optimisation-overhead in production.
    Deep invariant networks with differentiable augmentation layers. (arXiv:2202.02142v6 [cs.LG] UPDATED)
    Designing learning systems which are invariant to certain data transformations is critical in machine learning. Practitioners can typically enforce a desired invariance on the trained model through the choice of a network architecture, e.g. using convolutions for translations, or using data augmentation. Yet, enforcing true invariance in the network can be difficult, and data invariances are not always known a piori. State-of-the-art methods for learning data augmentation policies require held-out data and are based on bilevel optimization problems, which are complex to solve and often computationally demanding. In this work we investigate new ways of learning invariances only from the training data. Using learnable augmentation layers built directly in the network, we demonstrate that our method is very versatile. It can incorporate any type of differentiable augmentation and be applied to a broad class of learning problems beyond computer vision. We provide empirical evidence showing that our approach is easier and faster to train than modern automatic data augmentation techniques based on bilevel optimization, while achieving comparable results. Experiments show that while the invariances transferred to a model through automatic data augmentation are limited by the model expressivity, the invariance yielded by our approach is insensitive to it by design.
    Deeper-GXX: Deepening Arbitrary GNNs. (arXiv:2110.13798v3 [cs.LG] UPDATED)
    Recently, motivated by real applications, a major research direction in graph neural networks (GNNs) is to explore deeper structures. For instance, the graph connectivity is not always consistent with the label distribution (e.g., the closest neighbors of some nodes are not from the same category). In this case, GNNs need to stack more layers, in order to find the same categorical neighbors in a longer path for capturing the class-discriminative information. However, two major problems hinder the deeper GNNs to obtain satisfactory performance, i.e., vanishing gradient and over-smoothing. On one hand, stacking layers makes the neural network hard to train as the gradients of the first few layers vanish. Moreover, when simply addressing vanishing gradient in GNNs, we discover the shading neighbors effect (i.e., stacking layers inappropriately distorts the non-IID information of graphs and degrade the performance of GNNs). On the other hand, deeper GNNs aggregate much more information from common neighbors such that individual node representations share more overlapping features, which makes the final output representations not discriminative (i.e., overly smoothed). In this paper, for the first time, we address both problems to enable deeper GNNs, and propose Deeper-GXX, which consists of the Weight-Decaying Graph Residual Connection module (WDG-ResNet) and Topology-Guided Graph Contrastive Loss (TGCL). Extensive experiments on real-world data sets demonstrate that Deeper-GXX outperforms state-of-the-art deeper baselines.
    DAGformer: Directed Acyclic Graph Transformer. (arXiv:2210.13148v2 [cs.LG] UPDATED)
    In many fields, such as natural language processing and computer vision, the Transformer architecture has become the standard. Recently, the Transformer architecture has also attracted a growing amount of interest in graph representation learning since it naturally overcomes some graph neural network (GNNs) restrictions. In this work, we focus on a special yet widely used class of graphs-DAGs. We propose the directed acyclic graph Transformer, DAGformer, a Transformer architecture that processes information according to the reachability relation defined by the partial order. DAGformer is simple and flexible, allowing it to be used with various transformer-based models. We show that our architecture achieves state-of-the-art performance on representative DAG datasets, outperforming all previous approaches.
    Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task. (arXiv:2210.13382v2 [cs.LG] UPDATED)
    Language models show a surprising range of capabilities, but the source of their apparent competence is unclear. Do these networks just memorize a collection of surface statistics, or do they rely on internal representations of the process that generates the sequences they see? We investigate this question by applying a variant of the GPT model to the task of predicting legal moves in a simple board game, Othello. Although the network has no a priori knowledge of the game or its rules, we uncover evidence of an emergent nonlinear internal representation of the board state. Interventional experiments indicate this representation can be used to control the output of the network and create "latent saliency maps" that can help explain predictions in human terms.
    Aboveground carbon biomass estimate with Physics-informed deep network. (arXiv:2210.13752v1 [cs.LG])
    The global carbon cycle is a key process to understand how our climate is changing. However, monitoring the dynamics is difficult because a high-resolution robust measurement of key state parameters including the aboveground carbon biomass (AGB) is required. Here, we use deep neural network to generate a wall-to-wall map of AGB within the Continental USA (CONUS) with 30-meter spatial resolution for the year 2021. We combine radar and optical hyperspectral imagery, with a physical climate parameter of SIF-based GPP. Validation results show that a masked variation of UNet has the lowest validation RMSE of 37.93 $\pm$ 1.36 Mg C/ha, as compared to 52.30 $\pm$ 0.03 Mg C/ha for random forest algorithm. Furthermore, models that learn from SIF-based GPP in addition to radar and optical imagery reduce validation RMSE by almost 10% and the standard deviation by 40%. Finally, we apply our model to measure losses in AGB from the recent 2021 Caldor wildfire in California, and validate our analysis with Sentinel-based burn index.
    Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding. (arXiv:2210.14169v1 [cs.CL])
    Dialogue understanding tasks often necessitate abundant annotated data to achieve good performance and that presents challenges in low-resource settings. To alleviate this barrier, we explore few-shot data augmentation for dialogue understanding by prompting large pre-trained language models and present a novel approach that iterates on augmentation quality by applying weakly-supervised filters. We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue. Models fine-tuned on our augmented data mixed with few-shot ground truth data are able to approach or surpass existing state-of-the-art performance on both datasets. For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
    Transformer-Empowered 6G Intelligent Networks: From Massive MIMO Processing to Semantic Communication. (arXiv:2205.03770v3 [cs.IT] UPDATED)
    6G wireless networks are foreseen to speed up the convergence of the physical and cyber worlds and to enable a paradigm-shift in the way we deploy and exploit communication networks. Machine learning, in particular deep learning (DL), is going to be one of the key technological enablers of 6G by offering a new paradigm for the design and optimization of networks with a high level of intelligence. In this article, we introduce an emerging DL architecture, known as the transformer, and discuss its potential impact on 6G network design. We first discuss the differences between the transformer and classical DL architectures, and emphasize the transformer's self-attention mechanism and strong representation capabilities, which make it particularly appealing in tackling various challenges in wireless network design. Specifically, we propose transformer-based solutions for massive multiple-input multiple-output (MIMO) systems and various semantic communication problems in 6G networks. Finally, we discuss key challenges and open issues in transformer-based solutions, and identify future research directions for their deployment in intelligent 6G networks.
    Speech Emotion Recognition using Supervised Deep Recurrent System for Mental Health Monitoring. (arXiv:2208.12812v2 [eess.AS] UPDATED)
    Understanding human behavior and monitoring mental health are essential to maintaining the community and society's safety. As there has been an increase in mental health problems during the COVID-19 pandemic due to uncontrolled mental health, early detection of mental issues is crucial. Nowadays, the usage of Intelligent Virtual Personal Assistants (IVA) has increased worldwide. Individuals use their voices to control these devices to fulfill requests and acquire different services. This paper proposes a novel deep learning model based on the gated recurrent neural network and convolution neural network to understand human emotion from speech to improve their IVA services and monitor their mental health.
    An Optimal Stochastic Algorithm for Decentralized Nonconvex Finite-sum Optimization. (arXiv:2210.13931v1 [math.OC])
    This paper studies the synchronized decentralized nonconvex optimization problem of the form $\min_{x\in{\mathbb R}^d} f(x)\triangleq \frac{1}{m}\sum_{i=1}^m f_i(x)$, where $f_i(x)\triangleq \frac{1}{n}\sum_{j=1}^n f_{i,j}(x)$ is the local function on $i$-th agent of the connected network. We propose a novel stochastic algorithm called DEcentralized probAbilistic Recursive gradiEnt deScenT (DEAREST), which integrates the techniques of variance reduction, gradient tracking and multi-consensus. We construct a Lyapunov function that simultaneously characterizes the function value, the gradient estimation error and the consensus error for the convergence analysis. Based on this measure, we provide a concise proof to show DEAREST requires at most ${\mathcal O}(mn+\sqrt{mn}L\varepsilon^{-2})$ incremental first-order oracle (IFO) calls and ${\mathcal O}(L\varepsilon^{-2}/\sqrt{1-\lambda_2(W)}\,)$ communication rounds to find an $\varepsilon$-stationary point in expectation, where $L$ is the smoothness parameter and $\lambda_2(W)$ is the second-largest eigenvalues of the gossip matrix $W$. We can verify both of the IFO complexity and communication complexity match the lower bounds. To the best of our knowledge, DEAREST is the first optimal algorithm for decentralized nonconvex finite-sum optimization.
    Hindering Adversarial Attacks with Implicit Neural Representations. (arXiv:2210.13982v1 [cs.LG])
    We introduce the Lossy Implicit Network Activation Coding (LINAC) defence, an input transformation which successfully hinders several common adversarial attacks on CIFAR-$10$ classifiers for perturbations up to $\epsilon = 8/255$ in $L_\infty$ norm and $\epsilon = 0.5$ in $L_2$ norm. Implicit neural representations are used to approximately encode pixel colour intensities in $2\text{D}$ images such that classifiers trained on transformed data appear to have robustness to small perturbations without adversarial training or large drops in performance. The seed of the random number generator used to initialise and train the implicit neural representation turns out to be necessary information for stronger generic attacks, suggesting its role as a private key. We devise a Parametric Bypass Approximation (PBA) attack strategy for key-based defences, which successfully invalidates an existing method in this category. Interestingly, our LINAC defence also hinders some transfer and adaptive attacks, including our novel PBA strategy. Our results emphasise the importance of a broad range of customised attacks despite apparent robustness according to standard evaluations. LINAC source code and parameters of defended classifier evaluated throughout this submission are available: https://github.com/deepmind/linac
    Covariance matrix preparation for quantum principal component analysis. (arXiv:2204.03495v2 [quant-ph] UPDATED)
    Principal component analysis (PCA) is a dimensionality reduction method in data analysis that involves diagonalizing the covariance matrix of the dataset. Recently, quantum algorithms have been formulated for PCA based on diagonalizing a density matrix. These algorithms assume that the covariance matrix can be encoded in a density matrix, but a concrete protocol for this encoding has been lacking. Our work aims to address this gap. Assuming amplitude encoding of the data, with the data given by the ensemble $\{p_i,| \psi_i \rangle\}$, then one can easily prepare the ensemble average density matrix $\overline{\rho} = \sum_i p_i |\psi_i\rangle \langle \psi_i |$. We first show that $\overline{\rho}$ is precisely the covariance matrix whenever the dataset is centered. For quantum datasets, we exploit global phase symmetry to argue that there always exists a centered dataset consistent with $\overline{\rho}$, and hence $\overline{\rho}$ can always be interpreted as a covariance matrix. This provides a simple means for preparing the covariance matrix for arbitrary quantum datasets or centered classical datasets. For uncentered classical datasets, our method is so-called "PCA without centering", which we interpret as PCA on a symmetrized dataset. We argue that this closely corresponds to standard PCA, and we derive equations and inequalities that bound the deviation of the spectrum obtained with our method from that of standard PCA. We numerically illustrate our method for the MNIST handwritten digit dataset. We also argue that PCA on quantum datasets is natural and meaningful, and we numerically implement our method for molecular ground-state datasets.
    From Points to Functions: Infinite-dimensional Representations in Diffusion Models. (arXiv:2210.13774v1 [cs.LG])
    Diffusion-based generative models learn to iteratively transfer unstructured noise to a complex target distribution as opposed to Generative Adversarial Networks (GANs) or the decoder of Variational Autoencoders (VAEs) which produce samples from the target distribution in a single step. Thus, in diffusion models every sample is naturally connected to a random trajectory which is a solution to a learned stochastic differential equation (SDE). Generative models are only concerned with the final state of this trajectory that delivers samples from the desired distribution. Abstreiter et. al showed that these stochastic trajectories can be seen as continuous filters that wash out information along the way. Consequently, it is reasonable to ask if there is an intermediate time step at which the preserved information is optimal for a given downstream task. In this work, we show that a combination of information content from different time steps gives a strictly better representation for the downstream task. We introduce an attention and recurrence based modules that ``learn to mix'' information content of various time-steps such that the resultant representation leads to superior performance in downstream tasks.
    A Database of Ultrastable MOFs Reassembled from Stable Fragments with Machine Learning Models. (arXiv:2210.14191v1 [cond-mat.mtrl-sci])
    High-throughput screening of large hypothetical databases of metal-organic frameworks (MOFs) can uncover new materials, but their stability in real-world applications is often unknown. We leverage community knowledge and machine learning (ML) models to identify MOFs that are thermally stable and stable upon activation. We separate these MOFs into their building blocks and recombine them to make a new hypothetical MOF database of over 50,000 structures that samples orders of magnitude more connectivity nets and inorganic building blocks than prior databases. This database shows an order of magnitude enrichment of ultrastable MOF structures that are stable upon activation and more than one standard deviation more thermally stable than the average experimentally characterized MOF. For the nearly 10,000 ultrastable MOFs, we compute bulk elastic moduli to confirm these materials have good mechanical stability, and we report methane deliverable capacities. Our work identifies privileged metal nodes in ultrastable MOFs that optimize gas storage and mechanical stability simultaneously.
    Mixed-Effect Thompson Sampling. (arXiv:2205.15124v2 [cs.LG] UPDATED)
    A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. We propose Mixed-Effect Thompson Sampling (meTS) that uses this structure to explore efficiently and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform extremely well empirically and show the generality of the proposed framework.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v2 [cs.LG] UPDATED)
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    UNIFY: a Unified Policy Designing Framework for Solving Constrained Optimization Problems with Machine Learning. (arXiv:2210.14030v1 [cs.LG])
    The interplay between Machine Learning (ML) and Constrained Optimization (CO) has recently been the subject of increasing interest, leading to a new and prolific research area covering (e.g.) Decision Focused Learning and Constrained Reinforcement Learning. Such approaches strive to tackle complex decision problems under uncertainty over multiple stages, involving both explicit (cost function, constraints) and implicit knowledge (from data), and possibly subject to execution time restrictions. While a good degree of success has been achieved, the existing methods still have limitations in terms of both applicability and effectiveness. For problems in this class, we propose UNIFY, a unified framework to design a solution policy for complex decision-making problems. Our approach relies on a clever decomposition of the policy in two stages, namely an unconstrained ML model and a CO problem, to take advantage of the strength of each approach while compensating for its weaknesses. With a little design effort, UNIFY can generalize several existing approaches, thus extending their applicability. We demonstrate the method effectiveness on two practical problems, namely an Energy Management System and the Set Multi-cover with stochastic coverage requirements. Finally, we highlight some current challenges of our method and future research directions that can benefit from the cross-fertilization of the two fields.
    Lottery Aware Sparsity Hunting: Enabling Federated Learning on Resource-Limited Edge. (arXiv:2208.13092v2 [cs.LG] UPDATED)
    Limited computation and communication capabilities of clients pose significant challenges in federated learning (FL) over resource-limited edge nodes. A potential solution to this problem is to deploy off-the-shelf sparse learning algorithms that train a binary sparse mask on each client with the expectation of training a consistent sparse server mask yielding sparse weight tensors. However, as we investigate in this paper, such naive deployments result in a significant drop in accuracy compared to FL with dense models, especially for clients with limited resource budgets. In particular, our investigations reveal a serious lack of consensus among the trained sparsity masks on clients, which prevents convergence for the server mask and potentially leads to a substantial drop in model performance. Based on such key observations, we propose \textit{federated lottery aware sparsity hunting} (FLASH), a unified sparse learning framework to make the server win a lottery in terms of yielding a sparse sub-model, able to maintain classification performance under highly resource-limited client settings. Moreover, to support FL on different devices requiring different parameter density, we leverage our findings to present \textit{hetero-FLASH}, where clients can have different target sparsity budgets based on their device resource limits. Experimental evaluations with multiple models on various datasets (both IID and non-IID) show superiority of our models in closing the gap with unpruned baseline while yielding up to $\mathord{\sim}10.1\%$ improved accuracy with $\mathord{\sim}10.26\times$ fewer communication costs, compared to existing alternatives, at similar hyperparameter settings.
    Slicing Aided Hyper Inference and Fine-tuning for Small Object Detection. (arXiv:2202.06934v5 [cs.CV] UPDATED)
    Detection of small objects and objects far away in the scene is a major challenge in surveillance applications. Such objects are represented by small number of pixels in the image and lack sufficient details, making them difficult to detect using conventional detectors. In this work, an open-source framework called Slicing Aided Hyper Inference (SAHI) is proposed that provides a generic slicing aided inference and fine-tuning pipeline for small object detection. The proposed technique is generic in the sense that it can be applied on top of any available object detector without any fine-tuning. Experimental evaluations, using object detection baselines on the Visdrone and xView aerial object detection datasets show that the proposed inference method can increase object detection AP by 6.8%, 5.1% and 5.3% for FCOS, VFNet and TOOD detectors, respectively. Moreover, the detection accuracy can be further increased with a slicing aided fine-tuning, resulting in a cumulative increase of 12.7%, 13.4% and 14.5% AP in the same order. Proposed technique has been integrated with Detectron2, MMDetection and YOLOv5 models and it is publicly available at https://github.com/obss/sahi.git .
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v2 [cs.LG] UPDATED)
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -- arising from approximating the Boltzmann distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
    DELATOR: Money Laundering Detection via Multi-Task Learning on Large Transaction Graphs. (arXiv:2205.10293v2 [cs.LG] UPDATED)
    Money laundering has become one of the most relevant criminal activities in modern societies, as it causes massive financial losses for governments, banks and other institutions. Detecting such activities is among the top priorities when it comes to financial analysis, but current approaches are often costly and labor intensive partly due to the sheer amount of data to be analyzed. Hence, there is a growing need for automatic anti-money laundering systems to assist experts. In this work, we propose DELATOR, a novel framework for detecting money laundering activities based on graph neural networks that learn from large-scale temporal graphs. DELATOR provides an effective and efficient method for learning from heavily imbalanced graph data, by adapting concepts from the GraphSMOTE framework and incorporating elements of multi-task learning to obtain rich node embeddings for node classification. DELATOR outperforms all considered baselines, including an off-the-shelf solution from Amazon AWS by 23% with respect to AUC-ROC. We also conducted real experiments that led to the discovery of 7 new suspicious cases among the 50 analyzed ones, which have been reported to the authorities.
    Photoacoustic image synthesis with generative adversarial networks. (arXiv:2103.15510v3 [eess.IV] UPDATED)
    Photoacoustic tomography (PAT) has the potential to recover morphological and functional tissue properties with high spatial resolution. However, previous attempts to solve the optical inverse problem with supervised machine learning were hampered by the absence of labeled reference data. While this bottleneck has been tackled by simulating training data, the domain gap between real and simulated images remains an unsolved challenge. We propose a novel approach to PAT image synthesis that involves subdividing the challenge of generating plausible simulations into two disjoint problems: (1) Probabilistic generation of realistic tissue morphology, and (2) pixel-wise assignment of corresponding optical and acoustic properties. The former is achieved with Generative Adversarial Networks (GANs) trained on semantically annotated medical imaging data. According to a validation study on a downstream task our approach yields more realistic synthetic images than the traditional model-based approach and could therefore become a fundamental step for deep learning-based quantitative PAT (qPAT).
    Sharpness-aware Minimization for Worst Case Optimization. (arXiv:2210.13533v1 [cs.LG])
    Improvement of worst group performance and generalization performance are core problems of current machine learning. There are diverse efforts to increase performance, such as weight norm penalty and data augmentation, but the improvements are limited. Recently, there have been two promising approaches to increase the worst group performance and generalization performance, respectively. Distributionally robust optimization (DRO) focuses on the worst or hardest group to improve the worst-group performance. Besides, sharpness-aware minimization (SAM) finds the flat minima to increase the generalization ability on an unseen dataset. They show significant performance improvements on the worst-group dataset and unseen dataset, respectively. However, DRO does not guarantee flatness, and SAM does not guarantee the worst group performance improvement. In other words, DRO and SAM may fail to increase the worst group performance when the training and test dataset shift occurs. In this study, we propose a new approach, the sharpness-aware group distributionally robust optimization (SGDRO). SGDRO finds the flat-minima that generalizes well on the worst group dataset. Different from DRO and SAM, SGDRO contributes to improving the generalization ability even the distribution shift occurs. We validate that SGDRO shows the smaller maximum eigenvalue and improved performance in the worst group.
    Learning Individual Treatment Effects under Heterogeneous Interference in Networks. (arXiv:2210.14080v1 [cs.LG])
    Estimates of individual treatment effects from networked observational data are attracting increasing attention these days. One major challenge in network scenarios is the violation of the stable unit treatment value assumption (SUTVA), which assumes that the treatment assignment of a unit does not influence others' outcomes. In network data, due to interference, the outcome of a unit is influenced not only by its treatment (i.e., direct effects) but also by others' treatments (i.e., spillover effects). Furthermore, the influences from other units are always heterogeneous (e.g., friends with similar interests affect a person differently than friends with different interests). In this paper, we focus on the problem of estimating individual treatment effects (both direct and spillover effects) under heterogeneous interference. To address this issue, we propose a novel Dual Weighting Regression (DWR) algorithm by simultaneously learning attention weights that capture the heterogeneous interference and sample weights to eliminate the complex confounding bias in networks. We formulate the entire learning process as a bi-level optimization problem. In theory, we present generalization error bounds for individual treatment effect estimation. Extensive experiments on four benchmark datasets demonstrate that the proposed DWR algorithm outperforms state-of-the-art methods for estimating individual treatment effects under heterogeneous interference.
    Food Ingredients Recognition through Multi-label Learning. (arXiv:2210.14147v1 [cs.CV])
    The ability to recognize various food-items in a generic food plate is a key determinant for an automated diet assessment system. This study motivates the need for automated diet assessment and proposes a framework to achieve this. Within this framework, we focus on one of the core functionalities to visually recognize various ingredients. To this end, we employed a deep multi-label learning approach and evaluated several state-of-the-art neural networks for their ability to detect an arbitrary number of ingredients in a dish image. The models evaluated in this work follow a definite meta-structure, consisting of an encoder and a decoder component. Two distinct decoding schemes, one based on global average pooling and the other on attention mechanism, are evaluated and benchmarked. Whereas for encoding, several well-known architectures, including DenseNet, EfficientNet, MobileNet, Inception and Xception, were employed. We present promising preliminary results for deep learning-based ingredients detection, using a challenging dataset, Nutrition5K, and establish a strong baseline for future explorations.
    Is an encoder within reach?. (arXiv:2206.01552v2 [cs.LG] UPDATED)
    The encoder network of an autoencoder is an approximation of the nearest point projection onto the manifold spanned by the decoder. A concern with this approximation is that, while the output of the encoder is always unique, the projection can possibly have infinitely many values. This implies that the latent representations learned by the autoencoder can be misleading. Borrowing from geometric measure theory, we introduce the idea of using the reach of the manifold spanned by the decoder to determine if an optimal encoder exists for a given dataset and decoder. We develop a local generalization of this reach and propose a numerical estimator thereof. We demonstrate that this allows us to determine which observations can be expected to have a unique, and thereby trustworthy, latent representation. As our local reach estimator is differentiable, we investigate its usage as a regularizer and show that this leads to learned manifolds for which projections are more often unique than without regularization.
    Graded-Q Reinforcement Learning with Information-Enhanced State Encoder for Hierarchical Collaborative Multi-Vehicle Pursuit. (arXiv:2210.13470v1 [cs.LG])
    The multi-vehicle pursuit (MVP), as a problem abstracted from various real-world scenarios, is becoming a hot research topic in Intelligent Transportation System (ITS). The combination of Artificial Intelligence (AI) and connected vehicles has greatly promoted the research development of MVP. However, existing works on MVP pay little attention to the importance of information exchange and cooperation among pursuing vehicles under the complex urban traffic environment. This paper proposed a graded-Q reinforcement learning with information-enhanced state encoder (GQRL-IESE) framework to address this hierarchical collaborative multi-vehicle pursuit (HCMVP) problem. In the GQRL-IESE, a cooperative graded Q scheme is proposed to facilitate the decision-making of pursuing vehicles to improve pursuing efficiency. Each pursuing vehicle further uses a deep Q network (DQN) to make decisions based on its encoded state. A coordinated Q optimizing network adjusts the individual decisions based on the current environment traffic information to obtain the global optimal action set. In addition, an information-enhanced state encoder is designed to extract critical information from multiple perspectives and uses the attention mechanism to assist each pursuing vehicle in effectively determining the target. Extensive experimental results based on SUMO indicate that the total timestep of the proposed GQRL-IESE is less than other methods on average by 47.64%, which demonstrates the excellent pursuing efficiency of the GQRL-IESE. Codes are outsourced in https://github.com/ANT-ITS/GQRL-IESE.
    Learning Partial Equivariances from Data. (arXiv:2110.10211v2 [cs.CV] UPDATED)
    Group Convolutional Neural Networks (G-CNNs) constrain learned features to respect the symmetries in the selected group, and lead to better generalization when these symmetries appear in the data. If this is not the case, however, equivariance leads to overly constrained models and worse performance. Frequently, transformations occurring in data can be better represented by a subset of a group than by a group as a whole, e.g., rotations in $[-90^{\circ}, 90^{\circ}]$. In such cases, a model that respects equivariance $\textit{partially}$ is better suited to represent the data. In addition, relevant transformations may differ for low and high-level features. For instance, full rotation equivariance is useful to describe edge orientations in a face, but partial rotation equivariance is better suited to describe face poses relative to the camera. In other words, the optimal level of equivariance may differ per layer. In this work, we introduce $\textit{Partial G-CNNs}$: G-CNNs able to learn layer-wise levels of partial and full equivariance to discrete, continuous groups and combinations thereof as part of training. Partial G-CNNs retain full equivariance when beneficial, e.g., for rotated MNIST, but adjust it whenever it becomes harmful, e.g., for classification of 6 / 9 digits or natural images. We empirically show that partial G-CNNs pair G-CNNs when full equivariance is advantageous, and outperform them otherwise.
    Off-Policy Correction for Actor-Critic Methods without Importance Sampling. (arXiv:2208.00755v2 [cs.LG] UPDATED)
    Compared to on-policy policy gradient techniques, off-policy model-free deep reinforcement learning (RL) that uses previously gathered data can improve sampling efficiency. However, off-policy learning becomes challenging when the discrepancy between the distributions of the policy of interest and the policies that collected the data increases. Although the well-studied importance sampling and off-policy policy gradient techniques were proposed to compensate for this discrepancy, they usually require a collection of long trajectories that increases the computational complexity and induce additional problems such as vanishing/exploding gradients or discarding many useful experiences. Moreover, their generalization to continuous action domains is strictly limited as they require action probabilities, which is unsuitable for deterministic policies. To overcome these limitations, we introduce a novel policy similarity measure to mitigate the effects of such discrepancy. Our method offers an adequate single-step off-policy correction without any probability estimates, and theoretical results show that it can achieve a contraction mapping with a fixed unique point, which allows "safe" off-policy learning. An extensive set of empirical results indicate that our algorithm substantially improves the state-of-the-art and attains higher returns in fewer steps than the competing methods by efficiently scheduling the learning rate in Q-learning and policy optimization.
    LiteLSTM Architecture for Deep Recurrent Neural Networks. (arXiv:2201.11624v2 [cs.LG] UPDATED)
    Long short-term memory (LSTM) is a robust recurrent neural network architecture for learning spatiotemporal sequential data. However, it requires significant computational power for learning and implementing from both software and hardware aspects. This paper proposes a novel LiteLSTM architecture based on reducing the computation components of the LSTM using the weights sharing concept to reduce the overall architecture cost and maintain the architecture performance. The proposed LiteLSTM can be significant for learning big data where time-consumption is crucial such as the security of IoT devices and medical data. Moreover, it helps to reduce the CO2 footprint. The proposed model was evaluated and tested empirically on two different datasets from computer vision and cybersecurity domains.
    BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping. (arXiv:2206.12038v4 [cs.SD] UPDATED)
    Methods for extracting audio and speech features have been studied since pioneering work on spectrum analysis decades ago. Recent efforts are guided by the ambition to develop general-purpose audio representations. For example, deep neural networks can extract optimal embeddings if they are trained on large audio datasets. This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets. Lastly, we present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features. All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks. Our results indicate that the hybrid model with a convolutional transformer as the encoder yields superior performance in most HEAR challenge tasks.
    A Statistically-Based Approach to Feedforward Neural Network Model Selection. (arXiv:2207.04248v2 [stat.ME] UPDATED)
    Feedforward neural networks (FNNs) can be viewed as non-linear regression models, where covariates enter the model through a combination of weighted summations and non-linear functions. Although these models have some similarities to the models typically used in statistical modelling, the majority of neural network research has been conducted outside of the field of statistics. This has resulted in a lack of statistically-based methodology, and, in particular, there has been little emphasis on model parsimony. Determining the input layer structure is analogous to variable selection, while the structure for the hidden layer relates to model complexity. In practice, neural network model selection is often carried out by comparing models using out-of-sample performance. However, in contrast, the construction of an associated likelihood function opens the door to information-criteria-based variable and architecture selection. A novel model selection method, which performs both input- and hidden-node selection, is proposed using the Bayesian information criterion (BIC) for FNNs. The choice of BIC over out-of-sample performance as the model selection objective function leads to an increased probability of recovering the true model, while parsimoniously achieving favourable out-of-sample performance. Simulation studies are used to evaluate and justify the proposed method, and applications on real data are investigated.
    A deep learning approach for brain tumor detection using magnetic resonance imaging. (arXiv:2210.13882v1 [eess.IV])
    The growth of abnormal cells in the brain's tissue causes brain tumors. Brain tumors are considered one of the most dangerous disorders in children and adults. It develops quickly, and the patient's survival prospects are slim if not appropriately treated. Proper treatment planning and precise diagnoses are essential to improving a patient's life expectancy. Brain tumors are mainly diagnosed using magnetic resonance imaging (MRI). As part of a convolution neural network (CNN)-based illustration, an architecture containing five convolution layers, five max-pooling layers, a Flatten layer, and two dense layers has been proposed for detecting brain tumors from MRI images. The proposed model includes an automatic feature extractor, modified hidden layer architecture, and activation function. Several test cases were performed, and the proposed model achieved 98.6% accuracy and 97.8% precision score with a low cross-entropy rate. Compared with other approaches such as adjacent feature propagation network (AFPNet), mask region-based CNN (mask RCNN), YOLOv5, and Fourier CNN (FCNN), the proposed model has performed better in detecting brain tumors.
    Reconstruction on Trees and Low-Degree Polynomials. (arXiv:2109.06915v3 [math.PR] UPDATED)
    The study of Markov processes and broadcasting on trees has deep connections to a variety of areas including statistical physics, graphical models, phylogenetic reconstruction, Markov Chain Monte Carlo, and community detection in random graphs. Notably, the celebrated Belief Propagation (BP) algorithm achieves Bayes-optimal performance for the reconstruction problem of predicting the value of the Markov process at the root of the tree from its values at the leaves. Recently, the analysis of low-degree polynomials has emerged as a valuable tool for predicting computational-to-statistical gaps. In this work, we investigate the performance of low-degree polynomials for the reconstruction problem on trees. Perhaps surprisingly, we show that there are simple tree models with $N$ leaves and bounded arity where (1) nontrivial reconstruction of the root value is possible with a simple polynomial time algorithm and with robustness to noise, but not with any polynomial of degree $N^{c}$ for $c > 0$ a constant depending only on the arity, and (2) when the tree is unknown and given multiple samples with correlated root assignments, nontrivial reconstruction of the root value is possible with a simple Statistical Query algorithm but not with any polynomial of degree $N^c$. These results clarify some of the limitations of low-degree polynomials vs. polynomial time algorithms for Bayesian estimation problems. They also complement recent work of Moitra, Mossel, and Sandon who studied the circuit complexity of Belief Propagation. As a consequence of our main result, we show that for some $c' > 0$ depending only on the arity, $\exp(N^{c'})$ many samples are needed for RBF kernel regression to obtain nontrivial correlation with the true regression function (BP). We pose related open questions about low-degree polynomials and the Kesten-Stigum threshold.
    Temporal Latent Bottleneck: Synthesis of Fast and Slow Processing Mechanisms in Sequence Learning. (arXiv:2205.14794v2 [cs.LG] UPDATED)
    Recurrent neural networks have a strong inductive bias towards learning temporally compressed representations, as the entire history of a sequence is represented by a single vector. By contrast, Transformers have little inductive bias towards learning temporally compressed representations, as they allow for attention over all previously computed elements in a sequence. Having a more compressed representation of a sequence may be beneficial for generalization, as a high-level representation may be more easily re-used and re-purposed and will contain fewer irrelevant details. At the same time, excessive compression of representations comes at the cost of expressiveness. We propose a solution which divides computation into two streams. A slow stream that is recurrent in nature aims to learn a specialized and compressed representation, by forcing chunks of $K$ time steps into a single representation which is divided into multiple vectors. At the same time, a fast stream is parameterized as a Transformer to process chunks consisting of $K$ time-steps conditioned on the information in the slow-stream. In the proposed approach we hope to gain the expressiveness of the Transformer, while encouraging better compression and structuring of representations in the slow stream. We show the benefits of the proposed method in terms of improved sample efficiency and generalization performance as compared to various competitive baselines for visual perception and sequential decision making tasks.
    A Spectral Method for Assessing and Combining Multiple Data Visualizations. (arXiv:2210.13711v1 [stat.ML])
    Dimension reduction and data visualization aim to project a high-dimensional dataset to a low-dimensional space while capturing the intrinsic structures in the data. It is an indispensable part of modern data science, and many dimensional reduction and visualization algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it critically important to evaluate their relative performance for a given dataset, and to leverage and combine their individual strengths. In this paper, we propose an efficient spectral method for assessing and combining multiple visualizations of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure -- the visualization eigenscore -- of the relative performance of the visualizations for preserving the structure around each data point. Then it leverages the eigenscores to obtain a consensus visualization, which has much improved { quality over the individual visualizations in capturing the underlying true data structure.} Our approach is flexible and works as a wrapper around any visualizations. We analyze multiple simulated and real-world datasets from diverse applications to demonstrate the effectiveness of the eigenscores for evaluating visualizations and the superiority of the proposed consensus visualization. Furthermore, we establish rigorous theoretical justification of our method based on a general statistical framework, yielding fundamental principles behind the empirical success of consensus visualization along with practical guidance.
    Some Simulation and Empirical Results for Semi-Supervised Learning of the Bayes Rule of Allocation. (arXiv:2210.13785v1 [stat.ML])
    There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data consists of some feature vectors that have their class labels missing. In this study, we consider the generative model approach proposed by Ahfock&McLachlan(2020) who introduced a framework with a missingness mechanism for the missing labels of the unclassified features. In the case of two multivariate normal classes with a common covariance matrix, they showed that the error rate of the estimated Bayes' rule formed by this SSL approach can actually have lower error rate than the one that could be formed from a completely classified sample. In this study we consider this rather surprising result in cases where there may be more than two normal classes with not necessarily common covariance matrices.
    Modelling Residential Supply Tasks Based on Digital Orthophotography Using Machine Learning. (arXiv:2210.14013v1 [eess.SY])
    In order to achieve the climate targets, electrification of individual mobility is essential. However, grid integration of electrical vehicles poses challenges for the electrical distribution network due to high charging power and simultaneity. To investigate these challenges in research studies, the network-referenced supply task needs to be modeled. Previous research work utilizes data that is not always complete or sufficiently granular in space. This is why this paper presents a methodology which allows a holistic determination of residential supply tasks based on orthophotos. To do this, buildings are first identified from orthophotos, then residential building types are classified, and finally the electricity demand of each building is determined. In an exemplary case study, we validate the presented methodology and compare the results with another supply task methodology. The results show that the electricity demand deviates from the results of a reference method by an average 9%. Deviations result mainly from the parameterization of the selected residential building types. Thus, the presented methodology is able to model supply tasks similarly as other methods but more granular.
    Analyzing Privacy Leakage in Machine Learning via Multiple Hypothesis Testing: A Lesson From Fano. (arXiv:2210.13662v1 [cs.LG])
    Differential privacy (DP) is by far the most widely accepted framework for mitigating privacy risks in machine learning. However, exactly how small the privacy parameter $\epsilon$ needs to be to protect against certain privacy risks in practice is still not well-understood. In this work, we study data reconstruction attacks for discrete data and analyze it under the framework of multiple hypothesis testing. We utilize different variants of the celebrated Fano's inequality to derive upper bounds on the inferential power of a data reconstruction adversary when the model is trained differentially privately. Importantly, we show that if the underlying private data takes values from a set of size $M$, then the target privacy parameter $\epsilon$ can be $O(\log M)$ before the adversary gains significant inferential power. Our analysis offers theoretical evidence for the empirical effectiveness of DP against data reconstruction attacks even at relatively large values of $\epsilon$.
    DeXtreme: Transfer of Agile In-hand Manipulation from Simulation to Reality. (arXiv:2210.13702v1 [cs.RO])
    Recent work has demonstrated the ability of deep reinforcement learning (RL) algorithms to learn complex robotic behaviours in simulation, including in the domain of multi-fingered manipulation. However, such models can be challenging to transfer to the real world due to the gap between simulation and reality. In this paper, we present our techniques to train a) a policy that can perform robust dexterous manipulation on an anthropomorphic robot hand and b) a robust pose estimator suitable for providing reliable real-time information on the state of the object being manipulated. Our policies are trained to adapt to a wide range of conditions in simulation. Consequently, our vision-based policies significantly outperform the best vision policies in the literature on the same reorientation task and are competitive with policies that are given privileged state information via motion capture systems. Our work reaffirms the possibilities of sim-to-real transfer for dexterous manipulation in diverse kinds of hardware and simulator setups, and in our case, with the Allegro Hand and Isaac Gym GPU-based simulation. Furthermore, it opens up possibilities for researchers to achieve such results with commonly-available, affordable robot hands and cameras. Videos of the resulting policy and supplementary information, including experiments and demos, can be found at \url{https://dextreme.org/}
    Predicting Survival Outcomes in the Presence of Unlabeled Data. (arXiv:2210.13891v1 [cs.LG])
    Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently observed drop-out, there are often also organizational and financial challenges, which can lead to reduced data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline data available of patients with similar characteristics and background information, e.g., from patients that fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third level of supervision in the context of survival analysis, apart from fully observed and censored instances, we also include unlabeled instances. We propose three approaches to deal with this novel setting and provide an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results demonstrate that all approaches are able to increase the predictive performance over independent test data. We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper approach generally provides the best results, often achieving high improvements, compared to not using unlabeled data.
    Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds. (arXiv:2210.14051v1 [cs.LG])
    We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We identify a key property of the EntRM, the monotonicity-preserving property, which enables the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that both of them attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|H}H\sqrt{HS^2AT})$ regret upper bound, where $S$ is the number of states, $A$ the number of states, $H$ the time horizon and $T$ the number of total time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with a much simpler regret analysis. To the best of our knowledge, this is the first regret analysis of DRL, which bridges DRL and RSRL in terms of sample complexity. Finally, we improve the existing lower bound by proving a tighter bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
    PolyHope: Dataset Creation for a Two-Level Hope Speech Detection Task from Tweets. (arXiv:2210.14136v1 [cs.CL])
    Hope is characterized as openness of spirit toward the future, a desire, expectation, and wish for something to happen or to be true that remarkably affects human's state of mind, emotions, behaviors, and decisions. Hope is usually associated with concepts of desired expectations and possibility/probability concerning the future. Despite its importance, hope has rarely been studied as a social media analysis task. This paper presents a hope speech dataset that classifies each tweet first into "Hope" and "Not Hope", then into three fine-grained hope categories: "Generalized Hope", "Realistic Hope", and "Unrealistic Hope" (along with "Not Hope"). English tweets in the first half of 2022 were collected to build this dataset. Furthermore, we describe our annotation process and guidelines in detail and discuss the challenges of classifying hope and the limitations of the existing hope speech detection corpora. In addition, we reported several baselines based on different learning approaches, such as traditional machine learning, deep learning, and transformers, to benchmark our dataset. We evaluated our baselines using weighted-averaged and macro-averaged F1-scores. Observations show that a strict process for annotator selection and detailed annotation guidelines enhanced the dataset's quality. This strict annotation process resulted in promising performance for simple machine learning classifiers with only bi-grams; however, binary and multiclass hope speech detection results reveal that contextual embedding models have higher performance in this dataset.
    Online Cross-Layer Knowledge Distillation on Graph Neural Networks with Deep Supervision. (arXiv:2210.13743v1 [cs.LG])
    Graph neural networks (GNNs) have become one of the most popular research topics in both academia and industry communities for their strong ability in handling irregular graph data. However, large-scale datasets are posing great challenges for deploying GNNs in edge devices with limited resources and model compression techniques have drawn considerable research attention. Existing model compression techniques such as knowledge distillation (KD) mainly focus on convolutional neural networks (CNNs). Only limited attempts have been made recently for distilling knowledge from GNNs in an offline manner. As the performance of the teacher model does not necessarily improve as the number of layers increases in GNNs, selecting an appropriate teacher model will require substantial efforts. To address these challenges, we propose a novel online knowledge distillation framework called Alignahead++ in this paper. Alignahead++ transfers structure and feature information in a student layer to the previous layer of another simultaneously trained student model in an alternating training procedure. Meanwhile, to avoid over-smoothing problem in GNNs, deep supervision is employed in Alignahead++ by adding an auxiliary classifier in each intermediate layer to prevent the collapse of the node feature embeddings. Experimental results on four datasets including PPI, Cora, PubMed and CiteSeer demonstrate that the student performance is consistently boosted in our collaborative training framework without the supervision of a pre-trained teacher model and its effectiveness can generally be improved by increasing the number of students.
    I Prefer not to Say: Operationalizing Fair and User-guided Data Minimization. (arXiv:2210.13954v1 [cs.LG])
    To grant users greater authority over their personal data, policymakers have suggested tighter data protection regulations (e.g., GDPR, CCPA). One key principle within these regulations is data minimization, which urges companies and institutions to only collect data that is relevant and adequate for the purpose of the data analysis. In this work, we take a user-centric perspective on this regulation, and let individual users decide which data they deem adequate and relevant to be processed by a machine-learned model. We require that users who decide to provide optional information should appropriately benefit from sharing their data, while users who rely on the mandate to leave their data undisclosed should not be penalized for doing so. This gives rise to the overlooked problem of fair treatment between individuals providing additional information and those choosing not to. While the classical fairness literature focuses on fair treatment between advantaged and disadvantaged groups, an initial look at this problem through the lens of classical fairness notions reveals that they are incompatible with these desiderata. We offer a solution to this problem by proposing the notion of Optional Feature Fairness (OFF) that follows from our requirements. To operationalize OFF, we derive a multi-model strategy and a tractable logistic regression model. We analyze the effect and the cost of applying OFF on several real-world data sets.
    Deep Neural Networks as the Semi-classical Limit of Topological Quantum Neural Networks: The problem of generalisation. (arXiv:2210.13741v1 [quant-ph])
    Deep Neural Networks miss a principled model of their operation. A novel framework for supervised learning based on Topological Quantum Field Theory that looks particularly well suited for implementation on quantum processors has been recently explored. We propose the use of this framework for understanding the problem of generalization in Deep Neural Networks. More specifically, in this approach Deep Neural Networks are viewed as the semi-classical limit of Topological Quantum Neural Networks. A framework of this kind explains easily the overfitting behavior of Deep Neural Networks during the training step and the corresponding generalization capabilities.
    Influence Functions for Sequence Tagging Models. (arXiv:2210.14177v1 [cs.CL])
    Many language tasks (e.g., Named Entity Recognition, Part-of-Speech tagging, and Semantic Role Labeling) are naturally framed as sequence tagging problems. However, there has been comparatively little work on interpretability methods for sequence tagging models. In this paper, we extend influence functions - which aim to trace predictions back to the training points that informed them - to sequence tagging tasks. We define the influence of a training instance segment as the effect that perturbing the labels within this segment has on a test segment level prediction. We provide an efficient approximation to compute this, and show that it tracks with the true segment influence, measured empirically. We show the practical utility of segment influence by using the method to identify systematic annotation errors in two named entity recognition corpora. Code to reproduce our results is available at https://github.com/successar/Segment_Influence_Functions.
    Multi-Fidelity Bayesian Optimization with Unreliable Information Sources. (arXiv:2210.13937v1 [cs.LG])
    Bayesian optimization (BO) is a powerful framework for optimizing black-box, expensive-to-evaluate functions. Over the past decade, many algorithms have been proposed to integrate cheaper, lower-fidelity approximations of the objective function into the optimization process, with the goal of converging towards the global optimum at a reduced cost. This task is generally referred to as multi-fidelity Bayesian optimization (MFBO). However, MFBO algorithms can lead to higher optimization costs than their vanilla BO counterparts, especially when the low-fidelity sources are poor approximations of the objective function, therefore defeating their purpose. To address this issue, we propose rMFBO (robust MFBO), a methodology to make any GP-based MFBO scheme robust to the addition of unreliable information sources. rMFBO comes with a theoretical guarantee that its performance can be bound to its vanilla BO analog, with high controllable probability. We demonstrate the effectiveness of the proposed methodology on a number of numerical benchmarks, outperforming earlier MFBO methods on unreliable sources. We expect rMFBO to be particularly useful to reliably include human experts with varying knowledge within BO processes.
    Entity Divider with Language Grounding in Multi-Agent Reinforcement Learning. (arXiv:2210.13942v1 [cs.LG])
    We investigate the use of natural language to drive the generalization of policies in multi-agent settings. Unlike single-agent settings, the generalization of policies should also consider the influence of other agents. Besides, with the increasing number of entities in multi-agent settings, more agent-entity interactions are needed for language grounding, and the enormous search space could impede the learning process. Moreover, given a simple general instruction,e.g., beating all enemies, agents are required to decompose it into multiple subgoals and figure out the right one to focus on. Inspired by previous work, we try to address these issues at the entity level and propose a novel framework for language grounding in multi-agent reinforcement learning, entity divider (EnDi). EnDi enables agents to independently learn subgoal division at the entity level and act in the environment based on the associated entities. The subgoal division is regularized by opponent modeling to avoid subgoal conflicts and promote coordinated strategies. Empirically, EnDi demonstrates the strong generalization ability to unseen games with new dynamics and expresses the superiority over existing methods.
    InForecaster: Forecasting Influenza Hemagglutinin Mutations Through the Lens of Anomaly Detection. (arXiv:2210.13709v1 [cs.LG])
    The influenza virus hemagglutinin is an important part of the virus attachment to the host cells. The hemagglutinin proteins are one of the genetic regions of the virus with a high potential for mutations. Due to the importance of predicting mutations in producing effective and low-cost vaccines, solutions that attempt to approach this problem have recently gained a significant attention. A historical record of mutations have been used to train predictive models in such solutions. However, the imbalance between mutations and the preserved proteins is a big challenge for the development of such models that needs to be addressed. Here, we propose to tackle this challenge through anomaly detection (AD). AD is a well-established field in Machine Learning (ML) that tries to distinguish unseen anomalies from the normal patterns using only normal training samples. By considering mutations as the anomalous behavior, we could benefit existing rich solutions in this field that have emerged recently. Such methods also fit the problem setup of extreme imbalance between the number of unmutated vs. mutated training samples. Motivated by this formulation, our method tries to find a compact representation for unmutated samples while forcing anomalies to be separated from the normal ones. This helps the model to learn a shared unique representation between normal training samples as much as possible, which improves the discernibility and detectability of mutated samples from the unmutated ones at the test time. We conduct a large number of experiments on four publicly available datasets, consisting of 3 different hemagglutinin protein datasets, and one SARS-CoV-2 dataset, and show the effectiveness of our method through different standard criteria.
    Revisiting Softmax for Uncertainty Approximation in Text Classification. (arXiv:2210.14037v1 [cs.LG])
    Uncertainty approximation in text classification is an important area with applications in domain adaptation and interpretability. The most widely used uncertainty approximation method is Monte Carlo Dropout, which is computationally expensive as it requires multiple forward passes through the model. A cheaper alternative is to simply use a softmax to estimate model uncertainty. However, prior work has indicated that the softmax can generate overconfident uncertainty estimates and can thus be tricked into producing incorrect predictions. In this paper, we perform a thorough empirical analysis of both methods on five datasets with two base neural architectures in order to reveal insight into the trade-offs between the two. We compare the methods' uncertainty approximations and downstream text classification performance, while weighing their performance against their computational complexity as a cost-benefit analysis, by measuring runtime (cost) and the downstream performance (benefit). We find that, while Monte Carlo produces the best uncertainty approximations, using a simple softmax leads to competitive uncertainty estimation for text classification at a much lower computational cost, suggesting that softmax can in fact be a sufficient uncertainty estimate when computational resources are a concern.
    SeismicNet: Physics-informed neural networks for seismic wave modeling in semi-infinite domain. (arXiv:2210.14044v1 [physics.geo-ph])
    There has been an increasing interest in integrating physics knowledge and machine learning for modeling dynamical systems. However, very limited studies have been conducted on seismic wave modeling tasks. A critical challenge is that these geophysical problems are typically defined in large domains (i.e., semi-infinite), which leads to high computational cost. In this paper, we present a novel physics-informed neural network (PINN) model for seismic wave modeling in semi-infinite domain without the nedd of labeled data. In specific, the absorbing boundary condition is introduced into the network as a soft regularizer for handling truncated boundaries. In terms of computational efficiency, we consider a sequential training strategy via temporal domain decomposition to improve the scalability of the network and solution accuracy. Moreover, we design a novel surrogate modeling strategy for parametric loading, which estimates the wave propagation in semin-infinite domain given the seismic loading at different locations. Various numerical experiments have been implemented to evaluate the performance of the proposed PINN model in the context of forward modeling of seismic wave propagation. In particular, we define diverse material distributions to test the versatility of this approach. The results demonstrate excellent solution accuracy under distinctive scenarios.
    COEP: Cascade Optimization for Inverse Problems with Entropy-Preserving Hyperparameter Tuning. (arXiv:2210.13983v1 [cs.LG])
    We propose COEP, an automated and principled framework to solve inverse problems with deep generative models. COEP consists of two components, a cascade algorithm for optimization and an entropy-preserving criterion for hyperparameter tuning. Through COEP, the two components build up an efficient and end-to-end solver for inverse problems that require no human evaluation. We establish theoretical guarantees for the proposed methods. We also empirically validate the strength of COEP on denoising and noisy compressed sensing, which are two fundamental tasks in inverse problems.
    Towards Robust Recommender Systems via Triple Cooperative Defense. (arXiv:2210.13762v1 [cs.LG])
    Recommender systems are often susceptible to well-crafted fake profiles, leading to biased recommendations. The wide application of recommender systems makes studying the defense against attack necessary. Among existing defense methods, data-processing-based methods inevitably exclude normal samples, while model-based methods struggle to enjoy both generalization and robustness. Considering the above limitations, we suggest integrating data processing and robust model and propose a general framework, Triple Cooperative Defense (TCD), which cooperates to improve model robustness through the co-training of three models. Specifically, in each round of training, we sequentially use the high-confidence prediction ratings (consistent ratings) of any two models as auxiliary training data for the remaining model, and the three models cooperatively improve recommendation robustness. Notably, TCD adds pseudo label data instead of deleting abnormal data, which avoids the cleaning of normal data, and the cooperative training of the three models is also beneficial to model generalization. Through extensive experiments with five poisoning attacks on three real-world datasets, the results show that the robustness improvement of TCD significantly outperforms baselines. It is worth mentioning that TCD is also beneficial for model generalizations.
    Bit Error and Block Error Rate Training for ML-Assisted Communication. (arXiv:2210.14103v1 [cs.IT])
    Even though machine learning (ML) techniques are being widely used in communications, the question of how to train communication systems has received surprisingly little attention. In this paper, we show that the commonly used binary cross-entropy (BCE) loss is a sensible choice in uncoded systems, e.g., for training ML-assisted data detectors, but may not be optimal in coded systems. We propose new loss functions targeted at minimizing the block error rate and SNR de-weighting, a novel method that trains communication systems for optimal performance over a range of signal-to-noise ratios. The utility of the proposed loss functions as well as of SNR de-weighting is shown through simulations in NVIDIA Sionna.
    Conditionally Risk-Averse Contextual Bandits. (arXiv:2210.13573v1 [stat.ML])
    We desire to apply contextual bandits to scenarios where average-case statistical guarantees are inadequate. Happily, we discover the composition of reduction to online regression and expectile loss is analytically tractable, computationally convenient, and empirically effective. The result is the first risk-averse contextual bandit algorithm with an online regret guarantee. We state our precise regret guarantee and conduct experiments from diverse scenarios in dynamic pricing, inventory management, and self-tuning software; including results from a production exascale cloud data processing system.
    Meta-learning Pathologies from Radiology Reports using Variance Aware Prototypical Networks. (arXiv:2210.13979v1 [cs.LG])
    Large pretrained Transformer-based language models like BERT and GPT have changed the landscape of Natural Language Processing (NLP). However, fine tuning such models still requires a large number of training examples for each target task, thus annotating multiple datasets and training these models on various downstream tasks becomes time consuming and expensive. In this work, we propose a simple extension of the Prototypical Networks for few-shot text classification. Our main idea is to replace the class prototypes by Gaussians and introduce a regularization term that encourages the examples to be clustered near the appropriate class centroids. Experimental results show that our method outperforms various strong baselines on 13 public and 4 internal datasets. Furthermore, we use the class distributions as a tool for detecting potential out-of-distribution (OOD) data points during deployment.
    PlanT: Explainable Planning Transformers via Object-Level Representations. (arXiv:2210.14222v1 [cs.RO])
    Planning an optimal route in a complex environment requires efficient reasoning about the surrounding scene. While human drivers prioritize important objects and ignore details not relevant to the decision, learning-based planners typically extract features from dense, high-dimensional grid representations containing all vehicle and road context information. In this paper, we propose PlanT, a novel approach for planning in the context of self-driving that uses a standard transformer architecture. PlanT is based on imitation learning with a compact object-level input representation. On the Longest6 benchmark for CARLA, PlanT outperforms all prior methods (matching the driving score of the expert) while being 5.3x faster than equivalent pixel-based planning baselines during inference. Combining PlanT with an off-the-shelf perception module provides a sensor-based driving system that is more than 10 points better in terms of driving score than the existing state of the art. Furthermore, we propose an evaluation protocol to quantify the ability of planners to identify relevant objects, providing insights regarding their decision-making. Our results indicate that PlanT can focus on the most relevant object in the scene, even when this object is geometrically distant.
    Multi-modal Dynamic Graph Network: Coupling Structural and Functional Connectome for Disease Diagnosis and Classification. (arXiv:2210.13721v1 [eess.IV])
    Multi-modal neuroimaging technology has greatlly facilitated the efficiency and diagnosis accuracy, which provides complementary information in discovering objective disease biomarkers. Conventional deep learning methods, e.g. convolutional neural networks, overlook relationships between nodes and fail to capture topological properties in graphs. Graph neural networks have been proven to be of great importance in modeling brain connectome networks and relating disease-specific patterns. However, most existing graph methods explicitly require known graph structures, which are not available in the sophisticated brain system. Especially in heterogeneous multi-modal brain networks, there exists a great challenge to model interactions among brain regions in consideration of inter-modal dependencies. In this study, we propose a Multi-modal Dynamic Graph Convolution Network (MDGCN) for structural and functional brain network learning. Our method benefits from modeling inter-modal representations and relating attentive multi-model associations into dynamic graphs with a compositional correspondence matrix. Moreover, a bilateral graph convolution layer is proposed to aggregate multi-modal representations in terms of multi-modal associations. Extensive experiments on three datasets demonstrate the superiority of our proposed method in terms of disease classification, with the accuracy of 90.4%, 85.9% and 98.3% in predicting Mild Cognitive Impairment (MCI), Parkinson's disease (PD), and schizophrenia (SCHZ) respectively. Furthermore, our statistical evaluations on the correspondence matrix exhibit a high correspondence with previous evidence of biomarkers.
    Pruning's Effect on Generalization Through the Lens of Training and Regularization. (arXiv:2210.13738v1 [cs.LG])
    Practitioners frequently observe that pruning improves model generalization. A long-standing hypothesis based on bias-variance trade-off attributes this generalization improvement to model size reduction. However, recent studies on over-parameterization characterize a new model size regime, in which larger models achieve better generalization. Pruning models in this over-parameterized regime leads to a contradiction -- while theory predicts that reducing model size harms generalization, pruning to a range of sparsities nonetheless improves it. Motivated by this contradiction, we re-examine pruning's effect on generalization empirically. We show that size reduction cannot fully account for the generalization-improving effect of standard pruning algorithms. Instead, we find that pruning leads to better training at specific sparsities, improving the training loss over the dense model. We find that pruning also leads to additional regularization at other sparsities, reducing the accuracy degradation due to noisy examples over the dense model. Pruning extends model training time and reduces model size. These two factors improve training and add regularization respectively. We empirically demonstrate that both factors are essential to fully explaining pruning's impact on generalization.
    Magnetic Resonance Spectroscopy Deep Learning Denoising Using Few In Vivo Data. (arXiv:2101.11442v4 [physics.med-ph] UPDATED)
    Magnetic Resonance Spectroscopy (MRS) is a noninvasive tool to reveal metabolic information. One challenge of 1H-MRS is the low Signal-Noise Ratio (SNR). To improve the SNR, a typical approach is to perform Signal Averaging (SA) with M repeated samples. The data acquisition time, however, is increased by M times accordingly, and a complete clinical MRS scan takes approximately 10 minutes at a common setting M=128. Recently, deep learning has been introduced to improve the SNR but most of them use the simulated data as the training set. This may hinder the MRS applications since some potential differences, such as acquisition system imperfections, and physiological and psychologic conditions may exist between the simulated and in vivo data. Here, we proposed a new scheme that purely used the repeated samples of realistic data. A deep learning model, Refusion Long Short-Term Memory (ReLSTM), was designed to learn the mapping from the low SNR time-domain data (24 SA) to the high SNR one (128 SA). Experiments on the in vivo brain spectra of 7 healthy subjects, 2 brain tumor patients and 1 cerebral infarction patient showed that only using 20% repeated samples, the denoised spectra by ReLSTM could provide comparable estimated concentrations of metabolites to 128 SA. Compared with the state-of-the-art low-rank denoising method, the ReLSTM achieved the lower relative error and the Cram\'er-Rao lower bounds in quantifying some important biomarkers. In summary, ReLSTM can perform high-fidelity denoising of the spectra under fast acquisition (24 SA), which would be valuable to MRS clinical studies.
    Energy-Based Contrastive Learning of Visual Representations. (arXiv:2202.04933v2 [cs.LG] UPDATED)
    Contrastive learning is a method of learning visual representations by training Deep Neural Networks (DNNs) to increase the similarity between representations of positive pairs (transformations of the same image) and reduce the similarity between representations of negative pairs (transformations of different images). Here we explore Energy-Based Contrastive Learning (EBCLR) that leverages the power of generative learning by combining contrastive learning with Energy-Based Models (EBMs). EBCLR can be theoretically interpreted as learning the joint distribution of positive pairs, and it shows promising results on small and medium-scale datasets such as MNIST, Fashion-MNIST, CIFAR-10, and CIFAR-100. Specifically, we find EBCLR demonstrates from X4 up to X20 acceleration compared to SimCLR and MoCo v2 in terms of training epochs. Furthermore, in contrast to SimCLR, we observe EBCLR achieves nearly the same performance with 254 negative pairs (batch size 128) and 30 negative pairs (batch size 16) per positive pair, demonstrating the robustness of EBCLR to small numbers of negative pairs. Hence, EBCLR provides a novel avenue for improving contrastive learning methods that usually require large datasets with a significant number of negative pairs per iteration to achieve reasonable performance on downstream tasks. Code: https://github.com/1202kbs/EBCLR
    Gradient-based Weight Density Balancing for Robust Dynamic Sparse Training. (arXiv:2210.14012v1 [cs.LG])
    Training a sparse neural network from scratch requires optimizing connections at the same time as the weights themselves. Typically, the weights are redistributed after a predefined number of weight updates, removing a fraction of the parameters of each layer and inserting them at different locations in the same layers. The density of each layer is determined using heuristics, often purely based on the size of the parameter tensor. While the connections per layer are optimized multiple times during training, the density of each layer typically remains constant. This leaves great unrealized potential, especially in scenarios with a high sparsity of 90% and more. We propose Global Gradient-based Redistribution, a technique which distributes weights across all layers - adding more weights to the layers that need them most. Our evaluation shows that our approach is less prone to unbalanced weight distribution at initialization than previous work and that it is able to find better performing sparse subnetworks at very high sparsity levels.
    GlobalFlowNet: Video Stabilization using Deep Distilled Global Motion Estimates. (arXiv:2210.13769v1 [cs.CV])
    Videos shot by laymen using hand-held cameras contain undesirable shaky motion. Estimating the global motion between successive frames, in a manner not influenced by moving objects, is central to many video stabilization techniques, but poses significant challenges. A large body of work uses 2D affine transformations or homography for the global motion. However, in this work, we introduce a more general representation scheme, which adapts any existing optical flow network to ignore the moving objects and obtain a spatially smooth approximation of the global motion between video frames. We achieve this by a knowledge distillation approach, where we first introduce a low pass filter module into the optical flow network to constrain the predicted optical flow to be spatially smooth. This becomes our student network, named as \textsc{GlobalFlowNet}. Then, using the original optical flow network as the teacher network, we train the student network using a robust loss function. Given a trained \textsc{GlobalFlowNet}, we stabilize videos using a two stage process. In the first stage, we correct the instability in affine parameters using a quadratic programming approach constrained by a user-specified cropping limit to control loss of field of view. In the second stage, we stabilize the video further by smoothing global motion parameters, expressed using a small number of discrete cosine transform coefficients. In extensive experiments on a variety of different videos, our technique outperforms state of the art techniques in terms of subjective quality and different quantitative measures of video stability. The source code is publicly available at \href{https://github.com/GlobalFlowNet/GlobalFlowNet}{https://github.com/GlobalFlowNet/GlobalFlowNet}
    Deep nurbs -- admissible neural networks. (arXiv:2210.13900v1 [math.NA])
    In this study, we propose a new numerical scheme for physics-informed neural networks (PINNs) that enables precise and inexpensive solution for partial differential equations (PDEs) in case of arbitrary geometries while strictly enforcing Dirichlet boundary conditions. The proposed approach combines admissible NURBS parametrizations required to define the physical domain and the Dirichlet boundary conditions with a PINN solver. The fundamental boundary conditions are automatically satisfied in this novel Deep NURBS framework. We verified our new approach using two-dimensional elliptic PDEs when considering arbitrary geometries, including non-Lipschitz domains. Compared to the classical PINN solver, the Deep NURBS estimator has a remarkably high convergence rate for all the studied problems. Moreover, a desirable accuracy was realized for most of the studied PDEs using only one hidden layer of neural networks. This novel approach is considered to pave the way for more effective solutions for high-dimensional problems by allowing for more realistic physics-informed statistical learning to solve PDE-based variational problems.
    SleepMore: Sleep Prediction at Scale via Multi-Device WiFi Sensing. (arXiv:2210.14152v1 [eess.SP])
    The availability of commercial wearable trackers equipped with features to monitor sleep duration and quality has enabled more useful sleep health monitoring applications and analyses. However, much research has reported the challenge of long-term user retention in sleep monitoring through these modalities. Since modern Internet users own multiple mobile devices, our work explores the possibility of employing ubiquitous mobile devices and passive WiFi sensing techniques to predict sleep duration as the fundamental measure for complementing long-term sleep monitoring initiatives. In this paper, we propose SleepMore, an accurate and easy-to-deploy sleep-tracking approach based on machine learning over the user's WiFi network activity. It first employs a semi-personalized random forest model with an infinitesimal jackknife variance estimation method to classify a user's network activity behavior into sleep and awake states per minute granularity. Through a moving average technique, the system uses these state sequences to estimate the user's nocturnal sleep period and its uncertainty rate. Uncertainty quantification enables SleepMore to overcome the impact of noisy WiFi data that can yield large prediction errors. We validate SleepMore using data from a month-long user study involving 46 college students and draw comparisons with the Oura Ring wearable. Beyond the college campus, we evaluate SleepMore on non-student users of different housing profiles. Our results demonstrate that SleepMore produces statistically indistinguishable sleep statistics from the Oura ring baseline for predictions made within a 5% uncertainty rate. These errors range between 15-28 minutes for determining sleep time and 7-29 minutes for determining wake time, proving statistically significant improvements over prior work. Our in-depth analysis explains the sources of errors.
    Barycentric-alignment and reconstruction loss minimization for domain generalization. (arXiv:2109.01902v5 [cs.LG] UPDATED)
    Domain generalization theory and methods are important for the success of Open World Pattern Recognition. The paper advances the current state-of-art works in this context by proposing a novel theoretical analysis and piratical algorithm. In particular, we revisit Domain Generalization (DG) problem, where the hypotheses are composed of a common representation mapping followed by a labeling function. Popular DG methods optimize a well-known upper bound of the risk in the unseen domain to learn both the optimal representation and labeling functions. However, the widely used bound contains a term that is not optimized due to its dual dependence on the representation mapping and the unknown optimal labeling function in the unseen domain. To fill this gap, we derive a new upper bound free of terms having such dual dependence. Our derivation leverages old and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Compared to previous bounds, our bound introduces two new terms: (i) the Wasserstein-2 barycenter term for the distribution alignment between domains and (ii) the reconstruction loss term for measuring how well the data can be reconstructed from its representation. Based on the new upper bound, we propose a novel DG algorithm that simultaneously minimizes the classification loss, the barycenter loss, and the reconstruction loss. Experiments on several datasets demonstrate superior performance of the proposed method compared to the state-of-the-art DG algorithms.
    Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data. (arXiv:2210.13958v1 [cs.LG])
    Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, giving rise to the use of generative approaches based on GAN. Generation of high-dimensional, time-series, authentic data that provides a wide distribution coverage of the real data, remains a challenging task for both resampling and GAN-based approaches. In this work we propose CA-GAN architecture that addresses some of the shortcomings of the current approaches, where we provide a detailed comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-series, real dataset of 3343 hypotensive Caucasian and Black patients. We show that our approach is better at both generating authentic data of the minority class and remaining within the original distribution of the real data.
    Whitening Convergence Rate of Coupling-based Normalizing Flows. (arXiv:2210.14032v1 [cs.LG])
    Coupling-based normalizing flows (e.g. RealNVP) are a popular family of normalizing flow architectures that work surprisingly well in practice. This calls for theoretical understanding. Existing work shows that such flows weakly converge to arbitrary data distributions. However, they make no statement about the stricter convergence criterion used in practice, the maximum likelihood loss. For the first time, we make a quantitative statement about this kind of convergence: We prove that all coupling-based normalizing flows perform whitening of the data distribution (i.e. diagonalize the covariance matrix) and derive corresponding convergence bounds that show a linear convergence rate in the depth of the flow. Numerical experiments demonstrate the implications of our theory and point at open questions.
    Deep Bayesian Active Learning for Accelerating Stochastic Simulation. (arXiv:2106.02770v6 [cs.LG] UPDATED)
    Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. While deep surrogate models can speed up the simulations, doing so for stochastic simulations and with active learning approaches is an underexplored area. We propose Interactive Neural Process (INP), a deep Bayesian active learning framework for learning deep surrogate models to accelerate stochastic simulations. INP consists of two components, a spatiotemporal surrogate model built upon Neural Process (NP) family and an acquisition function for active learning. For surrogate modeling, we develop Spatiotemporal Neural Process (STNP) to mimic the simulator dynamics. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. We perform a theoretical analysis and demonstrate that LIG reduces sample complexity compared with random sampling in high dimensions. We also conduct empirical studies on two complex spatiotemporal simulators for reaction diffusion and infectious disease. The results demonstrate that STNP outperforms the baselines in the offline learning setting and LIG achieves the state-of-the-art for Bayesian active learning.
    Networked Signal and Information Processing. (arXiv:2210.13767v1 [eess.SP])
    The article reviews significant advances in networked signal and information processing, which have enabled in the last 25 years extending decision making and inference, optimization, control, and learning to the increasingly ubiquitous environments of distributed agents. As these interacting agents cooperate, new collective behaviors emerge from local decisions and actions. Moreover, and significantly, theory and applications show that networked agents, through cooperation and sharing, are able to match the performance of cloud or federated solutions, while preserving privacy, increasing resilience, and saving resources.
    Eigen Memory Tree. (arXiv:2210.14077v1 [cs.LG])
    This work introduces the Eigen Memory Tree (EMT), a novel online memory model for sequential learning scenarios. EMTs store data at the leaves of a binary tree and route new samples through the structure using the principal components of previous experiences, facilitating efficient (logarithmic) access to relevant memories. We demonstrate that EMT outperforms existing online memory approaches, and provide a hybridized EMT-parametric algorithm that enjoys drastically improved performance over purely parametric methods with nearly no downsides. Our findings are validated using 206 datasets from the OpenML repository in both bounded and infinite memory budget situations.
    Goal-Driven Context-Aware Next Service Recommendation for Mashup Composition. (arXiv:2210.14127v1 [cs.SE])
    As service-oriented architecture becoming one of the most prevalent techniques to rapidly deliver functionalities to customers, increasingly more reusable software components have been published online in forms of web services. To create a mashup, it gets not only time-consuming but also error-prone for developers to find suitable services from such a sea of services. Service discovery and recommendation has thus attracted significant momentum in both academia and industry. This paper proposes a novel incremental recommend-as-you-go approach to recommending next potential service based on the context of a mashup under construction, considering services that have been selected to the current step as well as its mashup goal. The core technique is an algorithm of learning the embedding of services, which learns their past goal-driven context-aware decision making behaviors in addition to their semantic descriptions and co-occurrence history. A goal exclusionary negative sampling mechanism tailored for mashup development is also developed to improve training performance. Extensive experiments on a real-world dataset demonstrate the effectiveness of our approach.
    Boosting the Cycle Counting Power of Graph Neural Networks with I$^2$-GNNs. (arXiv:2210.13978v1 [cs.LG])
    Message Passing Neural Networks (MPNNs) are a widely used class of Graph Neural Networks (GNNs). The limited representational power of MPNNs inspires the study of provably powerful GNN architectures. However, knowing one model is more powerful than another gives little insight about what functions they can or cannot express. It is still unclear whether these models are able to approximate specific functions such as counting certain graph substructures, which is essential for applications in biology, chemistry and social network analysis. Motivated by this, we propose to study the counting power of Subgraph MPNNs, a recent and popular class of powerful GNN models that extract rooted subgraphs for each node, assign the root node a unique identifier and encode the root node's representation within its rooted subgraph. Specifically, we prove that Subgraph MPNNs fail to count more-than-4-cycles at node level, implying that node representations cannot correctly encode the surrounding substructures like ring systems with more than four atoms. To overcome this limitation, we propose I$^2$-GNNs to extend Subgraph MPNNs by assigning different identifiers for the root node and its neighbors in each subgraph. I$^2$-GNNs' discriminative power is shown to be strictly stronger than Subgraph MPNNs and partially stronger than the 3-WL test. More importantly, I$^2$-GNNs are proven capable of counting all 3, 4, 5 and 6-cycles, covering common substructures like benzene rings in organic chemistry, while still keeping linear complexity. To the best of our knowledge, it is the first linear-time GNN model that can count 6-cycles with theoretical guarantees. We validate its counting power in cycle counting tasks and demonstrate its competitive performance in molecular prediction benchmarks.
    Unsupervised Anomaly Detection for Auditing Data and Impact of Categorical Encodings. (arXiv:2210.14056v1 [cs.LG])
    In this paper, we introduce the Vehicle Claims dataset, consisting of fraudulent insurance claims for automotive repairs. The data belongs to the more broad category of Auditing data, which includes also Journals and Network Intrusion data. Insurance claim data are distinctively different from other auditing data (such as network intrusion data) in their high number of categorical attributes. We tackle the common problem of missing benchmark datasets for anomaly detection: datasets are mostly confidential, and the public tabular datasets do not contain relevant and sufficient categorical attributes. Therefore, a large-sized dataset is created for this purpose and referred to as Vehicle Claims (VC) dataset. The dataset is evaluated on shallow and deep learning methods. Due to the introduction of categorical attributes, we encounter the challenge of encoding them for the large dataset. As One Hot encoding of high cardinal dataset invokes the "curse of dimensionality", we experiment with GEL encoding and embedding layer for representing categorical attributes. Our work compares competitive learning, reconstruction-error, density estimation and contrastive learning approaches for Label, One Hot, GEL encoding and embedding layer to handle categorical values.
    The Debate Over Understanding in AI's Large Language Models. (arXiv:2210.13966v1 [cs.LG])
    We survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to "understand" language -- and the physical and social situations language encodes -- in any important sense. We describe arguments that have been made for and against such understanding, and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that a new science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.
    Second Order Ensemble Langevin Method for Sampling and Inverse Problems. (arXiv:2208.04506v2 [math.DS] UPDATED)
    We propose a sampling method based on an ensemble approximation of second order Langevin dynamics. The log target density is appended with a quadratic term in an auxiliary momentum variable and damped-driven Hamiltonian dynamics introduced; the resulting stochastic differential equation is invariant to the Gibbs measure, with marginal on the position coordinates given by the target. A preconditioner based on covariance under the law of the dynamics does not change this invariance property, and is introduced to accelerate convergence to the Gibbs measure. The resulting mean-field dynamics may be approximated by an ensemble method; this results in a gradient-free and affine-invariant stochastic dynamical system. Numerical results demonstrate its potential as the basis for a numerical sampler in Bayesian inverse problems.
    Removing Radio Frequency Interference from Auroral Kilometric Radiation with Stacked Autoencoders. (arXiv:2210.12931v2 [astro-ph.IM] UPDATED)
    Radio frequency data in astronomy enable scientists to analyze astrophysical phenomena. However, these data can be corrupted by a host of radio frequency interference (RFI) sources that limit the ability to observe underlying natural processes. In this study, we extended recent work in image processing to remove RFI from time-frequency spectrograms containing auroral kilometric radiation (AKR), a coherent radio emission originating from the Earth's auroral zones that is used to study astrophysical plasmas. We present a Denoising Autoencoder for Auroral Radio Emissions (DAARE) trained with synthetic spectrograms to denoise AKR spectrograms collected at the South Pole Station. DAARE achieved 42.2 peak-signal-to-noise ratio (PSNR) and 0.981 structural similarity (SSIM) on synthesized AKR observations, improving PSNR by 3.9 and SSIM by 0.064 compared to state-of-the-art filtering and denoising networks. Qualitative comparisons demonstrate DAARE's denoising capability to effectively remove RFI from real AKR observations, despite being trained completely on a dataset of simulated AKR. The framework for simulating AKR, training DAARE, and employing DAARE can be accessed at https://github.com/Cylumn/daare.
    LAB: Learnable Activation Binarizer for Binary Neural Networks. (arXiv:2210.13858v1 [cs.LG])
    Binary Neural Networks (BNNs) are receiving an upsurge of attention for bringing power-hungry deep learning towards edge devices. The traditional wisdom in this space is to employ sign() for binarizing featuremaps. We argue and illustrate that sign() is a uniqueness bottleneck, limiting information propagation throughout the network. To alleviate this, we propose to dispense sign(), replacing it with a learnable activation binarizer (LAB), allowing the network to learn a fine-grained binarization kernel per layer - as opposed to global thresholding. LAB is a novel universal module that can seamlessly be integrated into existing architectures. To confirm this, we plug it into four seminal BNNs and show a considerable performance boost at the cost of tolerable increase in delay and complexity. Finally, we build an end-to-end BNN (coined as LAB-BNN) around LAB, and demonstrate that it achieves competitive performance on par with the state-of-the-art on ImageNet.
    A White-Box Adversarial Attack Against a Digital Twin. (arXiv:2210.14018v1 [cs.CR])
    Recent research has shown that Machine Learning/Deep Learning (ML/DL) models are particularly vulnerable to adversarial perturbations, which are small changes made to the input data in order to fool a machine learning classifier. The Digital Twin, which is typically described as consisting of a physical entity, a virtual counterpart, and the data connections in between, is increasingly being investigated as a means of improving the performance of physical entities by leveraging computational techniques, which are enabled by the virtual counterpart. This paper explores the susceptibility of Digital Twin (DT), a virtual model designed to accurately reflect a physical object using ML/DL classifiers that operate as Cyber Physical Systems (CPS), to adversarial attacks. As a proof of concept, we first formulate a DT of a vehicular system using a deep neural network architecture and then utilize it to launch an adversarial attack. We attack the DT model by perturbing the input to the trained model and show how easily the model can be broken with white-box attacks.
    Useful Confidence Measures: Beyond the Max Score. (arXiv:2210.14070v1 [cs.LG])
    An important component in deploying machine learning (ML) in safety-critic applications is having a reliable measure of confidence in the ML model's predictions. For a classifier $f$ producing a probability vector $f(x)$ over the candidate classes, the confidence is typically taken to be $\max_i f(x)_i$. This approach is potentially limited, as it disregards the rest of the probability vector. In this work, we derive several confidence measures that depend on information beyond the maximum score, such as margin-based and entropy-based measures, and empirically evaluate their usefulness, focusing on NLP tasks with distribution shifts and Transformer-based models. We show that when models are evaluated on the out-of-distribution data ``out of the box'', using only the maximum score to inform the confidence measure is highly suboptimal. In the post-processing regime (where the scores of $f$ can be improved using additional in-distribution held-out data), this remains true, albeit less significant. Overall, our results suggest that entropy-based confidence is a surprisingly useful measure.
    Model-Free Prediction of Adversarial Drop Points in 3D Point Clouds. (arXiv:2210.14164v1 [cs.CV])
    Adversarial attacks pose serious challenges for deep neural network (DNN)-based analysis of various input signals. In the case of 3D point clouds, methods have been developed to identify points that play a key role in the network decision, and these become crucial in generating existing adversarial attacks. For example, a saliency map approach is a popular method for identifying adversarial drop points, whose removal would significantly impact the network decision. Generally, methods for identifying adversarial points rely on the deep model itself in order to determine which points are critically important for the model's decision. This paper aims to provide a novel viewpoint on this problem, in which adversarial points can be predicted independently of the model. To this end, we define 14 point cloud features and use multiple linear regression to examine whether these features can be used for model-free adversarial point prediction, and which combination of features is best suited for this purpose. Experiments show that a suitable combination of features is able to predict adversarial points of three different networks -- PointNet, PointNet++, and DGCNN -- significantly better than a random guess. The results also provide further insight into DNNs for point cloud analysis, by showing which features play key roles in their decision-making process.
    Atlas flow : compatible local structures on the manifold. (arXiv:2210.14149v1 [cs.CV])
    In this paper, we focus on the intersections of a manifold's local structures to analyze the global structure of a manifold. We obtain local regions on data manifolds such as the latent space of StyleGAN2, using Mapper, a tool from topological data analysis. We impose gluing compatibility conditions on overlapping local regions, which guarantee that the local structures can be glued together to the global structure of a manifold. We propose a novel generative flow model called Atlas flow that uses compatibility to reattach the local regions. Our model shows that the generating processes perform well on synthetic dataset samples of well-known manifolds with noise. Furthermore, we investigate the style vector manifold of StyleGAN2 using our model.
    MOFormer: Self-Supervised Transformer model for Metal-Organic Framework Property Prediction. (arXiv:2210.14188v1 [cs.LG])
    Metal-Organic Frameworks (MOFs) are materials with a high degree of porosity that can be used for applications in energy storage, water desalination, gas storage, and gas separation. However, the chemical space of MOFs is close to an infinite size due to the large variety of possible combinations of building blocks and topology. Discovering the optimal MOFs for specific applications requires an efficient and accurate search over an enormous number of potential candidates. Previous high-throughput screening methods using computational simulations like DFT can be time-consuming. Such methods also require optimizing 3D atomic structure of MOFs, which adds one extra step when evaluating hypothetical MOFs. In this work, we propose a structure-agnostic deep learning method based on the Transformer model, named as MOFormer, for property predictions of MOFs. The MOFormer takes a text string representation of MOF (MOFid) as input, thus circumventing the need of obtaining the 3D structure of hypothetical MOF and accelerating the screening process. Furthermore, we introduce a self-supervised learning framework that pretrains the MOFormer via maximizing the cross-correlation between its structure-agnostic representations and structure-based representations of crystal graph convolutional neural network (CGCNN) on >400k publicly available MOF data. Using self-supervised learning allows the MOFormer to intrinsically learn 3D structural information though it is not included in the input. Experiments show that pretraining improved the prediction accuracy of both models on various downstream prediction tasks. Furthermore, we revealed that MOFormer can be more data-efficient on quantum-chemical property prediction than structure-based CGCNN when training data is limited. Overall, MOFormer provides a novel perspective on efficient MOF design using deep learning.
    Learn-Morph-Infer: a new way of solving the inverse problem for brain tumor modeling. (arXiv:2111.04090v3 [physics.med-ph] UPDATED)
    Current treatment planning of patients diagnosed with a brain tumor, such as glioma, could significantly benefit by accessing the spatial distribution of tumor cell concentration. Existing diagnostic modalities, e.g. magnetic resonance imaging (MRI), contrast sufficiently well areas of high cell density. In gliomas, however, they do not portray areas of low cell concentration, which can often serve as a source for the secondary appearance of the tumor after treatment. To estimate tumor cell densities beyond the visible boundaries of the lesion, numerical simulations of tumor growth could complement imaging information by providing estimates of full spatial distributions of tumor cells. Over recent years a corpus of literature on medical image-based tumor modeling was published. It includes different mathematical formalisms describing the forward tumor growth model. Alongside, various parametric inference schemes were developed to perform an efficient tumor model personalization, i.e. solving the inverse problem. However, the unifying drawback of all existing approaches is the time complexity of the model personalization which prohibits a potential integration of the modeling into clinical settings. In this work, we introduce a deep learning based methodology for inferring the patient-specific spatial distribution of brain tumors from T1Gd and FLAIR MRI medical scans. Coined as Learn-Morph-Infer the method achieves real-time performance in the order of minutes on widely available hardware and the compute time is stable across tumor models of different complexity, such as reaction-diffusion and reaction-advection-diffusion models. We believe the proposed inverse solution approach not only bridges the way for clinical translation of brain tumor personalization but can also be adopted to other scientific and engineering domains.
    CalFAT: Calibrated Federated Adversarial Training with Label Skewness. (arXiv:2205.14926v2 [cs.LG] UPDATED)
    Recent studies have shown that, like traditional machine learning, federated learning (FL) is also vulnerable to adversarial attacks. To improve the adversarial robustness of FL, federated adversarial training (FAT) methods have been proposed to apply adversarial training locally before global aggregation. Although these methods demonstrate promising results on independent identically distributed (IID) data, they suffer from training instability on non-IID data with label skewness, resulting in degraded natural accuracy. This tends to hinder the application of FAT in real-world applications where the label distribution across the clients is often skewed. In this paper, we study the problem of FAT under label skewness, and reveal one root cause of the training instability and natural accuracy degradation issues: skewed labels lead to non-identical class probabilities and heterogeneous local models. We then propose a Calibrated FAT (CalFAT) approach to tackle the instability issue by calibrating the logits adaptively to balance the classes. We show both theoretically and empirically that the optimization of CalFAT leads to homogeneous local models across the clients and better convergence points.
    A Dynamical System View of Langevin-Based Non-Convex Sampling. (arXiv:2210.13867v1 [cs.LG])
    Non-convex sampling is a key challenge in machine learning, central to non-convex optimization in deep learning as well as to approximate probabilistic inference. Despite its significance, theoretically there remain many important challenges: Existing guarantees (1) typically only hold for the averaged iterates rather than the more desirable last iterates, (2) lack convergence metrics that capture the scales of the variables such as Wasserstein distances, and (3) mainly apply to elementary schemes such as stochastic gradient Langevin dynamics. In this paper, we develop a new framework that lifts the above issues by harnessing several tools from the theory of dynamical systems. Our key result is that, for a large class of state-of-the-art sampling schemes, their last-iterate convergence in Wasserstein distances can be reduced to the study of their continuous-time counterparts, which is much better understood. Coupled with standard assumptions of MCMC sampling, our theory immediately yields the last-iterate Wasserstein convergence of many advanced sampling schemes such as proximal, randomized mid-point, and Runge-Kutta integrators. Beyond existing methods, our framework also motivates more efficient schemes that enjoy the same rigorous guarantees.
    Differentially Private Language Models for Secure Data Sharing. (arXiv:2210.13918v1 [cs.LG])
    To protect the privacy of individuals whose data is being shared, it is of high importance to develop methods allowing researchers and companies to release textual data while providing formal privacy guarantees to its originators. In the field of NLP, substantial efforts have been directed at building mechanisms following the framework of local differential privacy, thereby anonymizing individual text samples before releasing them. In practice, these approaches are often dissatisfying in terms of the quality of their output language due to the strong noise required for local differential privacy. In this paper, we approach the problem at hand using global differential privacy, particularly by training a generative language model in a differentially private manner and consequently sampling data from it. Using natural language prompts and a new prompt-mismatch loss, we are able to create highly accurate and fluent textual datasets taking on specific desired attributes such as sentiment or topic and resembling statistical properties of the training data. We perform thorough experiments indicating that our synthetic datasets do not leak information from our original data and are of high language quality and highly suitable for training models for further analysis on real-world data. Notably, we also demonstrate that training classifiers on private synthetic data outperforms directly training classifiers on real data with DP-SGD.
    Are All Spurious Features in Natural Language Alike? An Analysis through a Causal Lens. (arXiv:2210.14011v1 [cs.CL])
    The term `spurious correlations' has been used in NLP to informally denote any undesirable feature-label correlations. However, a correlation can be undesirable because (i) the feature is irrelevant to the label (e.g. punctuation in a review), or (ii) the feature's effect on the label depends on the context (e.g. negation words in a review), which is ubiquitous in language tasks. In case (i), we want the model to be invariant to the feature, which is neither necessary nor sufficient for prediction. But in case (ii), even an ideal model (e.g. humans) must rely on the feature, since it is necessary (but not sufficient) for prediction. Therefore, a more fine-grained treatment of spurious features is needed to specify the desired model behavior. We formalize this distinction using a causal model and probabilities of necessity and sufficiency, which delineates the causal relations between a feature and a label. We then show that this distinction helps explain results of existing debiasing methods on different spurious features, and demystifies surprising results such as the encoding of spurious features in model representations after debiasing.
    Online model error correction with neural networks in the incremental 4D-Var framework. (arXiv:2210.13817v1 [stat.ML])
    Recent studies have demonstrated that it is possible to combine machine learning with data assimilation to reconstruct the dynamics of a physical model partially and imperfectly observed. Data assimilation is used to estimate the system state from the observations, while machine learning computes a surrogate model of the dynamical system based on those estimated states. The surrogate model can be defined as an hybrid combination where a physical model based on prior knowledge is enhanced with a statistical model estimated by a neural network. The training of the neural network is typically done offline, once a large enough dataset of model state estimates is available. By contrast, with online approaches the surrogate model is improved each time a new system state estimate is computed. Online approaches naturally fit the sequential framework encountered in geosciences where new observations become available with time. In a recent methodology paper, we have developed a new weak-constraint 4D-Var formulation which can be used to train a neural network for online model error correction. In the present article, we develop a simplified version of that method, in the incremental 4D-Var framework adopted by most operational weather centres. The simplified method is implemented in the ECMWF Object-Oriented Prediction System, with the help of a newly developed Fortran neural network library, and tested with a two-layer two-dimensional quasi geostrophic model. The results confirm that online learning is effective and yields a more accurate model error correction than offline learning. Finally, the simplified method is compatible with future applications to state-of-the-art models such as the ECMWF Integrated Forecasting System.
    Adaptive Behavior Cloning Regularization for Stable Offline-to-Online Reinforcement Learning. (arXiv:2210.13846v1 [cs.LG])
    Offline reinforcement learning, by learning from a fixed dataset, makes it possible to learn agent behaviors without interacting with the environment. However, depending on the quality of the offline dataset, such pre-trained agents may have limited performance and would further need to be fine-tuned online by interacting with the environment. During online fine-tuning, the performance of the pre-trained agent may collapse quickly due to the sudden distribution shift from offline to online data. While constraints enforced by offline RL methods such as a behaviour cloning loss prevent this to an extent, these constraints also significantly slow down online fine-tuning by forcing the agent to stay close to the behavior policy. We propose to adaptively weigh the behavior cloning loss during online fine-tuning based on the agent's performance and training stability. Moreover, we use a randomized ensemble of Q functions to further increase the sample efficiency of online fine-tuning by performing a large number of learning updates. Experiments show that the proposed method yields state-of-the-art offline-to-online reinforcement learning performance on the popular D4RL benchmark. Code is available: \url{https://github.com/zhaoyi11/adaptive_bc}.
    FocusedCleaner: Sanitizing Poisoned Graphs for Robust GNN-based Node Classification. (arXiv:2210.13815v1 [cs.LG])
    Recently, a lot of research attention has been devoted to exploring Web security, a most representative topic is the adversarial robustness of graph mining algorithms. Especially, a widely deployed adversarial attacks formulation is the graph manipulation attacks by modifying the relational data to mislead the Graph Neural Networks' (GNNs) predictions. Naturally, an intrinsic question one would ask is whether we can accurately identify the manipulations over graphs - we term this problem as poisoned graph sanitation. In this paper, we present FocusedCleaner, a poisoned graph sanitation framework consisting of two modules: bi-level structural learning and victim node detection. In particular, the structural learning module will reserve the attack process to steadily sanitize the graph while the detection module provides the "focus" - a narrowed and more accurate search region - to structural learning. These two modules will operate in iterations and reinforce each other to sanitize a poisoned graph step by step. Extensive experiments demonstrate that FocusedCleaner outperforms the state-of-the-art baselines both on poisoned graph sanitation and improving robustness.
    What Has Been Enhanced in my Knowledge-Enhanced Language Model?. (arXiv:2202.00964v5 [cs.CL] UPDATED)
    We show that existing model interpretation methods such as linear probes and prompts have some key limitations in answering these questions. We revisit KI from an information-theoretic view and propose a new theoretically sound probe called Graph Convolution Simulator (GCS) for KI interpretation. GCS uses graph attention on the corresponding knowledge graph for interpretation. In our experiments we verify that GCS can provide reasonable interpretation results for two well-known knowledge-enhanced LMs: ERNIE and K-Adapter. We also find that only a marginal amount of knowledge is successfully integrated in these models, and simply increasing the size of the KI corpus may not lead to better knowledge-enhanced LMs.
    Weight Fixing Networks. (arXiv:2210.13554v1 [cs.LG])
    Modern iterations of deep learning models contain millions (billions) of unique parameters, each represented by a b-bit number. Popular attempts at compressing neural networks (such as pruning and quantisation) have shown that many of the parameters are superfluous, which we can remove (pruning) or express with less than b-bits (quantisation) without hindering performance. Here we look to go much further in minimising the information content of networks. Rather than a channel or layer-wise encoding, we look to lossless whole-network quantisation to minimise the entropy and number of unique parameters in a network. We propose a new method, which we call Weight Fixing Networks (WFN) that we design to realise four model outcome objectives: i) very few unique weights, ii) low-entropy weight encodings, iii) unique weight values which are amenable to energy-saving versions of hardware multiplication, and iv) lossless task-performance. Some of these goals are conflicting. To best balance these conflicts, we combine a few novel (and some well-trodden) tricks; a novel regularisation term, (i, ii) a view of clustering cost as relative distance change (i, ii, iv), and a focus on whole-network re-use of weights (i, iii). Our Imagenet experiments demonstrate lossless compression using 56x fewer unique weights and a 1.9x lower weight-space entropy than SOTA quantisation approaches.
    Learning to forecast vegetation greenness at fine resolution over Africa with ConvLSTMs. (arXiv:2210.13648v1 [cs.LG])
    Forecasting the state of vegetation in response to climate and weather events is a major challenge. Its implementation will prove crucial in predicting crop yield, forest damage, or more generally the impact on ecosystems services relevant for socio-economic functioning, which if absent can lead to humanitarian disasters. Vegetation status depends on weather and environmental conditions that modulate complex ecological processes taking place at several timescales. Interactions between vegetation and different environmental drivers express responses at instantaneous but also time-lagged effects, often showing an emerging spatial context at landscape and regional scales. We formulate the land surface forecasting task as a strongly guided video prediction task where the objective is to forecast the vegetation developing at very fine resolution using topography and weather variables to guide the prediction. We use a Convolutional LSTM (ConvLSTM) architecture to address this task and predict changes in the vegetation state in Africa using Sentinel-2 satellite NDVI, having ERA5 weather reanalysis, SMAP satellite measurements, and topography (DEM of SRTMv4.1) as variables to guide the prediction. Ours results highlight how ConvLSTM models can not only forecast the seasonal evolution of NDVI at high resolution, but also the differential impacts of weather anomalies over the baselines. The model is able to predict different vegetation types, even those with very high NDVI variability during target length, which is promising to support anticipatory actions in the context of drought-related disasters.
    Learning Low Dimensional State Spaces with Overparameterized Recurrent Neural Network. (arXiv:2210.14064v1 [cs.LG])
    Overparameterization in deep learning typically refers to settings where a trained Neural Network (NN) has representational capacity to fit the training data in many ways, some of which generalize well, while others do not. In the case of Recurrent Neural Networks (RNNs), there exists an additional layer of overparameterization, in the sense that a model may exhibit many solutions that generalize well for sequence lengths seen in training, some of which extrapolate to longer sequences, while others do not. Numerous works studied the tendency of Gradient Descent (GD) to fit overparameterized NNs with solutions that generalize well. On the other hand, its tendency to fit overparameterized RNNs with solutions that extrapolate has been discovered only lately, and is far less understood. In this paper, we analyze the extrapolation properties of GD when applied to overparameterized linear RNNs. In contrast to recent arguments suggesting an implicit bias towards short-term memory, we provide theoretical evidence for learning low dimensional state spaces, which can also model long-term memory. Our result relies on a dynamical characterization which shows that GD (with small step size and near-zero initialization) strives to maintain a certain form of balancedness, as well as on tools developed in the context of the moment problem from statistics (recovery of a probability distribution from its moments). Experiments corroborate our theory, demonstrating extrapolation via learning low dimensional state spaces with both linear and non-linear RNNs
    Deploying a Steered Query Optimizer in Production at Microsoft. (arXiv:2210.13625v1 [cs.DB])
    Modern analytical workloads are highly heterogeneous and massively complex, making generic query optimizers untenable for many customers and scenarios. As a result, it is important to specialize these optimizers to instances of the workloads. In this paper, we continue a recent line of work in steering a query optimizer towards better plans for a given workload, and make major strides in pushing previous research ideas to production deployment. Along the way we solve several operational challenges including, making steering actions more manageable, keeping the costs of steering within budget, and avoiding unexpected performance regressions in production. Our resulting system, QQ-advisor, essentially externalizes the query planner to a massive offline pipeline for better exploration and specialization. We discuss various aspects of our design and show detailed results over production SCOPE workloads at Microsoft, where the system is currently enabled by default.
    Toward domain generalized pruning by scoring out-of-distribution importance. (arXiv:2210.13810v1 [cs.LG])
    Filter pruning has been widely used for compressing convolutional neural networks to reduce computation costs during the deployment stage. Recent studies have shown that filter pruning techniques can achieve lossless compression of deep neural networks, reducing redundant filters (kernels) without sacrificing accuracy performance. However, the evaluation is done when the training and testing data are from similar environmental conditions (independent and identically distributed), and how the filter pruning techniques would affect the cross-domain generalization (out-of-distribution) performance is largely ignored. We conduct extensive empirical experiments and reveal that although the intra-domain performance could be maintained after filter pruning, the cross-domain performance will decay to a large extent. As scoring a filter's importance is one of the central problems for pruning, we design the importance scoring estimation by using the variance of domain-level risks to consider the pruning risk in the unseen distribution. As such, we can remain more domain generalized filters. The experiments show that under the same pruning ratio, our method can achieve significantly better cross-domain generalization performance than the baseline filter pruning method. For the first attempt, our work sheds light on the joint problem of domain generalization and filter pruning research.
    I see what you hear: a vision-inspired method to localize words. (arXiv:2210.13567v1 [cs.CV])
    This paper explores the possibility of using visual object detection techniques for word localization in speech data. Object detection has been thoroughly studied in the contemporary literature for visual data. Noting that an audio can be interpreted as a 1-dimensional image, object localization techniques can be fundamentally useful for word localization. Building upon this idea, we propose a lightweight solution for word detection and localization. We use bounding box regression for word localization, which enables our model to detect the occurrence, offset, and duration of keywords in a given audio stream. We experiment with LibriSpeech and train a model to localize 1000 words. Compared to existing work, our method reduces model size by 94%, and improves the F1 score by 6.5\%.
    Does Medical Imaging learn different Convolution Filters?. (arXiv:2210.13799v1 [eess.IV])
    Recent work has investigated the distributions of learned convolution filters through a large-scale study containing hundreds of heterogeneous image models. Surprisingly, on average, the distributions only show minor drifts in comparisons of various studied dimensions including the learned task, image domain, or dataset. However, among the studied image domains, medical imaging models appeared to show significant outliers through "spikey" distributions, and, therefore, learn clusters of highly specific filters different from other domains. Following this observation, we study the collected medical imaging models in more detail. We show that instead of fundamental differences, the outliers are due to specific processing in some architectures. Quite the contrary, for standardized architectures, we find that models trained on medical data do not significantly differ in their filter distributions from similar architectures trained on data from other domains. Our conclusions reinforce previous hypotheses stating that pre-training of imaging models can be done with any kind of diverse image data.
    Reinforcement Learning and Bandits for Speech and Language Processing: Tutorial, Review and Outlook. (arXiv:2210.13623v1 [cs.AI])
    In recent years, reinforcement learning and bandits have transformed a wide range of real-world applications including healthcare, finance, recommendation systems, robotics, and last but not least, the speech and natural language processing. While most speech and language applications of reinforcement learning algorithms are centered around improving the training of deep neural networks with its flexible optimization properties, there are still many grounds to explore to utilize the benefits of reinforcement learning, such as its reward-driven adaptability, state representations, temporal structures and generalizability. In this survey, we present an overview of recent advancements of reinforcement learning and bandits, and discuss how they can be effectively employed to solve speech and natural language processing problems with models that are adaptive, interactive and scalable.
    Graph Reinforcement Learning-based CNN Inference Offloading in Dynamic Edge Computing. (arXiv:2210.13464v1 [cs.LG])
    This paper studies the computational offloading of CNN inference in dynamic multi-access edge computing (MEC) networks. To address the uncertainties in communication time and Edge servers' available capacity, we use early-exit mechanism to terminate the computation earlier to meet the deadline of inference tasks. We design a reward function to trade off the communication, computation and inference accuracy, and formulate the offloading problem of CNN inference as a maximization problem with the goal of maximizing the average inference accuracy and throughput in long term. To solve the maximization problem, we propose a graph reinforcement learning-based early-exit mechanism (GRLE), which outperforms the state-of-the-art work, deep reinforcement learning-based online offloading (DROO) and its enhanced method, DROO with early-exit mechanism (DROOE), under different dynamic scenarios. The experimental results show that GRLE achieves the average accuracy up to 3.41x over graph reinforcement learning (GRL) and 1.45x over DROOE, which shows the advantages of GRLE for offloading decision-making in dynamic MEC.
    Bridging the Training-Inference Gap for Dense Phrase Retrieval. (arXiv:2210.13678v1 [cs.CL])
    Building dense retrievers requires a series of standard procedures, including training and validating neural models and creating indexes for efficient search. However, these procedures are often misaligned in that training objectives do not exactly reflect the retrieval scenario at inference time. In this paper, we explore how the gap between training and inference in dense retrieval can be reduced, focusing on dense phrase retrieval (Lee et al., 2021) where billions of representations are indexed at inference. Since validating every dense retriever with a large-scale index is practically infeasible, we propose an efficient way of validating dense retrievers using a small subset of the entire corpus. This allows us to validate various training strategies including unifying contrastive loss terms and using hard negatives for phrase retrieval, which largely reduces the training-inference discrepancy. As a result, we improve top-1 phrase retrieval accuracy by 2~3 points and top-20 passage retrieval accuracy by 2~4 points for open-domain question answering. Our work urges modeling dense retrievers with careful consideration of training and inference via efficient validation while advancing phrase retrieval as a general solution for dense retrieval.
    Does Joint Training Really Help Cascaded Speech Translation?. (arXiv:2210.13700v1 [eess.AS])
    Currently, in speech translation, the straightforward approach - cascading a recognition system with a translation system - delivers state-of-the-art results. However, fundamental challenges such as error propagation from the automatic speech recognition system still remain. To mitigate these problems, recently, people turn their attention to direct data and propose various joint training methods. In this work, we seek to answer the question of whether joint training really helps cascaded speech translation. We review recent papers on the topic and also investigate a joint training criterion by marginalizing the transcription posterior probabilities. Our findings show that a strong cascaded baseline can diminish any improvements obtained using joint training, and we suggest alternatives to joint training. We hope this work can serve as a refresher of the current speech translation landscape, and motivate research in finding more efficient and creative ways to utilize the direct data for speech translation.
    The Robustness Limits of SoTA Vision Models to Natural Variation. (arXiv:2210.13604v1 [cs.CV])
    Recent state-of-the-art vision models introduced new architectures, learning paradigms, and larger pretraining data, leading to impressive performance on tasks such as classification. While previous generations of vision models were shown to lack robustness to factors such as pose, it's unclear the extent to which this next generation of models are more robust. To study this question, we develop a dataset of more than 7 million images with controlled changes in pose, position, background, lighting, and size. We study not only how robust recent state-of-the-art models are, but also the extent to which models can generalize variation in factors when they're present during training. We consider a catalog of recent vision models, including vision transformers (ViT), self-supervised models such as masked autoencoders (MAE), and models trained on larger datasets such as CLIP. We find out-of-the-box, even today's best models are not robust to common changes in pose, size, and background. When some samples varied during training, we found models required a significant portion of diversity to generalize -- though eventually robustness did improve. When diversity is only seen for some classes however, we found models did not generalize to other classes, unless the classes were very similar to those seen varying during training. We hope our work will shed further light on the blind spots of SoTA models and spur the development of more robust vision models.
    Teal: Learning-Accelerated Optimization of Traffic Engineering. (arXiv:2210.13763v1 [cs.NI])
    In the last decade, global cloud wide-area networks (WANs) have grown 10$\times$ in size due to the deployment of new network sites and datacenters, making it challenging for commercial optimization engines to solve the network traffic engineering (TE) problem within the temporal budget of a few minutes. In this work, we show that carefully designed deep learning models are key to accelerating the running time of intra-WAN TE systems for large deployments since deep learning is both massively parallel and it benefits from the wealth of historical traffic allocation data from production WANs. However, off-the-shelf deep learning methods fail to perform well on the TE task since they ignore the effects of network connectivity on flow allocations. They are also faced with a tractability challenge posed by the large problem scale of TE optimization. Moreover, neural networks do not have mechanisms to readily enforce hard constraints on model outputs (e.g., link capacity constraints). We tackle these challenges by designing a deep learning-based TE system -- Teal. First, Teal leverages graph neural networks (GNN) to faithfully capture connectivity and model network flows. Second, Teal devises a multi-agent reinforcement learning (RL) algorithm to process individual demands independently in parallel to lower the problem scale. Finally, Teal reduces link capacity violations and improves solution quality using the alternating direction method of multipliers (ADMM). We evaluate Teal on traffic matrices of a global commercial cloud provider and find that Teal computes near-optimal traffic allocations with a 59$\times$ speedup over state-of-the-art TE systems on a WAN topology of over 1,500 nodes.
    Using Deep Learning to Find the Next Unicorn: A Practical Synthesis. (arXiv:2210.14195v1 [q-fin.CP])
    Startups often represent newly established business models associated with disruptive innovation and high scalability. They are commonly regarded as powerful engines for economic and social development. Meanwhile, startups are heavily constrained by many factors such as limited financial funding and human resources. Therefore the chance for a startup to eventually succeed is as rare as ``spotting a unicorn in the wild''. Venture Capital (VC) strives to identify and invest in unicorn startups during their early stages, hoping to gain a high return. To avoid entirely relying on human domain expertise and intuition, investors usually employ data-driven approaches to forecast the success probability of startups. Over the past two decades, the industry has gone through a paradigm shift moving from conventional statistical approaches towards becoming machine-learning (ML) based. Notably, the rapid growth of data volume and variety is quickly ushering in deep learning (DL), a subset of ML, as a potentially superior approach in terms capacity and expressivity. In this work, we carry out a literature review and synthesis on DL-based approaches, covering the entire DL life cycle. The objective is a) to obtain a thorough and in-depth understanding of the methodologies for startup evaluation using DL, and b) to distil valuable and actionable learning for practitioners. To the best of our knowledge, our work is the first of this kind.
    Asynchronous Distributed Reinforcement Learning for LQR Control via Zeroth-Order Block Coordinate Descent. (arXiv:2107.12416v3 [eess.SY] UPDATED)
    Recently introduced distributed zeroth-order optimization (ZOO) algorithms have shown their utility in distributed reinforcement learning (RL). Unfortunately, in the gradient estimation process, almost all of them require random samples with the same dimension as the global variable and/or require evaluation of the global cost function, which may induce high estimation variance for large-scale networks. In this paper, we propose a novel distributed zeroth-order algorithm by leveraging the network structure inherent in the optimization objective, which allows each agent to estimate its local gradient by local cost evaluation independently, without use of any consensus protocol. The proposed algorithm exhibits an asynchronous update scheme, and is designed for stochastic non-convex optimization with a possibly non-convex feasible domain based on the block coordinate descent method. The algorithm is later employed as a distributed model-free RL algorithm for distributed linear quadratic regulator design, where a learning graph is designed to describe the required interaction relationship among agents in distributed learning. We provide an empirical validation of the proposed algorithm to benchmark its performance on convergence rate and variance against a centralized ZOO algorithm.
    Highly Efficient Real-Time Streaming and Fully On-Device Speaker Diarization with Multi-Stage Clustering. (arXiv:2210.13690v1 [eess.AS])
    While recent research advances in speaker diarization mostly focus on improving the quality of diarization results, there is also an increasing interest in improving the efficiency of diarization systems. In this paper, we propose a multi-stage clustering strategy, that uses different clustering algorithms for input of different lengths. Specifically, a fallback clusterer is used to handle short-form inputs; a main clusterer is used to handle medium-length inputs; and a pre-clusterer is used to compress long-form inputs before they are processed by the main clusterer. Both the main clusterer and the pre-clusterer can be configured with an upper bound of the computational complexity to adapt to devices with different constraints. This multi-stage clustering strategy is critical for streaming on-device speaker diarization systems, where the budgets of CPU, memory and battery are tight.
    SAS: A Simple, Accurate and Scalable Node Classification Algorithm. (arXiv:2104.09120v2 [cs.LG] UPDATED)
    Graph neural networks have achieved state-of-the-art accuracy for graph node classification. However, GNNs are difficult to scale to large graphs, for example frequently encountering out-of-memory errors on even moderate size graphs. Recent works have sought to address this problem using a two-stage approach, which first aggregates data along graph edges, then trains a classifier without using additional graph information. These methods can run on much larger graphs and are orders of magnitude faster than GNNs, but achieve lower classification accuracy. We propose a novel two-stage algorithm based on a simple but effective observation: we should first train a classifier then aggregate, rather than the other way around. We show our algorithm is faster and can handle larger graphs than existing two-stage algorithms, while achieving comparable or higher accuracy than popular GNNs. We also present a theoretical basis to explain our algorithm's improved accuracy, by giving a synthetic nonlinear dataset in which performing aggregation before classification actually decreases accuracy compared to doing classification alone, while our classify then aggregate approach substantially improves accuracy compared to classification alone.
    In-context Reinforcement Learning with Algorithm Distillation. (arXiv:2210.14215v1 [cs.LG])
    We propose Algorithm Distillation (AD), a method for distilling reinforcement learning (RL) algorithms into neural networks by modeling their training histories with a causal sequence model. Algorithm Distillation treats learning to reinforcement learn as an across-episode sequential prediction problem. A dataset of learning histories is generated by a source RL algorithm, and then a causal transformer is trained by autoregressively predicting actions given their preceding learning histories as context. Unlike sequential policy prediction architectures that distill post-learning or expert sequences, AD is able to improve its policy entirely in-context without updating its network parameters. We demonstrate that AD can reinforcement learn in-context in a variety of environments with sparse rewards, combinatorial task structure, and pixel-based observations, and find that AD learns a more data-efficient RL algorithm than the one that generated the source data.
    Adaptive Label Smoothing with Self-Knowledge in Natural Language Generation. (arXiv:2210.13459v1 [cs.LG])
    Overconfidence has been shown to impair generalization and calibration of a neural network. Previous studies remedy this issue by adding a regularization term to a loss function, preventing a model from making a peaked distribution. Label smoothing smoothes target labels with a pre-defined prior label distribution; as a result, a model is learned to maximize the likelihood of predicting the soft label. Nonetheless, the amount of smoothing is the same in all samples and remains fixed in training. In other words, label smoothing does not reflect the change in probability distribution mapped by a model over the course of training. To address this issue, we propose a regularization scheme that brings dynamic nature into the smoothing parameter by taking model probability distribution into account, thereby varying the parameter per instance. A model in training self-regulates the extent of smoothing on the fly during forward propagation. Furthermore, inspired by recent work in bridging label smoothing and knowledge distillation, our work utilizes self-knowledge as a prior label distribution in softening target labels, and presents theoretical support for the regularization effect by knowledge distillation and the dynamic smoothing parameter. Our regularizer is validated comprehensively, and the result illustrates marked improvements in model generalization and calibration, enhancing robustness and trustworthiness of a model.
    Active Learning for Single Neuron Models with Lipschitz Non-Linearities. (arXiv:2210.13601v1 [cs.LG])
    We consider the problem of active learning for single neuron models, also sometimes called ``ridge functions'', in the agnostic setting (under adversarial label noise). Such models have been shown to be broadly effective in modeling physical phenomena, and for constructing surrogate data-driven models for partial differential equations. Surprisingly, we show that for a single neuron model with any Lipschitz non-linearity (such as the ReLU, sigmoid, absolute value, low-degree polynomial, among others), strong provable approximation guarantees can be obtained using a well-known active learning strategy for fitting \emph{linear functions} in the agnostic setting. % -- i.e. for the case when there is no non-linearity. Namely, we can collect samples via statistical \emph{leverage score sampling}, which has been shown to be near-optimal in other active learning scenarios. We support our theoretical results with empirical simulations showing that our proposed active learning strategy based on leverage score sampling outperforms (ordinary) uniform sampling when fitting single neuron models.
    Budget-Constrained Bounds for Mini-Batch Estimation of Optimal Transport. (arXiv:2210.13630v1 [cs.LG])
    Optimal Transport (OT) is a fundamental tool for comparing probability distributions, but its exact computation remains prohibitive for large datasets. In this work, we introduce novel families of upper and lower bounds for the OT problem constructed by aggregating solutions of mini-batch OT problems. The upper bound family contains traditional mini-batch averaging at one extreme and a tight bound found by optimal coupling of mini-batches at the other. In between these extremes, we propose various methods to construct bounds based on a fixed computational budget. Through various experiments, we explore the trade-off between computational budget and bound tightness and show the usefulness of these bounds in computer vision applications.
    Gaussian Mean Testing Made Simple. (arXiv:2210.13706v1 [math.ST])
    We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \mathbb{R}^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.
    Feng-Shui Compass: A Modern Exploration of Traditional Chinese Environmental Analysis. (arXiv:2210.13672v1 [cs.LG])
    The technological advancement in data analysis and sensor technology has contributed to a growth in knowledge of the surrounding environments. Feng Shui, the Chinese philosophy of evaluating a certain environment and how it influences human well-being, can only be determined by self-claimed specialists for the past thousands of years. We developed a device as well as a procedure to evaluate the ambient environment of a room to perform a study that attempts to use sensor data to predict the well-being score of a person in that environment, therefore evaluating the primary aspect of Feng Shui. Our study revealed preliminary results showing great potential for further research with larger experiments.
    MARLlib: Extending RLlib for Multi-agent Reinforcement Learning. (arXiv:2210.13708v1 [cs.LG])
    Despite the fast development of multi-agent reinforcement learning (MARL) methods, there is a lack of commonly-acknowledged baseline implementation and evaluation platforms. As a result, an urgent need for MARL researchers is to develop an integrated library suite, similar to the role of RLlib in single-agent RL, that delivers reliable MARL implementation and replicable evaluation in various benchmarks. To fill such a research gap, in this paper, we propose Multi-Agent RLlib (MARLlib), a comprehensive MARL algorithm library that facilitates RLlib for solving multi-agent problems. With a novel design of agent-level distributed dataflow, MARLlib manages to unify tens of algorithms, including different types of independent learning, centralized critic, and value decomposition methods; this leads to a highly composable integration of MARL algorithms that are not possible to unify before. Furthermore, MARLlib goes beyond current work by integrating diverse environment interfaces and providing flexible parameter sharing strategies; this allows to create versatile solutions to cooperative, competitive, and mixed tasks with minimal code modifications for end users. A plethora of experiments are conducted to substantiate the correctness of our implementation, based on which we further derive new insights on the relationship between the performance and the design of algorithmic components. With MARLlib, we expect researchers to be able to tackle broader real-world multi-agent problems with trustworthy solutions. Our code\footnote{\url{https://github.com/Replicable-MARL/MARLlib}} and documentation\footnote{\url{https://marllib.readthedocs.io/}} are released for reference.
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v4 [cs.LG] UPDATED)
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.
    Opportunistic Episodic Reinforcement Learning. (arXiv:2210.13504v1 [cs.LG])
    In this paper, we propose and study opportunistic reinforcement learning - a new variant of reinforcement learning problems where the regret of selecting a suboptimal action varies under an external environmental condition known as the variation factor. When the variation factor is low, so is the regret of selecting a suboptimal action and vice versa. Our intuition is to exploit more when the variation factor is high, and explore more when the variation factor is low. We demonstrate the benefit of this novel framework for finite-horizon episodic MDPs by designing and evaluating OppUCRL2 and OppPSRL algorithms. Our algorithms dynamically balance the exploration-exploitation trade-off for reinforcement learning by introducing variation factor-dependent optimism to guide exploration. We establish an $\tilde{O}(HS \sqrt{AT})$ regret bound for the OppUCRL2 algorithm and show through simulations that both OppUCRL2 and OppPSRL algorithm outperform their original corresponding algorithms.
    MEET: A Monte Carlo Exploration-Exploitation Trade-off for Buffer Sampling. (arXiv:2210.13545v1 [cs.LG])
    Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.
    Understanding the Evolution of Linear Regions in Deep Reinforcement Learning. (arXiv:2210.13611v1 [cs.LG])
    Policies produced by deep reinforcement learning are typically characterised by their learning curves, but they remain poorly understood in many other respects. ReLU-based policies result in a partitioning of the input space into piecewise linear regions. We seek to understand how observed region counts and their densities evolve during deep reinforcement learning using empirical results that span a range of continuous control tasks and policy network dimensions. Intuitively, we may expect that during training, the region density increases in the areas that are frequently visited by the policy, thereby affording fine-grained control. We use recent theoretical and empirical results for the linear regions induced by neural networks in supervised learning settings for grounding and comparison of our results. Empirically, we find that the region density increases only moderately throughout training, as measured along fixed trajectories coming from the final policy. However, the trajectories themselves also increase in length during training, and thus the region densities decrease as seen from the perspective of the current trajectory. Our findings suggest that the complexity of deep reinforcement learning policies does not principally emerge from a significant growth in the complexity of functions observed on-and-around trajectories of the policy.
    On the Robustness of Dataset Inference. (arXiv:2210.13631v1 [cs.LG])
    Machine learning (ML) models are costly to train as they can require a significant amount of data, computational resources and technical expertise. Thus, they constitute valuable intellectual property that needs protection from adversaries wanting to steal them. Ownership verification techniques allow the victims of model stealing attacks to demonstrate that a suspect model was in fact stolen from theirs. Although a number of ownership verification techniques based on watermarking or fingerprinting have been proposed, most of them fall short either in terms of security guarantees (well-equipped adversaries can evade verification) or computational cost. A fingerprinting technique introduced at ICLR '21, Dataset Inference (DI), has been shown to offer better robustness and efficiency than prior methods. The authors of DI provided a correctness proof for linear (suspect) models. However, in the same setting, we prove that DI suffers from high false positives (FPs) -- it can incorrectly identify an independent model trained with non-overlapping data from the same distribution as stolen. We further prove that DI also triggers FPs in realistic, non-linear suspect models. We then confirm empirically that DI leads to FPs, with high confidence. Second, we show that DI also suffers from false negatives (FNs) -- an adversary can fool DI by regularising a stolen model's decision boundaries using adversarial training, thereby leading to an FN. To this end, we demonstrate that DI fails to identify a model adversarially trained from a stolen dataset -- the setting where DI is the hardest to evade. Finally, we discuss the implications of our findings, the viability of fingerprinting-based ownership verification in general, and suggest directions for future work.
    FedGRec: Federated Graph Recommender System with Lazy Update of Latent Embeddings. (arXiv:2210.13686v1 [cs.LG])
    Recommender systems are widely used in industry to improve user experience. Despite great success, they have recently been criticized for collecting private user data. Federated Learning (FL) is a new paradigm for learning on distributed data without direct data sharing. Therefore, Federated Recommender (FedRec) systems are proposed to mitigate privacy concerns to non-distributed recommender systems. However, FedRec systems have a performance gap to its non-distributed counterpart. The main reason is that local clients have an incomplete user-item interaction graph, thus FedRec systems cannot utilize indirect user-item interactions well. In this paper, we propose the Federated Graph Recommender System (FedGRec) to mitigate this gap. Our FedGRec system can effectively exploit the indirect user-item interactions. More precisely, in our system, users and the server explicitly store latent embeddings for users and items, where the latent embeddings summarize different orders of indirect user-item interactions and are used as a proxy of missing interaction graph during local training. We perform extensive empirical evaluations to verify the efficacy of using latent embeddings as a proxy of missing interaction graph; the experimental results show superior performance of our system compared to various baselines. A short version of the paper is presented in \href{https://federated-learning.org/fl-neurips-2022/}{the FL-NeurIPS'22 workshop}.
    LANS: Large-scale Arabic News Summarization Corpus. (arXiv:2210.13600v1 [cs.CL])
    Text summarization has been intensively studied in many languages, and some languages have reached advanced stages. Yet, Arabic Text Summarization (ATS) is still in its developing stages. Existing ATS datasets are either small or lack diversity. We build, LANS, a large-scale and diverse dataset for Arabic Text Summarization task. LANS offers 8.4 million articles and their summaries extracted from newspapers websites metadata between 1999 and 2019. The high-quality and diverse summaries are written by journalists from 22 major Arab newspapers, and include an eclectic mix of at least more than 7 topics from each source. We conduct an intrinsic evaluation on LANS by both automatic and human evaluations. Human evaluation of 1000 random samples reports 95.4% accuracy for our collected summaries, and automatic evaluation quantifies the diversity and abstractness of the summaries. The dataset is publicly available upon request.
    Bayesian Methods in Automated Vehicle's Car-following Uncertainties: Enabling Strategic Decision Making. (arXiv:2210.13683v1 [eess.SY])
    This paper proposes a methodology to estimate uncertainty in automated vehicle (AV) dynamics in real time via Bayesian inference. Based on the estimated uncertainty, the method aims to continuously monitor the car-following (CF) performance of the AV to support strategic actions to maintain a desired performance. Our methodology consists of three sequential components: (i) the Stochastic Gradient Langevin Dynamics (SGLD) is adopted to estimate parameter uncertainty relative to vehicular dynamics in real time, (ii) dynamic monitoring of car-following stability (local and string-wise), and (iii) strategic actions for control adjustment if anomaly is detected. The proposed methodology provides means to gauge AV car-following performance in real time and preserve desired performance against real time uncertainty that are unaccounted for in the vehicle control algorithm.
    Evaluating and Optimizing Hearing-Aid Self-Fitting Methods using Population Coverage. (arXiv:2210.13732v1 [cs.LG])
    Adults with mild-to-moderate hearing loss can use over-the-counter hearing aids to treat their hearing loss at a fraction of traditional hearing care costs. These products incorporate self-fitting methods that allow end-users to configure their hearing aids without the help of an audiologist. A self-fitting method helps users configure the gain-frequency responses that control the amplification for each frequency band of the incoming sound. This paper considers how to design effective self-fitting methods and whether we may evaluate certain aspects of their design without resorting to expensive user studies. Most existing fitting methods provide various user interfaces to allow users to select a configuration from a predetermined set of presets. We propose a novel metric for evaluating the performance of preset-based approaches by computing their population coverage. The population coverage estimates the fraction of users for which it is possible to find a configuration they prefer. A unique aspect of our approach is a probabilistic model that captures how a user's unique preferences differ from other users with similar hearing loss. Next, we develop methods for determining presets to maximize population coverage. Exploratory results demonstrate that the proposed algorithms can effectively select a small number of presets that provide higher population coverage than clustering-based approaches. Moreover, we may use our algorithms to configure the number of increments for slider-based methods.
    Planning with Uncertainty: Deep Exploration in Model-Based Reinforcement Learning. (arXiv:2210.13455v1 [cs.LG])
    Deep model-based Reinforcement Learning (RL) has shown super-human performance in many challenging domains. Low sample efficiency and limited exploration remain as leading obstacles in the field, however. In this paper, we demonstrate deep exploration in model-based RL by incorporating epistemic uncertainty into planning trees, circumventing the standard approach of propagating uncertainty through value learning. We evaluate this approach with the state of the art model-based RL algorithm MuZero, and extend its training process to stabilize learning from explicitly-exploratory trajectories. In our experiments planning with uncertainty is able to demonstrate effective deep exploration with standard uncertainty estimation mechanisms, and with it significant gains in sample efficiency.
    Scaling up and Stabilizing Differentiable Planning with Implicit Differentiation. (arXiv:2210.13542v1 [cs.LG])
    Differentiable planning promises end-to-end differentiability and adaptivity. However, an issue prevents it from scaling up to larger-scale problems: they need to differentiate through forward iteration layers to compute gradients, which couples forward computation and backpropagation, and needs to balance forward planner performance and computational cost of the backward pass. To alleviate this issue, we propose to differentiate through the Bellman fixed-point equation to decouple forward and backward passes for Value Iteration Network and its variants, which enables constant backward cost (in planning horizon) and flexible forward budget and helps scale up to large tasks. We study the convergence stability, scalability, and efficiency of the proposed implicit version of VIN and its variants and demonstrate their superiorities on a range of planning tasks: 2D navigation, visual navigation, and 2-DOF manipulation in configuration space and workspace.
    Toward an Intelligent Tutoring System for Argument Mining in Legal Texts. (arXiv:2210.13635v1 [cs.CL])
    We propose an adaptive environment (CABINET) to support caselaw analysis (identifying key argument elements) based on a novel cognitive computing framework that carefully matches various machine learning (ML) capabilities to the proficiency of a user. CABINET supports law students in their learning as well as professionals in their work. The results of our experiments focused on the feasibility of the proposed framework are promising. We show that the system is capable of identifying a potential error in the analysis with very low false positives rate (2.0-3.5%), as well as of predicting the key argument element type (e.g., an issue or a holding) with a reasonably high F1-score (0.74).
    Learning Ability of Interpolating Convolutional Neural Networks. (arXiv:2210.14184v1 [stat.ML])
    It is frequently observed that overparameterized neural networks generalize well. Regarding such phenomena, existing theoretical work mainly devotes to linear settings or fully connected neural networks. This paper studies learning ability of an important family of deep neural networks, deep convolutional neural networks (DCNNs), under underparameterized and overparameterized settings. We establish the best learning rates of underparameterized DCNNs without parameter restrictions presented in the literature. We also show that, by adding well defined layers to an underparameterized DCNN, we can obtain some interpolating DCNNs that maintain the good learning rates of the underparameterized DCNN. This result is achieved by a novel network deepening scheme designed for DCNNs. Our work provides theoretical verification on how overfitted DCNNs generalize well.
    Mixed Precision Quantization to Tackle Gradient Leakage Attacks in Federated Learning. (arXiv:2210.13457v1 [cs.LG])
    Federated Learning (FL) enables collaborative model building among a large number of participants without the need for explicit data sharing. But this approach shows vulnerabilities when privacy inference attacks are applied to it. In particular, in the event of a gradient leakage attack, which has a higher success rate in retrieving sensitive data from the model gradients, FL models are at higher risk due to the presence of communication in their inherent architecture. The most alarming thing about this gradient leakage attack is that it can be performed in such a covert way that it does not hamper the training performance while the attackers backtrack from the gradients to get information about the raw data. Two of the most common approaches proposed as solutions to this issue are homomorphic encryption and adding noise with differential privacy parameters. These two approaches suffer from two major drawbacks. They are: the key generation process becomes tedious with the increasing number of clients, and noise-based differential privacy suffers from a significant drop in global model accuracy. As a countermeasure, we propose a mixed-precision quantized FL scheme, and we empirically show that both of the issues addressed above can be resolved. In addition, our approach can ensure more robustness as different layers of the deep model are quantized with different precision and quantization modes. We empirically proved the validity of our method with three benchmark datasets and found a minimal accuracy drop in the global model after applying quantization.
    Active Predictive Coding: A Unified Neural Framework for Learning Hierarchical World Models for Perception and Planning. (arXiv:2210.13461v1 [cs.LG])
    Predictive coding has emerged as a prominent model of how the brain learns through predictions, anticipating the importance accorded to predictive learning in recent AI architectures such as transformers. Here we propose a new framework for predictive coding called active predictive coding which can learn hierarchical world models and solve two radically different open problems in AI: (1) how do we learn compositional representations, e.g., part-whole hierarchies, for equivariant vision? and (2) how do we solve large-scale planning problems, which are hard for traditional reinforcement learning, by composing complex action sequences from primitive policies? Our approach exploits hypernetworks, self-supervised learning and reinforcement learning to learn hierarchical world models that combine task-invariant state transition networks and task-dependent policy networks at multiple abstraction levels. We demonstrate the viability of our approach on a variety of vision datasets (MNIST, FashionMNIST, Omniglot) as well as on a scalable hierarchical planning problem. Our results represent, to our knowledge, the first demonstration of a unified solution to the part-whole learning problem posed by Hinton, the nested reference frames problem posed by Hawkins, and the integrated state-action hierarchy learning problem in reinforcement learning.
    Energy Pricing in P2P Energy Systems Using Reinforcement Learning. (arXiv:2210.13555v1 [cs.LG])
    The increase in renewable energy on the consumer side gives place to new dynamics in the energy grids. Participants in a microgrid can produce energy and trade it with their peers (peer-to-peer) with the permission of the energy provider. In such a scenario, the stochastic nature of distributed renewable energy generators and energy consumption increases the complexity of defining fair prices for buying and selling energy. In this study, we introduce a reinforcement learning framework to help solve this issue by training an agent to set the prices that maximize the profit of all components in the microgrid, aiming to facilitate the implementation of P2P grids in real-life scenarios. The microgrid considers consumers, prosumers, the service provider, and a community battery. Experimental results on the \textit{Pymgrid} dataset show a successful approach to price optimization for all components in the microgrid. The proposed framework ensures flexibility to account for the interest of these components, as well as the ratio of consumers and prosumers in the microgrid. The results also examine the effect of changing the capacity of the community battery on the profit of the system. The implementation code is available \href{https://github.com/Artifitialleap-MBZUAI/rl-p2p-price-prediction}{here}.
    OpenAUC: Towards AUC-Oriented Open-Set Recognition. (arXiv:2210.13458v1 [cs.LG])
    Traditional machine learning follows a close-set assumption that the training and test set share the same label space. While in many practical scenarios, it is inevitable that some test samples belong to unknown classes (open-set). To fix this issue, Open-Set Recognition (OSR), whose goal is to make correct predictions on both close-set samples and open-set samples, has attracted rising attention. In this direction, the vast majority of literature focuses on the pattern of open-set samples. However, how to evaluate model performance in this challenging task is still unsolved. In this paper, a systematic analysis reveals that most existing metrics are essentially inconsistent with the aforementioned goal of OSR: (1) For metrics extended from close-set classification, such as Open-set F-score, Youden's index, and Normalized Accuracy, a poor open-set prediction can escape from a low performance score with a superior close-set prediction. (2) Novelty detection AUC, which measures the ranking performance between close-set and open-set samples, ignores the close-set performance. To fix these issues, we propose a novel metric named OpenAUC. Compared with existing metrics, OpenAUC enjoys a concise pairwise formulation that evaluates open-set performance and close-set performance in a coupling manner. Further analysis shows that OpenAUC is free from the aforementioned inconsistency properties. Finally, an end-to-end learning method is proposed to minimize the OpenAUC risk, and the experimental results on popular benchmark datasets speak to its effectiveness.
    Towards Formal Approximated Minimal Explanations of Neural Networks. (arXiv:2210.13915v1 [cs.LG])
    With the rapid growth of machine learning, deep neural networks (DNNs) are now being used in numerous domains. Unfortunately, DNNs are "black-boxes", and cannot be interpreted by humans, which is a substantial concern in safety-critical systems. To mitigate this issue, researchers have begun working on explainable AI (XAI) methods, which can identify a subset of input features that are the cause of a DNN's decision for a given input. Most existing techniques are heuristic, and cannot guarantee the correctness of the explanation provided. In contrast, recent and exciting attempts have shown that formal methods can be used to generate provably correct explanations. Although these methods are sound, the computational complexity of the underlying verification problem limits their scalability; and the explanations they produce might sometimes be overly complex. Here, we propose a novel approach to tackle these limitations. We (1) suggest an efficient, verification-based method for finding minimal explanations, which constitute a provable approximation of the global, minimum explanation; (2) show how DNN verification can assist in calculating lower and upper bounds on the optimal explanation; (3) propose heuristics that significantly improve the scalability of the verification process; and (4) suggest the use of bundles, which allows us to arrive at more succinct and interpretable explanations. Our evaluation shows that our approach significantly outperforms state-of-the-art techniques, and produces explanations that are more useful to humans. We thus regard this work as a step toward leveraging verification technology in producing DNNs that are more reliable and comprehensible.
    Comparing neural network training performance between Elixir and Python. (arXiv:2210.13945v1 [cs.LG])
    With a wide range of libraries focused on the machine learning market, such as TensorFlow, NumPy, Pandas, Keras, and others, Python has made a name for itself as one of the main programming languages. In February 2021, Jos\'e Valim and Sean Moriarity published the first version of the Numerical Elixir (Nx) library, a library for tensor operations written in Elixir. Nx aims to allow the language be a good choice for GPU-intensive operations. This work aims to compare the results of Python and Elixir on training convolutional neural networks (CNN) using MNIST and CIFAR-10 datasets, concluding that Python achieved overall better results, and that Elixir is already a viable alternative.
    A Differentiable Relaxation of Graph Segmentation and Alignment for AMR Parsing. (arXiv:2010.12676v2 [cs.CL] UPDATED)
    Abstract Meaning Representations (AMR) are a broad-coverage semantic formalism which represents sentence meaning as a directed acyclic graph. To train most AMR parsers, one needs to segment the graph into subgraphs and align each such subgraph to a word in a sentence; this is normally done at preprocessing, relying on hand-crafted rules. In contrast, we treat both alignment and segmentation as latent variables in our model and induce them as part of end-to-end training. As marginalizing over the structured latent variables is infeasible, we use the variational autoencoding framework. To ensure end-to-end differentiable optimization, we introduce a differentiable relaxation of the segmentation and alignment problems. We observe that inducing segmentation yields substantial gains over using a `greedy' segmentation heuristic. The performance of our method also approaches that of a model that relies on the segmentation rules of \citet{lyu-titov-2018-amr}, which were hand-crafted to handle individual AMR constructions.
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v1 [cs.LG])
    Recent mean field interpretations of learning dynamics in over-parameterized neural networks offer theoretical insights on the empirical success of first order optimization algorithms in finding global minima of the nonconvex risk landscape. In this paper, we explore applying mean field learning dynamics as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow from the learning dynamics in the mean field regime over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.
    Machine learning-based approach for online fault Diagnosis of Discrete Event System. (arXiv:2210.13466v1 [cs.LG])
    The problem considered in this paper is the online diagnosis of Automated Production Systems with sensors and actuators delivering discrete binary signals that can be modeled as Discrete Event Systems. Even though there are numerous diagnosis methods, none of them can meet all the criteria of implementing an efficient diagnosis system (such as an intelligent solution, an average effort, a reasonable cost, an online diagnosis, fewer false alarms, etc.). In addition, these techniques require either a correct, robust, and representative model of the system or relevant data or experts' knowledge that require continuous updates. In this paper, we propose a Machine Learning-based approach of a diagnostic system. It is considered as a multi-class classifier that predicts the plant state: normal or faulty and what fault that has arisen in the case of failing behavior.
    Machine and Deep Learning for IoT Security and Privacy: Applications, Challenges, and Future Directions. (arXiv:2210.13547v1 [cs.CR])
    The integration of the Internet of Things (IoT) connects a number of intelligent devices with a minimum of human interference that can interact with one another. IoT is rapidly emerging in the areas of computer science. However, new security problems were posed by the cross-cutting design of the multidisciplinary elements and IoT systems involved in deploying such schemes. Ineffective is the implementation of security protocols, i.e., authentication, encryption, application security, and access network for IoT systems and their essential weaknesses in security. Current security approaches can also be improved to protect the IoT environment effectively. In recent years, deep learning (DL)/ machine learning (ML) has progressed significantly in various critical implementations. Therefore, DL/ML methods are essential to turn IoT systems protection from simply enabling safe contact between IoT systems to intelligence systems in security. This review aims to include an extensive analysis of ML systems and state-of-the-art developments in DL methods to improve enhanced IoT device protection methods. On the other hand, various new insights in machine and deep learning for IoT Securities illustrate how it could help future research. IoT protection risks relating to emerging or essential threats are identified, as well as future IoT device attacks and possible threats associated with each surface. We then carefully analyze DL and ML IoT protection approaches and present each approach's benefits, possibilities, and weaknesses. This review discusses a number of potential challenges and limitations. The future works, recommendations, and suggestions of DL/ML in IoT security are also included.
    pmuBAGE: The Benchmarking Assortment of Generated PMU Data for Power System Events. (arXiv:2210.14204v1 [eess.SY])
    This paper introduces pmuGE (phasor measurement unit Generator of Events), one of the first data-driven generative model for power system event data. We have trained this model on thousands of actual events and created a dataset denoted pmuBAGE (the Benchmarking Assortment of Generated PMU Events). The dataset consists of almost 1000 instances of labeled event data to encourage benchmark evaluations on phasor measurement unit (PMU) data analytics. PMU data are challenging to obtain, especially those covering event periods. Nevertheless, power system problems have recently seen phenomenal advancements via data-driven machine learning solutions. A highly accessible standard benchmarking dataset would enable a drastic acceleration of the development of successful machine learning techniques in this field. We propose a novel learning method based on the Event Participation Decomposition of Power System Events, which makes it possible to learn a generative model of PMU data during system anomalies. The model can create highly realistic event data without compromising the differential privacy of the PMUs used to train it. The dataset is available online for any researcher or practitioner to use at the pmuBAGE Github Repository: https://github.com/NanpengYu/pmuBAGE.
    A multi-category inverse design neural network and its application to diblock copolymers. (arXiv:2210.13453v1 [cond-mat.soft])
    In this work, we design a multi-category inverse design neural network to map ordered periodic structure to physical parameters. The neural network model consists of two parts, a classifier and Structure-Parameter-Mapping (SPM) subnets. The classifier is used to identify structure, and the SPM subnets are used to predict physical parameters for desired structures. We also present an extensible reciprocal-space data augmentation method to guarantee the rotation and translation invariant of periodic structures. We apply the proposed network model and data augmentation method to two-dimensional diblock copolymers based on the Landau-Brazovskii model. Results show that the multi-category inverse design neural network is high accuracy in predicting physical parameters for desired structures. Moreover, the idea of multi-categorization can also be extended to other inverse design problems.
    Computational Inference in Cognitive Science: Operational, Societal and Ethical Considerations. (arXiv:2210.13526v1 [q-bio.NC])
    Emerging research frontiers and computational advances have gradually transformed cognitive science into a multidisciplinary and data-driven field. As a result, there is a proliferation of cognitive theories investigated and interpreted from different academic lens and in different levels of abstraction. We formulate this applied aspect of this challenge as the computational cognitive inference, and describe the major routes of computational approaches. To balance the potential optimism alongside the speed and scale of the data-driven era of cognitive science, we propose to inspect this trend in more empirical terms by identifying the operational challenges, societal impacts and ethical guidelines in conducting research and interpreting results from the computational inference in cognitive science.
    Video based Object 6D Pose Estimation using Transformers. (arXiv:2210.13540v1 [cs.CV])
    We introduce a Transformer based 6D Object Pose Estimation framework VideoPose, comprising an end-to-end attention based modelling architecture, that attends to previous frames in order to estimate accurate 6D Object Poses in videos. Our approach leverages the temporal information from a video sequence for pose refinement, along with being computationally efficient and robust. Compared to existing methods, our architecture is able to capture and reason from long-range dependencies efficiently, thus iteratively refining over video sequences. Experimental evaluation on the YCB-Video dataset shows that our approach is on par with the state-of-the-art Transformer methods, and performs significantly better relative to CNN based approaches. Further, with a speed of 33 fps, it is also more efficient and therefore applicable to a variety of applications that require real-time object pose estimation. Training code and pretrained models are available at https://github.com/ApoorvaBeedu/VideoPose
    Exploring the impact of weather on Metro demand forecasting using machine learning method. (arXiv:2210.13965v1 [cs.LG])
    Urban rail transit provides significant comprehensive benefits such as large traffic volume and high speed, serving as one of the most important components of urban traffic construction management and congestion solution. Using real passenger flow data of an Asian subway system from April to June of 2018, this work analyzes the space-time distribution of the passenger flow using short-term traffic flow prediction. Stations are divided into four types for passenger flow forecasting, and meteorological records are collected for the same period. Then, machine learning methods with different inputs are applied and multivariate regression is performed to evaluate the improvement effect of each weather element on passenger flow forecasting of representative metro stations on hourly basis. Our results show that by inputting weather variables the precision of prediction on weekends enhanced while the performance on weekdays only improved marginally, while the contribution of different elements of weather differ. Also, different categories of stations are affected differently by weather. This study provides a possible method to further improve other prediction models, and attests to the promise of data-driven analytics for optimization of short-term scheduling in transit management.
    Parametric PDF for Goodness of Fit. (arXiv:2210.14005v1 [cs.LG])
    The goodness of fit methods for classification problems relies traditionally on confusion matrices. This paper aims to enrich these methods with a risk evaluation and stability analysis tools. For this purpose, we present a parametric PDF framework.
    Referee: Reference-Free Sentence Summarization with Sharper Controllability through Symbolic Knowledge Distillation. (arXiv:2210.13800v1 [cs.CL])
    We present Referee, a novel framework for sentence summarization that can be trained reference-free (i.e., requiring no gold summaries for supervision), while allowing direct control for compression ratio. Our work is the first to demonstrate that reference-free, controlled sentence summarization is feasible via the conceptual framework of Symbolic Knowledge Distillation (West et al., 2022), where latent knowledge in pre-trained language models is distilled via explicit examples sampled from the teacher models, further purified with three types of filters: length, fidelity, and Information Bottleneck. Moreover, we uniquely propose iterative distillation of knowledge, where student models from the previous iteration of distillation serve as teacher models in the next iteration. Starting off from a relatively modest set of GPT3-generated summaries, we demonstrate how iterative knowledge distillation can lead to considerably smaller, but better summarizers with sharper controllability. A useful by-product of this iterative distillation process is a high-quality dataset of sentence-summary pairs with varying degrees of compression ratios. Empirical results demonstrate that the final student models vastly outperform the much larger GPT3-Instruct model in terms of the controllability of compression ratios, without compromising the quality of resulting summarization.
    Learning Latent Structural Causal Models. (arXiv:2210.13583v1 [cs.LG])
    Causal learning has long concerned itself with the accurate recovery of underlying causal mechanisms. Such causal modelling enables better explanations of out-of-distribution data. Prior works on causal learning assume that the high-level causal variables are given. However, in machine learning tasks, one often operates on low-level data like image pixels or high-dimensional vectors. In such settings, the entire Structural Causal Model (SCM) -- structure, parameters, \textit{and} high-level causal variables -- is unobserved and needs to be learnt from low-level data. We treat this problem as Bayesian inference of the latent SCM, given low-level data. For linear Gaussian additive noise SCMs, we present a tractable approximate inference method which performs joint inference over the causal variables, structure and parameters of the latent SCM from random, known interventions. Experiments are performed on synthetic datasets and a causally generated image dataset to demonstrate the efficacy of our approach. We also perform image generation from unseen interventions, thereby verifying out of distribution generalization for the proposed causal model.
    Artificial Intelligence-Based Methods for Fusion of Electronic Health Records and Imaging Data. (arXiv:2210.13462v1 [cs.LG])
    Healthcare data are inherently multimodal, including electronic health records (EHR), medical images, and multi-omics data. Combining these multimodal data sources contributes to a better understanding of human health and provides optimal personalized healthcare. Advances in artificial intelligence (AI) technologies, particularly machine learning (ML), enable the fusion of these different data modalities to provide multimodal insights. To this end, in this scoping review, we focus on synthesizing and analyzing the literature that uses AI techniques to fuse multimodal medical data for different clinical applications. More specifically, we focus on studies that only fused EHR with medical imaging data to develop various AI methods for clinical applications. We present a comprehensive analysis of the various fusion strategies, the diseases and clinical outcomes for which multimodal fusion was used, the ML algorithms used to perform multimodal fusion for each clinical application, and the available multimodal medical datasets. We followed the PRISMA-ScR guidelines. We searched Embase, PubMed, Scopus, and Google Scholar to retrieve relevant studies. We extracted data from 34 studies that fulfilled the inclusion criteria. In our analysis, a typical workflow was observed: feeding raw data, fusing different data modalities by applying conventional machine learning (ML) or deep learning (DL) algorithms, and finally, evaluating the multimodal fusion through clinical outcome predictions. Specifically, early fusion was the most used technique in most applications for multimodal learning (22 out of 34 studies). We found that multimodality fusion models outperformed traditional single-modality models for the same task. Disease diagnosis and prediction were the most common clinical outcomes (reported in 20 and 10 studies, respectively) from a clinical outcome perspective.
    Causal Explanation for Reinforcement Learning: Quantifying State and Temporal Importance. (arXiv:2210.13507v1 [cs.AI])
    Explainability plays an increasingly important role in machine learning. Because reinforcement learning (RL) involves interactions between states and actions over time, explaining an RL policy is more challenging than that of supervised learning. Furthermore, humans view the world from causal lens and thus prefer causal explanations over associational ones. Therefore, in this paper, we develop a causal explanation mechanism that quantifies the causal importance of states on actions and such importance over time. Moreover, via a series of simulation studies including crop irrigation, Blackjack, collision avoidance, and lunar lander, we demonstrate the advantages of our mechanism over state-of-the-art associational methods in terms of RL policy explanation.
    Adaptive Top-K in SGD for Communication-Efficient Distributed Learning. (arXiv:2210.13532v1 [cs.LG])
    Distributed stochastic gradient descent (SGD) with gradient compression has emerged as a communication-efficient solution to accelerate distributed learning. Top-K sparsification is one of the most popular gradient compression methods that sparsifies the gradient in a fixed degree during model training. However, there lacks an approach to adaptively adjust the degree of sparsification to maximize the potential of model performance or training speed. This paper addresses this issue by proposing a novel adaptive Top-K SGD framework, enabling adaptive degree of sparsification for each gradient descent step to maximize the convergence performance by exploring the trade-off between communication cost and convergence error. Firstly, we derive an upper bound of the convergence error for the adaptive sparsification scheme and the loss function. Secondly, we design the algorithm by minimizing the convergence error under the communication cost constraints. Finally, numerical results show that the proposed adaptive Top-K in SGD achieves a significantly better convergence rate compared with the state-of-the-art methods.
    Do-calculus enables estimation of causal effects in partially observed biomolecular pathways. (arXiv:2102.06626v2 [cs.LG] UPDATED)
    Estimating causal queries, such as changes in protein abundance in response to a perturbation, is a fundamental task in the analysis of biomolecular pathways. The estimation requires experimental measurements on the pathway components. However, in practice many pathway components are left unobserved (latent) because they are either unknown, or difficult to measure. Latent variable models (LVMs) are well-suited for such estimation. Unfortunately, LVM-based estimation of causal queries can be inaccurate when parameters of the latent variables are not uniquely identified, or when the number of latent variables is misspecified. This has limited the use of LVMs for causal inference in biomolecular pathways. In this manuscript, we propose a general and practical approach for LVM-based estimation of causal queries. We prove that, despite the challenges above, LVM-based estimators of causal queries are accurate if the queries are identifiable according to Pearl's do-calculus, and describe an algorithm for its estimation. We illustrate the breadth and the practical utility of this approach for estimating causal queries in four synthetic and two experimental case studies, where structures of biomolecular pathways challenge the existing methods for causal query estimation. The code and the data documenting all the case studies are available at \url{https://github.com/srtaheri/LVMwithDoCalculus}
    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v1 [cs.LG])
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.
    Topical Segmentation of Spoken Narratives: A Test Case on Holocaust Survivor Testimonies. (arXiv:2210.13783v1 [cs.CL])
    The task of topical segmentation is well studied, but previous work has mostly addressed it in the context of structured, well-defined segments, such as segmentation into paragraphs, chapters, or segmenting text that originated from multiple sources. We tackle the task of segmenting running (spoken) narratives, which poses hitherto unaddressed challenges. As a test case, we address Holocaust survivor testimonies, given in English. Other than the importance of studying these testimonies for Holocaust research, we argue that they provide an interesting test case for topical segmentation, due to their unstructured surface level, relative abundance (tens of thousands of such testimonies were collected), and the relatively confined domain that they cover. We hypothesize that boundary points between segments correspond to low mutual information between the sentences proceeding and following the boundary. Based on this hypothesis, we explore a range of algorithmic approaches to the task, building on previous work on segmentation that uses generative Bayesian modeling and state-of-the-art neural machinery. Compared to manually annotated references, we find that the developed approaches show considerable improvements over previous work.
    Motif-Backdoor: Rethinking the Backdoor Attack on Graph Neural Networks via Motifs. (arXiv:2210.13710v1 [cs.LG])
    Graph neural network (GNN) with a powerful representation capability has been widely applied to various areas, such as biological gene prediction, social recommendation, etc. Recent works have exposed that GNN is vulnerable to the backdoor attack, i.e., models trained with maliciously crafted training samples are easily fooled by patched samples. Most of the proposed studies launch the backdoor attack using a trigger that either is the randomly generated subgraph (e.g., erd\H{o}s-r\'enyi backdoor) for less computational burden, or the gradient-based generative subgraph (e.g., graph trojaning attack) to enable a more effective attack. However, the interpretation of how is the trigger structure and the effect of the backdoor attack related has been overlooked in the current literature. Motifs, recurrent and statistically significant sub-graphs in graphs, contain rich structure information. In this paper, we are rethinking the trigger from the perspective of motifs, and propose a motif-based backdoor attack, denoted as Motif-Backdoor. It contributes from three aspects. (i) Interpretation: it provides an in-depth explanation for backdoor effectiveness by the validity of the trigger structure from motifs, leading to some novel insights, e.g., using subgraphs that appear less frequently in the graph as the trigger can achieve better attack performance. (ii) Effectiveness: Motif-Backdoor reaches the state-of-the-art (SOTA) attack performance in both black-box and defensive scenarios. (iii) Efficiency: based on the graph motif distribution, Motif-Backdoor can quickly obtain an effective trigger structure without target model feedback or subgraph model generation. Extensive experimental results show that Motif-Backdoor realizes the SOTA performance on three popular models and four public datasets compared with five baselines.
    Subspace Recovery from Heterogeneous Data with Non-isotropic Noise. (arXiv:2210.13497v1 [cs.LG])
    Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $\mu_i$. Our goal is to recover the linear subspace shared by $\mu_1,\ldots,\mu_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $\mu_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.
    Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models. (arXiv:2210.14199v1 [cs.LG])
    Language modeling on large-scale datasets leads to impressive performance gains on various downstream language tasks. The validation pre-training loss (or perplexity in autoregressive language modeling) is often used as the evaluation metric when developing language models since the pre-training loss tends to be well-correlated with downstream performance (which is itself difficult to evaluate comprehensively). Contrary to this conventional wisdom, this paper shows that 1) pre-training loss cannot fully explain downstream performance and 2) flatness of the model is well-correlated with downstream performance where pre-training loss is not. On simplified datasets, we identify three ways to produce models with the same (statistically optimal) pre-training loss but different downstream performance: continue pre-training after convergence, increasing the model size, and changing the training algorithm. These experiments demonstrate the existence of implicit bias of pre-training algorithms/optimizers -- among models with the same minimal pre-training loss, they implicitly prefer more transferable ones. Toward understanding this implicit bias, we prove that SGD with standard mini-batch noise implicitly prefers flatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss. We also prove in a synthetic language setting that among the models with the minimal pre-training loss, the flattest model transfers to downstream tasks.
    Stable deep MRI reconstruction using Generative Priors. (arXiv:2210.13834v1 [eess.IV])
    Data-driven approaches recently achieved remarkable success in medical image reconstruction, but integration into clinical routine remains challenging due to a lack of generalizability and interpretability. Existing approaches usually require high-quality data-image pairs for training, but such data is not easily available for any imaging protocol and the reconstruction quality can quickly degrade even if only minor changes are made to the protocol. In addition, data-driven methods may create artificial features that can influence the clinicians decision-making. This is unacceptable if the clinician is unaware of the uncertainty associated with the reconstruction. In this paper, we address these challenges in a unified framework based on generative image priors. We propose a novel deep neural network based regularizer which is trained in an unsupervised setting on reference images without requiring any data-image pairs. After training, the regularizer can be used as part of a classical variational approach in combination with any acquisition protocols and shows stable behavior even if the test data deviates significantly from the training data. Furthermore, our probabilistic interpretation provides a distribution of reconstructions and hence allows uncertainty quantification. We demonstrate our approach on parallel magnetic resonance imaging, where results show competitive performance with SotA end-to-end deep learning methods, while preserving the flexibility of the acquisition protocol and allowing for uncertainty quantification.
    IELM: An Open Information Extraction Benchmark for Pre-Trained Language Models. (arXiv:2210.14128v1 [cs.CL])
    We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM). Recent studies have demonstrated that pre-trained LMs, such as BERT and GPT, may store linguistic and relational knowledge. In particular, LMs are able to answer ``fill-in-the-blank'' questions when given a pre-defined relation category. Instead of focusing on pre-defined relations, we create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs. We accomplish this by turning pre-trained LMs into zero-shot OIE systems. Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets (CaRB and Re-OIE2016) and two new large-scale factual OIE datasets (TAC KBP-OIE and Wikidata-OIE) that we establish via distant supervision. For instance, the zero-shot pre-trained LMs outperform the F1 score of the state-of-the-art supervised OIE methods on our factual OIE datasets without needing to use any training sets. Our code and datasets are available at https://github.com/cgraywang/IELM
    Faster Projection-Free Augmented Lagrangian Methods via Weak Proximal Oracle. (arXiv:2210.13968v1 [math.OC])
    This paper considers a convex composite optimization problem with affine constraints, which includes problems that take the form of minimizing a smooth convex objective function over the intersection of (simple) convex sets, or regularized with multiple (simple) functions. Motivated by high-dimensional applications in which exact projection/proximal computations are not tractable, we propose a \textit{projection-free} augmented Lagrangian-based method, in which primal updates are carried out using a \textit{weak proximal oracle} (WPO). In an earlier work, WPO was shown to be more powerful than the standard \textit{linear minimization oracle} (LMO) that underlies conditional gradient-based methods (aka Frank-Wolfe methods). Moreover, WPO is computationally tractable for many high-dimensional problems of interest, including those motivated by recovery of low-rank matrices and tensors, and optimization over polytopes which admit efficient LMOs. The main result of this paper shows that under a certain curvature assumption (which is weaker than strong convexity), our WPO-based algorithm achieves an ergodic rate of convergence of $O(1/T)$ for both the objective residual and feasibility gap. This result, to the best of our knowledge, improves upon the $O(1/\sqrt{T})$ rate for existing LMO-based projection-free methods for this class of problems. Empirical experiments on a low-rank and sparse covariance matrix estimation task and the Max Cut semidefinite relaxation demonstrate the superiority of our method over state-of-the-art LMO-based Lagrangian-based methods.
    Privacy Vulnerability of Split Computing to Data-Free Model Inversion Attacks. (arXiv:2107.06304v2 [cs.LG] UPDATED)
    Mobile edge devices see increased demands in deep neural networks (DNNs) inference while suffering from stringent constraints in computing resources. Split computing (SC) emerges as a popular approach to the issue by executing only initial layers on devices and offloading the remaining to the cloud. Prior works usually assume that SC offers privacy benefits as only intermediate features, instead of private data, are shared from devices to the cloud. In this work, we debunk this SC-induced privacy protection by (i) presenting a novel data-free model inversion method and (ii) demonstrating sample inversion where private data from devices can still be leaked with high fidelity from the shared feature even after tens of neural network layers. We propose Divide-and-Conquer Inversion (DCI) which partitions the given deep network into multiple shallow blocks and inverts each block with an inversion method. Additionally, cycle-consistency technique is introduced by re-directing the inverted results back to the model under attack in order to better supervise the training of the inversion modules. In contrast to prior art based on generative priors and computation-intensive optimization in deriving inverted samples, DCI removes the need for real device data and generative priors, and completes inversion with a single quick forward pass over inversion modules. For the first time, we scale data-free and sample-specific inversion to deep architectures and large datasets for both discriminative and generative networks. We perform model inversion attack to ResNet and RepVGG models on ImageNet and SNGAN on CelebA and recover the original input from intermediate features more than 40 layers deep into the network.
    Understanding the Generalization Benefit of Normalization Layers: Sharpness Reduction. (arXiv:2206.07085v2 [cs.LG] UPDATED)
    Normalization layers (e.g., Batch Normalization, Layer Normalization) were introduced to help with optimization difficulties in very deep nets, but they clearly also help generalization, even in not-so-deep nets. Motivated by the long-held belief that flatter minima lead to better generalization, this paper gives mathematical analysis and supporting experiments suggesting that normalization (together with accompanying weight-decay) encourages GD to reduce the sharpness of loss surface. Here "sharpness" is carefully defined given that the loss is scale-invariant, a known consequence of normalization. Specifically, for a fairly broad class of neural nets with normalization, our theory explains how GD with a finite learning rate enters the so-called Edge of Stability (EoS) regime, and characterizes the trajectory of GD in this regime via a continuous sharpness-reduction flow.
    Noise Injection as a Probe of Deep Learning Dynamics. (arXiv:2210.13599v1 [cs.LG])
    We propose a new method to probe the learning mechanism of Deep Neural Networks (DNN) by perturbing the system using Noise Injection Nodes (NINs). These nodes inject uncorrelated noise via additional optimizable weights to existing feed-forward network architectures, without changing the optimization algorithm. We find that the system displays distinct phases during training, dictated by the scale of injected noise. We first derive expressions for the dynamics of the network and utilize a simple linear model as a test case. We find that in some cases, the evolution of the noise nodes is similar to that of the unperturbed loss, thus indicating the possibility of using NINs to learn more about the full system in the future.
    Temporally Disentangled Representation Learning. (arXiv:2210.13647v1 [cs.LG])
    Recently in the field of unsupervised representation learning, strong identifiability results for disentanglement of causally-related latent variables have been established by exploiting certain side information, such as class labels, in addition to independence. However, most existing work is constrained by functional form assumptions such as independent sources or further with linear transitions, and distribution assumptions such as stationary, exponential family distribution. It is unknown whether the underlying latent variables and their causal relations are identifiable if they have arbitrary, nonparametric causal influences in between. In this work, we establish the identifiability theories of nonparametric latent causal processes from their nonlinear mixtures under fixed temporal causal influences and analyze how distribution changes can further benefit the disentanglement. We propose \textbf{\texttt{TDRL}}, a principled framework to recover time-delayed latent causal variables and identify their relations from measured sequential data under stationary environments and under different distribution shifts. Specifically, the framework can factorize unknown distribution shifts into transition distribution changes under fixed and time-varying latent causal relations, and under observation changes in observation. Through experiments, we show that time-delayed latent causal influences are reliably identified and that our approach considerably outperforms existing baselines that do not correctly exploit this modular representation of changes. Our code is available at: \url{https://github.com/weirayao/tdrl}.
    Classification of Misinformation in New Articles using Natural Language Processing and a Recurrent Neural Network. (arXiv:2210.13534v1 [cs.CL])
    This paper seeks to address the classification of misinformation in news articles using a Long Short Term Memory Recurrent Neural Network. Articles were taken from 2018; a year that was filled with reporters writing about President Donald Trump, Special Counsel Robert Mueller, the Fifa World Cup, and Russia. The model presented successfully classifies these articles with an accuracy score of 0.779944. We consider this to be successful because the model was trained on articles that included languages other than English as well as incomplete, or fragmented, articles.
    Sequential Recommendation with Auxiliary Item Relationships via Multi-Relational Transformer. (arXiv:2210.13572v1 [cs.IR])
    Sequential Recommendation (SR) models user dynamics and predicts the next preferred items based on the user history. Existing SR methods model the 'was interacted before' item-item transitions observed in sequences, which can be viewed as an item relationship. However, there are multiple auxiliary item relationships, e.g., items from similar brands and with similar contents in real-world scenarios. Auxiliary item relationships describe item-item affinities in multiple different semantics and alleviate the long-lasting cold start problem in the recommendation. However, it remains a significant challenge to model auxiliary item relationships in SR. To simultaneously model high-order item-item transitions in sequences and auxiliary item relationships, we propose a Multi-relational Transformer capable of modeling auxiliary item relationships for SR (MT4SR). Specifically, we propose a novel self-attention module, which incorporates arbitrary item relationships and weights item relationships accordingly. Second, we regularize intra-sequence item relationships with a novel regularization module to supervise attentions computations. Third, for inter-sequence item relationship pairs, we introduce a novel inter-sequence related items modeling module. Finally, we conduct experiments on four benchmark datasets and demonstrate the effectiveness of MT4SR over state-of-the-art methods and the improvements on the cold start problem. The code is available at https://github.com/zfan20/MT4SR.
    SepLL: Separating Latent Class Labels from Weak Supervision Noise. (arXiv:2210.13898v1 [cs.LG])
    In the weakly supervised learning paradigm, labeling functions automatically assign heuristic, often noisy, labels to data samples. In this work, we provide a method for learning from weak labels by separating two types of complementary information associated with the labeling functions: information related to the target label and information specific to one labeling function only. Both types of information are reflected to different degrees by all labeled instances. In contrast to previous works that aimed at correcting or removing wrongly labeled instances, we learn a branched deep model that uses all data as-is, but splits the labeling function information in the latent space. Specifically, we propose the end-to-end model SepLL which extends a transformer classifier by introducing a latent space for labeling function specific and task-specific information. The learning signal is only given by the labeling functions matches, no pre-processing or label model is required for our method. Notably, the task prediction is made from the latent layer without any direct task signal. Experiments on Wrench text classification tasks show that our model is competitive with the state-of-the-art, and yields a new best average performance.
    Anisotropic, Sparse and Interpretable Physics-Informed Neural Networks for PDEs. (arXiv:2207.00377v2 [cs.LG] UPDATED)
    There has been a growing interest in the use of Deep Neural Networks (DNNs) to solve Partial Differential Equations (PDEs). Despite the promise that such approaches hold, there are various aspects where they could be improved. Two such shortcomings are (i) their computational inefficiency relative to classical numerical methods, and (ii) the non-interpretability of a trained DNN model. In this work we present ASPINN, an anisotropic extension of our earlier work called SPINN--Sparse, Physics-informed, and Interpretable Neural Networks--to solve PDEs that addresses both these issues. ASPINNs generalize radial basis function networks. We demonstrate using a variety of examples involving elliptic and hyperbolic PDEs that the special architecture we propose is more efficient than generic DNNs, while at the same time being directly interpretable. Further, they improve upon the SPINN models we proposed earlier in that fewer nodes are require to capture the solution using ASPINN than using SPINN, thanks to the anisotropy of the local zones of influence of each node. The interpretability of ASPINN translates to a ready visualization of their weights and biases, thereby yielding more insight into the nature of the trained model. This in turn provides a systematic procedure to improve the architecture based on the quality of the computed solution. ASPINNs thus serve as an effective bridge between classical numerical algorithms and modern DNN based methods to solve PDEs. In the process, we also streamline the training of ASPINNs into a form that is closer to that of supervised learning algorithms.
    Fast and Low-Memory Deep Neural Networks Using Binary Matrix Factorization. (arXiv:2210.13468v1 [cs.LG])
    Despite the outstanding performance of deep neural networks in different applications, they are still computationally extensive and require a great number of memories. This motivates more research on reducing the resources required for implementing such networks. An efficient approach addressed for this purpose is matrix factorization, which has been shown to be effective on different networks. In this paper, we utilize binary matrix factorization and show its great efficiency in reducing the required number of resources in deep neural networks. In effect, this technique can lead to the practical implementation of such networks.
    Boosting Kidney Stone Identification in Endoscopic Images Using Two-Step Transfer Learning. (arXiv:2210.13654v1 [cs.CV])
    Knowing the cause of kidney stone formation is crucial to establish treatments that prevent recurrence. There are currently different approaches for determining the kidney stone type. However, the reference ex-vivo identification procedure can take up to several weeks, while an in-vivo visual recognition requires highly trained specialists. Machine learning models have been developed to provide urologists with an automated classification of kidney stones during an ureteroscopy; however, there is a general lack in terms of quality of the training data and methods. In this work, a two-step transfer learning approach is used to train the kidney stone classifier. The proposed approach transfers knowledge learned on a set of images of kidney stones acquired with a CCD camera (ex-vivo dataset) to a final model that classifies images from endoscopic images (ex-vivo dataset). The results show that learning features from different domains with similar information helps to improve the performance of a model that performs classification in real conditions (for instance, uncontrolled lighting conditions and blur). Finally, in comparison to models that are trained from scratch or by initializing ImageNet weights, the obtained results suggest that the two-step approach extracts features improving the identification of kidney stones in endoscopic images.
  • Open

    Online model error correction with neural networks in the incremental 4D-Var framework. (arXiv:2210.13817v1 [stat.ML])
    Recent studies have demonstrated that it is possible to combine machine learning with data assimilation to reconstruct the dynamics of a physical model partially and imperfectly observed. Data assimilation is used to estimate the system state from the observations, while machine learning computes a surrogate model of the dynamical system based on those estimated states. The surrogate model can be defined as an hybrid combination where a physical model based on prior knowledge is enhanced with a statistical model estimated by a neural network. The training of the neural network is typically done offline, once a large enough dataset of model state estimates is available. By contrast, with online approaches the surrogate model is improved each time a new system state estimate is computed. Online approaches naturally fit the sequential framework encountered in geosciences where new observations become available with time. In a recent methodology paper, we have developed a new weak-constraint 4D-Var formulation which can be used to train a neural network for online model error correction. In the present article, we develop a simplified version of that method, in the incremental 4D-Var framework adopted by most operational weather centres. The simplified method is implemented in the ECMWF Object-Oriented Prediction System, with the help of a newly developed Fortran neural network library, and tested with a two-layer two-dimensional quasi geostrophic model. The results confirm that online learning is effective and yields a more accurate model error correction than offline learning. Finally, the simplified method is compatible with future applications to state-of-the-art models such as the ECMWF Integrated Forecasting System.
    An approach to the Gaussian RBF kernels via Fock spaces. (arXiv:2210.14167v1 [math-ph])
    We use methods from the Fock space and Segal-Bargmann theories to prove several results on the Gaussian RBF kernel in complex analysis. The latter is one of the most used kernels in modern machine learning kernel methods, and in support vector machines (SVMs) classification algorithms. Complex analysis techniques allow us to consider several notions linked to the RBF kernels like the feature space and the feature map, using the so-called Segal-Bargmann transform. We show also how the RBF kernels can be related to some of the most used operators in quantum mechanics and time frequency analysis, specifically, we prove the connections of such kernels with creation, annihilation, Fourier, translation, modulation and Weyl operators. For the Weyl operators, we also study a semigroup property in this case.
    Sharpness-aware Minimization for Worst Case Optimization. (arXiv:2210.13533v1 [cs.LG])
    Improvement of worst group performance and generalization performance are core problems of current machine learning. There are diverse efforts to increase performance, such as weight norm penalty and data augmentation, but the improvements are limited. Recently, there have been two promising approaches to increase the worst group performance and generalization performance, respectively. Distributionally robust optimization (DRO) focuses on the worst or hardest group to improve the worst-group performance. Besides, sharpness-aware minimization (SAM) finds the flat minima to increase the generalization ability on an unseen dataset. They show significant performance improvements on the worst-group dataset and unseen dataset, respectively. However, DRO does not guarantee flatness, and SAM does not guarantee the worst group performance improvement. In other words, DRO and SAM may fail to increase the worst group performance when the training and test dataset shift occurs. In this study, we propose a new approach, the sharpness-aware group distributionally robust optimization (SGDRO). SGDRO finds the flat-minima that generalizes well on the worst group dataset. Different from DRO and SAM, SGDRO contributes to improving the generalization ability even the distribution shift occurs. We validate that SGDRO shows the smaller maximum eigenvalue and improved performance in the worst group.
    Covariance matrix preparation for quantum principal component analysis. (arXiv:2204.03495v2 [quant-ph] UPDATED)
    Principal component analysis (PCA) is a dimensionality reduction method in data analysis that involves diagonalizing the covariance matrix of the dataset. Recently, quantum algorithms have been formulated for PCA based on diagonalizing a density matrix. These algorithms assume that the covariance matrix can be encoded in a density matrix, but a concrete protocol for this encoding has been lacking. Our work aims to address this gap. Assuming amplitude encoding of the data, with the data given by the ensemble $\{p_i,| \psi_i \rangle\}$, then one can easily prepare the ensemble average density matrix $\overline{\rho} = \sum_i p_i |\psi_i\rangle \langle \psi_i |$. We first show that $\overline{\rho}$ is precisely the covariance matrix whenever the dataset is centered. For quantum datasets, we exploit global phase symmetry to argue that there always exists a centered dataset consistent with $\overline{\rho}$, and hence $\overline{\rho}$ can always be interpreted as a covariance matrix. This provides a simple means for preparing the covariance matrix for arbitrary quantum datasets or centered classical datasets. For uncentered classical datasets, our method is so-called "PCA without centering", which we interpret as PCA on a symmetrized dataset. We argue that this closely corresponds to standard PCA, and we derive equations and inequalities that bound the deviation of the spectrum obtained with our method from that of standard PCA. We numerically illustrate our method for the MNIST handwritten digit dataset. We also argue that PCA on quantum datasets is natural and meaningful, and we numerically implement our method for molecular ground-state datasets.
    A New Central Limit Theorem for the Augmented IPW Estimator: Variance Inflation, Cross-Fit Covariance and Beyond. (arXiv:2205.10198v2 [math.ST] UPDATED)
    Estimation of the average treatment effect (ATE) is a central problem in causal inference. In recent times, inference for the ATE in the presence of high-dimensional covariates has been extensively studied. Among the diverse approaches that have been proposed, augmented inverse probability weighting (AIPW) with cross-fitting has emerged a popular choice in practice. In this work, we study this cross-fit AIPW estimator under well-specified outcome regression and propensity score models in a high-dimensional regime where the number of features and samples are both large and comparable. Under assumptions on the covariate distribution, we establish a new central limit theorem for the suitably scaled cross-fit AIPW that applies without any sparsity assumptions on the underlying high-dimensional parameters. Our CLT uncovers two crucial phenomena among others: (i) the AIPW exhibits a substantial variance inflation that can be precisely quantified in terms of the signal-to-noise ratio and other problem parameters, (ii) the asymptotic covariance between the pre-cross-fit estimators is non-negligible even on the root-n scale. These findings are strikingly different from their classical counterparts. On the technical front, our work utilizes a novel interplay between three distinct tools--approximate message passing theory, the theory of deterministic equivalents, and the leave-one-out approach. We believe our proof techniques should be useful for analyzing other two-stage estimators in this high-dimensional regime. Finally, we complement our theoretical results with simulations that demonstrate both the finite sample efficacy of our CLT and its robustness to our assumptions.
    Proximal Mean Field Learning in Shallow Neural Networks. (arXiv:2210.13879v1 [cs.LG])
    Recent mean field interpretations of learning dynamics in over-parameterized neural networks offer theoretical insights on the empirical success of first order optimization algorithms in finding global minima of the nonconvex risk landscape. In this paper, we explore applying mean field learning dynamics as a computational algorithm, rather than as an analytical tool. Specifically, we design a Sinkhorn regularized proximal algorithm to approximate the distributional flow from the learning dynamics in the mean field regime over weighted point clouds. In this setting, a contractive fixed point recursion computes the time-varying weights, numerically realizing the interacting Wasserstein gradient flow of the parameter distribution supported over the neuronal ensemble. An appealing aspect of the proposed algorithm is that the measure-valued recursions allow meshless computation. We demonstrate the proposed computational framework of interacting weighted particle evolution on binary and multi-class classification. Our algorithm performs gradient descent of the free energy associated with the risk functional.
    Some Simulation and Empirical Results for Semi-Supervised Learning of the Bayes Rule of Allocation. (arXiv:2210.13785v1 [stat.ML])
    There has been increasing attention to semi-supervised learning (SSL) approaches in machine learning to forming a classifier in situations where the training data consists of some feature vectors that have their class labels missing. In this study, we consider the generative model approach proposed by Ahfock&McLachlan(2020) who introduced a framework with a missingness mechanism for the missing labels of the unclassified features. In the case of two multivariate normal classes with a common covariance matrix, they showed that the error rate of the estimated Bayes' rule formed by this SSL approach can actually have lower error rate than the one that could be formed from a completely classified sample. In this study we consider this rather surprising result in cases where there may be more than two normal classes with not necessarily common covariance matrices.
    Cost-Effective Online Contextual Model Selection. (arXiv:2207.06030v2 [cs.LG] UPDATED)
    How can we collect the most useful labels to learn a model selection policy, when presented with arbitrary heterogeneous data streams? In this paper, we formulate this task as an online contextual active model selection problem, where at each round the learner receives an unlabeled data point along with a context. The goal is to output the best model for any given context without obtaining an excessive amount of labels. In particular, we focus on the task of selecting pre-trained classifiers, and propose a contextual active model selection algorithm (CAMS), which relies on a novel uncertainty sampling query criterion defined on a given policy class for adaptive model selection. In comparison to prior art, our algorithm does not assume a globally optimal model. We provide rigorous theoretical analysis for the regret and query complexity under both adversarial and stochastic settings. Our experiments on several benchmark classification datasets demonstrate the algorithm's effectiveness in terms of both regret and query complexity. Notably, to achieve the same accuracy, CAMS incurs less than 10% of the label cost when compared to the best online model selection baselines on CIFAR10.
    Whitening Convergence Rate of Coupling-based Normalizing Flows. (arXiv:2210.14032v1 [cs.LG])
    Coupling-based normalizing flows (e.g. RealNVP) are a popular family of normalizing flow architectures that work surprisingly well in practice. This calls for theoretical understanding. Existing work shows that such flows weakly converge to arbitrary data distributions. However, they make no statement about the stricter convergence criterion used in practice, the maximum likelihood loss. For the first time, we make a quantitative statement about this kind of convergence: We prove that all coupling-based normalizing flows perform whitening of the data distribution (i.e. diagonalize the covariance matrix) and derive corresponding convergence bounds that show a linear convergence rate in the depth of the flow. Numerical experiments demonstrate the implications of our theory and point at open questions.
    Pre-training via Denoising for Molecular Property Prediction. (arXiv:2206.00133v2 [cs.LG] UPDATED)
    Many important problems involving molecular property prediction from 3D structures have limited data, posing a generalization challenge for neural networks. In this paper, we describe a pre-training technique based on denoising that achieves a new state-of-the-art in molecular property prediction by utilizing large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks. Relying on the well-known link between denoising autoencoders and score-matching, we show that the denoising objective corresponds to learning a molecular force field -- arising from approximating the Boltzmann distribution with a mixture of Gaussians -- directly from equilibrium structures. Our experiments demonstrate that using this pre-training objective significantly improves performance on multiple benchmarks, achieving a new state-of-the-art on the majority of targets in the widely used QM9 dataset. Our analysis then provides practical insights into the effects of different factors -- dataset sizes, model size and architecture, and the choice of upstream and downstream datasets -- on pre-training.
    Learning Multi-Objective Curricula for Robotic Policy Learning. (arXiv:2110.03032v3 [cs.LG] CROSS LISTED)
    Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL). They are designed to control how a DRL agent collects data, which is inspired by how humans gradually adapt their learning processes to their capabilities. For example, ACL can be used for subgoal generation, reward shaping, environment generation, or initial state generation. However, prior work only considers curriculum learning following one of the aforementioned predefined paradigms. It is unclear which of these paradigms are complementary, and how the combination of them can be learned from interactions with the environment. Therefore, in this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula that are generated by a set of parametric curriculum modules. Each curriculum module is instantiated as a neural network and is responsible for generating a particular curriculum. In order to coordinate those potentially conflicting modules in unified parameter space, we propose a multi-task hyper-net learning framework that uses a single hyper-net to parameterize all those curriculum modules. In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum, which may otherwise be difficult to design manually. We evaluate our method on a series of robotic manipulation tasks and demonstrate its superiority over other state-of-the-art ACL methods in terms of sample efficiency and final performance.
    Reconstructing Training Data from Trained Neural Networks. (arXiv:2206.07758v2 [cs.LG] UPDATED)
    Understanding to what extent neural networks memorize training data is an intriguing question with practical and theoretical implications. In this paper we show that in some cases a significant fraction of the training data can in fact be reconstructed from the parameters of a trained neural network classifier. We propose a novel reconstruction scheme that stems from recent theoretical results about the implicit bias in training neural networks with gradient-based methods. To the best of our knowledge, our results are the first to show that reconstructing a large portion of the actual training samples from a trained neural network classifier is generally possible. This has negative implications on privacy, as it can be used as an attack for revealing sensitive training data. We demonstrate our method for binary MLP classifiers on a few standard computer vision datasets.
    Learning Optimal Flows for Non-Equilibrium Importance Sampling. (arXiv:2206.09908v2 [math.ST] UPDATED)
    Many applications in computational sciences and statistical inference require the computation of expectations with respect to complex high-dimensional distributions with unknown normalization constants, as well as the estimation of these constants. Here we develop a method to perform these calculations based on generating samples from a simple base distribution, transporting them by the flow generated by a velocity field, and performing averages along these flowlines. This non-equilibrium importance sampling (NEIS) strategy is straightforward to implement and can be used for calculations with arbitrary target distributions. On the theory side, we discuss how to tailor the velocity field to the target and establish general conditions under which the proposed estimator is a perfect estimator with zero-variance. We also draw connections between NEIS and approaches based on mapping a base distribution onto a target via a transport map. On the computational side, we show how to use deep learning to represent the velocity field by a neural network and train it towards the zero variance optimum. These results are illustrated numerically on benchmark examples (with dimension up to $10$), where after training the velocity field, the variance of the NEIS estimator is reduced by up to $6$ orders of magnitude than that of a vanilla estimator. We also compare the performances of NEIS with those of Neal's annealed importance sampling (AIS).
    Beyond neural scaling laws: beating power law scaling via data pruning. (arXiv:2206.14486v4 [cs.LG] UPDATED)
    Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how in theory we can break beyond power law scaling and potentially even reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this improved scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling in practice on ResNets trained on CIFAR-10, SVHN, and ImageNet. Next, given the importance of finding high-quality pruning metrics, we perform the first large-scale benchmarking study of ten different data pruning metrics on ImageNet. We find most existing high performing metrics scale poorly to ImageNet, while the best are computationally intensive and require labels for every image. We therefore developed a new simple, cheap and scalable self-supervised pruning metric that demonstrates comparable performance to the best supervised metrics. Overall, our work suggests that the discovery of good data-pruning metrics may provide a viable path forward to substantially improved neural scaling laws, thereby reducing the resource costs of modern deep learning.
    Barycentric-alignment and reconstruction loss minimization for domain generalization. (arXiv:2109.01902v5 [cs.LG] UPDATED)
    Domain generalization theory and methods are important for the success of Open World Pattern Recognition. The paper advances the current state-of-art works in this context by proposing a novel theoretical analysis and piratical algorithm. In particular, we revisit Domain Generalization (DG) problem, where the hypotheses are composed of a common representation mapping followed by a labeling function. Popular DG methods optimize a well-known upper bound of the risk in the unseen domain to learn both the optimal representation and labeling functions. However, the widely used bound contains a term that is not optimized due to its dual dependence on the representation mapping and the unknown optimal labeling function in the unseen domain. To fill this gap, we derive a new upper bound free of terms having such dual dependence. Our derivation leverages old and recent transport inequalities that link optimal transport metrics with information-theoretic measures. Compared to previous bounds, our bound introduces two new terms: (i) the Wasserstein-2 barycenter term for the distribution alignment between domains and (ii) the reconstruction loss term for measuring how well the data can be reconstructed from its representation. Based on the new upper bound, we propose a novel DG algorithm that simultaneously minimizes the classification loss, the barycenter loss, and the reconstruction loss. Experiments on several datasets demonstrate superior performance of the proposed method compared to the state-of-the-art DG algorithms.
    Mitigating Gradient Bias in Multi-objective Learning: A Provably Convergent Stochastic Approach. (arXiv:2210.12624v1 [cs.LG] CROSS LISTED)
    Machine learning problems with multiple objective functions appear either in learning with multiple criteria where learning has to make a trade-off between multiple performance metrics such as fairness, safety and accuracy; or, in multi-task learning where multiple tasks are optimized jointly, sharing inductive bias between them. This problems are often tackled by the multi-objective optimization framework. However, existing stochastic multi-objective gradient methods and its variants (e.g., MGDA, PCGrad, CAGrad, etc.) all adopt a biased noisy gradient direction, which leads to degraded empirical performance. To this end, we develop a stochastic Multi-objective gradient Correction (MoCo) method for multi-objective optimization. The unique feature of our method is that it can guarantee convergence without increasing the batch size even in the non-convex setting. Simulations on multi-task supervised and reinforcement learning demonstrate the effectiveness of our method relative to state-of-the-art methods.
    Bridging Distributional and Risk-sensitive Reinforcement Learning with Provable Regret Bounds. (arXiv:2210.14051v1 [cs.LG])
    We study the regret guarantee for risk-sensitive reinforcement learning (RSRL) via distributional reinforcement learning (DRL) methods. In particular, we consider finite episodic Markov decision processes whose objective is the entropic risk measure (EntRM) of return. We identify a key property of the EntRM, the monotonicity-preserving property, which enables the risk-sensitive distributional dynamic programming framework. We then propose two novel DRL algorithms that implement optimism through two different schemes, including a model-free one and a model-based one. We prove that both of them attain $\tilde{\mathcal{O}}(\frac{\exp(|\beta| H)-1}{|\beta|H}H\sqrt{HS^2AT})$ regret upper bound, where $S$ is the number of states, $A$ the number of states, $H$ the time horizon and $T$ the number of total time steps. It matches RSVI2 proposed in \cite{fei2021exponential} with a much simpler regret analysis. To the best of our knowledge, this is the first regret analysis of DRL, which bridges DRL and RSRL in terms of sample complexity. Finally, we improve the existing lower bound by proving a tighter bound of $\Omega(\frac{\exp(\beta H/6)-1}{\beta H}H\sqrt{SAT})$ for $\beta>0$ case, which recovers the tight lower bound $\Omega(H\sqrt{SAT})$ in the risk-neutral setting.
    Mixed-Effect Thompson Sampling. (arXiv:2205.15124v2 [cs.LG] UPDATED)
    A contextual bandit is a popular framework for online learning to act under uncertainty. In practice, the number of actions is huge and their expected rewards are correlated. In this work, we introduce a general framework for capturing such correlations through a mixed-effect model where actions are related through multiple shared effect parameters. We propose Mixed-Effect Thompson Sampling (meTS) that uses this structure to explore efficiently and bound its Bayes regret. The regret bound has two terms, one for learning the action parameters and the other for learning the shared effect parameters. The terms reflect the structure of our model and the quality of priors. Our theoretical findings are validated empirically using both synthetic and real-world problems. We also propose numerous extensions of practical interest. While they do not come with guarantees, they perform extremely well empirically and show the generality of the proposed framework.
    A low-rank ensemble Kalman filter for elliptic observations. (arXiv:2203.05120v3 [physics.flu-dyn] UPDATED)
    We propose a regularization method for ensemble Kalman filtering (EnKF) with elliptic observation operators. Commonly used EnKF regularization methods suppress state correlations at long distances. For observations described by elliptic partial differential equations, such as the pressure Poisson equation (PPE) in incompressible fluid flows, distance localization cannot be applied, as we cannot disentangle slowly decaying physical interactions from spurious long-range correlations. This is particularly true for the PPE, in which distant vortex elements couple nonlinearly to induce pressure. Instead, these inverse problems have a low effective dimension: low-dimensional projections of the observations strongly inform a low-dimensional subspace of the state space. We derive a low-rank factorization of the Kalman gain based on the spectrum of the Jacobian of the observation operator. The identified eigenvectors generalize the source and target modes of the multipole expansion, independently of the underlying spatial distribution of the problem. Given rapid spectral decay, inference can be performed in the low-dimensional subspace spanned by the dominant eigenvectors. This low-rank EnKF is assessed on dynamical systems with Poisson observation operators, where we seek to estimate the positions and strengths of point singularities over time from potential or pressure observations. We also comment on the broader applicability of this approach to elliptic inverse problems outside the context of filtering.
    Influence Functions for Sequence Tagging Models. (arXiv:2210.14177v1 [cs.CL])
    Many language tasks (e.g., Named Entity Recognition, Part-of-Speech tagging, and Semantic Role Labeling) are naturally framed as sequence tagging problems. However, there has been comparatively little work on interpretability methods for sequence tagging models. In this paper, we extend influence functions - which aim to trace predictions back to the training points that informed them - to sequence tagging tasks. We define the influence of a training instance segment as the effect that perturbing the labels within this segment has on a test segment level prediction. We provide an efficient approximation to compute this, and show that it tracks with the true segment influence, measured empirically. We show the practical utility of segment influence by using the method to identify systematic annotation errors in two named entity recognition corpora. Code to reproduce our results is available at https://github.com/successar/Segment_Influence_Functions.
    Deep Bayesian Active Learning for Accelerating Stochastic Simulation. (arXiv:2106.02770v6 [cs.LG] UPDATED)
    Stochastic simulations such as large-scale, spatiotemporal, age-structured epidemic models are computationally expensive at fine-grained resolution. While deep surrogate models can speed up the simulations, doing so for stochastic simulations and with active learning approaches is an underexplored area. We propose Interactive Neural Process (INP), a deep Bayesian active learning framework for learning deep surrogate models to accelerate stochastic simulations. INP consists of two components, a spatiotemporal surrogate model built upon Neural Process (NP) family and an acquisition function for active learning. For surrogate modeling, we develop Spatiotemporal Neural Process (STNP) to mimic the simulator dynamics. For active learning, we propose a novel acquisition function, Latent Information Gain (LIG), calculated in the latent space of NP based models. We perform a theoretical analysis and demonstrate that LIG reduces sample complexity compared with random sampling in high dimensions. We also conduct empirical studies on two complex spatiotemporal simulators for reaction diffusion and infectious disease. The results demonstrate that STNP outperforms the baselines in the offline learning setting and LIG achieves the state-of-the-art for Bayesian active learning.  ( 2 min )
    Predicting Survival Outcomes in the Presence of Unlabeled Data. (arXiv:2210.13891v1 [cs.LG])
    Many clinical studies require the follow-up of patients over time. This is challenging: apart from frequently observed drop-out, there are often also organizational and financial challenges, which can lead to reduced data collection and, in turn, can complicate subsequent analyses. In contrast, there is often plenty of baseline data available of patients with similar characteristics and background information, e.g., from patients that fall outside the study time window. In this article, we investigate whether we can benefit from the inclusion of such unlabeled data instances to predict accurate survival times. In other words, we introduce a third level of supervision in the context of survival analysis, apart from fully observed and censored instances, we also include unlabeled instances. We propose three approaches to deal with this novel setting and provide an empirical comparison over fifteen real-life clinical and gene expression survival datasets. Our results demonstrate that all approaches are able to increase the predictive performance over independent test data. We also show that integrating the partial supervision provided by censored data in a semi-supervised wrapper approach generally provides the best results, often achieving high improvements, compared to not using unlabeled data.  ( 2 min )
    Gaussian Mean Testing Made Simple. (arXiv:2210.13706v1 [math.ST])
    We study the following fundamental hypothesis testing problem, which we term Gaussian mean testing. Given i.i.d. samples from a distribution $p$ on $\mathbb{R}^d$, the task is to distinguish, with high probability, between the following cases: (i) $p$ is the standard Gaussian distribution, $\mathcal{N}(0,I_d)$, and (ii) $p$ is a Gaussian $\mathcal{N}(\mu,\Sigma)$ for some unknown covariance $\Sigma$ and mean $\mu \in \mathbb{R}^d$ satisfying $\|\mu\|_2 \geq \epsilon$. Recent work gave an algorithm for this testing problem with the optimal sample complexity of $\Theta(\sqrt{d}/\epsilon^2)$. Both the previous algorithm and its analysis are quite complicated. Here we give an extremely simple algorithm for Gaussian mean testing with a one-page analysis. Our algorithm is sample optimal and runs in sample linear time.  ( 2 min )
    Noisy Low-rank Matrix Optimization: Geometry of Local Minima and Convergence Rate. (arXiv:2203.03899v2 [math.OC] UPDATED)
    This paper is concerned with low-rank matrix optimization, which has found a wide range of applications in machine learning. This problem in the special case of matrix sensing has been studied extensively through the notion of Restricted Isometry Property (RIP), leading to a wealth of results on the geometric landscape of the problem and the convergence rate of common algorithms. However, the existing results can handle the problem in the case with a general objective function subject to noisy data only when the RIP constant is close to 0. In this paper, we develop a new mathematical framework to solve the above-mentioned problem with a far less restrictive RIP constant. We prove that as long as the RIP constant of the noiseless objective is less than $1/3$, any spurious local solution of the noisy optimization problem must be close to the ground truth solution. By working through the strict saddle property, we also show that an approximate solution can be found in polynomial time. We characterize the geometry of the spurious local minima of the problem in a local region around the ground truth in the case when the RIP constant is greater than $1/3$. Compared to the existing results in the literature, this paper offers the strongest RIP bound and provides a complete theoretical analysis on the global and local optimization landscapes of general low-rank optimization problems under random corruptions from any finite-variance family.  ( 3 min )
    Pruning's Effect on Generalization Through the Lens of Training and Regularization. (arXiv:2210.13738v1 [cs.LG])
    Practitioners frequently observe that pruning improves model generalization. A long-standing hypothesis based on bias-variance trade-off attributes this generalization improvement to model size reduction. However, recent studies on over-parameterization characterize a new model size regime, in which larger models achieve better generalization. Pruning models in this over-parameterized regime leads to a contradiction -- while theory predicts that reducing model size harms generalization, pruning to a range of sparsities nonetheless improves it. Motivated by this contradiction, we re-examine pruning's effect on generalization empirically. We show that size reduction cannot fully account for the generalization-improving effect of standard pruning algorithms. Instead, we find that pruning leads to better training at specific sparsities, improving the training loss over the dense model. We find that pruning also leads to additional regularization at other sparsities, reducing the accuracy degradation due to noisy examples over the dense model. Pruning extends model training time and reduces model size. These two factors improve training and add regularization respectively. We empirically demonstrate that both factors are essential to fully explaining pruning's impact on generalization.  ( 2 min )
    I Prefer not to Say: Operationalizing Fair and User-guided Data Minimization. (arXiv:2210.13954v1 [cs.LG])
    To grant users greater authority over their personal data, policymakers have suggested tighter data protection regulations (e.g., GDPR, CCPA). One key principle within these regulations is data minimization, which urges companies and institutions to only collect data that is relevant and adequate for the purpose of the data analysis. In this work, we take a user-centric perspective on this regulation, and let individual users decide which data they deem adequate and relevant to be processed by a machine-learned model. We require that users who decide to provide optional information should appropriately benefit from sharing their data, while users who rely on the mandate to leave their data undisclosed should not be penalized for doing so. This gives rise to the overlooked problem of fair treatment between individuals providing additional information and those choosing not to. While the classical fairness literature focuses on fair treatment between advantaged and disadvantaged groups, an initial look at this problem through the lens of classical fairness notions reveals that they are incompatible with these desiderata. We offer a solution to this problem by proposing the notion of Optional Feature Fairness (OFF) that follows from our requirements. To operationalize OFF, we derive a multi-model strategy and a tractable logistic regression model. We analyze the effect and the cost of applying OFF on several real-world data sets.  ( 3 min )
    New Lower Bounds for Private Estimation and a Generalized Fingerprinting Lemma. (arXiv:2205.08532v3 [cs.DS] UPDATED)
    We prove new lower bounds for statistical estimation tasks under the constraint of $(\varepsilon, \delta)$-differential privacy. First, we provide tight lower bounds for private covariance estimation of Gaussian distributions. We show that estimating the covariance matrix in Frobenius norm requires $\Omega(d^2)$ samples, and in spectral norm requires $\Omega(d^{3/2})$ samples, both matching upper bounds up to logarithmic factors. We prove these bounds via our main technical contribution, a broad generalization of the fingerprinting method to exponential families. Additionally, using the private Assouad method of Acharya, Sun, and Zhang, we show a tight $\Omega(d/(\alpha^2 \varepsilon))$ lower bound for estimating the mean of a distribution with bounded covariance to $\alpha$-error in $\ell_2$-distance. Prior known lower bounds for all these problems were either polynomially weaker or held under the stricter condition of $(\varepsilon,0)$-differential privacy.  ( 2 min )
    Learning Ability of Interpolating Convolutional Neural Networks. (arXiv:2210.14184v1 [stat.ML])
    It is frequently observed that overparameterized neural networks generalize well. Regarding such phenomena, existing theoretical work mainly devotes to linear settings or fully connected neural networks. This paper studies learning ability of an important family of deep neural networks, deep convolutional neural networks (DCNNs), under underparameterized and overparameterized settings. We establish the best learning rates of underparameterized DCNNs without parameter restrictions presented in the literature. We also show that, by adding well defined layers to an underparameterized DCNN, we can obtain some interpolating DCNNs that maintain the good learning rates of the underparameterized DCNN. This result is achieved by a novel network deepening scheme designed for DCNNs. Our work provides theoretical verification on how overfitted DCNNs generalize well.  ( 2 min )
    Deep nurbs -- admissible neural networks. (arXiv:2210.13900v1 [math.NA])
    In this study, we propose a new numerical scheme for physics-informed neural networks (PINNs) that enables precise and inexpensive solution for partial differential equations (PDEs) in case of arbitrary geometries while strictly enforcing Dirichlet boundary conditions. The proposed approach combines admissible NURBS parametrizations required to define the physical domain and the Dirichlet boundary conditions with a PINN solver. The fundamental boundary conditions are automatically satisfied in this novel Deep NURBS framework. We verified our new approach using two-dimensional elliptic PDEs when considering arbitrary geometries, including non-Lipschitz domains. Compared to the classical PINN solver, the Deep NURBS estimator has a remarkably high convergence rate for all the studied problems. Moreover, a desirable accuracy was realized for most of the studied PDEs using only one hidden layer of neural networks. This novel approach is considered to pave the way for more effective solutions for high-dimensional problems by allowing for more realistic physics-informed statistical learning to solve PDE-based variational problems.  ( 2 min )
    Mitigating Health Data Poverty: Generative Approaches versus Resampling for Time-series Clinical Data. (arXiv:2210.13958v1 [cs.LG])
    Several approaches have been developed to mitigate algorithmic bias stemming from health data poverty, where minority groups are underrepresented in training datasets. Augmenting the minority class using resampling (such as SMOTE) is a widely used approach due to the simplicity of the algorithms. However, these algorithms decrease data variability and may introduce correlations between samples, giving rise to the use of generative approaches based on GAN. Generation of high-dimensional, time-series, authentic data that provides a wide distribution coverage of the real data, remains a challenging task for both resampling and GAN-based approaches. In this work we propose CA-GAN architecture that addresses some of the shortcomings of the current approaches, where we provide a detailed comparison with both SMOTE and WGAN-GP*, using a high-dimensional, time-series, real dataset of 3343 hypotensive Caucasian and Black patients. We show that our approach is better at both generating authentic data of the minority class and remaining within the original distribution of the real data.  ( 2 min )
    Improving Group Lasso for high-dimensional categorical data. (arXiv:2210.14021v1 [stat.ME])
    Sparse modelling or model selection with categorical data is challenging even for a moderate number of variables, because one parameter is roughly needed to encode one category or level. The Group Lasso is a well known efficient algorithm for selection continuous or categorical variables, but all estimates related to a selected factor usually differ. Therefore, a fitted model may not be sparse, which makes the model interpretation difficult. To obtain a sparse solution of the Group Lasso we propose the following two-step procedure: first, we reduce data dimensionality using the Group Lasso; then to choose the final model we use an information criterion on a small family of models prepared by clustering levels of individual factors. We investigate selection correctness of the algorithm in a sparse high-dimensional scenario. We also test our method on synthetic as well as real datasets and show that it performs better than the state of the art algorithms with respect to the prediction accuracy or model dimension.  ( 2 min )
    Multi-Fidelity Bayesian Optimization with Unreliable Information Sources. (arXiv:2210.13937v1 [cs.LG])
    Bayesian optimization (BO) is a powerful framework for optimizing black-box, expensive-to-evaluate functions. Over the past decade, many algorithms have been proposed to integrate cheaper, lower-fidelity approximations of the objective function into the optimization process, with the goal of converging towards the global optimum at a reduced cost. This task is generally referred to as multi-fidelity Bayesian optimization (MFBO). However, MFBO algorithms can lead to higher optimization costs than their vanilla BO counterparts, especially when the low-fidelity sources are poor approximations of the objective function, therefore defeating their purpose. To address this issue, we propose rMFBO (robust MFBO), a methodology to make any GP-based MFBO scheme robust to the addition of unreliable information sources. rMFBO comes with a theoretical guarantee that its performance can be bound to its vanilla BO analog, with high controllable probability. We demonstrate the effectiveness of the proposed methodology on a number of numerical benchmarks, outperforming earlier MFBO methods on unreliable sources. We expect rMFBO to be particularly useful to reliably include human experts with varying knowledge within BO processes.  ( 2 min )
    Sequential Decision Making on Unmatched Data using Bayesian Kernel Embeddings. (arXiv:2210.13692v1 [stat.ML])
    The problem of sequentially maximizing the expectation of a function seeks to maximize the expected value of a function of interest without having direct control on its features. Instead, the distribution of such features depends on a given context and an action taken by an agent. In contrast to Bayesian optimization, the arguments of the function are not under agent's control, but are indirectly determined by the agent's action based on a given context. If the information of the features is to be included in the maximization problem, the full conditional distribution of such features, rather than its expectation only, needs to be accounted for. Furthermore, the function is itself unknown, only counting with noisy observations of such function, and potentially requiring the use of unmatched data sets. We propose a novel algorithm for the aforementioned problem which takes into consideration the uncertainty derived from the estimation of both the conditional distribution of the features and the unknown function, by modeling the former as a Bayesian conditional mean embedding and the latter as a Gaussian process. Our algorithm empirically outperforms the current state-of-the-art algorithm in the experiments conducted.  ( 2 min )
    A Spectral Method for Assessing and Combining Multiple Data Visualizations. (arXiv:2210.13711v1 [stat.ML])
    Dimension reduction and data visualization aim to project a high-dimensional dataset to a low-dimensional space while capturing the intrinsic structures in the data. It is an indispensable part of modern data science, and many dimensional reduction and visualization algorithms have been developed. However, different algorithms have their own strengths and weaknesses, making it critically important to evaluate their relative performance for a given dataset, and to leverage and combine their individual strengths. In this paper, we propose an efficient spectral method for assessing and combining multiple visualizations of a given dataset produced by diverse algorithms. The proposed method provides a quantitative measure -- the visualization eigenscore -- of the relative performance of the visualizations for preserving the structure around each data point. Then it leverages the eigenscores to obtain a consensus visualization, which has much improved { quality over the individual visualizations in capturing the underlying true data structure.} Our approach is flexible and works as a wrapper around any visualizations. We analyze multiple simulated and real-world datasets from diverse applications to demonstrate the effectiveness of the eigenscores for evaluating visualizations and the superiority of the proposed consensus visualization. Furthermore, we establish rigorous theoretical justification of our method based on a general statistical framework, yielding fundamental principles behind the empirical success of consensus visualization along with practical guidance.  ( 2 min )
    Temporally Disentangled Representation Learning. (arXiv:2210.13647v1 [cs.LG])
    Recently in the field of unsupervised representation learning, strong identifiability results for disentanglement of causally-related latent variables have been established by exploiting certain side information, such as class labels, in addition to independence. However, most existing work is constrained by functional form assumptions such as independent sources or further with linear transitions, and distribution assumptions such as stationary, exponential family distribution. It is unknown whether the underlying latent variables and their causal relations are identifiable if they have arbitrary, nonparametric causal influences in between. In this work, we establish the identifiability theories of nonparametric latent causal processes from their nonlinear mixtures under fixed temporal causal influences and analyze how distribution changes can further benefit the disentanglement. We propose \textbf{\texttt{TDRL}}, a principled framework to recover time-delayed latent causal variables and identify their relations from measured sequential data under stationary environments and under different distribution shifts. Specifically, the framework can factorize unknown distribution shifts into transition distribution changes under fixed and time-varying latent causal relations, and under observation changes in observation. Through experiments, we show that time-delayed latent causal influences are reliably identified and that our approach considerably outperforms existing baselines that do not correctly exploit this modular representation of changes. Our code is available at: \url{https://github.com/weirayao/tdrl}.  ( 2 min )
    Conditionally Risk-Averse Contextual Bandits. (arXiv:2210.13573v1 [stat.ML])
    We desire to apply contextual bandits to scenarios where average-case statistical guarantees are inadequate. Happily, we discover the composition of reduction to online regression and expectile loss is analytically tractable, computationally convenient, and empirically effective. The result is the first risk-averse contextual bandit algorithm with an online regret guarantee. We state our precise regret guarantee and conduct experiments from diverse scenarios in dynamic pricing, inventory management, and self-tuning software; including results from a production exascale cloud data processing system.  ( 2 min )
    Private Online Prediction from Experts: Separations and Faster Rates. (arXiv:2210.13537v1 [cs.LG])
    Online prediction from experts is a fundamental problem in machine learning and several works have studied this problem under privacy constraints. We propose and analyze new algorithms for this problem that improve over the regret bounds of the best existing algorithms for non-adaptive adversaries. For approximate differential privacy, our algorithms achieve regret bounds of $\tilde{O}(\sqrt{T \log d} + \log d/\varepsilon)$ for the stochastic setting and $\tilde O(\sqrt{T \log d} + T^{1/3} \log d/\varepsilon)$ for oblivious adversaries (where $d$ is the number of experts). For pure DP, our algorithms are the first to obtain sub-linear regret for oblivious adversaries in the high-dimensional regime $d \ge T$. Moreover, we prove new lower bounds for adaptive adversaries. Our results imply that unlike the non-private setting, there is a strong separation between the optimal regret for adaptive and non-adaptive adversaries for this problem. Our lower bounds also show a separation between pure and approximate differential privacy for adaptive adversaries where the latter is necessary to achieve the non-private $O(\sqrt{T})$ regret.  ( 2 min )
    Subspace Recovery from Heterogeneous Data with Non-isotropic Noise. (arXiv:2210.13497v1 [cs.LG])
    Recovering linear subspaces from data is a fundamental and important task in statistics and machine learning. Motivated by heterogeneity in Federated Learning settings, we study a basic formulation of this problem: the principal component analysis (PCA), with a focus on dealing with irregular noise. Our data come from $n$ users with user $i$ contributing data samples from a $d$-dimensional distribution with mean $\mu_i$. Our goal is to recover the linear subspace shared by $\mu_1,\ldots,\mu_n$ using the data points from all users, where every data point from user $i$ is formed by adding an independent mean-zero noise vector to $\mu_i$. If we only have one data point from every user, subspace recovery is information-theoretically impossible when the covariance matrices of the noise vectors can be non-spherical, necessitating additional restrictive assumptions in previous work. We avoid these assumptions by leveraging at least two data points from each user, which allows us to design an efficiently-computable estimator under non-spherical and user-dependent noise. We prove an upper bound for the estimation error of our estimator in general scenarios where the number of data points and amount of noise can vary across users, and prove an information-theoretic error lower bound that not only matches the upper bound up to a constant factor, but also holds even for spherical Gaussian noise. This implies that our estimator does not introduce additional estimation error (up to a constant factor) due to irregularity in the noise. We show additional results for a linear regression problem in a similar setup.  ( 3 min )
    Noise Injection as a Probe of Deep Learning Dynamics. (arXiv:2210.13599v1 [cs.LG])
    We propose a new method to probe the learning mechanism of Deep Neural Networks (DNN) by perturbing the system using Noise Injection Nodes (NINs). These nodes inject uncorrelated noise via additional optimizable weights to existing feed-forward network architectures, without changing the optimization algorithm. We find that the system displays distinct phases during training, dictated by the scale of injected noise. We first derive expressions for the dynamics of the network and utilize a simple linear model as a test case. We find that in some cases, the evolution of the noise nodes is similar to that of the unperturbed loss, thus indicating the possibility of using NINs to learn more about the full system in the future.  ( 2 min )
    Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup. (arXiv:2210.13512v1 [cs.LG])
    Mixup is a data augmentation technique that relies on training using random convex combinations of data points and their labels. In recent years, Mixup has become a standard primitive used in the training of state-of-the-art image classification models due to its demonstrated benefits over empirical risk minimization with regards to generalization and robustness. In this work, we try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class. We also show empirically that these theoretical insights extend to the practical settings of image benchmarks modified to have additional synthetic features.  ( 2 min )

  • Open

    RNN policy trained for the Fetch Brax environment, using the new version 0.3.0 of EvoTorch (evotorch.ai): https://github.com/nnaisense/evotorch/releases/tag/v0.3.0
    submitted by /u/NaturalGradient [link] [comments]  ( 118 min )
    We’ve released EvoTorch 0.3.0, with VecGymNE, memory usage improvements, Colab support and more! VecGymNE enables evolutionary RL with vectorized environments and policies, especially massively parallel simulators like Brax!
    New Features Vectorized gym support: Added a new problem class, evotorch.neuroevolution.VecGymNE, to solve vectorized gym environments. This new problem class can work with brax environments and can exploit GPU acceleration. PicklingLogger: Added a new logger, evotorch.logging.PicklingLogger, which periodically pickles and saves the current solution to the disk. Python 3.7 support: The Python dependency was lowered from 3.8 to 3.7. Therefore, EvoTorch can now be imported from within a Google Colab notebook. Fixes Fixed a performance issue caused by the undesired cloning of the entire storages of tensor slices. Fixed the signature and the docstrings of the overridable method _do_cross_over(...) of the class evotorch.operators.CrossOver. Check out the release and try it now: https://github.com/nnaisense/evotorch/releases/tag/v0.3.0 Below you can see some videos of agents trained for Brax environments using EvoTorch and vectorized neuroevolution! ​ Humanoid Environment Fetch Environment submitted by /u/NaturalGradient [link] [comments]  ( 118 min )
    Why did performance max out suddenly? (A2C - Cartpole v1)
    I am learning A2C using "CartPole-v1". While testing I saw performance max out suddenly. Score seems to fluctuate and learn a little bit, but suddenly jumps to 1000. What may cause this? Am I making a logical error somewhere? For reference, I am using tensorflow's actor critic implementation, and I don't think I made any major changes. https://preview.redd.it/g7rpct8e61w91.png?width=1093&format=png&auto=webp&s=3a0cc0b508b243f98fdd5056973557b5d53cba64 submitted by /u/notyourjeff [link] [comments]  ( 86 min )
    CORL: Offline Reinforcement Learning Library
    Happy to announce CORL — a library that provides high-quality single-file implementations of Deep Offline Reinforcement Learning algorithms and uses Weights and Biases to track experiments. SOTA algorithms (Decision Transformer, AWAC, BC, CQL, IQL, TD3+BC, SAC-N, EDAC) Benchmarked on widely used D4RL datasets (results match performances reported in the original papers, sometimes even with better results) Configs with hyperparameters for better reproduction Weights&Biases logs for all of the experiments (so that you don’t have to solely rely on final performances from papers) github: https://github.com/tinkoff-ai/CORL paper: https://arxiv.org/abs/2210.07105 (accepted at NeurIPS, 3rd Offline RL Workshop) P.S. Apologies for cross-posting from ML; just in case someone's not following that big subreddit submitted by /u/vkurenkov [link] [comments]  ( 120 min )
    Announcing The Farama Foundation, a new nonprofit maintaining and standardizing open source reinforcement learning environments for the long term (and the new maintainer of Gym, now Gymnasium)
    submitted by /u/jkterry1 [link] [comments]  ( 125 min )
    Agent picking same action every time step
    I have started playing around with custom sb3 environments and managed to create something that appears to work. My environment consists of simulated time series data, and my agent's job is to predict the next value. ​ So the agent starts with a random guess and observes the actual value. If the agent guesses correctly, he receives a large reward. If he is within a certain threshold he receives a smaller (but positive) reward and otherwise he is punished with a negative reward. My code is shown below: ​ import gym import numpy as np import matplotlib.pyplot as plt from random import seed from random import seed from random import random from gym import spaces import os import time from stable_baselines3 import PPO from stable_baselines3.common.env_checker import check_env #generate fak…  ( 128 min )
    Topics in Real Analysis and Functional Analysis for RL(kinna lazy to real all the chapters)?
    submitted by /u/Professional_Card176 [link] [comments]  ( 116 min )
    PPO Episodic Reward Normalised 0 to 1
    Hello community, I'm new in the world of RL so pardon me in advance for any mistakes. I have a custom RL environment which defines a continuous finite horizon problem meaning that there is a fixed episode length. I have a continuous reward (i. e. a per simulation step reward) which encourages the agent to perform a task. I'm using the SB3 implementation of PPO. I nottice that people suggest reward normalisation. To my understanding this means when updating the policy network using a batch of experiences then you use their mean and std to normalise them. However I believe this is done internally in SB3 (Correct me if wrong). From some experiments I've nottice that normalizing the episodic reward value to lie within the range of 0 to 1 helps to the convergence of the algorithm. Is there any evidence that something like this might help with convergence? p.s. To explain a bit better what I did was that lets say that the number of max actions I want the agent to perform within an episode is 500. My max reward per step is 1/500 so an Ideal agent would get a total of 1. Thanks in advance. submitted by /u/nuki96 [link] [comments]  ( 123 min )
  • Open

    [P] We’ve released EvoTorch 0.3.0, with VecGymNE, memory usage improvements, Colab support and more! VecGymNE enables evolutionary RL with vectorized environments and policies, especially massively parallel simulators like Brax!
    New Features Vectorized gym support: Added a new problem class, evotorch.neuroevolution.VecGymNE, to solve vectorized gym environments. This new problem class can work with brax environments and can exploit GPU acceleration. PicklingLogger: Added a new logger, evotorch.logging.PicklingLogger, which periodically pickles and saves the current solution to the disk. Python 3.7 support: The Python dependency was lowered from 3.8 to 3.7. Therefore, EvoTorch can now be imported from within a Google Colab notebook. Fixes Fixed a performance issue caused by the undesired cloning of the entire storages of tensor slices. Fixed the signature and the docstrings of the overridable method _do_cross_over(...) of the class evotorch.operators.CrossOver. Check out the release and try it now: https://github.com/nnaisense/evotorch/releases/tag/v0.3.0 Below you can see some videos of agents trained for Brax environments using EvoTorch and vectorized neuroevolution! ​ Humanoid Environment Fetch Environment submitted by /u/NaturalGradient [link] [comments]  ( 119 min )
    [D] Cover Letters are Dead? - Social Impact Tuesday #2
    One of the many things that contemporary language models do well is generating cover letters. A brief google search shows a small mountain of fledgling companies, each of which is built around calling the GPT-3 API to generate cover letters. From “opencoverletter.com”, through “kickresume.com”, all the way to “rezi.ai”, if you want a cover letter, they’ve got you – well – covered. The usefulness of human-written cover letters has been debated for a while in business media. With cover letters being such low-hanging fruit for generative language models, do cover letters still have a place in future company recruitment or scholarship applications? How ethical is using a generative model to help you write a cover letter? If this is the beginning of the end for cover letters, what will replace them? – This is part 2 of a recurring series discussing the social impacts of new machine learning technologies. The current discussions center around a recent survey paper on language models, but feel free to suggest topics for future weeks. submitted by /u/TiredOldCrow [link] [comments]  ( 119 min )
    [D] Tensorflow learning differently than Pytorch
    I have been playing around with both TF and Pytorch for a while and I noticed that PyTorch in general gives me better results than Tensorflow on a simple binary classification**.** Baffled by this I tried to investigate a little bit further with a simple comparison: I made two colabs notebooks trying to solve the very same binary classification problem (cats vs dogs) with both frameworks. I tried to keep the model's architecture as similar as possible, as far as I can tell, relying on pre-trained VGG16 weights and allowing training on all the layers. The following plots show that PyTorch immediately reaches top performance in just one epoch, while TF is not even close after 10 epochs. Learning rate, optimizer are the same. The VGG16 architecture seems a little bit different in the two frameworks with different number of parameters. Am I missing something obvious? plot legend training = blue line validation = orange line ​ ​ Tensorflow (tf.keras) ​ PyTorch ​ ​ TF colab link: https://colab.research.google.com/drive/1YeOlEGNJXXWJ2bkY1kk2iMJ4JCJwpsKm?usp=sharing Pytorch colab link: https://colab.research.google.com/drive/1nSAuyd9x7WAfA3FkwD6NTyBOzuBvGiRQ#scrollTo=Yf22Fq7CyB3F submitted by /u/aleguida [link] [comments]  ( 121 min )
    [P] Object detection model learns backgrounds and not objects
    I'm training a machine learning model using YOLOv5 from Ultralytics (arch: YOLOv5s6). The task is to detect and identify laundry symbols. For that, I've scraped and labeled 600 images from Google. Using this dataset, I receive a result with an mAP around 0.6. But 600 images is a tiny dataset and there are multiple laundry symbols where I have only 1-4 images for training and symbols where I have 100 and more. So I started writing a Python script which generates more images of laundry symbols. The script basically takes a background image and adds randomly positioned 1-10 laundry symbols in different colors and rotations. No background is used twice. With that script, I generated around 6.000 entirely different images with laundry symbols that every laundry symbol is at least 800 times in the dataset. Here are examples of the generated data: Link 1 Link 2 I combined the scraped and the generated dataset and retrained the model with the same configuration. The result is really bad: the mAP dropped to 0.15 and the model overfits. The confusion matrix told me why: Confusion matrix Why is the model learning the background instead the objects? First I thought my annotation might be wrong, but the training script from Ultralytics saves a few examples of training batch images - there the boxes are drawn perfectly around the generated symbols. For completeness, below are more analytics added about the training: More analytics Labels Curves More examples from the dataset submitted by /u/Waterfront_xD [link] [comments]  ( 123 min )
    Kubeflow, Jupyter notebook online. Question to community [D]
    Hi I'm doing research about how useful can be Jupyter notebook in the cloud similar to Google collab with pricing per hour or full Kubeflow as a service. I don't see too many solutions like that. My target group is individual researchers, students or people who don't have too many resources—so similar cases like cheap VPS to do some fun and practice Linux. I know a company that works in this model and earn money. What do you think? What can be needed to use it instead of Google collab? Can you be interested in using Jupyter online? Maybe you need configured Kubeflow with pricing per hour? The only Kubeflow in cloud I see it is https://www.arrikto.com/kubeflow-as-a-service/ submitted by /u/asokopo [link] [comments]  ( 119 min )
    [N] OpenAI Gym and a bunch of the most used open source RL environments have been consolidated into a single new nonprofit (The Farama Foundation)
    You can read the full announcement post here: https://farama.org/Announcing-The-Farama-Foundation submitted by /u/jkterry1 [link] [comments]  ( 122 min )
    [D] Question: State of the art in automated phonetic speech transcription
    Dear All, my wife is a linguist who does speech transcription by hand. Does anybody know if there is good literature on machine learning or even open source models for automated speech to phonetic transcription? submitted by /u/farnk1 [link] [comments]  ( 116 min )
    Combining image and text embedding [P]
    I am currently working on a database retrieval framework, that takes an image and categorical text data, creates an embedding of these and calculates the distance of this combined embedding to other known datapoints. However, my results seem to be off. So I was wondering, what would be an appropriate way of combining these embeddings? The details about the embedding: image features are embedded with a pretrained vgg19 model categorical text features are embedded by creating one-hot vectors both embeddings are combined by concatenating the vectors So in the end, i get a vector that looks like this: [image embedding(1,8192+ text embedding (1,137)] Use of the embeddings: The embeddings are then used to find the NearestNeighbors by calculating the cosine distance. Question/Issue: My question is, would that be an appropriate way of combining features of a sample in n-dimensional space? Are there any other/preferred ways? submitted by /u/External_Oven_6379 [link] [comments]  ( 121 min )
    [P] AutoPrognosis - A system for automating the design of predictive modeling pipelines tailored for clinical prognosis.
    AutoPrognosis is an AutoML library for tabular data focused on survival analysis or classification tasks. The library can select optimal prediction pipelines for a given dataset and a task type and can handle data missingness using HyperImpute. On top of that, the library provides interpretability and uncertainty information, and can be used to build end-to-end apps on top of Streamlit. Code: https://github.com/vanderschaarlab/autoprognosis submitted by /u/ManagementBig2995 [link] [comments]  ( 117 min )
    [R] How competitive are workshop submissions at ICML?
    I mean, organizing a workshop. Thanks :) submitted by /u/ArmandDerech [link] [comments]  ( 130 min )
  • Open

    Learn more about AI here!
    Want to learn more about the wild world of AI? Give us a follow on LinkedIn to stay up to date with over 125+ AI companies' latest innovations, awards, & more! https://www.linkedin.com/company/aipartnershipscorp/ submitted by /u/AIPartnerships [link] [comments]  ( 112 min )
    AI Dream 69 - Interstellar Trippy Exploration
    submitted by /u/LordPewPew777 [link] [comments]  ( 112 min )
    The New AI Hype
    submitted by /u/nunoheart [link] [comments]  ( 111 min )
    Facial Recognition AI for Uploaded Videos
    Hello, A client is giving us a ridiculous amount of footage, and they're asking us to find certain people in long clips, which takes forever to do manually. It would be fantastic if we could use some kind of tool that has AI/facial recognition and automatic tagging to run this footage through (as long as it's faster than going through it manually). Do you know of anything like that? ​ Thanks! submitted by /u/Tenkoh [link] [comments]  ( 112 min )
    Looking for an AI Chatbot to interview for Documentary
    I'm creating a documentary about AI and its future. I'm looking for a good AI that'll give me responses with enough depth, and hopefully aren't a single sentence. I'll be asking it about its existence so I want it to answer questions like "What do you see?", and "Who are you?" with enough detail to be interesting to a viewer. Hopefully there's one that can do all this and has a face and voice but that's not too important. ​ If anyone has any ideas let me know. submitted by /u/murmuur_ [link] [comments]  ( 112 min )
    History of Humankind (AI generated video)
    This video was generated with AI by 200 carefully engineered prompts, resulting in 7361 images. Let me know what you think. I am happy to answer any questions and if you want to collaborate let me know! https://www.youtube.com/watch?v=n-qjJWc21d8 submitted by /u/iamrmc [link] [comments]  ( 112 min )
    Any AIs that I can input a bunch of poems and then have it output new poems similar to the inputs?
    Title says it all. I want to input a bunch of simple poems and have an AI generate new poems inspired by the inputs. Does anything like this exist? If not, where should I begin inorder to create and train a model like this? submitted by /u/yea_okay_dude [link] [comments]  ( 115 min )
    Apparently both Character.AI and Novel.AI can't compare shopping prices on sites
    I Know I'm an idiot for asking and even asking the devs on this , please understand I'm not tech savy and have severe dyscalculia. Can anyone suggest an AI that works with comparing shopping prices for items and allow you figure out the item you want for the price you put in ? submitted by /u/Realistic_Monk_4703 [link] [comments]  ( 112 min )
    AI images for the masses: Shutterstock and OpenAI partner up
    submitted by /u/much_successes [link] [comments]  ( 114 min )
    Stable Diffusion Infinity Easy Windows Install! Basically Free Dall-E 2 Outpainting....
    submitted by /u/PuppetHere [link] [comments]  ( 114 min )
    New AI Driven Robotics Beat Human Professional Soccer Skills | Breakthrough Google AI Edits Images With Text | New Deep Learning Tech Uses Light Waves
    submitted by /u/kenickh [link] [comments]  ( 112 min )
    I need ideas for a competition I am entering
    Hello, I am entering a competition with a team of friends. It requires us to make an idea for a real world tech application of AI Our team thought a lot about a lot of ideas and they either already exist, or are too complicated to attempt. Resources to help learn more about ai can also help. Thanks in advance! submitted by /u/ToonerAnonymous [link] [comments]  ( 117 min )
    How do you turn videos into ai art like this?
    submitted by /u/DIGITALJOSH333 [link] [comments]  ( 112 min )
    Midjourney Creates Logos!
    submitted by /u/MoeHefin [link] [comments]  ( 112 min )
    Is there some kind of 'AI tool' which groups certain similar words?
    Imagine this: I have a list of 400 bug reports. Bug report 1: The search bar is not working Bug report 2: I generally like the website but searching through the search bar is not working Bug report 3: My account is locked Bug report 4: Searching doesn't work Etc. etc. etc - and the list goes on! Now - as you can see, there are a lot of similarities! For instance, bug reports about the 'Search bar' or 'Searching' occur quite often (3/4 in this case). But, it would be an insane pain to read through all the 400 reports and try to mark down the similarities / occurrences. Imagine if the list is hundreds of inputs long. Therefore - is there some kind of AI tool wherein you can input a certain list and it exports certain grouping of similarities? So, it'd tell me: "The Word 'Search' came up in 3/4 bug reports and is the most reported" submitted by /u/nadir7379 [link] [comments]  ( 118 min )
    Cyberpunk Vikings
    submitted by /u/nalr00n [link] [comments]  ( 112 min )
  • Open

    Open Images V7 — Now Featuring Point Labels
    Posted by Rodrigo Benenson, Research Scientist, Google Research Open Images is a computer vision dataset covering ~9 million images with labels spanning thousands of object categories. Researchers around the world use Open Images to train and evaluate computer vision models. Since the initial release of Open Images in 2016, which included image-level labels covering 6k categories, we have provided multiple updates to enrich annotations and expand the potential use cases of the dataset. Through several releases, we have added image-level labels for over 20k categories on all images and bounding box annotations, visual relations, instance segmentations, and localized narratives (synchronized voice, mouse trace, and text caption) on a subset of 1.9M images. Today, we are happy to announ…  ( 91 min )
  • Open

    Run inference at scale for OpenFold, a PyTorch-based protein folding ML model, using Amazon EKS
    This post was co-written with Sachin Kadyan, a leading developer of OpenFold. In drug discovery, understanding the 3D structure of proteins is key to assessing the ability of a drug to bind to it, directly impacting its efficacy. Predicting the 3D protein form, however, is very complex, challenging, expensive, and time consuming, and can take […]  ( 13 min )
    Configure DTMF slots and ordered retry prompts with Amazon Lex
    This post walks you through a few new features that make it simple to design a conversational flow entirely within Amazon Lex that adheres to best practices for IVR design related to retry prompting. We also cover how to configure a DTMF-only prompt as well as other attributes like timeouts and barge-in. When designing an […]  ( 6 min )
    Run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints
    As AI adoption is accelerating across the industry, customers are building sophisticated models that take advantage of new scientific breakthroughs in deep learning. These next-generation models allow you to achieve state-of-the-art, human-like performance in the fields of natural language processing (NLP), computer vision, speech recognition, medical research, cybersecurity, protein structure prediction, and many others. For […]  ( 12 min )
  • Open

    State of Data Science and Machine Learning: Kaggle 2022 Survey
    In September, Kaggle released their annual survey for the state of data science and machine learning Here are some top level findings I found interesting  An increasing number of data scientists are living and working in India and Japan Python and SQL remain the two most common programming skills for data scientists VSCode is now… Read More »State of Data Science and Machine Learning: Kaggle 2022 Survey The post State of Data Science and Machine Learning: Kaggle 2022 Survey appeared first on Data Science Central.  ( 19 min )
    Snowflake Users and Their Data: A Report on Snowflake Users and How They Optimize Their Data
    Snowflake is one of the most popular cloud data warehouses today. In just one decade, the company has grown to more than 6,800 enterprise customers and $1.2 billion in annual revenue, more than double its prior year.  Like other cloud data warehouses, including Databricks, Amazon RedShift, and Google BigQuery, Snowflake boasts an attractive combination of… Read More »Snowflake Users and Their Data: A Report on Snowflake Users and How They Optimize Their Data The post Snowflake Users and Their Data: A Report on Snowflake Users and How They Optimize Their Data appeared first on Data Science Central.  ( 20 min )
    Data Subassemblies and Data Products Part 3 – Data Product Dev Canvas
    I received outstanding feedback from many folks in updating the Data Product Development canvas including Jon Cooke, Harsha Srivatsa, Craig (Doc) Savage, Melissa Perry, Raj Yakkali, Erik Burd, and Mark Stouse amongst the many.  And as I like to say, “All ideas are worthy of consideration” and I did consider all your input as I… Read More »Data Subassemblies and Data Products Part 3 – Data Product Dev Canvas The post Data Subassemblies and Data Products Part 3 – Data Product Dev Canvas appeared first on Data Science Central.  ( 20 min )
    10 Tips to Protect Your Organization Against Ransomware Attacks in 2022
    Are you wondering how to protect your organization’s data against ransomware attacks? You have come to the right place. Avoiding ransomware isn’t a Herculean task, but before we share the specifics of how to prevent ransomware attacks, we would like to talk about what exactly ransomware attacks are.  What Are Ransomware Attacks? There are different… Read More »10 Tips to Protect Your Organization Against Ransomware Attacks in 2022 The post 10 Tips to Protect Your Organization Against Ransomware Attacks in 2022 appeared first on Data Science Central.  ( 21 min )
    What is Enterprise Agility?
    I have done a lot of writing recently about these many referents and the Agile 2 Academy, with whom I am currently working.  I’ve stressed the importance of attaining what I have been calling Business Agility and observed the many ways in which Agile and agile as they are currently practiced don’t necessarily contribute to it.  The post What is Enterprise Agility? appeared first on Data Science Central.  ( 22 min )
  • Open

    Gibbs’ method of determining an orbit
    Josiah Willard Gibbs (1839–1903) was prominent American scientist at a time when America had yet to produce many prominent scientists. I first heard of him via Gibbs phenomenon and later by attending one of the Gibbs lectures at an AMS meeting. Gibbs came up with a method of determining an orbit from three observations. As […] Gibbs’ method of determining an orbit first appeared on John D. Cook.  ( 6 min )
    ASQ/ANSI Z1.4 sampling procedures
    I mentioned the other day that the US military standard MIL-STD-105 for statistical sampling procedures lives on in the ASQ/ANSI standard Z1.4. The Department of Defense cancelled their own standard in 1995 in favor of adopting civilian standards, in particular ASQ/ANSI Z1.4. There are two main differences between military standard and its replacement. First, the […] ASQ/ANSI Z1.4 sampling procedures first appeared on John D. Cook.  ( 5 min )
    Five points determine a conic section
    This post is the first in a series looking at determining an orbit. Lambert’s theorem is often summarized by saying you can determine an orbit from two observations. This statement isn’t true without further assumptions, assumptions I plan to make explicit. A solution to the two-body problem is an orbit given by a conic section, […] Five points determine a conic section first appeared on John D. Cook.  ( 5 min )
  • Open

    7 Underrated Artificial Intelligence Developments That Will Change the World
    The future is here, and it's coming for your job. That's the claim of many pundits, who argue that artificial intelligence (AI) will…  ( 13 min )
    Best AI Writers for Students - a Comprehensive List
    It’s a good time to be alive. Not only do we have the privilege of witnessing history in the making, but we also have access to some of the…  ( 23 min )
    What Are the Best Ways to Use AI Story Generators for Your Business?
    As businesses strive to create more engaging content, many are turning to AI story generators as a way to produce high-quality content…  ( 13 min )
    Designing great AI products — AI roles, jobs, and tasks
    When incorporating AI into your workflows, you can also think of AI as a part of the organization and assign it specific roles. You can…  ( 12 min )
    IPFS: The Web3 Data Storage Revolution
    A quick and comprehensive guide to storage advantages in a Web3 Ecosystem.  ( 8 min )
    How to Measure the Capabilities of Mobile Games Through the Use of Automation Testing?
    Testing is essential to guaranteeing success in the very cutthroat environment of the mobile app and game industry. Even if you don’t give…  ( 9 min )
  • Open

    New AI Driven Robotics Beat Human Professional Soccer Skills | Breakthrough Google AI Edits Images With Text | New Deep Learning Tech Uses Light Waves
    submitted by /u/kenickh [link] [comments]  ( 104 min )
  • Open

    3D Artist SouthernShotty Creates Wholesome Characters This Week ‘In the NVIDIA Studio’
    This week 'In the NVIDIA Studio,' we’re highlighting 3D and motion graphics artist SouthernShotty — and scenes from his soon-to-be released short film, Watermelon Girl.  The post 3D Artist SouthernShotty Creates Wholesome Characters This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Oracle Teacher: Leveraging Target Information for Better Knowledge Distillation of CTC Models. (arXiv:2111.03664v3 [cs.LG] UPDATED)
    Knowledge distillation (KD), best known as an effective method for model compression, aims at transferring the knowledge of a bigger network (teacher) to a much smaller network (student). Conventional KD methods usually employ the teacher model trained in a supervised manner, where output labels are treated only as targets. Extending this supervised scheme further, we introduce a new type of teacher model for connectionist temporal classification (CTC)-based sequence models, namely Oracle Teacher, that leverages both the source inputs and the output labels as the teacher model's input. Since the Oracle Teacher learns a more accurate CTC alignment by referring to the target information, it can provide the student with more optimal guidance. One potential risk for the proposed approach is a trivial solution that the model's output directly copies the target input. Based on a many-to-one mapping property of the CTC algorithm, we present a training strategy that can effectively prevent the trivial solution and thus enables utilizing both source and target inputs for model training. Extensive experiments are conducted on two sequence learning tasks: speech recognition and scene text recognition. From the experimental results, we empirically show that the proposed model improves the students across these tasks while achieving a considerable speed-up in the teacher model's training time.  ( 3 min )
    On Elimination Strategies for Bandit Fixed-Confidence Identification. (arXiv:2205.10936v2 [cs.LG] UPDATED)
    Elimination algorithms for bandit identification, which prune the plausible correct answers sequentially until only one remains, are computationally convenient since they reduce the problem size over time. However, existing elimination strategies are often not fully adaptive (they update their sampling rule infrequently) and are not easy to extend to combinatorial settings, where the set of answers is exponentially large in the problem dimension. On the other hand, most existing fully-adaptive strategies to tackle general identification problems are computationally demanding since they repeatedly test the correctness of every answer, without ever reducing the problem size. We show that adaptive methods can be modified to use elimination in both their stopping and sampling rules, hence obtaining the best of these two worlds: the algorithms (1) remain fully adaptive, (2) suffer a sample complexity that is never worse of their non-elimination counterpart, and (3) provably eliminate certain wrong answers early. We confirm these benefits experimentally, where elimination improves significantly the computational complexity of adaptive methods on common tasks like best-arm identification in linear bandits.  ( 2 min )
    Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models. (arXiv:2205.12694v2 [cs.CL] UPDATED)
    Model compression by way of parameter pruning, quantization, or distillation has recently gained popularity as an approach for reducing the computational requirements of modern deep neural network models for NLP. Inspired by prior works suggesting a connection between simpler, more generalizable models and those that lie within wider loss basins, we hypothesize that optimizing for flat minima should lead to simpler parameterizations and thus more compressible models. We propose to combine sharpness-aware minimization (SAM) with various task-specific model compression methods, including iterative magnitude pruning (IMP), structured pruning with a distillation objective, and post-training dynamic quantization. Empirically, we show that optimizing for flatter minima consistently leads to greater compressibility of parameters compared to vanilla Adam when fine-tuning BERT models, with little to no loss in accuracy on the GLUE text classification and SQuAD question answering benchmarks. Moreover, SAM finds superior winning tickets during IMP that 1) are amenable to vanilla Adam optimization, and 2) transfer more effectively across tasks.  ( 2 min )
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v3 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.  ( 2 min )
    Training and Inference on Any-Order Autoregressive Models the Right Way. (arXiv:2205.13554v2 [cs.LG] UPDATED)
    Conditional inference on arbitrary subsets of variables is a core problem in probabilistic inference with important applications such as masked language modeling and image inpainting. In recent years, the family of Any-Order Autoregressive Models (AO-ARMs) -- closely related to popular models such as BERT and XLNet -- has shown breakthrough performance in arbitrary conditional tasks across a sweeping range of domains. But, in spite of their success, in this paper we identify significant improvements to be made to previous formulations of AO-ARMs. First, we show that AO-ARMs suffer from redundancy in their probabilistic model, i.e., they define the same distribution in multiple different ways. We alleviate this redundancy by training on a smaller set of univariate conditionals that still maintains support for efficient arbitrary conditional inference. Second, we upweight the training loss for univariate conditionals that are evaluated more frequently during inference. Our method leads to improved performance with no compromises on tractability, giving state-of-the-art likelihoods in arbitrary conditional modeling on text (Text8), image (CIFAR10, ImageNet32), and continuous tabular data domains.  ( 2 min )
    Less is More: Summary of Long Instructions is Better for Program Synthesis. (arXiv:2203.08597v2 [cs.CL] UPDATED)
    Despite the success of large pre-trained language models (LMs) such as Codex, they show below-par performance on the larger and more complicated programming related questions. We show that LMs benefit from the summarized version of complicated questions. Our findings show that superfluous information often present in problem description such as human characters, background stories, and names (which are included to help humans in understanding a task) does not help models in understanding a task. To this extent, we create a meta-dataset from the frequently used APPS dataset and the newly created CodeContests dataset for the program synthesis task. Our meta-dataset consists of human and synthesized summaries of the long and complicated programming questions. Experimental results on Codex show that our proposed approach outperforms baseline by 8.13% on the APPS dataset and 11.88% on the CodeContests dataset on average in terms of strict accuracy. Our analysis shows that summaries significantly improve performance for introductory (9.86%) and interview (11.48%) programming questions. However, it shows improvement by a small margin (~ 2%) for competitive programming questions, implying scope for future research in this direction.  ( 2 min )
    Laplacian-based Cluster-Contractive t-SNE for High Dimensional Data Visualization. (arXiv:2207.12214v3 [cs.LG] UPDATED)
    Dimensionality reduction techniques aim at representing high-dimensional data in low-dimensional spaces to extract hidden and useful information or facilitate visual understanding and interpretation of the data. However, few of them take into consideration the potential cluster information contained implicitly in the high-dimensional data. In this paper, we propose LaptSNE, a new graph-layout nonlinear dimensionality reduction method based on t-SNE, one of the best techniques for visualizing high-dimensional data as 2D scatter plots. Specifically, LaptSNE leverages the eigenvalue information of the graph Laplacian to shrink the potential clusters in the low-dimensional embedding when learning to preserve the local and global structure from high-dimensional space to low-dimensional space. It is nontrivial to solve the proposed model because the eigenvalues of normalized symmetric Laplacian are functions of the decision variable. We provide a majorization-minimization algorithm with convergence guarantee to solve the optimization problem of LaptSNE and show how to calculate the gradient analytically, which may be of broad interest when considering optimization with Laplacian-composited objective. We evaluate our method by a formal comparison with state-of-the-art methods on seven benchmark datasets, both visually and via established quantitative measurements. The results demonstrate the superiority of our method over baselines such as t-SNE and UMAP. We also provide out-of-sample extension, large-scale extension and mini-batch extension for our LaptSNE to facilitate dimensionality reduction in various scenarios.  ( 2 min )
    CLCNet: Rethinking of Ensemble Modeling with Classification Confidence Network. (arXiv:2205.09612v5 [cs.LG] UPDATED)
    In this paper, we propose a Classification Confidence Network (CLCNet) that can determine whether the classification model classifies input samples correctly. It can take a classification result in the form of vector in any dimension, and return a confidence score as output, which represents the probability of an instance being classified correctly. We can utilize CLCNet in a simple cascade structure system consisting of several SOTA (state-of-the-art) classification models, and our experiments show that the system can achieve the following advantages: 1. The system can customize the average computation requirement (FLOPs) per image while inference. 2. Under the same computation requirement, the performance of the system can exceed any model that has identical structure with the model in the system, but different in size. In fact, this is a new type of ensemble modeling. Like general ensemble modeling, it can achieve higher performance than single classification model, yet our system requires much less computation than general ensemble modeling. We have uploaded our code to a github repository: https://github.com/yaoching0/CLCNet-Rethinking-of-Ensemble-Modeling.  ( 2 min )
    Maximum Class Separation as Inductive Bias in One Matrix. (arXiv:2206.08704v2 [cs.LG] UPDATED)
    Maximizing the separation between classes constitutes a well-known inductive bias in machine learning and a pillar of many traditional algorithms. By default, deep networks are not equipped with this inductive bias and therefore many alternative solutions have been proposed through differential optimization. Current approaches tend to optimize classification and separation jointly: aligning inputs with class vectors and separating class vectors angularly. This paper proposes a simple alternative: encoding maximum separation as an inductive bias in the network by adding one fixed matrix multiplication before computing the softmax activations. The main observation behind our approach is that separation does not require optimization but can be solved in closed-form prior to training and plugged into a network. We outline a recursive approach to obtain the matrix consisting of maximally separable vectors for any number of classes, which can be added with negligible engineering effort and computational overhead. Despite its simple nature, this one matrix multiplication provides real impact. We show that our proposal directly boosts classification, long-tailed recognition, out-of-distribution detection, and open-set recognition, from CIFAR to ImageNet. We find empirically that maximum separation works best as a fixed bias; making the matrix learnable adds nothing to the performance. The closed-form implementation and code to reproduce the experiments are available on github.  ( 2 min )
    Scalable and Low-Latency Federated Learning with Cooperative Mobile Edge Networking. (arXiv:2205.13054v2 [cs.DC] UPDATED)
    Federated learning (FL) enables collaborative model training without centralizing data. However, the traditional FL framework is cloud-based and suffers from high communication latency. On the other hand, the edge-based FL framework that relies on an edge server co-located with mobile base station for model aggregation has low communication latency but suffers from degraded model accuracy due to the limited coverage of edge server. In light of high accuracy but high-latency cloud-based FL and low-latency but low-accuracy edge-based FL, this paper proposes a new FL framework based on cooperative mobile edge networking called cooperative federated edge learning (CFEL) to enable both high-accuracy and low-latency distributed intelligence at mobile edge networks. Considering the unique two-tier network architecture of CFEL, a novel federated optimization method dubbed cooperative edge-based federated averaging (CE-FedAvg) is further developed, wherein each edge server both coordinates collaborative model training among the devices within its own coverage and cooperates with other edge servers to learn a shared global model through decentralized consensus. Experimental results based on benchmark datasets show that CFEL can largely reduce the training time to achieve a target model accuracy compared with prior FL frameworks.  ( 2 min )
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v3 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
    Ground-Truth Labels Matter: A Deeper Look into Input-Label Demonstrations. (arXiv:2205.12685v2 [cs.CL] UPDATED)
    Despite recent explosion of interests in in-context learning, the underlying mechanism and the precise impact of the quality of demonstrations remain elusive. Intuitively, ground-truth labels should have as much impact in in-context learning (ICL) as supervised learning, but recent work reported that the input-label correspondence is significantly less important than previously thought. Intrigued by this counter-intuitive observation, we re-examine the importance of ground-truth labels in in-context learning. With the introduction of two novel metrics, namely Label-Correctness Sensitivity and Ground-truth Label Effect Ratio (GLER), we were able to conduct quantifiable analysis on the impact of ground-truth label demonstrations. Through extensive analyses, we find that the correct input-label mappings can have varying impacts on the downstream in-context learning performances, depending on the experimental configuration. Through additional studies, we identify key components, such as the verbosity of prompt templates and the language model size, as the controlling factor to achieve more noise-resilient ICL.
    Classical and learned MR to pseudo-CT mappings for accurate transcranial ultrasound simulation. (arXiv:2206.15441v2 [physics.med-ph] UPDATED)
    Model-based treatment planning for transcranial ultrasound therapy typically involves mapping the acoustic properties of the skull from an x-ray computed tomography (CT) image of the head. Here, three methods for generating pseudo-CT images from magnetic resonance (MR) images were compared as an alternative to CT. A convolutional neural network (U-Net) was trained on paired MR-CT images to generate pseudo-CT images from either T1-weighted or zero-echo time (ZTE) MR images (denoted tCT and zCT, respectively). A direct mapping from ZTE to pseudo-CT was also implemented (denoted cCT). When comparing the pseudo-CT and ground truth CT images for the test set, the mean absolute error was 133, 83, and 145 Hounsfield units (HU) across the whole head, and 398, 222, and 336 HU within the skull for the tCT, zCT, and cCT images, respectively. Ultrasound simulations were also performed using the generated pseudo-CT images and compared to simulations based on CT. An annular array transducer was used targeting the visual or motor cortex. The mean differences in the simulated focal pressure, focal position, and focal volume were 9.9%, 1.5 mm, and 15.1% for simulations based on the tCT images, 5.7%, 0.6 mm, and 5.7% for the zCT, and 6.7%, 0.9 mm, and 12.1% for the cCT. The improved results for images mapped from ZTE highlight the advantage of using imaging sequences which improve contrast of the skull bone. Overall, these results demonstrate that acoustic simulations based on MR images can give comparable accuracy to those based on CT.
    Evaluating and Crafting Datasets Effective for Deep Learning With Data Maps. (arXiv:2208.10033v2 [cs.LG] UPDATED)
    Rapid development in deep learning model construction has prompted an increased need for appropriate training data. The popularity of large datasets - sometimes known as "big data" - has diverted attention from assessing their quality. Training on large datasets often requires excessive system resources and an infeasible amount of time. Furthermore, the supervised machine learning process has yet to be fully automated: for supervised learning, large datasets require more time for manually labeling samples. We propose a method of curating smaller datasets with comparable out-of-distribution model accuracy after an initial training session using an appropriate distribution of samples classified by how difficult it is for a model to learn from them.
    A Survey on Computationally Efficient Neural Architecture Search. (arXiv:2206.01520v3 [cs.LG] UPDATED)
    Neural architecture search (NAS) has become increasingly popular in the deep learning community recently, mainly because it can provide an opportunity to allow interested users without rich expertise to benefit from the success of deep neural networks (DNNs). However, NAS is still laborious and time-consuming because a large number of performance estimations are required during the search process of NAS, and training DNNs is computationally intensive. To solve this major limitation of NAS, improving the computational efficiency is essential in the design of NAS. However, a systematic overview of computationally efficient NAS (CE-NAS) methods still lacks. To fill this gap, we provide a comprehensive survey of the state-of-the-art on CE-NAS by categorizing the existing work into proxy-based and surrogate-assisted NAS methods, together with a thorough discussion of their design principles and a quantitative comparison of their performances and computational complexities. The remaining challenges and open research questions are also discussed, and promising research topics in this emerging field are suggested.
    Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios. (arXiv:2206.01900v2 [cs.AI] UPDATED)
    Evaluation of intervention in a multi-agent system, e.g., when humans should intervene in autonomous driving systems and when a player should pass to teammates for a good shot, is challenging in various engineering and scientific fields. Estimating the individual treatment effect (ITE) using counterfactual long-term prediction is practical to evaluate such interventions. However, most of the conventional frameworks did not consider the time-varying complex structure of multi-agent relationships and covariate counterfactual prediction. This may sometimes lead to erroneous assessments of ITE and interpretation problems. Here we propose an interpretable, counterfactual recurrent network in multi-agent systems to estimate the effect of the intervention. Our model leverages graph variational recurrent neural networks and theory-based computation with domain knowledge for the ITE estimation framework based on long-term prediction of multi-agent covariates and outcomes, which can confirm under the circumstances under which the intervention is effective. On simulated models of an automated vehicle and biological agents with time-varying confounders, we show that our methods achieved lower estimation errors in counterfactual covariates and the most effective treatment timing than the baselines. Furthermore, using real basketball data, our methods performed realistic counterfactual predictions and evaluated the counterfactual passes in shot scenarios.  ( 3 min )
    Unsupervised Learning for Combinatorial Optimization with Principled Objective Relaxation. (arXiv:2207.05984v3 [cs.LG] UPDATED)
    Using machine learning to solve combinatorial optimization (CO) problems is challenging, especially when the data is unlabeled. This work proposes an unsupervised learning framework for CO problems. Our framework follows a standard relaxation-plus-rounding approach and adopts neural networks to parameterize the relaxed solutions so that simple back-propagation can train the model end-to-end. Our key contribution is the observation that if the relaxed objective satisfies entry-wise concavity, a low optimization loss guarantees the quality of the final integral solutions. This observation significantly broadens the applicability of the previous framework inspired by Erdos' probabilistic method. In particular, this observation can guide the design of objective models in applications where the objectives are not given explicitly while requiring being modeled in prior. We evaluate our framework by solving a synthetic graph optimization problem, and two real-world applications including resource allocation in circuit design and approximate computing. Our framework largely outperforms the baselines based on na\"{i}ve relaxation, reinforcement learning, and Gumbel-softmax tricks.
    RACE: Retrieval-Augmented Commit Message Generation. (arXiv:2203.02700v3 [cs.SE] UPDATED)
    Commit messages are important for software development and maintenance. Many neural network-based approaches have been proposed and shown promising results on automatic commit message generation. However, the generated commit messages could be repetitive or redundant. In this paper, we propose RACE, a new retrieval-augmented neural commit message generation method, which treats the retrieved similar commit as an exemplar and leverages it to generate an accurate commit message. As the retrieved commit message may not always accurately describe the content/intent of the current code diff, we also propose an exemplar guider, which learns the semantic similarity between the retrieved and current code diff and then guides the generation of commit message based on the similarity. We conduct extensive experiments on a large public dataset with five programming languages. Experimental results show that RACE can outperform all baselines. Furthermore, RACE can boost the performance of existing Seq2Seq models in commit message generation.  ( 2 min )
    Infinite-Fidelity Coregionalization for Physical Simulation. (arXiv:2207.00678v2 [cs.LG] UPDATED)
    Multi-fidelity modeling and learning are important in physical simulation-related applications. It can leverage both low-fidelity and high-fidelity examples for training so as to reduce the cost of data generation while still achieving good performance. While existing approaches only model finite, discrete fidelities, in practice, the fidelity choice is often continuous and infinite, which can correspond to a continuous mesh spacing or finite element length. In this paper, we propose Infinite Fidelity Coregionalization (IFC). Given the data, our method can extract and exploit rich information within continuous, infinite fidelities to bolster the prediction accuracy. Our model can interpolate and/or extrapolate the predictions to novel fidelities, which can be even higher than the fidelities of training data. Specifically, we introduce a low-dimensional latent output as a continuous function of the fidelity and input, and multiple it with a basis matrix to predict high-dimensional solution outputs. We model the latent output as a neural Ordinary Differential Equation (ODE) to capture the complex relationships within and integrate information throughout the continuous fidelities. We then use Gaussian processes or another ODE to estimate the fidelity-varying bases. For efficient inference, we reorganize the bases as a tensor, and use a tensor-Gaussian variational posterior to develop a scalable inference algorithm for massive outputs. We show the advantage of our method in several benchmark tasks in computational physics.
    Sound and Complete Verification of Polynomial Networks. (arXiv:2209.07235v2 [cs.LG] UPDATED)
    Polynomial Networks (PNs) have demonstrated promising performance on face and image recognition recently. However, robustness of PNs is unclear and thus obtaining certificates becomes imperative for enabling their adoption in real-world applications. Existing verification algorithms on ReLU neural networks (NNs) based on classical branch and bound (BaB) techniques cannot be trivially applied to PN verification. In this work, we devise a new bounding method, equipped with BaB for global convergence guarantees, called Verification of Polynomial Networks or VPN for short. One key insight is that we obtain much tighter bounds than the interval bound propagation (IBP) and DeepT-Fast [Bonaert et al., 2021] baselines. This enables sound and complete PN verification with empirical validation on MNIST, CIFAR10 and STL10 datasets. We believe our method has its own interest to NN verification. The source code is publicly available at https://github.com/megaelius/PNVerification.
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v2 [cs.LG] UPDATED)
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.  ( 2 min )
    Feature Extractor Stacking for Cross-domain Few-shot Meta-learning. (arXiv:2205.05831v2 [cs.CV] UPDATED)
    Cross-domain few-shot meta-learning (CDFSML) addresses learning problems where knowledge needs to be transferred from several source domains into an instance-scarce target domain with an explicitly different distribution. Recently published CDFSML methods generally construct a "universal model" that combines knowledge of multiple source domains into one backbone feature extractor. This enables efficient inference but necessitates re-computation of the backbone whenever a new source domain is added. Moreover, these methods often derive their universal model from a collection of backbones -- normally one for each source domain -- where these backbones are constrained to have the same architecture as the universal model. We propose feature extractor stacking (FES), a new CDFSML method for combining information from a collection of backbones that imposes no constraints on the backbones' architecture and does not require re-computing a universal model when a backbone for a new source domain becomes available. We present the basic FES algorithm, which is inspired by the classic stacking approach to meta-learning, and also introduce two variants: convolutional FES (ConFES) and regularised FES (ReFES). Given a target-domain task, these algorithms fine-tune each backbone independently, use cross-validation to extract meta training data from the support set available for the task, and learn a simple linear meta-classifier from this data. We evaluate our FES methods on the well-known Meta-Dataset benchmark, targeting image classification with convolutional neural networks, and show that they can achieve state-of-the-art performance.
    Unknown-Aware Domain Adversarial Learning for Open-Set Domain Adaptation. (arXiv:2206.07551v2 [cs.LG] UPDATED)
    Open-Set Domain Adaptation (OSDA) assumes that a target domain contains unknown classes, which are not discovered in a source domain. Existing domain adversarial learning methods are not suitable for OSDA because distribution matching with $\textit{unknown}$ classes leads to negative transfer. Previous OSDA methods have focused on matching the source and the target distribution by only utilizing $\textit{known}$ classes. However, this $\textit{known}$-only matching may fail to learn the target-$\textit{unknown}$ feature space. Therefore, we propose Unknown-Aware Domain Adversarial Learning (UADAL), which $\textit{aligns}$ the source and the target-$\textit{known}$ distribution while simultaneously $\textit{segregating}$ the target-$\textit{unknown}$ distribution in the feature alignment procedure. We provide theoretical analyses on the optimized state of the proposed $\textit{unknown-aware}$ feature alignment, so we can guarantee both $\textit{alignment}$ and $\textit{segregation}$ theoretically. Empirically, we evaluate UADAL on the benchmark datasets, which shows that UADAL outperforms other methods with better feature alignments by reporting state-of-the-art performances.
    Dataset Distillation using Neural Feature Regression. (arXiv:2206.00719v2 [cs.LG] UPDATED)
    Dataset distillation aims to learn a small synthetic dataset that preserves most of the information from the original dataset. Dataset distillation can be formulated as a bi-level meta-learning problem where the outer loop optimizes the meta-dataset and the inner loop trains a model on the distilled data. Meta-gradient computation is one of the key challenges in this formulation, as differentiating through the inner loop learning procedure introduces significant computation and memory costs. In this paper, we address these challenges using neural Feature Regression with Pooling (FRePo), achieving the state-of-the-art performance with an order of magnitude less memory requirement and two orders of magnitude faster training than previous methods. The proposed algorithm is analogous to truncated backpropagation through time with a pool of models to alleviate various types of overfitting in dataset distillation. FRePo significantly outperforms the previous methods on CIFAR100, Tiny ImageNet, and ImageNet-1K. Furthermore, we show that high-quality distilled data can greatly improve various downstream applications, such as continual learning and membership inference defense. Please check out our webpage at https://sites.google.com/view/frepo.
    A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics. (arXiv:2207.09304v2 [math.PR] UPDATED)
    We establish a sharp uniform-in-time error estimate for the Stochastic Gradient Langevin Dynamics (SGLD), which is a popular sampling algorithm. Under mild assumptions, we obtain a uniform-in-time $O(\eta^2)$ bound for the KL-divergence between the SGLD iteration and the Langevin diffusion, where $\eta$ is the step size (or learning rate). Our analysis is also valid for varying step sizes. Based on this, we are able to obtain an $O(\eta)$ bound for the distance between the SGLD iteration and the invariant distribution of the Langevin diffusion, in terms of Wasserstein or total variation distances.
    Investigation of a Machine learning methodology for the SKA pulsar search pipeline. (arXiv:2209.04430v2 [astro-ph.IM] UPDATED)
    The SKA pulsar search pipeline will be used for real time detection of pulsars. Modern radio telescopes such as SKA will be generating petabytes of data in their full scale of operation. Hence experience-based and data-driven algorithms become indispensable for applications such as candidate detection. Here we describe our findings from testing a state of the art object detection algorithm called Mask R-CNN to detect candidate signatures in the SKA pulsar search pipeline. We have trained the Mask R-CNN model to detect candidate images. A custom annotation tool was developed to mark the regions of interest in large datasets efficiently. We have successfully demonstrated this algorithm by detecting candidate signatures on a simulation dataset. The paper presents details of this work with a highlight on the future prospects.
    Mix-Pooling Strategy for Attention Mechanism. (arXiv:2208.10322v2 [cs.LG] UPDATED)
    Recently many effective attention modules are proposed to boot the model performance by exploiting the internal information of convolutional neural networks in computer vision. In general, many previous works ignore considering the design of the pooling strategy of the attention mechanism since they adopt the global average pooling for granted, which hinders the further improvement of the performance of the attention mechanism. However, we empirically find and verify a phenomenon that the simple linear combination of global max-pooling and global min-pooling can produce pooling strategies that match or exceed the performance of global average pooling. Based on this empirical observation, we propose a simple-yet-effective attention module SPEM, which adopts a self-adaptive pooling strategy based on global max-pooling and global min-pooling and a lightweight module for producing the attention map. The effectiveness of SPEM is demonstrated by extensive experiments on widely-used benchmark datasets and popular attention networks.
    NeuralEF: Deconstructing Kernels by Deep Neural Networks. (arXiv:2205.00165v4 [cs.LG] UPDATED)
    Learning the principal eigenfunctions of an integral operator defined by a kernel and a data distribution is at the core of many machine learning problems. Traditional nonparametric solutions based on the Nystr{\"o}m formula suffer from scalability issues. Recent work has resorted to a parametric approach, i.e., training neural networks to approximate the eigenfunctions. However, the existing method relies on an expensive orthogonalization step and is difficult to implement. We show that these problems can be fixed by using a new series of objective functions that generalizes the EigenGame~\citep{gemp2020eigengame} to function space. We test our method on a variety of supervised and unsupervised learning problems and show it provides accurate approximations to the eigenfunctions of polynomial, radial basis, neural network Gaussian process, and neural tangent kernels. Finally, we demonstrate our method can scale up linearised Laplace approximation of deep neural networks to modern image classification datasets through approximating the Gauss-Newton matrix. Code is available at \url{https://github.com/thudzj/neuraleigenfunction}.
    Batch Bayesian optimisation via density-ratio estimation with guarantees. (arXiv:2209.10715v2 [cs.LG] UPDATED)
    Bayesian optimisation (BO) algorithms have shown remarkable success in applications involving expensive black-box functions. Traditionally BO has been set as a sequential decision-making process which estimates the utility of query points via an acquisition function and a prior over functions, such as a Gaussian process. Recently, however, a reformulation of BO via density-ratio estimation (BORE) allowed reinterpreting the acquisition function as a probabilistic binary classifier, removing the need for an explicit prior over functions and increasing scalability. In this paper, we present a theoretical analysis of BORE's regret and an extension of the algorithm with improved uncertainty estimates. We also show that BORE can be naturally extended to a batch optimisation setting by recasting the problem as approximate Bayesian inference. The resulting algorithms come equipped with theoretical performance guarantees and are assessed against other batch and sequential BO baselines in a series of experiments.
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v3 [cs.LG] UPDATED)
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.  ( 2 min )
    Fast Instrument Learning with Faster Rates. (arXiv:2205.10772v2 [stat.ML] UPDATED)
    We investigate nonlinear instrumental variable (IV) regression given high-dimensional instruments. We propose a simple algorithm which combines kernelized IV methods and an arbitrary, adaptive regression algorithm, accessed as a black box. Our algorithm enjoys faster-rate convergence and adapts to the dimensionality of informative latent features, while avoiding an expensive minimax optimization procedure, which has been necessary to establish similar guarantees. It further brings the benefit of flexible machine learning models to quasi-Bayesian uncertainty quantification, likelihood-based model selection, and model averaging. Simulation studies demonstrate the competitive performance of our method.  ( 2 min )
    Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent. (arXiv:2205.11361v2 [stat.ML] UPDATED)
    Recent studies have shown that gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. However, to obtain the desired effect, the step-size should be chosen sufficiently large, a task which is problem dependent and can be difficult in practice. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD), a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system. We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the step-size decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavy-tailed L\'evy-stable process. (ii) By making connections to recently developed generalization bounds for heavy-tailed processes, we derive a generalization bound for the limiting SDE and relate the worst-case generalization error over the trajectories of the process to the parameters of MPGD. (iii) We analyze the implicit regularization effect brought by the dynamical regularization and show that, in the weak perturbation regime, MPGD introduces terms that penalize the Hessian of the loss function. Empirical results are provided to demonstrate the advantages of MPGD.
    Compound Density Networks for Risk Prediction using Electronic Health Records. (arXiv:2208.01320v3 [cs.LG] UPDATED)
    Electronic Health Records (EHRs) exhibit a high amount of missing data due to variations of patient conditions and treatment needs. Imputation of missing values has been considered an effective approach to deal with this challenge. Existing work separates imputation method and prediction model as two independent parts of an EHR-based machine learning system. We propose an integrated end-to-end approach by utilizing a Compound Density Network (CDNet) that allows the imputation method and prediction model to be tuned together within a single framework. CDNet consists of a Gated recurrent unit (GRU), a Mixture Density Network (MDN), and a Regularized Attention Network (RAN). The GRU is used as a latent variable model to model EHR data. The MDN is designed to sample latent variables generated by GRU. The RAN serves as a regularizer for less reliable imputed values. The architecture of CDNet enables GRU and MDN to iteratively leverage the output of each other to impute missing values, leading to a more accurate and robust prediction. We validate CDNet on the mortality prediction task on the MIMIC-III dataset. Our model outperforms state-of-the-art models by significant margins. We also empirically show that regularizing imputed values is a key factor for superior prediction performance. Analysis of prediction uncertainty shows that our model can capture both aleatoric and epistemic uncertainties, which offers model users a better understanding of the model results.
    A Priority Map for Vision-and-Language Navigation with Trajectory Plans and Feature-Location Cues. (arXiv:2207.11717v3 [cs.LG] UPDATED)
    In a busy city street, a pedestrian surrounded by distractions can pick out a single sign if it is relevant to their route. Artificial agents in outdoor Vision-and-Language Navigation (VLN) are also confronted with detecting supervisory signal on environment features and location in inputs. To boost the prominence of relevant features in transformer-based architectures without costly preprocessing and pretraining, we take inspiration from priority maps - a mechanism described in neuropsychological studies. We implement a novel priority map module and pretrain on auxiliary tasks using low-sample datasets with high-level representations of routes and environment-related references to urban features. A hierarchical process of trajectory planning - with subsequent parameterised visual boost filtering on visual inputs and prediction of corresponding textual spans - addresses the core challenges of cross-modal alignment and feature-level localisation. The priority map module is integrated into a feature-location framework that doubles the task completion rates of standalone transformers and attains state-of-the-art performance on the Touchdown benchmark for VLN. Code and data are referenced in Appendix C.
    Pragmatically Learning from Pedagogical Demonstrations in Multi-Goal Environments. (arXiv:2206.04546v2 [cs.LG] UPDATED)
    Learning from demonstration methods usually leverage close to optimal demonstrations to accelerate training. By contrast, when demonstrating a task, human teachers deviate from optimal demonstrations and pedagogically modify their behavior by giving demonstrations that best disambiguate the goal they want to demonstrate. Analogously, human learners excel at pragmatically inferring the intent of the teacher, facilitating communication between the two agents. These mechanisms are critical in the few demonstrations regime, where inferring the goal is more difficult. In this paper, we implement pedagogy and pragmatism mechanisms by leveraging a Bayesian model of Goal Inference from demonstrations (BGI). We highlight the benefits of this model in multi-goal teacher-learner setups with two artificial agents that learn with goal-conditioned Reinforcement Learning. We show that combining BGI-agents (a pedagogical teacher and a pragmatic learner) results in faster learning and reduced goal ambiguity over standard learning from demonstrations, especially in the few demonstrations regime. We provide the code for our experiments (https://github.com/Caselles/NeurIPS22-demonstrations-pedagogy-pragmatism), as well as an illustrative video explaining our approach (https://youtu.be/V4n16IjkNyw).  ( 2 min )
    Masked Autoencoder for Self-Supervised Pre-training on Lidar Point Clouds. (arXiv:2207.00531v2 [cs.CV] UPDATED)
    Masked autoencoding has become a successful pretraining paradigm for Transformer models for text, images, and, recently, point clouds. Raw automotive datasets are suitable candidates for self-supervised pre-training as they generally are cheap to collect compared to annotations for tasks like 3D object detection (OD). However, the development of masked autoencoders for point clouds has focused solely on synthetic and indoor data. Consequently, existing methods have tailored their representations and models toward small and dense point clouds with homogeneous point densities.In this work, we study masked autoencoding for point clouds in an automotive setting, which are sparse and for which the point density can vary drastically among objects in the same scene. To this end, we propose Voxel-MAE, a simple masked autoencoding pre-training scheme designed for voxel representations. We pre-train the backbone of a Transformer-based 3D object detector to reconstruct masked voxels and to distinguish between empty and non-empty voxels. Our method improves the 3D OD performance by 1.75 mAP points and 1.05 NDS on the challenging nuScenes dataset. Further, we show that by pre-training with Voxel-MAE, we require only 40% of the annotated data to outperform a randomly initialized equivalent. Code available at https://github.com/georghess/voxel-mae
    Continuous QA Learning with Structured Prompts. (arXiv:2208.14602v2 [cs.CL] UPDATED)
    QA models with lifelong learning (LL) abilities are important for practical QA applications, and architecture-based LL methods are reported to be an effective implementation for these models. However, it is non-trivial to extend previous approaches to QA tasks since they either require access to task identities in the testing phase or do not explicitly model samples from unseen tasks. In this paper, we propose Diana: a dynamic architecture-based lifelong QA model that tries to learn a sequence of QA tasks with a prompt enhanced language model. Four types of hierarchically organized prompts are used in Diana to capture QA knowledge from different granularities. Specifically, we dedicate task-level prompts to capture task-specific knowledge to retain high LL performances and maintain instance-level prompts to learn knowledge shared across different input samples to improve the model's generalization performance. Moreover, we dedicate separate prompts to explicitly model unseen tasks and introduce a set of prompt key vectors to facilitate knowledge sharing between tasks. Extensive experiments demonstrate that Diana outperforms state-of-the-art lifelong QA models, especially in handling unseen tasks.
    LayoutEnhancer: Generating Good Indoor Layouts from Imperfect Data. (arXiv:2202.00185v2 [cs.GR] CROSS LISTED)
    We address the problem of indoor layout synthesis, which is a topic of continuing research interest in computer graphics. The newest works made significant progress using data-driven generative methods; however, these approaches rely on suitable datasets. In practice, desirable layout properties may not exist in a dataset, for instance, specific expert knowledge can be missing in the data. We propose a method that combines expert knowledge, for example, knowledge about ergonomics, with a data-driven generator based on the popular Transformer architecture. The knowledge is given as differentiable scalar functions, which can be used both as weights or as additional terms in the loss function. Using this knowledge, the synthesized layouts can be biased to exhibit desirable properties, even if these properties are not present in the dataset. Our approach can also alleviate problems of lack of data and imperfections in the data. Our work aims to improve generative machine learning for modeling and provide novel tools for designers and amateurs for the problem of interior layout creation.
    Grounding Aleatoric Uncertainty for Unsupervised Environment Design. (arXiv:2207.05219v2 [cs.LG] UPDATED)
    Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
    Quantum Semi-Supervised Learning with Quantum Supremacy. (arXiv:2110.02343v4 [quant-ph] UPDATED)
    Quantum machine learning promises to efficiently solve important problems. There are two persistent challenges in classical machine learning: the lack of labeled data, and the limit of computational power. We propose a novel framework that resolves both issues: quantum semi-supervised learning. Moreover, we provide a protocol in systematically designing quantum machine learning algorithms with quantum supremacy, which can be extended beyond quantum semi-supervised learning. In the meantime, we show that naive quantum matrix product estimation algorithm outperforms the best known classical matrix multiplication algorithm. We showcase two concrete quantum semi-supervised learning algorithms: a quantum self-training algorithm named the propagating nearest-neighbor classifier, and the quantum semi-supervised K-means clustering algorithm. By doing time complexity analysis, we conclude that they indeed possess quantum supremacy.
    Few-Shot Adaptation of Pre-Trained Networks for Domain Shift. (arXiv:2205.15234v3 [cs.CV] UPDATED)
    Deep networks are prone to performance degradation when there is a domain shift between the source (training) data and target (test) data. Recent test-time adaptation methods update batch normalization layers of pre-trained source models deployed in new target environments with streaming data to mitigate such performance degradation. Although such methods can adapt on-the-fly without first collecting a large target domain dataset, their performance is dependent on streaming conditions such as mini-batch size and class-distribution, which can be unpredictable in practice. In this work, we propose a framework for few-shot domain adaptation to address the practical challenges of data-efficient adaptation. Specifically, we propose a constrained optimization of feature normalization statistics in pre-trained source models supervised by a small support set from the target domain. Our method is easy to implement and improves source model performance with as few as one sample per class for classification tasks. Extensive experiments on 5 cross-domain classification and 4 semantic segmentation datasets show that our method achieves more accurate and reliable performance than test-time adaptation, while not being constrained by streaming conditions.
    Deep Learning-Based Rate-Splitting Multiple Access for Reconfigurable Intelligent Surface-Aided Tera-Hertz Massive MIMO. (arXiv:2209.08456v2 [eess.SP] UPDATED)
    Reconfigurable intelligent surface (RIS) can significantly enhance the service coverage of Tera-Hertz massive multiple-input multiple-output (MIMO) communication systems. However, obtaining accurate high-dimensional channel state information (CSI) with limited pilot and feedback signaling overhead is challenging, severely degrading the performance of conventional spatial division multiple access. To improve the robustness against CSI imperfection, this paper proposes a deep learning (DL)-based rate-splitting multiple access (RSMA) scheme for RIS-aided Tera-Hertz multi-user MIMO systems. Specifically, we first propose a hybrid data-model driven DL-based RSMA precoding scheme, including the passive precoding at the RIS as well as the analog active precoding and the RSMA digital active precoding at the base station (BS). To realize the passive precoding at the RIS, we propose a Transformer-based data-driven RIS reflecting network (RRN). As for the analog active precoding at the BS, we propose a match-filter based analog precoding scheme considering that the BS and RIS adopt the LoS-MIMO antenna array architecture. As for the RSMA digital active precoding at the BS, we propose a low-complexity approximate weighted minimum mean square error (AWMMSE) digital precoding scheme. Furthermore, for better precoding performance as well as lower computational complexity, a model-driven deep unfolding active precoding network (DFAPN) is also designed by combining the proposed AWMMSE scheme with DL. Then, to acquire accurate CSI at the BS for the investigated RSMA precoding scheme to achieve higher spectral efficiency, we propose a CSI acquisition network (CAN) with low pilot and feedback signaling overhead, where the downlink pilot transmission, CSI feedback at the user equipments (UEs), and CSI reconstruction at the BS are modeled as an end-to-end neural network based on Transformer.
    Align then Fusion: Generalized Large-scale Multi-view Clustering with Anchor Matching Correspondences. (arXiv:2205.15075v2 [cs.LG] UPDATED)
    Multi-view anchor graph clustering selects representative anchors to avoid full pair-wise similarities and therefore reduce the complexity of graph methods. Although widely applied in large-scale applications, existing approaches do not pay sufficient attention to establishing correct correspondences between the anchor sets across views. To be specific, anchor graphs obtained from different views are not aligned column-wisely. Such an \textbf{A}nchor-\textbf{U}naligned \textbf{P}roblem (AUP) would cause inaccurate graph fusion and degrade the clustering performance. Under multi-view scenarios, generating correct correspondences could be extremely difficult since anchors are not consistent in feature dimensions. To solve this challenging issue, we propose the first study of the generalized and flexible anchor graph fusion framework termed \textbf{F}ast \textbf{M}ulti-\textbf{V}iew \textbf{A}nchor-\textbf{C}orrespondence \textbf{C}lustering (FMVACC). Specifically, we show how to find anchor correspondence with both feature and structure information, after which anchor graph fusion is performed column-wisely. Moreover, we theoretically show the connection between FMVACC and existing multi-view late fusion \cite{liu2018late} and partial view-aligned clustering \cite{huang2020partially}, which further demonstrates our generality. Extensive experiments on seven benchmark datasets demonstrate the effectiveness and efficiency of our proposed method. Moreover, the proposed alignment module also shows significant performance improvement applying to existing multi-view anchor graph competitors indicating the importance of anchor alignment. Our code is available at \url{https://github.com/wangsiwei2010/NeurIPS22-FMVACC}.
    Bayesian Inference with Latent Hamiltonian Neural Networks. (arXiv:2208.06120v2 [cs.LG] UPDATED)
    When sampling for Bayesian inference, one popular approach is to use Hamiltonian Monte Carlo (HMC) and specifically the No-U-Turn Sampler (NUTS) which automatically decides the end time of the Hamiltonian trajectory. However, HMC and NUTS can require numerous numerical gradients of the target density, and can prove slow in practice. We propose Hamiltonian neural networks (HNNs) with HMC and NUTS for solving Bayesian inference problems. Once trained, HNNs do not require numerical gradients of the target density during sampling. Moreover, they satisfy important properties such as perfect time reversibility and Hamiltonian conservation, making them well-suited for use within HMC and NUTS because stationarity can be shown. We also propose an HNN extension called latent HNNs (L-HNNs), which are capable of predicting latent variable outputs. Compared to HNNs, L-HNNs offer improved expressivity and reduced integration errors. Finally, we employ L-HNNs in NUTS with an online error monitoring scheme to prevent sample degeneracy in regions of low probability density. We demonstrate L-HNNs in NUTS with online error monitoring on several examples involving complex, heavy-tailed, and high-local-curvature probability densities. Overall, L-HNNs in NUTS with online error monitoring satisfactorily inferred these probability densities. Compared to traditional NUTS, L-HNNs in NUTS with online error monitoring required 1--2 orders of magnitude fewer numerical gradients of the target density and improved the effective sample size (ESS) per gradient by an order of magnitude.
    Hyperparameter Optimization of Generative Adversarial Network Models for High-Energy Physics Simulations. (arXiv:2208.07715v2 [hep-ex] UPDATED)
    The Generative Adversarial Network (GAN) is a powerful and flexible tool that can generate high-fidelity synthesized data by learning. It has seen many applications in simulating events in High Energy Physics (HEP), including simulating detector responses and physics events. However, training GANs is notoriously hard and optimizing their hyperparameters even more so. It normally requires many trial-and-error training attempts to force a stable training and reach a reasonable fidelity. Significant tuning work has to be done to achieve the accuracy required by physics analyses. This work uses the physics-agnostic and high-performance-computer-friendly hyperparameter optimization tool HYPPO to optimize and examine the sensitivities of the hyperparameters of a GAN for two independent HEP datasets. This work provides the first insights into efficiently tuning GANs for Large Hadron Collider data. We show that given proper hyperparameter tuning, we can find GANs that provide high-quality approximations of the desired quantities. We also provide guidelines for how to go about GAN architecture tuning using the analysis tools in HYPPO.
    A Transformer-based Generative Model for De Novo Molecular Design. (arXiv:2210.08749v2 [cs.LG] UPDATED)
    In the scope of drug discovery, the molecular design aims to identify novel compounds from the chemical space where the potential drug-like molecules are estimated to be in the order of 10^60 - 10^100. Since this search task is computationally intractable due to the unbounded search space, deep learning draws a lot of attention as a new way of generating unseen molecules. As we seek compounds with specific target proteins, we propose a Transformer-based deep model for de novo target-specific molecular design. The proposed method is capable of generating both drug-like compounds (without specified targets) and target-specific compounds. The latter are generated by enforcing different keys and values of the multi-head attention for each target. In this way, we allow the generation of SMILES strings to be conditional on the specified target. Experimental results demonstrate that our method is capable of generating both valid drug-like compounds and target-specific compounds. Moreover, the sampled compounds from conditional model largely occupy the real target-specific molecules' chemical space and also cover a significant fraction of novel compounds.
    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v3 [stat.ML] UPDATED)
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.  ( 2 min )
    Instance-Specific Augmentation: Capturing Local Invariances. (arXiv:2206.00051v2 [cs.LG] UPDATED)
    We introduce InstaAug, a method for automatically learning input-specific augmentations from data. Previous data augmentation methods have generally assumed independence between the original input and the transformation applied to that input. This can be highly restrictive, as the invariances that the augmentations are based on are themselves often highly input dependent; e.g., we can change a leaf from green to yellow while maintaining its label, but not a lime. InstaAug instead allows for input dependency by introducing an invariance module that maps inputs to tailored transformation distributions. It can be simultaneously trained alongside the downstream model in a fully end-to-end manner, or separately learned for a pre-trained model. We empirically demonstrate that InstaAug learns meaningful input-dependent augmentations for a wide range of transformation classes, which in turn provides better performance on both supervised and self-supervised tasks.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v3 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    A theory of learning with constrained weight-distribution. (arXiv:2206.08933v2 [q-bio.NC] UPDATED)
    A central question in computational neuroscience is how structure determines function in neural networks. The emerging high-quality large-scale connectomic datasets raise the question of what general functional principles can be gleaned from structural information such as the distribution of excitatory/inhibitory synapse types and the distribution of synaptic weights. Motivated by this question, we developed a statistical mechanical theory of learning in neural networks that incorporates structural information as constraints. We derived an analytical solution for the memory capacity of the perceptron, a basic feedforward model of supervised learning, with constraint on the distribution of its weights. Our theory predicts that the reduction in capacity due to the constrained weight-distribution is related to the Wasserstein distance between the imposed distribution and that of the standard normal distribution. To test the theoretical predictions, we use optimal transport theory and information geometry to develop an SGD-based algorithm to find weights that simultaneously learn the input-output task and satisfy the distribution constraint. We show that training in our algorithm can be interpreted as geodesic flows in the Wasserstein space of probability distributions. We further developed a statistical mechanical theory for teacher-student perceptron rule learning and ask for the best way for the student to incorporate prior knowledge of the rule. Our theory shows that it is beneficial for the learner to adopt different prior weight distributions during learning, and shows that distribution-constrained learning outperforms unconstrained and sign-constrained learning. Our theory and algorithm provide novel strategies for incorporating prior knowledge about weights into learning, and reveal a powerful connection between structure and function in neural networks.
    Accelerated Probabilistic Marching Cubes by Deep Learning for Time-Varying Scalar Ensembles. (arXiv:2207.07260v3 [cs.LG] UPDATED)
    Visualizing the uncertainty of ensemble simulations is challenging due to the large size and multivariate and temporal features of ensemble data sets. One popular approach to studying the uncertainty of ensembles is analyzing the positional uncertainty of the level sets. Probabilistic marching cubes is a technique that performs Monte Carlo sampling of multivariate Gaussian noise distributions for positional uncertainty visualization of level sets. However, the technique suffers from high computational time, making interactive visualization and analysis impossible to achieve. This paper introduces a deep-learning-based approach to learning the level-set uncertainty for two-dimensional ensemble data with a multivariate Gaussian noise assumption. We train the model using the first few time steps from time-varying ensemble data in our workflow. We demonstrate that our trained model accurately infers uncertainty in level sets for new time steps and is up to 170X faster than that of the original probabilistic model with serial computation and 10X faster than that of the original parallel computation.  ( 2 min )
    Sequence Model Imitation Learning with Unobserved Contexts. (arXiv:2208.02225v2 [cs.LG] UPDATED)
    We consider imitation learning problems where the learner's ability to mimic the expert increases throughout the course of an episode as more information is revealed. One example of this is when the expert has access to privileged information: while the learner might not be able to accurately reproduce expert behavior early on in an episode, by considering the entire history of states and actions, they might be able to eventually identify the hidden context and act as the expert would. We prove that on-policy imitation learning algorithms (with or without access to a queryable expert) are better equipped to handle these sorts of asymptotically realizable problems than off-policy methods. This is because on-policy algorithms provably learn to recover from their initially suboptimal actions, while off-policy methods treat their suboptimal past actions as though they came from the expert. This often manifests as a latching behavior: a naive repetition of past actions. We conduct experiments in a toy bandit domain that show that there exist sharp phase transitions of whether off-policy approaches are able to match expert performance asymptotically, in contrast to the uniformly good performance of on-policy approaches. We demonstrate that on several continuous control tasks, on-policy approaches are able to use history to identify the context while off-policy approaches actually perform worse when given access to history.  ( 2 min )
    Faster Convergence of Local SGD for Over-Parameterized Models. (arXiv:2201.12719v2 [cs.LG] UPDATED)
    Modern machine learning architectures are often highly expressive. They are usually over-parameterized and can interpolate the data by driving the empirical loss close to zero. We analyze the convergence of Local SGD (or FedAvg) for such over-parameterized models in the heterogeneous data setting and improve upon the existing literature by establishing the following convergence rates. We show an error bound of $\O(\exp(-T))$ for strongly-convex loss functions, where $T$ is the total number of iterations. For general convex loss functions, we establish an error bound of $\O(1/T)$ under a mild data similarity assumption and an error bound of $\O(K/T)$ otherwise, where $K$ is the number of local steps. We also extend our results for non-convex loss functions by proving an error bound of $\O(K/T)$. Before our work, the best-known convergence rate for strongly-convex loss functions was $\O(\exp(-T/K))$, and none existed for general convex or non-convex loss functions under the overparameterized setting. We complete our results by providing problem instances in which such convergence rates are tight to a constant factor under a reasonably small stepsize scheme. Finally, we validate our theoretical results using numerical experiments on real and synthetic data.  ( 2 min )
    Pure Transformers are Powerful Graph Learners. (arXiv:2207.02505v2 [cs.LG] UPDATED)
    We show that standard Transformers without graph-specific modifications can lead to promising results in graph learning both in theory and practice. Given a graph, we simply treat all nodes and edges as independent tokens, augment them with token embeddings, and feed them to a Transformer. With an appropriate choice of token embeddings, we prove that this approach is theoretically at least as expressive as an invariant graph network (2-IGN) composed of equivariant linear layers, which is already more expressive than all message-passing Graph Neural Networks (GNN). When trained on a large-scale graph dataset (PCQM4Mv2), our method coined Tokenized Graph Transformer (TokenGT) achieves significantly better results compared to GNN baselines and competitive results compared to Transformer variants with sophisticated graph-specific inductive bias. Our implementation is available at https://github.com/jw9730/tokengt.
    Score-Based Diffusion meets Annealed Importance Sampling. (arXiv:2208.07698v3 [stat.ML] UPDATED)
    More than twenty years after its introduction, Annealed Importance Sampling (AIS) remains one of the most effective methods for marginal likelihood estimation. It relies on a sequence of distributions interpolating between a tractable initial distribution and the target distribution of interest which we simulate from approximately using a non-homogeneous Markov chain. To obtain an importance sampling estimate of the marginal likelihood, AIS introduces an extended target distribution to reweight the Markov chain proposal. While much effort has been devoted to improving the proposal distribution used by AIS, an underappreciated issue is that AIS uses a convenient but suboptimal extended target distribution. We here leverage recent progress in score-based generative modeling (SGM) to approximate the optimal extended target distribution minimizing the variance of the marginal likelihood estimate for AIS proposals corresponding to the discretization of Langevin and Hamiltonian dynamics. We demonstrate these novel, differentiable, AIS procedures on a number of synthetic benchmark distributions and variational auto-encoders.
    Reliable Decision from Multiple Subtasks through Threshold Optimization: Content Moderation in the Wild. (arXiv:2208.07522v3 [cs.LG] UPDATED)
    Social media platforms struggle to protect users from harmful content through content moderation. These platforms have recently leveraged machine learning models to cope with the vast amount of user-generated content daily. Since moderation policies vary depending on countries and types of products, it is common to train and deploy the models per policy. However, this approach is highly inefficient, especially when the policies change, requiring dataset re-labeling and model re-training on the shifted data distribution. To alleviate this cost inefficiency, social media platforms often employ third-party content moderation services that provide prediction scores of multiple subtasks, such as predicting the existence of underage personnel, rude gestures, or weapons, instead of directly providing final moderation decisions. However, making a reliable automated moderation decision from the prediction scores of the multiple subtasks for a specific target policy has not been widely explored yet. In this study, we formulate real-world scenarios of content moderation and introduce a simple yet effective threshold optimization method that searches the optimal thresholds of the multiple subtasks to make a reliable moderation decision in a cost-effective way. Extensive experiments demonstrate that our approach shows better performance in content moderation compared to existing threshold optimization methods and heuristics.
    Instant Neural Representation for Interactive Volume Rendering. (arXiv:2207.11620v2 [cs.GR] UPDATED)
    Neural networks have shown great potential in compressing volume data for visualization. However, due to the high cost of training and inference, such volumetric neural representations have thus far only been applied to offline data processing and non-interactive rendering. In this paper, we demonstrate that by simultaneously leveraging modern GPU tensor cores, a native CUDA neural network framework, and a well-designed rendering algorithm with macro-cell acceleration, we can interactively ray trace volumetric neural representations (10-60fps). Our neural representations are also high-fidelity (PSNR > 30dB) and compact (10-1000x smaller). Additionally, we show that it is possible to fit the entire training step inside a rendering loop and skip the pre-training process completely. To support extreme-scale volume data, we also develop an efficient out-of-core training strategy, which allows our volumetric neural representation training to potentially scale up to terascale using only an NVIDIA RTX 3090 workstation.
    Accelerating SGD for Highly Ill-Conditioned Huge-Scale Online Matrix Completion. (arXiv:2208.11246v2 [cs.LG] UPDATED)
    The matrix completion problem seeks to recover a $d\times d$ ground truth matrix of low rank $r\ll d$ from observations of its individual elements. Real-world matrix completion is often a huge-scale optimization problem, with $d$ so large that even the simplest full-dimension vector operations with $O(d)$ time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slow-down when the underlying ground truth is ill-conditioned; it requires at least $O(\kappa\log(1/\epsilon))$ iterations to get $\epsilon$-close to ground truth matrix with condition number $\kappa$. In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for huge-scale online optimization while also making it agnostic to $\kappa$. For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$-accuracy in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with $\kappa=1$. In our experiments, we observe a similar acceleration for item-item collaborative filtering on the MovieLens25M dataset via a pair-wise ranking loss, with 100 million training pairs and 10 million testing pairs. [See supporting code at https://github.com/Hong-Ming/ScaledSGD.]
    GraTO: Graph Neural Network Framework Tackling Over-smoothing with Neural Architecture Search. (arXiv:2208.09027v2 [cs.LG] UPDATED)
    Current Graph Neural Networks (GNNs) suffer from the over-smoothing problem, which results in indistinguishable node representations and low model performance with more GNN layers. Many methods have been put forward to tackle this problem in recent years. However, existing tackling over-smoothing methods emphasize model performance and neglect the over-smoothness of node representations. Additional, different approaches are applied one at a time, while there lacks an overall framework to jointly leverage multiple solutions to the over-smoothing challenge. To solve these problems, we propose GraTO, a framework based on neural architecture search to automatically search for GNNs architecture. GraTO adopts a novel loss function to facilitate striking a balance between model performance and representation smoothness. In addition to existing methods, our search space also includes DropAttribute, a novel scheme for alleviating the over-smoothing challenge, to fully leverage diverse solutions. We conduct extensive experiments on six real-world datasets to evaluate GraTo, which demonstrates that GraTo outperforms baselines in the over-smoothing metrics and achieves competitive performance in accuracy. GraTO is especially effective and robust with increasing numbers of GNN layers. Further experiments bear out the quality of node representations learned with GraTO and the effectiveness of model architecture. We make cide of GraTo available at Github (\url{https://github.com/fxsxjtu/GraTO}).
    Single-phase deep learning in cortico-cortical networks. (arXiv:2206.11769v2 [q-bio.NC] UPDATED)
    The error-backpropagation (backprop) algorithm remains the most common solution to the credit assignment problem in artificial neural networks. In neuroscience, it is unclear whether the brain could adopt a similar strategy to correctly modify its synapses. Recent models have attempted to bridge this gap while being consistent with a range of experimental observations. However, these models are either unable to effectively backpropagate error signals across multiple layers or require a multi-phase learning process, neither of which are reminiscent of learning in the brain. Here, we introduce a new model, Bursting Cortico-Cortical Networks (BurstCCN), which solves these issues by integrating known properties of cortical networks namely bursting activity, short-term plasticity (STP) and dendrite-targeting interneurons. BurstCCN relies on burst multiplexing via connection-type-specific STP to propagate backprop-like error signals within deep cortical networks. These error signals are encoded at distal dendrites and induce burst-dependent plasticity as a result of excitatory-inhibitory top-down inputs. First, we demonstrate that our model can effectively backpropagate errors through multiple layers using a single-phase learning process. Next, we show both empirically and analytically that learning in our model approximates backprop-derived gradients. Finally, we demonstrate that our model is capable of learning complex image classification tasks (MNIST and CIFAR-10). Overall, our results suggest that cortical features across sub-cellular, cellular, microcircuit and systems levels jointly underlie single-phase efficient deep learning in the brain.
    Conjecturing-Based Computational Discovery of Patterns in Data. (arXiv:2011.11576v3 [cs.LG] UPDATED)
    We propose the use of a conjecturing machine that generates feature relationships in the form of bounds involving nonlinear terms for numerical features and boolean expressions for categorical features. The proposed \textsc{Conjecturing} framework recovers known nonlinear and boolean relationships among features from data. In both settings, true underlying relationships are revealed. We then compare the method to a previously-proposed framework for symbolic regression and demonstrate that it can also be used to recover equations that are satisfied among features in a dataset. The framework is then applied to patient-level data regarding COVID-19 outcomes to suggest possible risk factors that are confirmed in medical literature.  ( 2 min )
    Federated Calibration and Evaluation of Binary Classifiers. (arXiv:2210.12526v1 [cs.CR])
    We address two major obstacles to practical use of supervised classifiers on distributed private data. Whether a classifier was trained by a federation of cooperating clients or trained centrally out of distribution, (1) the output scores must be calibrated, and (2) performance metrics must be evaluated -- all without assembling labels in one place. In particular, we show how to perform calibration and compute precision, recall, accuracy and ROC-AUC in the federated setting under three privacy models (i) secure aggregation, (ii) distributed differential privacy, (iii) local differential privacy. Our theorems and experiments clarify tradeoffs between privacy, accuracy, and data efficiency. They also help decide whether a given application has sufficient data to support federated calibration and evaluation.  ( 2 min )
    Practical and Parallelizable Algorithms for Non-Monotone Submodular Maximization with Size Constraint. (arXiv:2009.01947v4 [cs.DS] UPDATED)
    We present combinatorial and parallelizable algorithms for maximization of a submodular function, not necessarily monotone, with respect to a size constraint. We improve the best approximation factor achieved by an algorithm that has optimal adaptivity and nearly optimal query complexity to $0.193 - \varepsilon$. The conference version of this work mistakenly employed a subroutine that does not work for non-monotone, submodular functions. In this version, we propose a fixed and improved subroutine to add a set with high average marginal gain, \threseq, which returns a solution in $O( \log(n) )$ adaptive rounds with high probability. Moreover, we provide two approximation algorithms. The first has approximation ratio $1/6 - \varepsilon$, adaptivity $O( \log (n) )$, and query complexity $O( n \log (k) )$, while the second has approximation ratio $0.193 - \varepsilon$, adaptivity $O( \log^2 (n) )$, and query complexity $O(n \log (k))$. Our algorithms are empirically validated to use a low number of adaptive rounds and total queries while obtaining solutions with high objective value in comparison with state-of-the-art approximation algorithms, including continuous algorithms that use the multilinear extension.  ( 2 min )
    Probing Transfer in Deep Reinforcement Learning without Task Engineering. (arXiv:2210.12448v1 [cs.LG])
    We evaluate the use of original game curricula supported by the Atari 2600 console as a heterogeneous transfer benchmark for deep reinforcement learning agents. Game designers created curricula using combinations of several discrete modifications to the basic versions of games such as Space Invaders, Breakout and Freeway, making them progressively more challenging for human players. By formally organising these modifications into several factors of variation, we are able to show that Analyses of Variance (ANOVA) are a potent tool for studying the effects of human-relevant domain changes on the learning and transfer performance of a deep reinforcement learning agent. Since no manual task engineering is needed on our part, leveraging the original multi-factorial design avoids the pitfalls of unintentionally biasing the experimental setup. We find that game design factors have a large and statistically significant impact on an agent's ability to learn, and so do their combinatorial interactions. Furthermore, we show that zero-shot transfer from the basic games to their respective variations is possible, but the variance in performance is also largely explained by interactions between factors. As such, we argue that Atari game curricula offer a challenging benchmark for transfer learning in RL, that can help the community better understand the generalisation capabilities of RL agents along dimensions which meaningfully impact human generalisation performance. As a start, we report that value-function finetuning of regularly trained agents achieves positive transfer in a majority of cases, but significant headroom for algorithmic innovation remains. We conclude with the observation that selective transfer from multiple variants could further improve performance.  ( 3 min )
    Kernel Methods for Causal Functions: Dose, Heterogeneous, and Incremental Response Curves. (arXiv:2010.04855v7 [econ.EM] UPDATED)
    We propose estimators based on kernel ridge regression for nonparametric causal functions such as dose, heterogeneous, and incremental response curves. Treatment and covariates may be discrete or continuous in general spaces. Due to a decomposition property specific to the RKHS, our estimators have simple closed form solutions. We prove uniform consistency with finite sample rates via original analysis of generalized kernel ridge regression. We extend our main results to counterfactual distributions and to causal functions identified by front and back door criteria. We achieve state-of-the-art performance in nonlinear simulations with many covariates, and conduct a policy evaluation of the US Job Corps training program for disadvantaged youths.  ( 2 min )
    Dual-Path Style Learning for End-to-End Noise-Robust Speech Recognition. (arXiv:2203.14838v2 [eess.AS] UPDATED)
    Automatic speech recognition (ASR) systems degrade significantly in face of noisy conditions. Recently, speech enhancement (SE) has been introduced as front-end module to reduce noise and improve speech quality for ASR, but it would also suppress some important speech information, i.e., over-suppression problem. To alleviate this, we propose a dual-path style learning approach for end-to-end noise-robust automatic speech recognition (DPSL-ASR). Specifically, we first introduce clean speech feature along with the fused feature from previously proposed IFF-Net as dual-path inputs to recover the over-suppressed information. Then, we propose a style learning method to map the fused feature close to clean feature, in order to learn latent speech information from the latter, i.e., clean "speech style". Furthermore, we employ consistency loss to minimize the distance of ASR outputs in two paths to improve noise-robustness. Experimental results show that the proposed approach achieves relative word error rate (WER) reductions of 10.6% and 8.6% over the best IFF-Net baseline, on RATS Channel-A and CHiME-4 1-Channel Track datasets, respectively. Visualizations of intermediate embeddings indicate that DPSL-ASR can recover abundant over-suppressed information in enhanced speech. Our code is available at GitHub: https://github.com/YUCHEN005/DPSL-ASR.  ( 2 min )
    Counterfactual Generation Under Confounding. (arXiv:2210.12368v1 [cs.LG])
    A machine learning model, under the influence of observed or unobserved confounders in the training data, can learn spurious correlations and fail to generalize when deployed. For image classifiers, augmenting a training dataset using counterfactual examples has been empirically shown to break spurious correlations. However, the counterfactual generation task itself becomes more difficult as the level of confounding increases. Existing methods for counterfactual generation under confounding consider a fixed set of interventions (e.g., texture, rotation) and are not flexible enough to capture diverse data-generating processes. Given a causal generative process, we formally characterize the adverse effects of confounding on any downstream tasks and show that the correlation between generative factors (attributes) can be used to quantitatively measure confounding between generative factors. To minimize such correlation, we propose a counterfactual generation method that learns to modify the value of any attribute in an image and generate new images given a set of observed attributes, even when the dataset is highly confounded. These counterfactual images are then used to regularize the downstream classifier such that the learned representations are the same across various generative factors conditioned on the class label. Our method is computationally efficient, simple to implement, and works well for any number of generative factors and confounding variables. Our experimental results on both synthetic (MNIST variants) and real-world (CelebA) datasets show the usefulness of our approach.  ( 2 min )
    Bayesian Optimization with Conformal Coverage Guarantees. (arXiv:2210.12496v1 [cs.LG])
    Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.  ( 2 min )
    Progressive Neural Networks. (arXiv:1606.04671v4 [cs.LG] UPDATED)
    Learning to solve complex sequences of tasks--while both leveraging transfer and avoiding catastrophic forgetting--remains a key obstacle to achieving human-level intelligence. The progressive networks approach represents a step forward in this direction: they are immune to forgetting and can leverage prior knowledge via lateral connections to previously learned features. We evaluate this architecture extensively on a wide variety of reinforcement learning tasks (Atari and 3D maze games), and show that it outperforms common baselines based on pretraining and finetuning. Using a novel sensitivity measure, we demonstrate that transfer occurs at both low-level sensory and high-level control layers of the learned policy.  ( 2 min )
    Sample Noise Impact on Active Learning. (arXiv:2109.01372v2 [stat.ML] UPDATED)
    This work explores the effect of noisy sample selection in active learning strategies. We show on both synthetic problems and real-life use-cases that knowledge of the sample noise can significantly improve the performance of active learning strategies. Building on prior work, we propose a robust sampler, Incremental Weighted K-Means that brings significant improvement on the synthetic tasks but only a marginal uplift on real-life ones. We hope that the questions raised in this paper are of interest to the community and could open new paths for active learning research.  ( 2 min )
    Considerations for Visualizing Uncertainty in Clinical Machine Learning Models. (arXiv:2210.12220v1 [cs.HC])
    Clinician-facing predictive models are increasingly present in the healthcare setting. Regardless of their success with respect to performance metrics, all models have uncertainty. We investigate how to visually communicate uncertainty in this setting in an actionable, trustworthy way. To this end, we conduct a qualitative study with cardiac critical care clinicians. Our results reveal that clinician trust may be impacted most not by the degree of uncertainty, but rather by how transparent the visualization of what the sources of uncertainty are. Our results show a clear connection between feature interpretability and clinical actionability.  ( 2 min )
    Cut-and-Approximate: 3D Shape Reconstruction from Planar Cross-sections with Deep Reinforcement Learning. (arXiv:2210.12509v1 [cs.CV])
    Current methods for 3D object reconstruction from a set of planar cross-sections still struggle to capture detailed topology or require a considerable number of cross-sections. In this paper, we present, to the best of our knowledge the first 3D shape reconstruction network to solve this task which additionally uses orthographic projections of the shape. Our method is based on applying a Reinforcement Learning algorithm to learn how to effectively parse the shape using a trial-and-error scheme relying on scalar rewards. This method cuts a part of a 3D shape in each step which is then approximated as a polygon mesh. The agent aims to maximize the reward that depends on the accuracy of surface reconstruction for the approximated parts. We also consider pre-training of the network for faster learning using demonstrations generated by a heuristic approach. Experiments show that our training algorithm which benefits from both imitation learning and also self exploration, learns efficient policies faster, which results the agent to produce visually compelling results.  ( 2 min )
    Diversity-boosted Generalization-Specialization Balancing for Zero-shot Learning. (arXiv:2201.01961v2 [cs.CV] UPDATED)
    Zero-Shot Learning (ZSL) aims to transfer classification capability from seen to unseen classes. Recent methods have proved that generalization and specialization are two essential abilities to achieve good performance in ZSL. However, focusing on only one of the abilities may result in models that are either too general with degraded classification ability or too specialized to generalize to unseen classes. In this paper, we propose an end-to-end network, termed as BGSNet, which equips and balances generalization and specialization abilities at the instance and dataset level. Specifically, BGSNet consists of two branches: the Generalization Network (GNet), which applies episodic meta-learning to learn generalized knowledge, and the Balanced Specialization Network (BSNet), which adopts multiple attentive extractors to extract discriminative features and achieve instance-level balance. A novel self-adjusted diversity loss is designed to optimize BSNet with redundancy reduced and diversity boosted. We further propose a differentiable dataset-level balance and update the weights in a linear annealing schedule to simulate network pruning and thus obtain the optimal structure for BSNet with dataset-level balance achieved. Experiments on four benchmark datasets demonstrate our model's effectiveness. Sufficient component ablations prove the necessity of integrating and balancing generalization and specialization abilities.
    Magnitude-aware Probabilistic Speaker Embeddings. (arXiv:2202.13826v3 [eess.AS] UPDATED)
    Recently, hyperspherical embeddings have established themselves as a dominant technique for face and voice recognition. Specifically, Euclidean space vector embeddings are learned to encode person-specific information in their direction while ignoring the magnitude. However, recent studies have shown that the magnitudes of the embeddings extracted by deep neural networks may indicate the quality of the corresponding inputs. This paper explores the properties of the magnitudes of the embeddings related to quality assessment and out-of-distribution detection. We propose a new probabilistic speaker embedding extractor using the information encoded in the embedding magnitude and leverage it in the speaker verification pipeline. We also propose several quality-aware diarization methods and incorporate the magnitudes in those. Our results indicate significant improvements over magnitude-agnostic baselines both in speaker verification and diarization tasks.
    Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming. (arXiv:2203.01382v3 [cs.LG] UPDATED)
    Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristics as an interactive procedure, built around the existing workflow where users draw ideas from a selected set of development data for designing the heuristic sources. With the formalism, we study two core problems of how to strategically select the development data to guide users in efficiently creating informative heuristics, and how to exploit the information within the development process to contextualize and better learn from the resultant heuristics. Building upon two novel methodologies that effectively tackle the respective problems considered, we present Nemo, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v4 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    Robust Anytime Learning of Markov Decision Processes. (arXiv:2205.15827v3 [cs.AI] UPDATED)
    Markov decision processes (MDPs) are formal models commonly used in sequential decision-making. MDPs capture the stochasticity that may arise, for instance, from imprecise actuators via probabilities in the transition function. However, in data-driven applications, deriving precise probabilities from (limited) data introduces statistical errors that may lead to unexpected or undesirable outcomes. Uncertain MDPs (uMDPs) do not require precise probabilities but instead use so-called uncertainty sets in the transitions, accounting for such limited data. Tools from the formal verification community efficiently compute robust policies that provably adhere to formal specifications, like safety constraints, under the worst-case instance in the uncertainty set. We continuously learn the transition probabilities of an MDP in a robust anytime-learning approach that combines a dedicated Bayesian inference scheme with the computation of robust policies. In particular, our method (1) approximates probabilities as intervals, (2) adapts to new data that may be inconsistent with an intermediate model, and (3) may be stopped at any time to compute a robust policy on the uMDP that faithfully captures the data so far. Furthermore, our method is capable of adapting to changes in the environment. We show the effectiveness of our approach and compare it to robust policies computed on uMDPs learned by the UCRL2 reinforcement learning algorithm in an experimental evaluation on several benchmarks.
    Factuality Enhanced Language Models for Open-Ended Text Generation. (arXiv:2206.04624v2 [cs.CL] UPDATED)
    Pretrained language models (LMs) are susceptible to generate text with nonfactual information. In this work, we measure and improve the factual accuracy of large-scale LMs for open-ended text generation. We design the FactualityPrompts test set and metrics to measure the factuality of LM generations. Based on that, we study the factual accuracy of LMs with parameter sizes ranging from 126M to 530B. Interestingly, we find that larger LMs are more factual than smaller ones, although a previous study suggests that larger LMs can be less truthful in terms of misconceptions. In addition, popular sampling algorithms (e.g., top-p) in open-ended text generation can harm the factuality due to the ''uniform randomness'' introduced at every sampling step. We propose the factual-nucleus sampling algorithm that dynamically adapts the randomness to improve the factuality of generation while maintaining quality. Furthermore, we analyze the inefficiencies of the standard training method in learning correct associations between entities from factual text corpus (e.g., Wikipedia). We propose a factuality-enhanced training method that uses TopicPrefix for better awareness of facts and sentence completion as the training objective, which can vastly reduce the factual errors.
    Meta-Learning with Adjoint Methods. (arXiv:2110.08432v2 [cs.LG] UPDATED)
    Model Agnostic Meta Learning (MAML) is widely used to find a good initialization for a family of tasks. Despite its success, a critical challenge in MAML is to calculate the gradient w.r.t. the initialization of a long training trajectory for the sampled tasks, because the computation graph can rapidly explode and the computational cost is very expensive. To address this problem, we propose Adjoint MAML (A-MAML). We view gradient descent in the inner optimization as the evolution of an Ordinary Differential Equation (ODE). To efficiently compute the gradient of the validation loss w.r.t. the initialization, we use the adjoint method to construct a companion, backward ODE. To obtain the gradient w.r.t. the initialization, we only need to run the standard ODE solver twice -- one is forward in time that evolves a long trajectory of gradient flow for the sampled task; the other is backward and solves the adjoint ODE. We need not create or expand any intermediate computational graphs, adopt aggressive approximations, or impose proximal regularizers in the training loss. Our approach is cheap, accurate, and adaptable to different trajectory lengths. We demonstrate the advantage of our approach in both synthetic and real-world meta-learning tasks.
    TranSHER: Translating Knowledge Graph Embedding with Hyper-Ellipsoidal Restriction. (arXiv:2204.13221v2 [cs.AI] UPDATED)
    Knowledge graph embedding methods are important for the knowledge graph completion (or link prediction) task. One existing efficient method, PairRE, leverages two separate vectors to model complex relations (i.e., 1-to-N, N-to-1, and N-to-N) in knowledge graphs. However, such a method strictly restricts entities on the hyper-ellipsoid surfaces which limits the optimization of entity distribution, leading to suboptimal performance of knowledge graph completion. To address this issue, we propose a novel score function TranSHER, which leverages relation-specific translations between head and tail entities to relax the constraint of hyper-ellipsoid restrictions. By introducing an intuitive and simple relation-specific translation, TranSHER can provide more direct guidance on optimization and capture more semantic characteristics of entities with complex relations. Experimental results show that TranSHER achieves significant performance improvements on link prediction and generalizes well to datasets in different domains and scales. Our codes are public available at https://github.com/yizhilll/TranSHER.
    PI-NLF: A Proportional-Integral Approach for Non-negative Latent Factor Analysis. (arXiv:2205.02591v2 [cs.LG] UPDATED)
    A high-dimensional and incomplete (HDI) matrix frequently appears in various big-data-related applications, which demonstrates the inherently non-negative interactions among numerous nodes. A non-negative latent factor (NLF) model performs efficient representation learning to an HDI matrix, whose learning process mostly relies on a single latent factor-dependent, non-negative and multiplicative update (SLF-NMU) algorithm. However, an SLF-NMU algorithm updates a latent factor based on the current update increment only without appropriate considerations of past learning information, resulting in slow convergence. Inspired by the prominent success of a proportional-integral (PI) controller in various applications, this paper proposes a Proportional-Integral-incorporated Non-negative Latent Factor (PI-NLF) model with two-fold ideas: a) establishing an Increment Refinement (IR) mechanism via considering the past update increments following the principle of a PI controller; and b) designing an IR-based SLF-NMU (ISN) algorithm to accelerate the convergence rate of a resultant model. Empirical studies on four HDI datasets demonstrate that a PI-NLF model outperforms the state-of-the-art models in both computational efficiency and estimation accuracy for missing data of an HDI matrix. Hence, this study unveils the feasibility of boosting the performance of a non-negative learning algorithm through an error feedback controller.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v3 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    Adaptive Divergence-based Non-negative Latent Factor Analysis. (arXiv:2203.16214v2 [cs.LG] UPDATED)
    High-Dimensional and Incomplete (HDI) data are frequently found in various industrial applications with complex interactions among numerous nodes, which are commonly non-negative for representing the inherent non-negativity of node interactions. A Non-negative Latent Factor (NLF) model is able to extract intrinsic features from such data efficiently. However, existing NLF models all adopt a static divergence metric like Euclidean distance or {\alpha}-\b{eta} divergence to build its learning objective, which greatly restricts its scalability of accurately representing HDI data from different domains. Aiming at addressing this issue, this study presents an Adaptive Divergence-based Non-negative Latent Factor (ADNLF) model with three-fold ideas: a) generalizing the objective function with the {\alpha}-\b{eta}-divergence to expand its potential of representing various HDI data; b) adopting a non-negative bridging function to connect the optimization variables with output latent factors for fulfilling the non-negativity constraints constantly; and c) making the divergence parameters adaptive through particle swarm optimization, thereby facilitating adaptive divergence in the learning objective to achieve high scalability. Empirical studies are conducted on four HDI datasets from real applications, whose results demonstrate that in comparison with state-of-the-art NLF models, an ADNLF model achieves significantly higher estimation accuracy for missing data of an HDI dataset with high computational efficiency.
    Backdoor Attacks in Federated Learning by Rare Embeddings and Gradient Ensembling. (arXiv:2204.14017v2 [cs.LG] UPDATED)
    Recent advances in federated learning have demonstrated its promising capability to learn on decentralized datasets. However, a considerable amount of work has raised concerns due to the potential risks of adversaries participating in the framework to poison the global model for an adversarial purpose. This paper investigates the feasibility of model poisoning for backdoor attacks through rare word embeddings of NLP models. In text classification, less than 1% of adversary clients suffices to manipulate the model output without any drop in the performance on clean sentences. For a less complex dataset, a mere 0.1% of adversary clients is enough to poison the global model effectively. We also propose a technique specialized in the federated learning scheme called Gradient Ensemble, which enhances the backdoor performance in all our experimental settings.
    Not to Overfit or Underfit the Source Domains? An Empirical Study of Domain Generalization in Question Answering. (arXiv:2205.07257v2 [cs.CL] UPDATED)
    Machine learning models are prone to overfitting their training (source) domains, which is commonly believed to be the reason why they falter in novel target domains. Here we examine the contrasting view that multi-source domain generalization (DG) is first and foremost a problem of mitigating source domain underfitting: models not adequately learning the signal already present in their multi-domain training data. Experiments on a reading comprehension DG benchmark show that as a model learns its source domains better -- using familiar methods such as knowledge distillation (KD) from a bigger model -- its zero-shot out-of-domain utility improves at an even faster pace. Improved source domain learning also demonstrates superior out-of-domain generalization over three popular existing DG approaches that aim to limit overfitting. Our implementation of KD-based domain generalization is available via PrimeQA at: https://ibm.biz/domain-generalization-with-kd.
    Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. (arXiv:2205.15466v5 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value estimation, and we show that our MSR algorithm's sample complexity is close to the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.
    Policy Optimization with Advantage Regularization for Long-Term Fairness in Decision Systems. (arXiv:2210.12546v1 [cs.LG])
    Long-term fairness is an important factor of consideration in designing and deploying learning-based decision systems in high-stake decision-making contexts. Recent work has proposed the use of Markov Decision Processes (MDPs) to formulate decision-making with long-term fairness requirements in dynamically changing environments, and demonstrated major challenges in directly deploying heuristic and rule-based policies that worked well in static environments. We show that policy optimization methods from deep reinforcement learning can be used to find strictly better decision policies that can often achieve both higher overall utility and less violation of the fairness requirements, compared to previously-known strategies. In particular, we propose new methods for imposing fairness requirements in policy optimization by regularizing the advantage evaluation of different actions. Our proposed methods make it easy to impose fairness constraints without reward engineering or sacrificing training efficiency. We perform detailed analyses in three established case studies, including attention allocation in incident monitoring, bank loan approval, and vaccine distribution in population networks.
    Exploring Optimal Deep Learning Models for Image-based Malware Variant Classification. (arXiv:2004.05258v2 [cs.CR] UPDATED)
    Analyzing a huge amount of malware is a major burden for security analysts. Since emerging malware is often a variant of existing malware, automatically classifying malware into known families greatly reduces a part of their burden. Image-based malware classification with deep learning is an attractive approach for its simplicity, versatility, and affinity with the latest technologies. However, the impact of differences in deep learning models and the degree of transfer learning on the classification accuracy of malware variants has not been fully studied. In this paper, we conducted an exhaustive survey of deep learning models using 24 ImageNet pre-trained models and five fine-tuning parameters, totaling 120 combinations, on two platforms. As a result, we found that the highest classification accuracy was obtained by fine-tuning one of the latest deep learning models with a relatively low degree of transfer learning, and we achieved the highest classification accuracy ever in cross-validation on the Malimg and Drebin datasets. We also confirmed that this trend holds true for the recent malware variants using the VirusTotal 2020 Windows and Android datasets. The experimental results suggest that it is effective to periodically explore optimal deep learning models with the latest models and malware datasets by gradually reducing the degree of transfer learning from half.
    Recurrent Parameter Generators. (arXiv:2107.07110v2 [cs.CV] UPDATED)
    Deep learning has achieved tremendous success by training increasingly large models, which are then compressed for practical deployment. We propose a drastically different approach to compact and optimal deep learning: We decouple the Degrees of freedom (DoF) and the actual number of parameters of a model, optimize a small DoF with predefined random linear constraints for a large model of arbitrary architecture, in one-stage end-to-end learning. Specifically, we create a recurrent parameter generator (RPG), which repeatedly fetches parameters from a ring and unpacks them onto a large model with random permutation and sign flipping to promote parameter decorrelation. We show that gradient descent can automatically find the best model under constraints with faster convergence. Our extensive experimentation reveals a log-linear relationship between model DoF and accuracy. Our RPG demonstrates remarkable DoF reduction and can be further pruned and quantized for additional run-time performance gain. For example, in terms of top-1 accuracy on ImageNet, RPG achieves $96\%$ of ResNet18's performance with only $18\%$ DoF (the equivalent of one convolutional layer) and $52\%$ of ResNet34's performance with only $0.25\%$ DoF! Our work shows a significant potential of constrained neural optimization in compact and optimal deep learning.
    Toward Understanding Convolutional Neural Networks from Volterra Convolution Perspective. (arXiv:2110.09902v3 [cs.LG] UPDATED)
    We make an attempt to understanding convolutional neural network by exploring the relationship between (deep) convolutional neural networks and Volterra convolutions. We propose a novel approach to explain and study the overall characteristics of neural networks without being disturbed by the horribly complex architectures. Specifically, we attempt to convert the basic structures of a convolutional neural network (CNN) and their combinations to the form of Volterra convolutions. The results show that most of convolutional neural networks can be approximated in the form of Volterra convolution, where the approximated proxy kernels preserve the characteristics of the original network. Analyzing these proxy kernels may give valuable insight about the original network. Base on this setup, we presented methods to approximating the order-zero and order-one proxy kernels, and verified the correctness and effectiveness of our results.
    Listen to Interpret: Post-hoc Interpretability for Audio Networks with NMF. (arXiv:2202.11479v2 [cs.SD] UPDATED)
    This paper tackles post-hoc interpretability for audio processing networks. Our goal is to interpret decisions of a network in terms of high-level audio objects that are also listenable for the end-user. To this end, we propose a novel interpreter design that incorporates non-negative matrix factorization (NMF). In particular, a carefully regularized interpreter module is trained to take hidden layer representations of the targeted network as input and produce time activations of pre-learnt NMF components as intermediate outputs. Our methodology allows us to generate intuitive audio-based interpretations that explicitly enhance parts of the input signal most relevant for a network's decision. We demonstrate our method's applicability on popular benchmarks, including a real-world multi-label classification task.
    Locating and Editing Factual Associations in GPT. (arXiv:2202.05262v4 [cs.CL] UPDATED)
    We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v2 [stat.ML] UPDATED)
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on an informative public dataset, then learning to model the private data with that representation. In particular, we minimize the maximum mean discrepancy (MMD) between private target data and a generator's distribution, using a kernel based on perceptual features learned from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images with $\epsilon \approx 2$ which capture distinctive features in the distribution, far surpassing the current state of the art, which mostly focuses on datasets such as MNIST and FashionMNIST at a large $\epsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
    Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization. (arXiv:2107.01131v3 [stat.ML] UPDATED)
    Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
    Learning Physics-Consistent Particle Interactions. (arXiv:2202.00299v3 [cs.LG] UPDATED)
    Interacting particle systems play a key role in science and engineering. Access to the governing particle interaction law is fundamental for a complete understanding of such systems. However, the inherent system complexity keeps the particle interaction hidden in many cases. Machine learning methods have the potential to learn the behavior of interacting particle systems by combining experiments with data analysis methods. However, most existing algorithms focus on learning the kinetics at the particle level. Learning pairwise interaction, e.g., pairwise force or pairwise potential energy, remains an open challenge. Here, we propose an algorithm that adapts the Graph Networks framework, which contains an edge part to learn the pairwise interaction and a node part to model the dynamics at particle level. Different from existing approaches that use neural networks in both parts, we design a deterministic operator in the node part that allows to precisely infer the pairwise interactions that are consistent with underlying physical laws by only being trained to predict the particle acceleration. We test the proposed methodology on multiple datasets and demonstrate that it achieves superior performance in inferring correctly the pairwise interactions while also being consistent with the underlying physics on all the datasets. The proposed framework is scalable to larger systems and transferable to any type of particle interactions, contrary to the previously proposed purely data-driven solutions. The developed methodology can support a better understanding and discovery of the underlying particle interaction laws, and hence guide the design of materials with targeted properties.
    TextHacker: Learning based Hybrid Local Search Algorithm for Text Hard-label Adversarial Attack. (arXiv:2201.08193v2 [cs.CL] UPDATED)
    Existing textual adversarial attacks usually utilize the gradient or prediction confidence to generate adversarial examples, making it hard to be deployed in real-world applications. To this end, we consider a rarely investigated but more rigorous setting, namely hard-label attack, in which the attacker can only access the prediction label. In particular, we find we can learn the importance of different words via the change on prediction label caused by word substitutions on the adversarial examples. Based on this observation, we propose a novel adversarial attack, termed Text Hard-label attacker (TextHacker). TextHacker randomly perturbs lots of words to craft an adversarial example. Then, TextHacker adopts a hybrid local search algorithm with the estimation of word importance from the attack history to minimize the adversarial perturbation. Extensive evaluations for text classification and textual entailment show that TextHacker significantly outperforms existing hard-label attacks regarding the attack performance as well as adversary quality.
    Biological Sequence Design with GFlowNets. (arXiv:2203.04115v2 [q-bio.BM] UPDATED)
    Design of de novo biological sequences with desired properties, like protein and DNA sequences, often involves an active loop with several rounds of molecule ideation and expensive wet-lab evaluations. These experiments can consist of multiple stages, with increasing levels of precision and cost of evaluation, where candidates are filtered. This makes the diversity of proposed candidates a key consideration in the ideation phase. In this work, we propose an active learning algorithm leveraging epistemic uncertainty estimation and the recently proposed GFlowNets as a generator of diverse candidate solutions, with the objective to obtain a diverse batch of useful (as defined by some utility function, for example, the predicted anti-microbial activity of a peptide) and informative candidates after each round. We also propose a scheme to incorporate existing labeled datasets of candidates, in addition to a reward function, to speed up learning in GFlowNets. We present empirical results on several biological sequence design tasks, and we find that our method generates more diverse and novel batches with high scoring candidates compared to existing approaches.
    KGE-CL: Contrastive Learning of Tensor Decomposition Based Knowledge Graph Embeddings. (arXiv:2112.04871v2 [cs.AI] UPDATED)
    Learning the embeddings of knowledge graphs (KG) is vital in artificial intelligence, and can benefit various downstream applications, such as recommendation and question answering. In recent years, many research efforts have been proposed for knowledge graph embedding (KGE). However, most previous KGE methods ignore the semantic similarity between the related entities and entity-relation couples in different triples since they separately optimize each triple with the scoring function. To address this problem, we propose a simple yet efficient contrastive learning framework for tensor decomposition based (TDB) KGE, which can shorten the semantic distance of the related entities and entity-relation couples in different triples and thus improve the performance of KGE. We evaluate our proposed method on three standard KGE datasets: WN18RR, FB15k-237 and YAGO3-10. Our method can yield some new state-of-the-art results, achieving 51.2% MRR, 46.8% Hits@1 on the WN18RR dataset, 37.8% MRR, 28.6% Hits@1 on FB15k-237 dataset, and 59.1% MRR, 51.8% Hits@1 on the YAGO3-10 dataset.
    Longitudinal cardio-respiratory fitness prediction through wearables in free-living environments. (arXiv:2205.03116v2 [cs.LG] UPDATED)
    Cardiorespiratory fitness is an established predictor of metabolic disease and mortality. Fitness is directly measured as maximal oxygen consumption (VO$_{2}max$), or indirectly assessed using heart rate responses to standard exercise tests. However, such testing is costly and burdensome because it requires specialized equipment such as treadmills and oxygen masks, limiting its utility. Modern wearables capture dynamic real-world data which could improve fitness prediction. In this work, we design algorithms and models that convert raw wearable sensor data into cardiorespiratory fitness estimates. We validate these estimates' ability to capture fitness profiles in free-living conditions using the Fenland Study (N=11,059), along with its longitudinal cohort (N=2,675), and a third external cohort using the UK Biobank Validation Study (N=181) who underwent maximal VO$_{2}max$ testing, the gold standard measurement of fitness. Our results show that the combination of wearables and other biomarkers as inputs to neural networks yields a strong correlation to ground truth in a holdout sample (r = 0.82, 95CI 0.80-0.83), outperforming other approaches and models and detects fitness change over time (e.g., after 7 years). We also show how the model's latent space can be used for fitness-aware patient subtyping paving the way to scalable interventions and personalized trial recruitment. These results demonstrate the value of wearables for fitness estimation that today can be measured only with laboratory tests.
    LMPriors: Pre-Trained Language Models as Task-Specific Priors. (arXiv:2210.12530v1 [cs.LG])
    Particularly in low-data regimes, an outstanding challenge in machine learning is developing principled techniques for augmenting our models with suitable priors. This is to encourage them to learn in ways that are compatible with our understanding of the world. But in contrast to generic priors such as shrinkage or sparsity, we draw inspiration from the recent successes of large-scale language models (LMs) to construct task-specific priors distilled from the rich knowledge of LMs. Our method, Language Model Priors (LMPriors), incorporates auxiliary natural language metadata about the task -- such as variable names and descriptions -- to encourage downstream model outputs to be consistent with the LM's common-sense reasoning based on the metadata. Empirically, we demonstrate that LMPriors improve model performance in settings where such natural language descriptions are available, and perform well on several tasks that benefit from such prior knowledge, such as feature selection, causal inference, and safe reinforcement learning.
    Parallel Successive Learning for Dynamic Distributed Model Training over Heterogeneous Wireless Networks. (arXiv:2202.02947v5 [cs.LG] UPDATED)
    Federated learning (FedL) has emerged as a popular technique for distributing model training over a set of wireless devices, via iterative local updates (at devices) and global aggregations (at the server). In this paper, we develop parallel successive learning (PSL), which expands the FedL architecture along three dimensions: (i) Network, allowing decentralized cooperation among the devices via device-to-device (D2D) communications. (ii) Heterogeneity, interpreted at three levels: (ii-a) Learning: PSL considers heterogeneous number of stochastic gradient descent iterations with different mini-batch sizes at the devices; (ii-b) Data: PSL presumes a dynamic environment with data arrival and departure, where the distributions of local datasets evolve over time, captured via a new metric for model/concept drift. (ii-c) Device: PSL considers devices with different computation and communication capabilities. (iii) Proximity, where devices have different distances to each other and the access point. PSL considers the realistic scenario where global aggregations are conducted with idle times in-between them for resource efficiency improvements, and incorporates data dispersion and model dispersion with local model condensation into FedL. Our analysis sheds light on the notion of cold vs. warmed up models, and model inertia in distributed machine learning. We then propose network-aware dynamic model tracking to optimize the model learning vs. resource efficiency tradeoff, which we show is an NP-hard signomial programming problem. We finally solve this problem through proposing a general optimization solver. Our numerical results reveal new findings on the interdependencies between the idle times in-between the global aggregations, model/concept drift, and D2D cooperation configuration.
    A prediction-based approach for online dynamic patient scheduling: a case study in radiotherapy treatment. (arXiv:2112.08549v2 [cs.LG] UPDATED)
    Patient scheduling is a difficult task involving stochastic factors such as the unknown arrival times of patients. Similarly, the scheduling of radiotherapy for cancer treatments needs to handle patients with different urgency levels when allocating resources. High priority patients may arrive at any time, and there must be resources available to accommodate them. A common solution is to reserve a flat percentage of treatment capacity for emergency patients. However, this solution can result in overdue treatments for urgent patients, a failure to fully exploit treatment capacity, and delayed treatments for low-priority patients. This problem is especially severe in large and crowded hospitals. In this paper, we propose a prediction-based approach for online dynamic radiotherapy scheduling that dynamically adapts the present scheduling decision based on each incoming patient and the current allocation of resources. Our approach is based on a regression model trained to recognize the links between patients' arrival patterns, and their ideal waiting time in optimal offline solutions where all future arrivals are known in advance. When our prediction-based approach is compared to flat-reservation policies, it does a better job of preventing overdue treatments for emergency patients, while also maintaining comparable waiting times for the other patients. We also demonstrate how our proposed approach supports explainability and interpretability in scheduling decisions using SHAP values.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v2 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Efficient (Soft) Q-Learning for Text Generation with Limited Good Data. (arXiv:2106.07704v4 [cs.CL] UPDATED)
    Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many emerging applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning (SQL) perspective. It enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of novel text generation tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods.
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v3 [cs.LG] UPDATED)
    We propose generative multitask learning (GMTL), a simple and scalable approach to causal representation learning for multitask learning. Our approach makes a minor change to the conventional multitask inference objective, and improves robustness to target shift. Since GMTL only modifies the inference objective, it can be used with existing multitask learning methods without requiring additional training. The improvement in robustness comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as \emph{target-causing confounders}. These confounders induce spurious dependencies between the input and targets. This poses a problem for conventional multitask learning, due to its assumption that the targets are conditionally independent given the input. GMTL mitigates target-causing confounding at inference time, by removing the influence of the joint target distribution, and predicting all targets jointly. This removes the spurious dependencies between the input and targets, where the degree of removal is adjustable via a single hyperparameter. This flexibility is useful for managing the trade-off between in- and out-of-distribution generalization. Our results on the Attributes of People and Taskonomy datasets reflect an improved robustness to target shift across four multitask learning methods.
    BanditMF: Multi-Armed Bandit Based Matrix Factorization Recommender System. (arXiv:2106.10898v4 [cs.IR] UPDATED)
    Multi-armed bandits (MAB) provide a principled online learning approach to attain the balance between exploration and exploitation. Due to the superior performance and low feedback learning without the learning to act in multiple situations, Multi-armed Bandits drawing widespread attention in applications ranging such as recommender systems. Likewise, within the recommender system, collaborative filtering (CF) is arguably the earliest and most influential method in the recommender system. Crucially, new users and an ever-changing pool of recommended items are the challenges that recommender systems need to address. For collaborative filtering, the classical method is training the model offline, then perform the online testing, but this approach can no longer handle the dynamic changes in user preferences which is the so-called cold start. So how to effectively recommend items to users in the absence of effective information? To address the aforementioned problems, a multi-armed bandit based collaborative filtering recommender system has been proposed, named BanditMF. BanditMF is designed to address two challenges in the multi-armed bandits algorithm and collaborative filtering: (1) how to solve the cold start problem for collaborative filtering under the condition of scarcity of valid information, (2) how to solve the sub-optimal problem of bandit algorithms in strong social relations domains caused by independently estimating unknown parameters associated with each user and ignoring correlations between users.
    Quantum Cross Entropy and Maximum Likelihood Principle. (arXiv:2102.11887v3 [quant-ph] UPDATED)
    Quantum machine learning is an emerging field at the intersection of machine learning and quantum computing. Classical cross entropy plays a central role in machine learning. We define its quantum generalization, the quantum cross entropy, prove its lower bounds, and investigate its relation to quantum fidelity. In the classical case, minimizing cross entropy is equivalent to maximizing likelihood. In the quantum case, when the quantum cross entropy is constructed from quantum data undisturbed by quantum measurements, this relation holds. Classical cross entropy is equal to negative log-likelihood. When we obtain quantum cross entropy through empirical density matrix based on measurement outcomes, the quantum cross entropy is lower-bounded by negative log-likelihood. These two different scenarios illustrate the information loss when making quantum measurements. We conclude that to achieve the goal of full quantum machine learning, it is crucial to utilize the deferred measurement principle.
    Self-Supervision is All You Need for Solving Rubik's Cube. (arXiv:2106.03157v4 [cs.LG] UPDATED)
    Existing combinatorial search methods are often complex and require some expertise. In this work, we propose a simple and performant deep learning method, especially for goal-predefined combinatorial problems represented by Rubik's Cube. We show that, for solving such problems with high optimality, it can be sufficient to train a deep neural network on random scrambles branching from the goal state. When tested on Rubik's Cube, our method outperformed the previous state-of-the-art method DeepCubeA in solution optimality, being 10 times more efficient in training and 2.0 times in inference. One key assumption is that, when viewed from scrambled states, random moves from the goal are biased to be optimal. We also demonstrate the proposed method on 15 Puzzle and Lights Out.
    LGD-GCN: Local and Global Disentangled Graph Convolutional Networks. (arXiv:2104.11893v2 [cs.LG] UPDATED)
    Disentangled Graph Convolutional Network (DisenGCN) is an encouraging framework to disentangle the latent factors arising in a real-world graph. However, it relies on disentangling information heavily from a local range (i.e., a node and its 1-hop neighbors), while the local information in many cases can be uneven and incomplete, hindering the interpretabiliy power and model performance of DisenGCN. In this paper, we introduce a novel Local and Global Disentangled Graph Convolutional Network (LGD-GCN) to capture both local and global information for graph disentanglement. LGD-GCN performs a statistical mixture modeling to derive a factor-aware latent continuous space, and then constructs different structures w.r.t. different factors from the revealed space. In this way, the global factor-specific information can be efficiently and selectively encoded via a message passing along these built structures, strengthening the intra-factor consistency. We also propose a novel diversity promoting regularizer employed with the latent space modeling, to encourage inter-factor diversity. Evaluations of the proposed LGD-GCN on the synthetic and real-world datasets show a better interpretability and improved performance in node classification over the existing competitive models.
    Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations. (arXiv:2203.09590v4 [cs.CL] UPDATED)
    World knowledge exists in both structured (tables, knowledge graphs) and unstructured forms (texts). Recently, there have been extensive research efforts in the integration of structured factual knowledge and unstructured textual knowledge. However, most studies focus on incorporating static factual knowledge into pre-trained language models, while there is less work on enhancing temporal knowledge graph embedding using textual knowledge. Existing integration approaches can not apply to temporal knowledge graphs (tKGs) since they often assume knowledge embedding is time-invariant. In fact, the entity embedding in tKG embedding models usually evolves over time, which poses the challenge of aligning temporally relevant textual information with entities. To this end, we propose Enhanced Temporal Knowledge Embeddings with Contextualized Language Representations (ECOLA), which uses tKG quadruple as an implicit measure to temporally align textual data and the time-evolving entity representations and uses a novel knowledge-text prediction task to inject textual information into temporal knowledge embedding. ECOLA jointly optimizes the knowledge-text prediction objective and the temporal knowledge embedding objective, and thus, can simultaneously take full advantage of textual and structured knowledge. Since existing datasets do not provide tKGs with aligned textual data, we introduce three new datasets for training and evaluating ECOLA. Experimental results on the temporal knowledge graph completion task show that ECOLA outperforms state-of-the-art tKG embedding models by a large margin.
    Multi-Edge Server-Assisted Dynamic Federated Learning with an Optimized Floating Aggregation Point. (arXiv:2203.13950v2 [cs.LG] UPDATED)
    We propose cooperative edge-assisted dynamic federated learning (CE-FL). CE-FL introduces a distributed machine learning (ML) architecture, where data collection is carried out at the end devices, while the model training is conducted cooperatively at the end devices and the edge servers, enabled via data offloading from the end devices to the edge servers through base stations. CE-FL also introduces floating aggregation point, where the local models generated at the devices and the servers are aggregated at an edge server, which varies from one model training round to another to cope with the network evolution in terms of data distribution and users' mobility. CE-FL considers the heterogeneity of network elements in terms of communication/computation models and the proximity to one another. CE-FL further presumes a dynamic environment with online variation of data at the network devices which causes a drift at the ML model performance. We model the processes taken during CE-FL, and conduct analytical convergence analysis of its ML model training. We then formulate network-aware CE-FL which aims to adaptively optimize all the network elements via tuning their contribution to the learning process, which turns out to be a non-convex mixed integer problem. Motivated by the large scale of the system, we propose a distributed optimization solver to break down the computation of the solution across the network elements. We finally demonstrate the effectiveness of our framework with the data collected from a real-world testbed.
    Multimodal data matters: language model pre-training over structured and unstructured electronic health records. (arXiv:2201.10113v6 [cs.CL] UPDATED)
    As two important textual modalities in electronic health records (EHR), both structured data (clinical codes) and unstructured data (clinical narratives) have recently been increasingly applied to the healthcare domain. Most existing EHR-oriented studies, however, either focus on a particular modality or integrate data from different modalities in a straightforward manner, which usually treats structured and unstructured data as two independent sources of information about patient admission and ignore the intrinsic interactions between them. In fact, the two modalities are documented during the same encounter where structured data inform the documentation of unstructured data and vice versa. In this paper, we proposed a Medical Multimodal Pre-trained Language Model, named MedM-PLM, to learn enhanced EHR representations over structured and unstructured data and explore the interaction of two modalities. In MedM-PLM, two Transformer-based neural network components are firstly adopted to learn representative characteristics from each modality. A cross-modal module is then introduced to model their interactions. We pre-trained MedM-PLM on the MIMIC-III dataset and verified the effectiveness of the model on three downstream clinical tasks, i.e., medication recommendation, 30-day readmission prediction and ICD coding. Extensive experiments demonstrate the power of MedM-PLM compared with state-of-the-art methods. Further analyses and visualizations show the robustness of our model, which could potentially provide more comprehensive interpretations for clinical decision-making.
    Learning a subspace of policies for online adaptation in Reinforcement Learning. (arXiv:2110.05169v3 [cs.LG] UPDATED)
    Deep Reinforcement Learning (RL) is mainly studied in a setting where the training and the testing environments are similar. But in many practical applications, these environments may differ. For instance, in control systems, the robot(s) on which a policy is learned might differ from the robot(s) on which a policy will run. It can be caused by different internal factors (e.g., calibration issues, system attrition, defective modules) or also by external changes (e.g., weather conditions). There is a need to develop RL methods that generalize well to variations of the training conditions. In this article, we consider the simplest yet hard to tackle generalization setting where the test environment is unknown at train time, forcing the agent to adapt to the system's new dynamics. This online adaptation process can be computationally expensive (e.g., fine-tuning) and cannot rely on meta-RL techniques since there is just a single train environment. To do so, we propose an approach where we learn a subspace of policies within the parameter space. This subspace contains an infinite number of policies that are trained to solve the training environment while having different parameter values. As a consequence, two policies in that subspace process information differently and exhibit different behaviors when facing variations of the train environment. Our experiments carried out over a large variety of benchmarks compare our approach with baselines, including diversity-based methods. In comparison, our approach is simple to tune, does not need any extra component (e.g., discriminator) and learns policies able to gather a high reward on unseen environments.
    Data Augmentation for Bayesian Deep Learning. (arXiv:1903.09668v4 [stat.ML] UPDATED)
    Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to incorporate stochastic Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. We use the theory of scale mixtures of normals to derive data augmentation strategies for deep learning. This allows variants of the expectation-maximization and MCMC algorithms to be brought to bear on these high dimensional nonlinear deep learning models. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is compared to traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial improvement in speed. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research.
    Exact Recovery of Community Structures Using DeepWalk and Node2vec. (arXiv:2101.07354v2 [stat.ML] UPDATED)
    Random-walk based network embedding algorithms like DeepWalk and node2vec are widely used to obtain Euclidean representation of the nodes in a network prior to performing downstream inference tasks. However, despite their impressive empirical performance, there is a lack of theoretical results explaining their large-sample behavior. In this paper, we study node2vec and DeepWalk through the perspective of matrix factorization. In particular, we analyze these algorithms in the setting of community detection for stochastic blockmodel graphs (and their degree-corrected variants). By exploiting the row-wise uniform perturbation bound for leading singular vectors, we derive high-probability error bounds between the matrix factorization-based node2vec/DeepWalk embeddings and their true counterparts, uniformly over all node embeddings. Based on strong concentration results, we further show the perfect membership recovery by node2vec/DeepWalk, followed by $K$-means/medians algorithms. Specifically, as the network becomes sparser, our results guarantee that with large enough window size and vertices number, applying $K$-means/medians on the matrix factorization-based node2vec embeddings can, with high probability, correctly recover the memberships of all vertices in a network generated from the stochastic blockmodel (or its degree-corrected variants). The theoretical justifications are mirrored in the numerical experiments and real data applications, for both the original node2vec and its matrix factorization variant.
    On-Demand Unlabeled Personalized Federated Learning. (arXiv:2111.08356v3 [cs.LG] UPDATED)
    In Federated Learning (FL), multiple clients collaborate to learn a shared model through a central server while keeping data decentralized. Personalized Federated Learning (PFL) further extends FL by learning a personalized model per client. In both FL and PFL, all clients participate in the training process and their labeled data are used for training. However, in reality, novel clients may wish to join a prediction service after it has been deployed, obtaining predictions for their own \textbf{unlabeled} data. Here, we introduce a new learning setup, On-Demand Unlabeled PFL (OD-PFL), where a system trained on a set of clients, needs to be later applied to novel unlabeled clients at inference time. We propose a novel approach to this problem, ODPFL-HN, which learns to produce a new model for the late-to-the-party client. Specifically, we train an encoder network that learns a representation for a client given its unlabeled data. That client representation is fed to a hypernetwork that generates a personalized model for that client. Evaluated on five benchmark datasets, we find that ODPFL-HN generalizes better than the current FL and PFL methods, especially when the novel client has a large shift from training clients. We also analyzed the generalization error for novel clients, and showed analytically and experimentally how novel clients can apply differential privacy.
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v3 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals, i.e. scalar summaries, of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.
    A Modular Deep Learning Pipeline for Galaxy-Scale Strong Gravitational Lens Detection and Modeling. (arXiv:1911.03867v3 [astro-ph.IM] UPDATED)
    Upcoming large astronomical surveys are expected to capture an unprecedented number of strong gravitational lensing systems. Deep learning is emerging as a promising practical tool for the detection and quantification of these galaxy-scale image distortions. The absence of large quantities of representative data from current astronomical surveys motivates the development of a robust forward-modeling approach using synthetic lensing images. Using a mock sample of strong lenses created upon a state-of-the-art extragalactic catalogs, we train a modular deep learning pipeline for uncertainty-quantified detection and modeling with intermediate image processing components for denoising and deblending the lensing systems. We demonstrate a high degree of interpretability and controlled systematics due to domain-specific task modules trained with different stages of synthetic image generation. For lens detection and modeling, we obtain semantically meaningful latent spaces that separate classes of strong lens images and yield uncertainty estimates that explain the origin of misclassified images and provide probabilistic predictions for the lens parameters. Validation of the inference pipeline has been carried out using images from the Subaru telescope's Hyper Suprime-Cam camera, and LSST DESC simulated DC2 sky survey catalogues.
    Reinforcement Learning for Ridesharing: An Extended Survey. (arXiv:2105.01099v8 [cs.LG] UPDATED)
    In this paper, we present a comprehensive, in-depth survey of the literature on reinforcement learning approaches to decision optimization problems in a typical ridesharing system. Papers on the topics of rideshare matching, vehicle repositioning, ride-pooling, routing, and dynamic pricing are covered. Most of the literature has appeared in the last few years, and several core challenges are to continue to be tackled: model complexity, agent coordination, and joint optimization of multiple levers. Hence, we also introduce popular data sets and open simulation environments to facilitate further research and development. Subsequently, we discuss a number of challenges and opportunities for reinforcement learning research on this important domain.
    Estimating oil and gas recovery factors via machine learning: Database-dependent accuracy and reliability. (arXiv:2210.12491v1 [cs.LG])
    With recent advances in artificial intelligence, machine learning (ML) approaches have become an attractive tool in petroleum engineering, particularly for reservoir characterizations. A key reservoir property is hydrocarbon recovery factor (RF) whose accurate estimation would provide decisive insights to drilling and production strategies. Therefore, this study aims to estimate the hydrocarbon RF for exploration from various reservoir characteristics, such as porosity, permeability, pressure, and water saturation via the ML. We applied three regression-based models including the extreme gradient boosting (XGBoost), support vector machine (SVM), and stepwise multiple linear regression (MLR) and various combinations of three databases to construct ML models and estimate the oil and/or gas RF. Using two databases and the cross-validation method, we evaluated the performance of the ML models. In each iteration 90 and 10% of the data were respectively used to train and test the models. The third independent database was then used to further assess the constructed models. For both oil and gas RFs, we found that the XGBoost model estimated the RF for the train and test datasets more accurately than the SVM and MLR models. However, the performance of all the models were unsatisfactory for the independent databases. Results demonstrated that the ML algorithms were highly dependent and sensitive to the databases based on which they were trained. Statistical tests revealed that such unsatisfactory performances were because the distributions of input features and target variables in the train datasets were significantly different from those in the independent databases (p-value < 0.05).
    Wide Boosting. (arXiv:2007.09855v4 [cs.LG] UPDATED)
    Gradient Boosting (GB) is a popular methodology used to solve prediction problems by minimizing a differentiable loss function, $L$. GB performs very well on tabular machine learning (ML) problems; however, as a pure ML solver it lacks the ability to fit models with probabilistic but correlated multi-dimensional outputs, for example, multiple correlated Bernoulli outputs. GB also does not form intermediate abstract data embeddings, one property of Deep Learning that gives greater flexibility and performance on other types of problems. This paper presents a simple adjustment to GB motivated in part by artificial neural networks. Specifically, our adjustment inserts a matrix multiplication between the output of a GB model and the loss, $L$. This allows the output of a GB model to have increased dimension prior to being fed into the loss and is thus ``wider'' than standard GB implementations. We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks and that the embeddings generated by WB contain are more useful in downstream prediction tasks than GB output predictions alone.
    Abstract Interpretation-Based Feature Importance for SVMs. (arXiv:2210.12456v1 [cs.LG])
    We propose a symbolic representation for support vector machines (SVMs) by means of abstract interpretation, a well-known and successful technique for designing and implementing static program analyses. We leverage this abstraction in two ways: (1) to enhance the interpretability of SVMs by deriving a novel feature importance measure, called abstract feature importance (AFI), that does not depend in any way on a given dataset of the accuracy of the SVM and is very fast to compute, and (2) for verifying stability, notably individual fairness, of SVMs and producing concrete counterexamples when the verification fails. We implemented our approach and we empirically demonstrated its effectiveness on SVMs based on linear and non-linear (polynomial and radial basis function) kernels. Our experimental results show that, independently of the accuracy of the SVM, our AFI measure correlates much more strongly with the stability of the SVM to feature perturbations than feature importance measures widely available in machine learning software such as permutation feature importance. It thus gives better insight into the trustworthiness of SVMs.
    Diversity-Promoting Ensemble for Medical Image Segmentation. (arXiv:2210.12388v1 [eess.IV])
    Medical image segmentation is an actively studied task in medical imaging, where the precision of the annotations is of utter importance towards accurate diagnosis and treatment. In recent years, the task has been approached with various deep learning systems, among the most popular models being U-Net. In this work, we propose a novel strategy to generate ensembles of different architectures for medical image segmentation, by leveraging the diversity (decorrelation) of the models forming the ensemble. More specifically, we utilize the Dice score among model pairs to estimate the correlation between the outputs of the two models forming each pair. To promote diversity, we select models with low Dice scores among each other. We carry out gastro-intestinal tract image segmentation experiments to compare our diversity-promoting ensemble (DiPE) with another strategy to create ensembles based on selecting the top scoring U-Net models. Our empirical results show that DiPE surpasses both individual models as well as the ensemble creation strategy based on selecting the top scoring models.
    On-Demand Sampling: Learning Optimally from Multiple Distributions. (arXiv:2210.12529v1 [cs.LG])
    Social and real-world considerations such as robustness, fairness, social welfare and multi-agent tradeoffs have given rise to multi-distribution learning paradigms, such as collaborative, group distributionally robust, and fair federated learning. In each of these settings, a learner seeks to minimize its worst-case loss over a set of $n$ predefined distributions, while using as few samples as possible. In this paper, we establish the optimal sample complexity of these learning paradigms and give algorithms that meet this sample complexity. Importantly, our sample complexity bounds exceed that of the sample complexity of learning a single distribution only by an additive factor of $n \log(n) / \epsilon^2$. These improve upon the best known sample complexity of agnostic federated learning by Mohri et al. by a multiplicative factor of $n$, the sample complexity of collaborative learning by Nguyen and Zakynthinou by a multiplicative factor $\log n / \epsilon^3$, and give the first sample complexity bounds for the group DRO objective of Sagawa et al. To achieve optimal sample complexity, our algorithms learn to sample and learn from distributions on demand. Our algorithm design and analysis is enabled by our extensions of stochastic optimization techniques for solving stochastic zero-sum games. In particular, we contribute variants of Stochastic Mirror Descent that can trade off between players' access to cheap one-off samples or more expensive reusable ones.
    Fundamental limits for learning hidden Markov model parameters. (arXiv:2106.12936v3 [stat.ML] UPDATED)
    We study the frontier between learnable and unlearnable hidden Markov models (HMMs). HMMs are flexible tools for clustering dependent data coming from unknown populations. The model parameters are known to be fully identifiable (up to label-switching) without any modeling assumption on the distributions of the populations as soon as the clusters are distinct and the hidden chain is ergodic with a full rank transition matrix. In the limit as any one of these conditions fails, it becomes impossible in general to identify parameters. For a chain with two hidden states we prove nonasymptotic minimax upper and lower bounds, matching up to constants, which exhibit thresholds at which the parameters become learnable. We also provide an upper bound on the relative entropy rate for parameters in a neighbourhood of the unlearnable region which may have interest in itself.
    Quantum Data Compression and Quantum Cross Entropy. (arXiv:2106.13823v2 [quant-ph] UPDATED)
    Quantum machine learning is an emerging field at the intersection of machine learning and quantum computing. A central quantity for the theoretical foundation of quantum machine learning is the quantum cross entropy. In this paper, we present one operational interpretation of this quantity, that the quantum cross entropy is the compression rate for sub-optimal quantum source coding. To do so, we give a simple, universal quantum data compression protocol, which is developed based on quantum generalization of variable-length coding, as well as quantum strong typicality.
    Perceive, Represent, Generate: Translating Multimodal Information to Robotic Motion Trajectories. (arXiv:2204.03051v2 [cs.RO] UPDATED)
    We present Perceive-Represent-Generate (PRG), a novel three-stage framework that maps perceptual information of different modalities (e.g., visual or sound), corresponding to a sequence of instructions, to an adequate sequence of movements to be executed by a robot. In the first stage, we perceive and pre-process the given inputs, isolating individual commands from the complete instruction provided by a human user. In the second stage we encode the individual commands into a multimodal latent space, employing a deep generative model. Finally, in the third stage we convert the multimodal latent values into individual trajectories and combine them into a single dynamic movement primitive, allowing its execution in a robotic platform. We evaluate our pipeline in the context of a novel robotic handwriting task, where the robot receives as input a word through different perceptual modalities (e.g., image, sound), and generates the corresponding motion trajectory to write it, creating coherent and readable handwritten words.
    K-LITE: Learning Transferable Visual Models with External Knowledge. (arXiv:2204.09222v2 [cs.CV] UPDATED)
    The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, due to the broad concept coverage achieved via large-scale data collection process. Alternatively, we argue that learning with external knowledge is a promising way which leverages a much more structured source of supervision and offers sample efficiency. We propose K-LITE, a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in text with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts. In evaluation, the text is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is available at https://github.com/microsoft/klite.
    NeuroMapper: In-browser Visualizer for Neural Network Training. (arXiv:2210.12492v1 [cs.HC])
    We present our ongoing work NeuroMapper, an in-browser visualization tool that helps machine learning (ML) developers interpret the evolution of a model during training, providing a new way to monitor the training process and visually discover reasons for suboptimal training. While most existing deep neural networks (DNNs) interpretation tools are designed for already-trained model, NeuroMapper scalably visualizes the evolution of the embeddings of a model's blocks across training epochs, enabling real-time visualization of 40,000 embedded points. To promote the embedding visualizations' spatial coherence across epochs, NeuroMapper adapts AlignedUMAP, a recent nonlinear dimensionality reduction technique to align the embeddings. With NeuroMapper, users can explore the training dynamics of a Resnet-50 model, and adjust the embedding visualizations' parameters in real time. NeuroMapper is open-sourced at https://github.com/poloclub/NeuroMapper and runs in all modern web browsers. A demo of the tool in action is available at: https://poloclub.github.io/NeuroMapper/.
    Reproducibility Issues for BERT-based Evaluation Metrics. (arXiv:2204.00004v2 [cs.CL] UPDATED)
    Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).
    Bridging the Gap Between Target Networks and Functional Regularization. (arXiv:2210.12282v1 [cs.LG])
    Bootstrapping is behind much of the successes of Deep Reinforcement Learning. However, learning the value function via bootstrapping often leads to unstable training due to fast-changing target values. Target Networks are employed to stabilize training by using an additional set of lagging parameters to estimate the target values. Despite the popularity of Target Networks, their effect on the optimization is still misunderstood. In this work, we show that they act as an implicit regularizer. This regularizer has disadvantages such as being inflexible and non convex. To overcome these issues, we propose an explicit Functional Regularization that is a convex regularizer in function space and can easily be tuned. We analyze the convergence of our method theoretically and empirically demonstrate that replacing Target Networks with the more theoretically grounded Functional Regularization approach leads to better sample efficiency and performance improvements.
    Deep Linear Networks for Matrix Completion -- An Infinite Depth Limit. (arXiv:2210.12497v1 [math.DS])
    The deep linear network (DLN) is a model for implicit regularization in gradient based optimization of overparametrized learning architectures. Training the DLN corresponds to a Riemannian gradient flow, where the Riemannian metric is defined by the architecture of the network and the loss function is defined by the learning task. We extend this geometric framework, obtaining explicit expressions for the volume form, including the case when the network has infinite depth. We investigate the link between the Riemannian geometry and the training asymptotics for matrix completion with rigorous analysis and numerics. We propose that implicit regularization is a result of bias towards high state space volume.
    SpectraNet: Multivariate Forecasting and Imputation under Distribution Shifts and Missing Data. (arXiv:2210.12515v1 [cs.LG])
    In this work, we tackle two widespread challenges in real applications for time-series forecasting that have been largely understudied: distribution shifts and missing data. We propose SpectraNet, a novel multivariate time-series forecasting model that dynamically infers a latent space spectral decomposition to capture current temporal dynamics and correlations on the recent observed history. A Convolution Neural Network maps the learned representation by sequentially mixing its components and refining the output. Our proposed approach can simultaneously produce forecasts and interpolate past observations and can, therefore, greatly simplify production systems by unifying imputation and forecasting tasks into a single model. SpectraNet achieves SoTA performance simultaneously on both tasks on five benchmark datasets, compared to forecasting and imputation models, with up to 92% fewer parameters and comparable training times. On settings with up to 80% missing data, SpectraNet has average performance improvements of almost 50% over the second-best alternative. Our code is available at https://github.com/cchallu/spectranet.
    Spectrum-BERT: Pre-training of Deep Bidirectional Transformers for Spectral Classification of Chinese Liquors. (arXiv:2210.12440v1 [cs.LG])
    Spectral detection technology, as a non-invasive method for rapid detection of substances, combined with deep learning algorithms, has been widely used in food detection. However, in real scenarios, acquiring and labeling spectral data is an extremely labor-intensive task, which makes it impossible to provide enough high-quality data for training efficient supervised deep learning models. To better leverage limited samples, we apply pre-training & fine-tuning paradigm to the field of spectral detection for the first time and propose a pre-training method of deep bidirectional transformers for spectral classification of Chinese liquors, abbreviated as Spectrum-BERT. Specifically, first, to retain the model's sensitivity to the characteristic peak position and local information of the spectral curve, we innovatively partition the curve into multiple blocks and obtain the embeddings of different blocks, as the feature input for the next calculation. Second, in the pre-training stage, we elaborately design two pre-training tasks, Next Curve Prediction (NCP) and Masked Curve Model (MCM), so that the model can effectively utilize unlabeled samples to capture the potential knowledge of spectral data, breaking the restrictions of the insufficient labeled samples, and improving the applicability and performance of the model in practical scenarios. Finally, we conduct a large number of experiments on the real liquor spectral dataset. In the comparative experiments, the proposed Spectrum-BERT significantly outperforms the baselines in multiple metrics and this advantage is more significant on the imbalanced dataset. Moreover, in the parameter sensitivity experiment, we also analyze the model performance under different parameter settings, to provide a reference for subsequent research.
    Deep Q-Learning for Nash Equilibria: Nash-DQN. (arXiv:1904.10554v2 [cs.LG] UPDATED)
    Model-free learning for multi-agent stochastic games is an active area of research. Existing reinforcement learning algorithms, however, are often restricted to zero-sum games, and are applicable only in small state-action spaces or other simplified settings. Here, we develop a new data efficient Deep-Q-learning methodology for model-free learning of Nash equilibria for general-sum stochastic games. The algorithm uses a local linear-quadratic expansion of the stochastic game, which leads to analytically solvable optimal actions. The expansion is parametrized by deep neural networks to give it sufficient flexibility to learn the environment without the need to experience all state-action pairs. We study symmetry properties of the algorithm stemming from label-invariant stochastic games and as a proof of concept, apply our algorithm to learning optimal trading strategies in competitive electronic markets.
    Learning Proximal Operators to Discover Multiple Optima. (arXiv:2201.11945v2 [cs.LG] UPDATED)
    Finding multiple solutions of non-convex optimization problems is a ubiquitous yet challenging task. Most past algorithms either apply single-solution optimization methods from multiple random initial guesses or search in the vicinity of found solutions using ad hoc heuristics. We present an end-to-end method to learn the proximal operator of a family of training problems so that multiple local minima can be quickly obtained from initial guesses by iterating the learned operator, emulating the proximal-point algorithm that has fast convergence. The learned proximal operator can be further generalized to recover multiple optima for unseen problems at test time, enabling applications such as object detection. The key ingredient in our formulation is a proximal regularization term, which elevates the convexity of our training loss: by applying recent theoretical results, we show that for weakly-convex objectives with Lipschitz gradients, training of the proximal operator converges globally with a practical degree of over-parameterization. We further present an exhaustive benchmark for multi-solution optimization to demonstrate the effectiveness of our method.
    Attention-Based Scattering Network for Satellite Imagery. (arXiv:2210.12185v1 [cs.CV])
    Multi-channel satellite imagery, from stacked spectral bands or spatiotemporal data, have meaningful representations for various atmospheric properties. Combining these features in an effective manner to create a performant and trustworthy model is of utmost importance to forecasters. Neural networks show promise, yet suffer from unintuitive computations, fusion of high-level features, and may be limited by the quantity of available data. In this work, we leverage the scattering transform to extract high-level features without additional trainable parameters and introduce a separation scheme to bring attention to independent input channels. Experiments show promising results on estimating tropical cyclone intensity and predicting the occurrence of lightning from satellite imagery.
    Explanation Shift: Detecting distribution shifts on tabular data via the explanation space. (arXiv:2210.12369v1 [cs.LG])
    As input data distributions evolve, the predictive performance of machine learning models tends to deteriorate. In the past, predictive performance was considered the key indicator to monitor. However, explanation aspects have come to attention within the last years. In this work, we investigate how model predictive performance and model explanation characteristics are affected under distribution shifts and how these key indicators are related to each other for tabular data. We find that the modeling of explanation shifts can be a better indicator for the detection of predictive performance changes than state-of-the-art techniques based on representations of distribution shifts. We provide a mathematical analysis of different types of distribution shifts as well as synthetic experimental examples.
    Joint Coreference Resolution for Zeros and non-Zeros in Arabic. (arXiv:2210.12169v1 [cs.CL])
    Most existing proposals about anaphoric zero pronoun (AZP) resolution regard full mention coreference and AZP resolution as two independent tasks, even though the two tasks are clearly related. The main issues that need tackling to develop a joint model for zero and non-zero mentions are the difference between the two types of arguments (zero pronouns, being null, provide no nominal information) and the lack of annotated datasets of a suitable size in which both types of arguments are annotated for languages other than Chinese and Japanese. In this paper, we introduce two architectures for jointly resolving AZPs and non-AZPs, and evaluate them on Arabic, a language for which, as far as we know, there has been no prior work on joint resolution. Doing this also required creating a new version of the Arabic subset of the standard coreference resolution dataset used for the CoNLL-2012 shared task (Pradhan et al.,2012) in which both zeros and non-zeros are included in a single dataset.
    GraphNeT: Graph neural networks for neutrino telescope event reconstruction. (arXiv:2210.12194v1 [astro-ph.IM])
    GraphNeT is an open-source python framework aimed at providing high quality, user friendly, end-to-end functionality to perform reconstruction tasks at neutrino telescopes using graph neural networks (GNNs). GraphNeT makes it fast and easy to train complex models that can provide event reconstruction with state-of-the-art performance, for arbitrary detector configurations, with inference times that are orders of magnitude faster than traditional reconstruction techniques. GNNs from GraphNeT are flexible enough to be applied to data from all neutrino telescopes, including future projects such as IceCube extensions or P-ONE. This means that GNN-based reconstruction can be used to provide state-of-the-art performance on most reconstruction tasks in neutrino telescopes, at real-time event rates, across experiments and physics analyses, with vast potential impact for neutrino and astro-particle physics.
    ADDMU: Detection of Far-Boundary Adversarial Examples with Data and Model Uncertainty Estimation. (arXiv:2210.12396v1 [cs.CL])
    Adversarial Examples Detection (AED) is a crucial defense technique against adversarial attacks and has drawn increasing attention from the Natural Language Processing (NLP) community. Despite the surge of new AED methods, our studies show that existing methods heavily rely on a shortcut to achieve good performance. In other words, current search-based adversarial attacks in NLP stop once model predictions change, and thus most adversarial examples generated by those attacks are located near model decision boundaries. To surpass this shortcut and fairly evaluate AED methods, we propose to test AED methods with \textbf{F}ar \textbf{B}oundary (\textbf{FB}) adversarial examples. Existing methods show worse than random guess performance under this scenario. To overcome this limitation, we propose a new technique, \textbf{ADDMU}, \textbf{a}dversary \textbf{d}etection with \textbf{d}ata and \textbf{m}odel \textbf{u}ncertainty, which combines two types of uncertainty estimation for both regular and FB adversarial example detection. Our new method outperforms previous methods by 3.6 and 6.0 \emph{AUC} points under each scenario. Finally, our analysis shows that the two types of uncertainty provided by \textbf{ADDMU} can be leveraged to characterize adversarial examples and identify the ones that contribute most to model's robustness in adversarial training.
    Exploring The Landscape of Distributional Robustness for Question Answering Models. (arXiv:2210.12517v1 [cs.CL])
    We conduct a large empirical evaluation to investigate the landscape of distributional robustness in question answering. Our investigation spans over 350 models and 16 question answering datasets, including a diverse set of architectures, model sizes, and adaptation methods (e.g., fine-tuning, adapter tuning, in-context learning, etc.). We find that, in many cases, model variations do not affect robustness and in-distribution performance alone determines out-of-distribution performance. Moreover, our findings indicate that i) zero-shot and in-context learning methods are more robust to distribution shifts than fully fine-tuned models; ii) few-shot prompt fine-tuned models exhibit better robustness than few-shot fine-tuned span prediction models; iii) parameter-efficient and robustness enhancing training methods provide no significant robustness improvements. In addition, we publicly release all evaluations to encourage researchers to further analyze robustness trends for question answering models.
    Auto-Encoder Neural Network Incorporating X-Ray Fluorescence Fundamental Parameters with Machine Learning. (arXiv:2210.12239v1 [cs.LG])
    We consider energy-dispersive X-ray Fluorescence (EDXRF) applications where the fundamental parameters method is impractical such as when instrument parameters are unavailable. For example, on a mining shovel or conveyor belt, rocks are constantly moving (leading to varying angles of incidence and distances) and there may be other factors not accounted for (like dust). Neural networks do not require instrument and fundamental parameters but training neural networks requires XRF spectra labelled with elemental composition, which is often limited because of its expense. We develop a neural network model that learns from limited labelled data and learns to invert a forward model. The forward model uses transition energies and probabilities of all elements and parameterized distributions to approximate other fundamental and instrument parameters. We evaluate the model and baseline models on a rock dataset from a lithium mine and identify which elements are appropriate for this method. This model demonstrates the potential to calibrate a neural network in a noisy environment where labelled data is limited.
    Speech Emotion Recognition via an Attentive Time-Frequency Neural Network. (arXiv:2210.12430v1 [cs.SD])
    Spectrogram is commonly used as the input feature of deep neural networks to learn the high(er)-level time-frequency pattern of speech signal for speech emotion recognition (SER). \textcolor{black}{Generally, different emotions correspond to specific energy activations both within frequency bands and time frames on spectrogram, which indicates the frequency and time domains are both essential to represent the emotion for SER. However, recent spectrogram-based works mainly focus on modeling the long-term dependency in time domain, leading to these methods encountering the following two issues: (1) neglecting to model the emotion-related correlations within frequency domain during the time-frequency joint learning; (2) ignoring to capture the specific frequency bands associated with emotions.} To cope with the issues, we propose an attentive time-frequency neural network (ATFNN) for SER, including a time-frequency neural network (TFNN) and time-frequency attention. Specifically, aiming at the first issue, we design a TFNN with a frequency-domain encoder (F-Encoder) based on the Transformer encoder and a time-domain encoder (T-Encoder) based on the Bidirectional Long Short-Term Memory (Bi-LSTM). The F-Encoder and T-Encoder model the correlations within frequency bands and time frames, respectively, and they are embedded into a time-frequency joint learning strategy to obtain the time-frequency patterns for speech emotions. Moreover, to handle the second issue, we also adopt time-frequency attention with a frequency-attention network (F-Attention) and a time-attention network (T-Attention) to focus on the emotion-related frequency band ranges and time frame ranges, which can enhance the discriminability of speech emotion features.
    Leveraging Large Language Models for Multiple Choice Question Answering. (arXiv:2210.12353v1 [cs.CL])
    While large language models (LLMs) like GPT-3 have achieved impressive results on multiple choice question answering (MCQA) tasks in the zero, one, and few-shot settings, they generally lag behind the MCQA state of the art (SOTA). MCQA tasks have traditionally been presented to LLMs like cloze tasks. An LLM is conditioned on a question (without the associated answer options) and its chosen option is the one assigned the highest probability after normalization (for length, etc.). A more natural prompting approach is to present the question and answer options to the LLM jointly and have it output the symbol (e.g., "A") associated with its chosen answer option. This approach allows the model to explicitly compare answer options, reduces computational costs, and mitigates the effects of tokenization scheme and answer option representations on answer selection. For the natural approach to be effective the LLM it is used with must be able to associate answer options with the symbols that represent them. The LLM needs what we term multiple choice symbol binding (MCSB) ability. This ability varies greatly by model. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach across 20 diverse datasets and largely closes the gap with the SOTA, suggesting that the MCQA ability of LLMs has been previously underestimated.
    Efficient Automatic Machine Learning via Design Graphs. (arXiv:2210.12257v1 [cs.LG])
    Despite the success of automated machine learning (AutoML), which aims to find the best design, including the architecture of deep networks and hyper-parameters, conventional AutoML methods are computationally expensive and hardly provide insights into the relations of different model design choices. To tackle the challenges, we propose FALCON, an efficient sample-based method to search for the optimal model design. Our key insight is to model the design space of possible model designs as a design graph, where the nodes represent design choices, and the edges denote design similarities. FALCON features 1) a task-agnostic module, which performs message passing on the design graph via a Graph Neural Network (GNN), and 2) a task-specific module, which conducts label propagation of the known model performance information on the design graph. Both modules are combined to predict the design performances in the design space, navigating the search direction. We conduct extensive experiments on 27 node and graph classification tasks from various application domains, and an image classification task on the CIFAR-10 dataset. We empirically show that FALCON can efficiently obtain the well-performing designs for each task using only 30 explored nodes. Specifically, FALCON has a comparable time cost with the one-shot approaches while achieving an average improvement of 3.3% compared with the best baselines.
    Conditional Diffusion with Less Explicit Guidance via Model Predictive Control. (arXiv:2210.12192v1 [cs.LG])
    How much explicit guidance is necessary for conditional diffusion? We consider the problem of conditional sampling using an unconditional diffusion model and limited explicit guidance (e.g., a noised classifier, or a conditional diffusion model) that is restricted to a small number of time steps. We explore a model predictive control (MPC)-like approach to approximate guidance by simulating unconditional diffusion forward, and backpropagating explicit guidance feedback. MPC-approximated guides have high cosine similarity to real guides, even over large simulation distances. Adding MPC steps improves generative quality when explicit guidance is limited to five time steps.
    Feature Engineering and Classification Models for Partial Discharge in Power Transformers. (arXiv:2210.12216v1 [cs.LG])
    To ensure reliability, power transformers are monitored for partial discharge (PD) events, which are symptoms of transformer failure. Since failures can have catastrophic cascading consequences, it is critical to preempt them as early as possible. Our goal is to classify PDs as corona, floating, particle, or void, to gain an understanding of the failure location. Using phase resolved PD signal data, we create a small set of features, which can be used to classify PDs with high accuracy. This set of features consists of the total magnitude, the maximum magnitude, and the length of the longest empty band. These features represent the entire signal and not just a single phase, so the feature set has a fixed size and is easily comprehensible. With both Random Forest and SVM classification methods, we attain a 99% classification accuracy, which is significantly higher than classification using phase based feature sets such as phase magnitude. Furthermore, we develop a stacking ensemble to combine several classification models, resulting in a superior model that outperforms existing methods in both accuracy and variance.
    Reducing Training Sample Memorization in GANs by Training with Memorization Rejection. (arXiv:2210.12231v1 [cs.LG])
    Generative adversarial network (GAN) continues to be a popular research direction due to its high generation quality. It is observed that many state-of-the-art GANs generate samples that are more similar to the training set than a holdout testing set from the same distribution, hinting some training samples are implicitly memorized in these models. This memorization behavior is unfavorable in many applications that demand the generated samples to be sufficiently distinct from known samples. Nevertheless, it is unclear whether it is possible to reduce memorization without compromising the generation quality. In this paper, we propose memorization rejection, a training scheme that rejects generated samples that are near-duplicates of training samples during training. Our scheme is simple, generic and can be directly applied to any GAN architecture. Experiments on multiple datasets and GAN models validate that memorization rejection effectively reduces training sample memorization, and in many cases does not sacrifice the generation quality. Code to reproduce the experiment results can be found at $\texttt{https://github.com/jybai/MRGAN}$.
    TCAB: A Large-Scale Text Classification Attack Benchmark. (arXiv:2210.12233v1 [cs.LG])
    We introduce the Text Classification Attack Benchmark (TCAB), a dataset for analyzing, understanding, detecting, and labeling adversarial attacks against text classifiers. TCAB includes 1.5 million attack instances, generated by twelve adversarial attacks targeting three classifiers trained on six source datasets for sentiment analysis and abuse detection in English. Unlike standard text classification, text attacks must be understood in the context of the target classifier that is being attacked, and thus features of the target classifier are important as well. TCAB includes all attack instances that are successful in flipping the predicted label; a subset of the attacks are also labeled by human annotators to determine how frequently the primary semantics are preserved. The process of generating attacks is automated, so that TCAB can easily be extended to incorporate new text attacks and better classifiers as they are developed. In addition to the primary tasks of detecting and labeling attacks, TCAB can also be used for attack localization, attack target labeling, and attack characterization. TCAB code and dataset are available at https://react-nlp.github.io/tcab/.
    Learning Ultrametric Trees for Optimal Transport Regression. (arXiv:2210.12288v1 [cs.LG])
    Optimal transport provides a metric which quantifies the dissimilarity between probability measures. For measures supported in discrete metric spaces, finding optimal transport distance has cubic time complexity in the size of the space. However, measures supported on trees admit a closed-form optimal transport which can be computed in linear time. In this paper, we aim to find an optimal tree structure for a given discrete metric space, so that the tree-Wasserstein distance can best approximate the optimal transport distance in the original space. One of our key ideas is to cast the problem in ultrametric spaces. This helps define different tree structures and allows us to optimize the tree structure via projected gradient decent over space of ultrametric matrices. During optimization, we project the parameters to the ultrametric space via a hierarchical minimum spanning tree algorithm. Experimental results on real datasets show that our approach outperforms previous approaches in approximating optimal transport distances. Finally, experiments on synthetic data generated on ground truth trees show that our algorithm can accurately uncover the underlying tree metrics.
    On the connection between Bregman divergence and value in regularized Markov decision processes. (arXiv:2210.12160v1 [cs.LG])
    In this short note we derive a relationship between the Bregman divergence from the current policy to the optimal policy and the suboptimality of the current value function in a regularized Markov decision process. This result has implications for multi-task reinforcement learning, offline reinforcement learning, and regret analysis under function approximation, among others.
    EDUKG: a Heterogeneous Sustainable K-12 Educational Knowledge Graph. (arXiv:2210.12228v1 [cs.CL])
    Web and artificial intelligence technologies, especially semantic web and knowledge graph (KG), have recently raised significant attention in educational scenarios. Nevertheless, subject-specific KGs for K-12 education still lack sufficiency and sustainability from knowledge and data perspectives. To tackle these issues, we propose EDUKG, a heterogeneous sustainable K-12 Educational Knowledge Graph. We first design an interdisciplinary and fine-grained ontology for uniformly modeling knowledge and resource in K-12 education, where we define 635 classes, 445 object properties, and 1314 datatype properties in total. Guided by this ontology, we propose a flexible methodology for interactively extracting factual knowledge from textbooks. Furthermore, we establish a general mechanism based on our proposed generalized entity linking system for EDUKG's sustainable maintenance, which can dynamically index numerous heterogeneous resources and data with knowledge topics in EDUKG. We further evaluate EDUKG to illustrate its sufficiency, richness, and variability. We publish EDUKG with more than 252 million entities and 3.86 billion triplets. Our code and data repository is now available at https://github.com/THU-KEG/EDUKG.
    Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition. (arXiv:2210.12256v1 [cs.LG])
    Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift. Alternatively, a bias-variance decomposition allows to directly measure the predictive uncertainty across the entire input space. But, such a decomposition for proper scores does not exist in current literature, and for exponential families it is convoluted. In this work, we introduce a general bias-variance decomposition for proper scores and reformulate the exponential family case, giving rise to the Bregman Information as the variance term in both cases. This allows us to prove that the Bregman Information for classification measures the uncertainty in the logit space. We showcase the practical relevance of this decomposition on two downstream tasks. First, we show how to construct confidence intervals for predictions on the instance-level based on the Bregman Information. Second, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.
    Graph Coloring via Neural Networks for Haplotype Assembly and Viral Quasispecies Reconstruction. (arXiv:2210.12158v1 [q-bio.GN])
    Understanding genetic variation, e.g., through mutations, in organisms is crucial to unravel their effects on the environment and human health. A fundamental characterization can be obtained by solving the haplotype assembly problem, which yields the variation across multiple copies of chromosomes. Variations among fast evolving viruses that lead to different strains (called quasispecies) are also deciphered with similar approaches. In both these cases, high-throughput sequencing technologies that provide oversampled mixtures of large noisy fragments (reads) of genomes, are used to infer constituent components (haplotypes or quasispecies). The problem is harder for polyploid species where there are more than two copies of chromosomes. State-of-the-art neural approaches to solve this NP-hard problem do not adequately model relations among the reads that are important for deconvolving the input signal. We address this problem by developing a new method, called NeurHap, that combines graph representation learning with combinatorial optimization. Our experiments demonstrate substantially better performance of NeurHap in real and synthetic datasets compared to competing approaches.
    Implicit Offline Reinforcement Learning via Supervised Learning. (arXiv:2210.12272v1 [stat.ML])
    Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset collected by policies of different expertise levels. It is as simple as supervised learning and Behavior Cloning (BC), but takes advantage of return information. On datasets collected by policies of similar expertise, implicit BC has been shown to match or outperform explicit BC. Despite the benefits of using implicit models to learn robotic skills via BC, offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models can leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show the close relationship between our implicit methods and other popular RL via Supervised Learning algorithms to provide a unified framework. Finally, we demonstrate the effectiveness of our method on high-dimension manipulation and locomotion tasks.
    Augmentation by Counterfactual Explanation -- Fixing an Overconfident Classifier. (arXiv:2210.12196v1 [cs.LG])
    A highly accurate but overconfident model is ill-suited for deployment in critical applications such as healthcare and autonomous driving. The classification outcome should reflect a high uncertainty on ambiguous in-distribution samples that lie close to the decision boundary. The model should also refrain from making overconfident decisions on samples that lie far outside its training distribution, far-out-of-distribution (far-OOD), or on unseen samples from novel classes that lie near its training distribution (near-OOD). This paper proposes an application of counterfactual explanations in fixing an over-confident classifier. Specifically, we propose to fine-tune a given pre-trained classifier using augmentations from a counterfactual explainer (ACE) to fix its uncertainty characteristics while retaining its predictive performance. We perform extensive experiments with detecting far-OOD, near-OOD, and ambiguous samples. Our empirical results show that the revised model have improved uncertainty measures, and its performance is competitive to the state-of-the-art methods.
    Probing with Noise: Unpicking the Warp and Weft of Embeddings. (arXiv:2210.12206v1 [cs.CL])
    Improving our understanding of how information is encoded in vector space can yield valuable interpretability insights. Alongside vector dimensions, we argue that it is possible for the vector norm to also carry linguistic information. We develop a method to test this: an extension of the probing framework which allows for relative intrinsic interpretations of probing results. It relies on introducing noise that ablates information encoded in embeddings, grounded in random baselines and confidence intervals. We apply the method to well-established probing tasks and find evidence that confirms the existence of separate information containers in English GloVe and BERT embeddings. Our correlation analysis aligns with the experimental findings that different encoders use the norm to encode different kinds of information: GloVe stores syntactic and sentence length information in the vector norm, while BERT uses it to encode contextual incongruity.
    Deep Reinforcement Learning for Stabilization of Large-scale Probabilistic Boolean Networks. (arXiv:2210.12229v1 [cs.LG])
    The ability to direct a Probabilistic Boolean Network (PBN) to a desired state is important to applications such as targeted therapeutics in cancer biology. Reinforcement Learning (RL) has been proposed as a framework that solves a discrete-time optimal control problem cast as a Markov Decision Process. We focus on an integrative framework powered by a model-free deep RL method that can address different flavours of the control problem (e.g., with or without control inputs; attractor state or a subset of the state space as the target domain). The method is agnostic to the distribution of probabilities for the next state, hence it does not use the probability transition matrix. The time complexity is linear on the time steps, or interactions between the agent (deep RL) and the environment (PBN), during training. Indeed, we explore the scalability of the deep RL approach to (set) stabilization of large-scale PBNs and demonstrate successful control on large networks, including a metastatic melanoma PBN with 200 nodes.
    Group Distributionally Robust Reinforcement Learning with Hierarchical Latent Variables. (arXiv:2210.12262v1 [cs.LG])
    One key challenge for multi-task Reinforcement learning (RL) in practice is the absence of task indicators. Robust RL has been applied to deal with task ambiguity, but may result in over-conservative policies. To balance the worst-case (robustness) and average performance, we propose Group Distributionally Robust Markov Decision Process (GDR-MDP), a flexible hierarchical MDP formulation that encodes task groups via a latent mixture model. GDR-MDP identifies the optimal policy that maximizes the expected return under the worst-possible qualified belief over task groups within an ambiguity set. We rigorously show that GDR-MDP's hierarchical structure improves distributional robustness by adding regularization to the worst possible outcomes. We then develop deep RL algorithms for GDR-MDP for both value-based and policy-based RL methods. Extensive experiments on Box2D control tasks, MuJoCo benchmarks, and Google football platforms show that our algorithms outperform classic robust training algorithms across diverse environments in terms of robustness under belief uncertainties. Demos are available on our project page (\url{https://sites.google.com/view/gdr-rl/home}).
    Detection of Risk Predictors of COVID-19 Mortality with Classifier Machine Learning Models Operated with Routine Laboratory Biomarkers. (arXiv:2210.12342v1 [cs.LG])
    Early evaluation of patients who require special care and high death expectancy in COVID-19 and effective determination of relevant biomarkers on large sample groups are important to reduce mortality. This study aimed to reveal the routine blood value predictors of COVID-19 mortality and to determine the lethal risk levels of these predictors during the disease process. The dataset of the study consists of 38 routine blood values of 2597 patients who died (n = 233) and recovered (n = 2364) from COVID-19 in August-December, 2021. In this study, histogram-based gradient boosting (HGB) model was the most successful mashine learning classifier in detecting living and deceased COVID-19 patients (with squared F1 metrics F1^2 = 1). The most efficient binary combinations with procalcitonin were obtained with D-dimer, ESR, D.Bil and ferritin. The HGB model operated with these couples correctly detected almost all of the patients who survived and died. (precision > 0.98, recall > 0.98, F1^2 > 0.98). Furthermore, in the HGB model operated with a single feature, the most efficient features were Procalcitonin (F1^2 = 0.96) and ferritin (F1^2 = 0.91). In addition, according to the two-threshold approach ferritin values between 376.2 mkg/L and 396.0 mkg/L (F1^2 = 0.91) and procalcitonin values between 0.2 mkg/L and 5.2 mkg/L (F1^2 = 0.95) were found to be fatal risk levels for COVID-19. Considering all the results, we suggest that many features combined with these features, especially procalcitonin and ferritin, operated with the HGB model, can be used to achieve very successful results in the classification of those who live and die from COVID-19.Moreover, we strongly recommend that clinicians consider the critical levels we have found for procalcitonin and ferritin properties to reduce the lethality of COVID-19 disease.
    Imbalanced Classification in Medical Imaging. (arXiv:2210.12234v1 [cs.CV])
    We propose performing imbalanced classification by regrouping majority classes into small classes so that we turn the problem into balanced multiclass classification. This new idea is dramatically different from popular loss reweighting and class resampling methods. Our preliminary result on imbalanced medical image classification shows that this natural idea can substantially boost the classification performance as measured by average precision (approximately area-under-the-precision-recall-curve, or AUPRC), which is more appropriate for evaluating imbalanced classification than other metrics such as balanced accuracy.
    Testing Independence of Exchangeable Random Variables. (arXiv:2210.12392v1 [math.ST])
    Given well-shuffled data, can we determine whether the data items are statistically (in)dependent? Formally, we consider the problem of testing whether a set of exchangeable random variables are independent. We will show that this is possible and develop tests that can confidently reject the null hypothesis that data is independent and identically distributed and have high power for (some) exchangeable distributions. We will make no structural assumptions on the underlying sample space. One potential application is in Deep Learning, where data is often scraped from the whole internet, with duplications abound, which can render data non-iid and test-set evaluation prone to give wrong answers.
    The Devil is in the Conflict: Disentangled Information Graph Neural Networks for Fraud Detection. (arXiv:2210.12384v1 [cs.LG])
    Graph-based fraud detection has heretofore received considerable attention. Owning to the great success of Graph Neural Networks (GNNs), many approaches adopting GNNs for fraud detection has been gaining momentum. However, most existing methods are based on the strong inductive bias of homophily, which indicates that the context neighbors tend to have same labels or similar features. In real scenarios, fraudsters often engage in camouflage behaviors in order to avoid detection system. Therefore, the homophilic assumption no longer holds, which is known as the inconsistency problem. In this paper, we argue that the performance degradation is mainly attributed to the inconsistency between topology and attribute. To address this problem, we propose to disentangle the fraud network into two views, each corresponding to topology and attribute respectively. Then we propose a simple and effective method that uses the attention mechanism to adaptively fuse two views which captures data-specific preference. In addition, we further improve it by introducing mutual information constraints for topology and attribute. To this end, we propose a Disentangled Information Graph Neural Network (DIGNN) model, which utilizes variational bounds to find an approximate solution to our proposed optimization objective function. Extensive experiments demonstrate that our model can significantly outperform stateof-the-art baselines on real-world fraud detection datasets.
    A New Perspective for Understanding Generalization Gap of Deep Neural Networks Trained with Large Batch Sizes. (arXiv:2210.12184v1 [cs.LG])
    Deep neural networks (DNNs) are typically optimized using various forms of mini-batch gradient descent algorithm. A major motivation for mini-batch gradient descent is that with a suitably chosen batch size, available computing resources can be optimally utilized (including parallelization) for fast model training. However, many works report the progressive loss of model generalization when the training batch size is increased beyond some limits. This is a scenario commonly referred to as generalization gap. Although several works have proposed different methods for alleviating the generalization gap problem, a unanimous account for understanding generalization gap is still lacking in the literature. This is especially important given that recent works have observed that several proposed solutions for generalization gap problem such learning rate scaling and increased training budget do not indeed resolve it. As such, our main exposition in this paper is to investigate and provide new perspectives for the source of generalization loss for DNNs trained with a large batch size. Our analysis suggests that large training batch size results in increased near-rank loss of units' activation (i.e. output) tensors, which consequently impacts model optimization and generalization. Extensive experiments are performed for validation on popular DNN models such as VGG-16, residual network (ResNet-56) and LeNet-5 using CIFAR-10, CIFAR-100, Fashion-MNIST and MNIST datasets.
    Exploring Representation-Level Augmentation for Code Search. (arXiv:2210.12285v1 [cs.SE])
    Code search, which aims at retrieving the most relevant code fragment for a given natural language query, is a common activity in software development practice. Recently, contrastive learning is widely used in code search research, where many data augmentation approaches for source code (e.g., semantic-preserving program transformation) are proposed to learn better representations. However, these augmentations are at the raw-data level, which requires additional code analysis in the preprocessing stage and additional training costs in the training stage. In this paper, we explore augmentation methods that augment data (both code and query) at representation level which does not require additional data processing and training, and based on this we propose a general format of representation-level augmentation that unifies existing methods. Then, we propose three new augmentation methods (linear extrapolation, binary interpolation, and Gaussian scaling) based on the general format. Furthermore, we theoretically analyze the advantages of the proposed augmentation methods over traditional contrastive learning methods on code search. We experimentally evaluate the proposed representation-level augmentation methods with state-of-the-art code search models on a large-scale public dataset consisting of six programming languages. The experimental results show that our approach can consistently boost the performance of the studied code search models. Our source code is available at https://github.com/Alex-HaochenLi/RACS.
    Context-Aware Image Completion. (arXiv:2210.12350v1 [cs.CV])
    Image completion is a task that aims to fill in the missing region of a masked image with plausible contents.However, existing image completion methods tend to fill in the missing region with the surrounding texture instead of hallucinating a visual instance that is suitable in accordance with the context of the scene. In this work, we propose a novel image completion model, dubbed Refill, that hallucinates the missing instance that harmonizes well with - and thus preserves - the original context. Refill first adopts a transformer architecture that considers the types, locations of the visible instances, and the location of the missing region. Then, Refill completes the missing foreground and background semantic segmentation masks within the missing region, providing pixel-level semantic and structural guidance to generate missing contents with seamless boundaries. Finally, we condition the image synthesis blocks of Refill using the completed segmentation mask to generate photo-realistic contents to fill out the missing region. Experimental results show the superiority of Refill over state-of-the-art image completion approaches on various natural images.
    Adaptive Data Fusion for Multi-task Non-smooth Optimization. (arXiv:2210.12334v1 [stat.ML])
    We study the problem of multi-task non-smooth optimization that arises ubiquitously in statistical learning, decision-making and risk management. We develop a data fusion approach that adaptively leverages commonalities among a large number of objectives to improve sample efficiency while tackling their unknown heterogeneities. We provide sharp statistical guarantees for our approach. Numerical experiments on both synthetic and real data demonstrate significant advantages of our approach over benchmarks.  ( 2 min )
    An unsupervised latent/output physics-informed convolutional-LSTM network for solving partial differential equations using peridynamic differential operator. (arXiv:2210.12177v1 [cs.LG])
    This study presents a novel unsupervised convolutional Neural Network (NN) architecture with nonlocal interactions for solving Partial Differential Equations (PDEs). The nonlocal Peridynamic Differential Operator (PDDO) is employed as a convolutional filter for evaluating derivatives the field variable. The NN captures the time-dynamics in smaller latent space through encoder-decoder layers with a Convolutional Long-short Term Memory (ConvLSTM) layer between them. The ConvLSTM architecture is modified by employing a novel activation function to improve the predictive capability of the learning architecture for physics with periodic behavior. The physics is invoked in the form of governing equations at the output of the NN and in the latent (reduced) space. By considering a few benchmark PDEs, we demonstrate the training performance and extrapolation capability of this novel NN architecture by comparing against Physics Informed Neural Networks (PINN) type solvers. It is more capable of extrapolating the solution for future timesteps than the other existing architectures.
    AI-based Arabic Language and Speech Tutor. (arXiv:2210.12346v1 [cs.CL])
    In the past decade, we have observed a growing interest in using technologies such as artificial intelligence (AI), machine learning, and chatbots to provide assistance to language learners, especially in second language learning. By using AI and natural language processing (NLP) and chatbots, we can create an intelligent self-learning environment that goes beyond multiple-choice questions and/or fill in the blank exercises. In addition, NLP allows for learning to be adaptive in that it offers more than an indication that an error has occurred. It also provides a description of the error, uses linguistic analysis to isolate the source of the error, and then suggests additional drills to achieve optimal individualized learning outcomes. In this paper, we present our approach for developing an Artificial Intelligence-based Arabic Language and Speech Tutor (AI-ALST) for teaching the Moroccan Arabic dialect. The AI-ALST system is an intelligent tutor that provides analysis and assessment of students learning the Moroccan dialect at University of Arizona (UA). The AI-ALST provides a self-learned environment to practice each lesson for pronunciation training. In this paper, we present our initial experimental evaluation of the AI-ALST that is based on MFCC (Mel frequency cepstrum coefficient) feature extraction, bidirectional LSTM (Long Short-Term Memory), attention mechanism, and a cost-based strategy for dealing with class-imbalance learning. We evaluated our tutor on the word pronunciation of lesson 1 of the Moroccan Arabic dialect class. The experimental results show that the AI-ALST can effectively and successfully detect pronunciation errors and evaluate its performance by using F_1-score, accuracy, precision, and recall.
    Transformer-Based Conditioned Variational Autoencoder for Dialogue Generation. (arXiv:2210.12326v1 [cs.CL])
    In human dialogue, a single query may elicit numerous appropriate responses. The Transformer-based dialogue model produces frequently occurring sentences in the corpus since it is a one-to-one mapping function. CVAE is a technique for reducing generic replies. In this paper, we create a new dialogue model (CVAE-T) based on the Transformer with CVAE structure. We use a pre-trained MLM model to rewrite some key n-grams in responses to obtain a series of negative examples, and introduce a regularization term during training to explicitly guide the latent variable in learning the semantic differences between each pair of positive and negative examples. Experiments suggest that the method we design is capable of producing more informative replies.
    NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos. (arXiv:2210.12352v1 [cs.CV])
    We present a method for learning 3D geometry and physics parameters of a dynamic scene from only a monocular RGB video input. To decouple the learning of underlying scene geometry from dynamic motion, we represent the scene as a time-invariant signed distance function (SDF) which serves as a reference frame, along with a time-conditioned deformation field. We further bridge this neural geometry representation with a differentiable physics simulator by designing a two-way conversion between the neural field and its corresponding hexahedral mesh, enabling us to estimate physics parameters from the source video by minimizing a cycle consistency loss. Our method also allows a user to interactively edit 3D objects from the source video by modifying the recovered hexahedral mesh, and propagating the operation back to the neural field representation. Experiments show that our method achieves superior mesh and video reconstruction of dynamic scenes compared to competing Neural Field approaches, and we provide extensive examples which demonstrate its ability to extract useful 3D representations from videos captured with consumer-grade cameras.  ( 2 min )
    The Dark Side of AutoML: Towards Architectural Backdoor Search. (arXiv:2210.12179v1 [cs.CR])
    This paper asks the intriguing question: is it possible to exploit neural architecture search (NAS) as a new attack vector to launch previously improbable attacks? Specifically, we present EVAS, a new attack that leverages NAS to find neural architectures with inherent backdoors and exploits such vulnerability using input-aware triggers. Compared with existing attacks, EVAS demonstrates many interesting properties: (i) it does not require polluting training data or perturbing model parameters; (ii) it is agnostic to downstream fine-tuning or even re-training from scratch; (iii) it naturally evades defenses that rely on inspecting model parameters or training data. With extensive evaluation on benchmark datasets, we show that EVAS features high evasiveness, transferability, and robustness, thereby expanding the adversary's design spectrum. We further characterize the mechanisms underlying EVAS, which are possibly explainable by architecture-level ``shortcuts'' that recognize trigger patterns. This work raises concerns about the current practice of NAS and points to potential directions to develop effective countermeasures.
    torchode: A Parallel ODE Solver for PyTorch. (arXiv:2210.12375v1 [cs.LG])
    We introduce an ODE solver for the PyTorch ecosystem that can solve multiple ODEs in parallel independently from each other while achieving significant performance gains. Our implementation tracks each ODE's progress separately and is carefully optimized for GPUs and compatibility with PyTorch's JIT compiler. Its design lets researchers easily augment any aspect of the solver and collect and analyze internal solver statistics. In our experiments, our implementation is up to 4.3 times faster per step than other ODE solvers and it is robust against within-batch interactions that lead other solvers to take up to 4 times as many steps.
    Use of BNNM for interference wave solutions of the gBS-like equation and comparison with PINNs. (arXiv:2210.12154v1 [cs.LG])
    In this work, the generalized broken soliton-like (gBS-like) equation is derived through the generalized bilinear method. The neural network model, which can fit the explicit solution with zero error, is found. The interference wave solution of the gBS-like equation is obtained by using the bilinear neural network method (BNNM) and physical informed neural networks (PINNs). Interference waves are shown well via three-dimensional plots and density plots. Compared with PINNs, the bilinear neural network method is not only more accurate but also faster.
    Benchmarking GPU and TPU Performance with Graph Neural Networks. (arXiv:2210.12247v1 [cs.LG])
    Many artificial intelligence (AI) devices have been developed to accelerate the training and inference of neural networks models. The most common ones are the Graphics Processing Unit (GPU) and Tensor Processing Unit (TPU). They are highly optimized for dense data representations. However, sparse representations such as graphs are prevalent in many domains, including science. It is therefore important to characterize the performance of available AI accelerators on sparse data. This work analyzes and compares the GPU and TPU performance training a Graph Neural Network (GNN) developed to solve a real-life pattern recognition problem. Characterizing the new class of models acting on sparse data may prove helpful in optimizing the design of deep learning libraries and future AI accelerators.
    Quantifying Complexity: An Object-Relations Approach to Complex Systems. (arXiv:2210.12347v1 [cs.LG])
    The best way to model, understand, and quantify the information contained in complex systems is an open question in physics, mathematics, and computer science. The uncertain relationship between entropy and complexity further complicates this question. With ideas drawn from the object-relations theory of psychology, this paper develops an object-relations model of complex systems which generalizes to systems of all types, including mathematical operations, machines, biological organisms, and social structures. The resulting Complex Information Entropy (CIE) equation is a robust method to quantify complexity across various contexts. The paper also describes algorithms to iteratively update and improve approximate solutions to the CIE equation, to recursively infer the composition of complex systems, and to discover the connections among objects across different lengthscales and timescales. Applications are discussed in the fields of engineering design, atomic and molecular physics, chemistry, materials science, neuroscience, psychology, sociology, ecology, economics, and medicine.  ( 2 min )
    PENTATRON: PErsonalized coNText-Aware Transformer for Retrieval-based cOnversational uNderstanding. (arXiv:2210.12308v1 [cs.LG])
    Conversational understanding is an integral part of modern intelligent devices. In a large fraction of the global traffic from customers using smart digital assistants, frictions in dialogues may be attributed to incorrect understanding of the entities in a customer's query due to factors including ambiguous mentions, mispronunciation, background noise and faulty on-device signal processing. Such errors are compounded by two common deficiencies from intelligent devices namely, (1) the device not being tailored to individual customers, and (2) the device responses being unaware of the context in the conversation session. Viewing this problem via the lens of retrieval-based search engines, we build and evaluate a scalable entity correction system, PENTATRON. The system leverages a parametric transformer-based language model to learn patterns from in-session customer-device interactions coupled with a non-parametric personalized entity index to compute the correct query, which aids downstream components in reasoning about the best response. In addition to establishing baselines and demonstrating the value of personalized and context-aware systems, we use multitasking to learn the domain of the correct entity. We also investigate the utility of language model prompts. Through extensive experiments, we show a significant upward movement of the key metric (Exact Match) by up to 500.97% (relative to the baseline).
    DIGMN: Dynamic Intent Guided Meta Network for Differentiated User Engagement Forecasting in Online Professional Social Platforms. (arXiv:2210.12402v1 [cs.LG])
    User engagement prediction plays a critical role for designing interaction strategies to grow user engagement and increase revenue in online social platforms. Through the in-depth analysis of the real-world data from the world's largest professional social platforms, i.e., LinkedIn, we find that users expose diverse engagement patterns, and a major reason for the differences in user engagement patterns is that users have different intents. That is, people have different intents when using LinkedIn, e.g., applying for jobs, building connections, or checking notifications, which shows quite different engagement patterns. Meanwhile, user intents and the corresponding engagement patterns may change over time. Although such pattern differences and dynamics are essential for user engagement prediction, differentiating user engagement patterns based on user dynamic intents for better user engagement forecasting has not received enough attention in previous works. In this paper, we proposed a Dynamic Intent Guided Meta Network (DIGMN), which can explicitly model user intent varying with time and perform differentiated user engagement forecasting. Specifically, we derive some interpretable basic user intents as prior knowledge from data mining and introduce prior intents in explicitly modeling dynamic user intent. Furthermore, based on the dynamic user intent representations, we propose a meta predictor to perform differentiated user engagement forecasting. Through a comprehensive evaluation on LinkedIn anonymous user data, our method outperforms state-of-the-art baselines significantly, i.e., 2.96% and 3.48% absolute error reduction, on coarse-grained and fine-grained user engagement prediction tasks, respectively, demonstrating the effectiveness of our method.
    NeuroPrim: An Attention-based Model for Solving NP-hard Spanning Tree Problems. (arXiv:2210.12453v1 [cs.LG])
    Spanning tree problems with special constraints are widely applied in real-life scenarios, such as water supply, transportation and telecommunications, which often require complex algorithm design and exponential time to solve. In recent years, there has been a surge of interest in end-to-end Deep Neural Networks (DNNs) to solve routing problems. However, as the output of such methods is a sequence of vertices, it is difficult to apply them to combinatorial optimization problems where the solution set consists of a edges sets, such as various spanning tree problems. In this paper, we propose NeuroPrim, a novel framework combining neural networks and the Prim algorithm, which is trained by REINFORCE with the POMO baseline to learn metrics for selecting edges for different spanning tree problems. We apply it to three difficult problems on Euclidean spaces, namely Degree-constrained Minimum Spanning Tree Problem (DCMSTP), Minimum Routing Cost Spanning Tree Problem (MRCSTP) and Steiner Tree Problem in Graphs (STPG). Experimental results show that our model is able to outperform some of the heuristics and obtain extremely small gaps of less than $0.1\%$ for simple problems such as DCMST with degree constraint $3$ and special cases of STPG up to 100 vertices. In addition, we find no significant degradation on problem instances as large as 1000, which demonstrates its strong generalization ability.
    Continual Reinforcement Learning with Group Symmetries. (arXiv:2210.12301v1 [cs.LG])
    Continual reinforcement learning (RL) aims to learn a sequence of tasks while retaining the capability to solve seen tasks and growing a new policy to solve novel tasks. Existing continual RL methods ignore that some tasks are equivalent under simple group operations, such as rotations or translations. They thus extend a new policy for each equivalent task and train the policy from scratch, resulting in poor sample complexity and generalization capability. In this work, we propose a novel continual RL framework with group symmetries, which grows a policy for each group of equivalent tasks instead of a single task. We introduce a PPO-based RL algorithm with an invariant feature extractor and a novel task grouping mechanism based on invariant features. We test our algorithm in realistic autonomous driving scenarios, where each group is associated with a map configuration. We show that our algorithm assigns tasks to different groups with high accuracy and outperforms baselines in terms of generalization capability by a large margin.
    MILD: Multimodal Interactive Latent Dynamics for Learning Human-Robot Interaction. (arXiv:2210.12418v1 [cs.RO])
    Modeling interaction dynamics to generate robot trajectories that enable a robot to adapt and react to a human's actions and intentions is critical for efficient and effective collaborative Human-Robot Interactions (HRI). Learning from Demonstration (LfD) methods from Human-Human Interactions (HHI) have shown promising results, especially when coupled with representation learning techniques. However, such methods for learning HRI either do not scale well to high dimensional data or cannot accurately adapt to changing via-poses of the interacting partner. We propose Multimodal Interactive Latent Dynamics (MILD), a method that couples deep representation learning and probabilistic machine learning to address the problem of two-party physical HRIs. We learn the interaction dynamics from demonstrations, using Hidden Semi-Markov Models (HSMMs) to model the joint distribution of the interacting agents in the latent space of a Variational Autoencoder (VAE). Our experimental evaluations for learning HRI from HHI demonstrations show that MILD effectively captures the multimodality in the latent representations of HRI tasks, allowing us to decode the varying dynamics occurring in such tasks. Compared to related work, MILD generates more accurate trajectories for the controlled agent (robot) when conditioned on the observed agent's (human) trajectory. Notably, MILD can learn directly from camera-based pose estimations to generate trajectories, which we then map to a humanoid robot without the need for any additional training.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v1 [cs.LG])
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v1 [stat.ML])
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as `uncertain evidence'. In many real-world scenarios, such uncertainty stems from measurement errors associated with observable quantities in probabilistic models. We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method `stochastic evidence' as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise concrete guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which we compare inference results associated with each interpretation.
    Factor Investing with a Deep Multi-Factor Model. (arXiv:2210.12462v1 [q-fin.CP])
    Modeling and characterizing multiple factors is perhaps the most important step in achieving excess returns over market benchmarks. Both academia and industry are striving to find new factors that have good explanatory power for future stock returns and good stability of their predictive power. In practice, factor investing is still largely based on linear multi-factor models, although many deep learning methods show promising results compared to traditional methods in stock trend prediction and portfolio risk management. However, the existing non-linear methods have two drawbacks: 1) there is a lack of interpretation of the newly discovered factors, 2) the financial insights behind the mining process are unclear, making practitioners reluctant to apply the existing methods to factor investing. To address these two shortcomings, we develop a novel deep multi-factor model that adopts industry neutralization and market neutralization modules with clear financial insights, which help us easily build a dynamic and multi-relational stock graph in a hierarchical structure to learn the graph representation of stock relationships at different levels, e.g., industry level and universal level. Subsequently, graph attention modules are adopted to estimate a series of deep factors that maximize the cumulative factor returns. And a factor-attention module is developed to approximately compose the estimated deep factors from the input factors, as a way to interpret the deep factors explicitly. Extensive experiments on real-world stock market data demonstrate the effectiveness of our deep multi-factor model in the task of factor investing.  ( 3 min )
    Improving Medical Predictions by Irregular Multimodal Electronic Health Records Modeling. (arXiv:2210.12156v1 [cs.LG])
    Health conditions among patients in intensive care units (ICUs) are monitored via electronic health records (EHRs), composed of numerical time series and lengthy clinical note sequences, both taken at irregular time intervals. Dealing with such irregularity in every modality, and integrating irregularity into multimodal representations to improve medical predictions, is a challenging problem. In this paper, we address this problem by (1) modeling irregular time series by incorporating hand-crafted imputation embeddings into learned interpolation embeddings via a gating mechanism; (2) casting a series of clinical note representations as multivariate irregular time series and tackling irregularity via a time attention mechanism; and (3) fusing multimodalities with an interleaved attention mechanism across temporal steps to integrate irregularity into multimodal representations. To the best of our knowledge, this is the first work to thoroughly model irregularity in multimodalities and to take into account temporal knowledge in multimodal fusion, for improving medical predictions. The results on two medical prediction tasks show that our proposed methods outperform the state-of-the-art (SOTA) methods in both every single modality and multimodal fusion scenarios, illustrating the effectiveness of our methods and the value of modeling irregularity in multimodal fusion.  ( 2 min )
    ALT: Breaking the Wall between Graph and Operator Level Optimizations for Deep Learning Compilation. (arXiv:2210.12415v1 [cs.LG])
    Deep learning models rely on highly optimized tensor libraries for efficient inference on heterogeneous hardware. Current deep compilers typically predetermine layouts of tensors and then optimize loops of operators. However, such unidirectional and one-off workflow strictly separates graph-level optimization and operator-level optimization into different system layers, missing opportunities for unified tuning. This paper proposes ALT, a compiler that performs joint graph- and operator-level optimizations for deep models. JOG provides a generic transformation module to manipulate layouts and loops with easy-to-use primitive functions. JOG further integrates an auto-tuning module that jointly optimizes graph-level data layouts and operator-level loops while guaranteeing efficiency. Experimental results show that JOG significantly outperforms state-of-the-art compilers (e.g., Ansor) in terms of both single operator performance (e.g., 1.5x speedup on average) and end-to-end inference performance (e.g., 1.4x speedup on average).
    Generative Modeling of High-resolution Global Precipitation Forecasts. (arXiv:2210.12504v1 [cs.LG])
    Forecasting global precipitation patterns and, in particular, extreme precipitation events is of critical importance to preparing for and adapting to climate change. Making accurate high-resolution precipitation forecasts using traditional physical models remains a major challenge in operational weather forecasting as they incur substantial computational costs and struggle to achieve sufficient forecast skill. Recently, deep-learning-based models have shown great promise in closing the gap with numerical weather prediction (NWP) models in terms of precipitation forecast skill, opening up exciting new avenues for precipitation modeling. However, it is challenging for these deep learning models to fully resolve the fine-scale structures of precipitation phenomena and adequately characterize the extremes of the long-tailed precipitation distribution. In this work, we present several improvements to the architecture and training process of a current state-of-the art deep learning precipitation model (FourCastNet) using a novel generative adversarial network (GAN) to better capture fine scales and extremes. Our improvements achieve superior performance in capturing the extreme percentiles of global precipitation, while comparable to state-of-the-art NWP models in terms of forecast skill at 1--2 day lead times. Together, these improvements set a new state-of-the-art in global precipitation forecasting.
    Compressing multidimensional weather and climate data into neural networks. (arXiv:2210.12538v1 [cs.LG])
    Weather and climate simulations produce petabytes of high-resolution data that are later analyzed by researchers in order to understand climate change or severe weather. We propose a new method of compressing this multidimensional weather and climate data: a coordinate-based neural network is trained to overfit the data, and the resulting parameters are taken as a compact representation of the original grid-based data. While compression ratios range from 300x to more than 3,000x, our method outperforms the state-of-the-art compressor SZ3 in terms of weighted RMSE, MAE. It can faithfully preserve important large scale atmosphere structures and does not introduce artifacts. When using the resulting neural network as a 790x compressed dataloader to train the WeatherBench forecasting model, its RMSE increases by less than 2%. The three orders of magnitude compression democratizes access to high-resolution climate data and enables numerous new research directions.
    SurCo: Learning Linear Surrogates For Combinatorial Nonlinear Optimization Problems. (arXiv:2210.12547v1 [cs.LG])
    Optimization problems with expensive nonlinear cost functions and combinatorial constraints appear in many real-world applications, but remain challenging to solve efficiently. Existing combinatorial solvers like Mixed Integer Linear Programming can be fast in practice but cannot readily optimize nonlinear cost functions, while general nonlinear optimizers like gradient descent often do not handle complex combinatorial structures, may require many queries of the cost function, and are prone to local optima. To bridge this gap, we propose SurCo that learns linear Surrogate costs which can be used by existing Combinatorial solvers to output good solutions to the original nonlinear combinatorial optimization problem, combining the flexibility of gradient-based methods with the structure of linear combinatorial optimization. We learn these linear surrogates end-to-end with the nonlinear loss by differentiating through the linear surrogate solver. Three variants of SurCo are proposed: SurCo-zero operates on individual nonlinear problems, SurCo-prior trains a linear surrogate predictor on distributions of problems, and SurCo-hybrid uses a model trained offline to warm start online solving for SurCo-zero. We analyze our method theoretically and empirically, showing smooth convergence and improved performance. Experiments show that compared to state-of-the-art approaches and expert-designed heuristics, SurCo obtains lower cost solutions with comparable or faster solve time for two realworld industry-level applications: embedding table sharding and inverse photonic design.
    Faster and more diverse de novo molecular optimization with double-loop reinforcement learning using augmented SMILES. (arXiv:2210.12458v1 [physics.chem-ph])
    Molecular generation via deep learning models in combination with reinforcement learning is a powerful way of generating proposed molecules with desirable properties. By defining a multi-objective scoring function, it is possible to generate thousands of ideas for molecules that scores well, which makes the approach interesting for drug discovery or material science purposes. However, if the scoring function is expensive regarding resources, such as time or computation, the high number of function evaluations needed for feedback in the reinforcement learning loop becomes a bottleneck. Here we propose to use double-loop reinforcement learning with simplified molecular line entry system (SMILES) augmentation to use scoring calculations more efficiently and arrive at well scoring molecules faster. By adding an inner loop where the SMILES strings generated are augmented to alternative non-canonical SMILES and used for additional rounds of reinforcement learning, we can effectively reuse the scoring calculations that are done on the molecular level. This approach speeds up the learning process regarding scoring function calls, as well as it protects moderately against mode collapse. We find that augmentation repeats between 5-10x seem safe for most scoring functions and additionally increase the diversity of the generated compounds, as well as making the sampling runs of chemical space more reproducible
    DL-Corrector-Remapper: A grid-free bias-correction deep learning methodology for data-driven high-resolution global weather forecasting. (arXiv:2210.12293v1 [physics.ao-ph])
    Data-driven models, such as FourCastNet (FCN), have shown exemplary performance in high-resolution global weather forecasting. This performance, however, is based on supervision on mesh-gridded weather data without the utilization of raw climate observational data, the gold standard ground truth. In this work we develop a methodology to correct, remap, and fine-tune gridded uniform forecasts of FCN so it can be directly compared against observational ground truth, which is sparse and non-uniform in space and time. This is akin to bias correction and post-processing of numerical weather prediction (NWP), a routine operation at meteorological and weather forecasting centers across the globe. The Adaptive Fourier Neural Operator (AFNO) architecture is used as the backbone to learn continuous representations of the atmosphere. The spatially and temporally non-uniform output is evaluated by the non-uniform discrete inverse Fourier transform (NUIDFT) given the output query locations. We call this network the Deep-Learning-Corrector-Remapper (DLCR). The improvement in DLCR's performance against the gold standard ground truth over the baseline's performance shows its potential to correct, remap, and fine-tune the mesh-gridded forecasts under the supervision of observations.  ( 2 min )
    Salience Allocation as Guidance for Abstractive Summarization. (arXiv:2210.12330v1 [cs.CL])
    Abstractive summarization models typically learn to capture the salient information from scratch implicitly. Recent literature adds extractive summaries as guidance for abstractive summarization models to provide hints of salient content and achieves better performance. However, extractive summaries as guidance could be over strict, leading to information loss or noisy signals. Furthermore, it cannot easily adapt to documents with various abstractiveness. As the number and allocation of salience content pieces vary, it is hard to find a fixed threshold deciding which content should be included in the guidance. In this paper, we propose a novel summarization approach with a flexible and reliable salience guidance, namely SEASON (SaliencE Allocation as Guidance for Abstractive SummarizatiON). SEASON utilizes the allocation of salience expectation to guide abstractive summarization and adapts well to articles in different abstractiveness. Automatic and human evaluations on two benchmark datasets show that the proposed method is effective and reliable. Empirical results on more than one million news articles demonstrate a natural fifteen-fifty salience split for news article sentences, providing a useful insight for composing news articles.  ( 2 min )
    Bayesian Convolutional Deep Sets with Task-Dependent Stationary Prior. (arXiv:2210.12363v1 [stat.ML])
    Convolutional deep sets are the architecture of a deep neural network (DNN) that can model stationary stochastic process. This architecture uses the kernel smoother and the DNN to construct the translation equivariant functional representations, and thus reflects the inductive bias of the stationarity into DNN. However, since this architecture employs the kernel smoother known as the non-parametric model, it may produce ambiguous representations when the number of data points is not given sufficiently. To remedy this issue, we introduce Bayesian convolutional deep sets that construct the random translation equivariant functional representations with stationary prior. Furthermore, we present how to impose the task-dependent prior for each dataset because a wrongly imposed prior forms an even worse representation than that of the kernel smoother. We validate the proposed architecture and its training on various experiments with time-series and image datasets.  ( 2 min )
    Self-supervised Graph-based Point-of-interest Recommendation. (arXiv:2210.12506v1 [cs.LG])
    The exponential growth of Location-based Social Networks (LBSNs) has greatly stimulated the demand for precise location-based recommendation services. Next Point-of-Interest (POI) recommendation, which aims to provide personalised POI suggestions for users based on their visiting histories, has become a prominent component in location-based e-commerce. Recent POI recommenders mainly employ self-attention mechanism or graph neural networks to model complex high-order POI-wise interactions. However, most of them are merely trained on the historical check-in data in a standard supervised learning manner, which fail to fully explore each user's multi-faceted preferences, and suffer from data scarcity and long-tailed POI distribution, resulting in sub-optimal performance. To this end, we propose a Self-s}upervised Graph-enhanced POI Recommender (S2GRec) for next POI recommendation. In particular, we devise a novel Graph-enhanced Self-attentive layer to incorporate the collaborative signals from both global transition graph and local trajectory graphs to uncover the transitional dependencies among POIs and capture a user's temporal interests. In order to counteract the scarcity and incompleteness of POI check-ins, we propose a novel self-supervised learning paradigm in \ssgrec, where the trajectory representations are contrastively learned from two augmented views on geolocations and temporal transitions. Extensive experiments are conducted on three real-world LBSN datasets, demonstrating the effectiveness of our model against state-of-the-art methods.  ( 2 min )
    Draft, Sketch, and Prove: Guiding Formal Theorem Provers with Informal Proofs. (arXiv:2210.12283v1 [cs.AI])
    The formalization of existing mathematical proofs is a notoriously difficult process. Despite decades of research on automation and proof assistants, writing formal proofs remains arduous and only accessible to a few experts. While previous studies to automate formalization focused on powerful search algorithms, no attempts were made to take advantage of available informal proofs. In this work, we introduce Draft, Sketch, and Prove (DSP), a method that maps informal proofs to formal proof sketches, and uses the sketches to guide an automated prover by directing its search to easier sub-problems. We investigate two relevant setups where informal proofs are either written by humans or generated by a language model. Our experiments and ablation studies show that large language models are able to produce well-structured formal sketches that follow the same reasoning steps as the informal proofs. Guiding an automated prover with these sketches enhances its performance from 20.9% to 39.3% on a collection of mathematical competition problems.  ( 2 min )
    Deep Multi-Branch CNN Architecture for Early Alzheimer's Detection from Brain MRIs. (arXiv:2210.12331v1 [eess.IV])
    Alzheimer's disease (AD) is a neuro-degenerative disease that can cause dementia and result severe reduction in brain function inhibiting simple tasks especially if no preventative care is taken. Over 1 in 9 Americans suffer from AD induced dementia and unpaid care for people with AD related dementia is valued at $271.6 billion. In this paper, we first review other approaches that could be used for early detection of AD. We then give an overview of our dataset that was from the Alzheimer's Disease Neuroimaging Initiative (ADNI) and propose a deep Convolutional Neural Network (CNN) architecture consisting of 7,866,819 parameters. This model has three different length convolutional branches each comprised of different kernel sizes that can predict whether a patient is non-demented, mild-demented, or moderately-demented with a 99.05% three class accuracy.
    Algorithms with Prediction Portfolios. (arXiv:2210.12438v1 [cs.LG])
    The research area of algorithms with predictions has seen recent success showing how to incorporate machine learning into algorithm design to improve performance when the predictions are correct, while retaining worst-case guarantees when they are not. Most previous work has assumed that the algorithm has access to a single predictor. However, in practice, there are many machine learning methods available, often with incomparable generalization guarantees, making it hard to pick a best method a priori. In this work we consider scenarios where multiple predictors are available to the algorithm and the question is how to best utilize them. Ideally, we would like the algorithm's performance to depend on the quality of the best predictor. However, utilizing more predictions comes with a cost, since we now have to identify which prediction is the best. We study the use of multiple predictors for a number of fundamental problems, including matching, load balancing, and non-clairvoyant scheduling, which have been well-studied in the single predictor setting. For each of these problems we introduce new algorithms that take advantage of multiple predictors, and prove bounds on the resulting performance.
    Learning Classifiers for Imbalanced and Overlapping Data. (arXiv:2210.12446v1 [cs.LG])
    This study is about inducing classifiers using data that is imbalanced, with a minority class being under-represented in relation to the majority classes. The first section of this research focuses on the main characteristics of data that generate this problem. Following a study of previous, relevant research, a variety of artificial, imbalanced data sets influenced by important elements were created. These data sets were used to create decision trees and rule-based classifiers. The second section of this research looks into how to improve classifiers by pre-processing data with resampling approaches. The results of the following trials are compared to the performance of distinct pre-processing re-sampling methods: two variants of random over-sampling and focused under-sampling NCR. This paper further optimises class imbalance with a new method called Sparsity. The data is made more sparse from its class centers, hence making it more homogenous.
    Baby Physical Safety Monitoring in Smart Home Using Action Recognition System. (arXiv:2210.12527v1 [cs.CV])
    Humans are able to intuitively deduce actions that took place between two states in observations via deductive reasoning. This is because the brain operates on a bidirectional communication model, which has radically improved the accuracy of recognition and prediction based on features connected to previous experiences. During the past decade, deep learning models for action recognition have significantly improved. However, deep neural networks struggle with these tasks on a smaller dataset for specific Action Recognition (AR) tasks. As with most action recognition tasks, the ambiguity of accurately describing activities in spatial-temporal data is a drawback that can be overcome by curating suitable datasets, including careful annotations and preprocessing of video data for analyzing various recognition tasks. In this study, we present a novel lightweight framework combining transfer learning techniques with a Conv2D LSTM layer to extract features from the pre-trained I3D model on the Kinetics dataset for a new AR task (Smart Baby Care) that requires a smaller dataset and less computational resources. Furthermore, we developed a benchmark dataset and an automated model that uses LSTM convolution with I3D (ConvLSTM-I3D) for recognizing and predicting baby activities in a smart baby room. Finally, we implemented video augmentation to improve model performance on the smart baby care task. Compared to other benchmark models, our experimental framework achieved better performance with less computational resources.  ( 2 min )
    Score-based Denoising Diffusion with Non-Isotropic Gaussian Noise Models. (arXiv:2210.12254v1 [cs.LG])
    Generative models based on denoising diffusion techniques have led to an unprecedented increase in the quality and diversity of imagery that is now possible to create with neural generative models. However, most contemporary state-of-the-art methods are derived from a standard isotropic Gaussian formulation. In this work we examine the situation where non-isotropic Gaussian distributions are used. We present the key mathematical derivations for creating denoising diffusion models using an underlying non-isotropic Gaussian noise model. We also provide initial experiments to help verify empirically that this more general modelling approach can also yield high-quality samples.
    Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. (arXiv:2210.12316v1 [cs.IR])
    Recently, the generality of natural language text has been leveraged to develop transferable recommender systems. The basic idea is to employ pre-trained language model (PLM) to encode item text into item representations. Despite the promising transferability, the binding between item text and item representations might be too tight, leading to potential problems such as over-emphasizing text similarity and exaggerating domain gaps. To address this issue, this paper proposes VQ-Rec, a novel approach to learning Vector-Quantized item representations for transferable sequential Recommender. The major novelty of our approach lies in the new item representation scheme: it first maps item text into a vector of discrete indices (called item code), and then employs these indices to lookup the code embedding table for deriving item representations. Such a scheme can be denoted as "text -> code -> representation". Based on this representation scheme, we further propose an enhanced contrastive pre-training approach, using semi-synthetic and mixed-domain code representations as hard negatives. Furthermore, we design a new cross-domain fine-tuning method based on a differentiable permutation-based network. Extensive experiments conducted on six public benchmarks demonstrate the effectiveness of the proposed approach, in both cross-domain and cross-platform settings.
    Task-Based Assessment for Neural Networks: Evaluating Undersampled MRI Reconstructions based on Human Observer Signal Detection. (arXiv:2210.12161v1 [eess.IV])
    Recent research has explored using neural networks to reconstruct undersampled magnetic resonance imaging (MRI) data. Because of the complexity of the artifacts in the reconstructed images, there is a need to develop task-based approaches of image quality. Common metrics for evaluating image quality like the normalized root mean squared error (NRMSE) and structural similarity (SSIM) are global metrics which average out impact of subtle features in the images. Using measures of image quality which incorporate a subtle signal for a specific task allow for image quality assessment which locally evaluates the effect of undersampling on a signal. We used a U-Net to reconstruct under-sampled images with 2x, 3x, 4x and 5x fold 1-D undersampling rates. Cross validation was performed for a 500 and a 4000 image training set with both structural similarity (SSIM) and mean squared error (MSE) losses. A two alternative forced choice (2-AFC) observer study was carried out for detecting a subtle signal (small blurred disk) from images with the 4000 image training set. We found that for both loss functions and training set sizes, the human observer performance on the 2-AFC studies led to a choice of a 2x undersampling but the SSIM and NRMSE led to a choice of a 3x undersampling. For this task, SSIM and NRMSE led to an overestimate of the achievable undersampling using a U-Net before a steep loss of image quality when compared to the performance of human observers in the detection of a subtle lesion.
    Just Mix Once: Worst-group Generalization by Group Interpolation. (arXiv:2210.12195v1 [cs.LG])
    Advances in deep learning theory have revealed how average generalization relies on superficial patterns in data. The consequences are brittle models with poor performance with shift in group distribution at test time. When group annotation is available, we can use robust optimization tools to tackle the problem. However, identification and annotation are time-consuming, especially on large datasets. A recent line of work leverages self-supervision and oversampling to improve generalization on minority groups without group annotation. We propose to unify and generalize these approaches using a class-conditional variant of mixup tailored for worst-group generalization. Our approach, Just Mix Once (JM1), interpolates samples during learning, augmenting the training distribution with a continuous mixture of groups. JM1 is domain agnostic and computationally efficient, can be used with any level of group annotation, and performs on par or better than the state-of-the-art on worst-group generalization. Additionally, we provide a simple explanation of why JM1 works.
    Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis. (arXiv:2210.12235v1 [stat.ML])
    One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.
    Volatility forecasting using Deep Learning and sentiment analysis. (arXiv:2210.12464v1 [cs.LG])
    Several studies have shown that deep learning models can provide more accurate volatility forecasts than the traditional methods used within this domain. This paper presents a composite model that merges a deep learning approach with sentiment analysis for predicting market volatility. To classify public sentiment, we use a Convolutional Neural Network, which obtained data from Reddit global news headlines. We then describe a composite forecasting model, a Long-Short-Term-Memory Neural Network method, to use historical sentiment and the previous day's volatility to make forecasts. We employed this method on the past volatility of the S\&P500 and the major BRICS indices to corroborate its effectiveness. Our results demonstrate that including sentiment can improve deep learning volatility forecasting models. However, in contrast to return forecasting, the performance benefits of including sentiment appear for volatility forecasting appears to be market specific.  ( 2 min )
    Learning Correlated Stackelberg Equilibrium in General-Sum Multi-Leader-Single-Follower Games. (arXiv:2210.12470v1 [cs.LG])
    Many real-world strategic games involve interactions between multiple players. We study a hierarchical multi-player game structure, where players with asymmetric roles can be separated into leaders and followers, a setting often referred to as Stackelberg game or leader-follower game. In particular, we focus on a Stackelberg game scenario where there are multiple leaders and a single follower, called the Multi-Leader-Single-Follower (MLSF) game. We propose a novel asymmetric equilibrium concept for the MLSF game called Correlated Stackelberg Equilibrium (CSE). We design online learning algorithms that enable the players to interact in a distributed manner, and prove that it can achieve no-external Stackelberg-regret learning. This further translates to the convergence to approximate CSE via a reduction from no-external regret to no-swap regret. At the core of our works, we solve the intricate problem of how to learn equilibrium in leader-follower games with noisy bandit feedback by balancing exploration and exploitation in different learning structures.  ( 2 min )
    Generalized Likelihood Ratio Test With One-Class Classifiers. (arXiv:2210.12494v1 [cs.LG])
    One-class classification (OCC) is the problem of deciding whether an observed sample belongs to a target class or not. We consider the problem of learning an OCC model when the dataset available at the learning stage contains only samples from the target class. We aim at obtaining a classifier that performs as the generalized likelihood ratio test (GLRT), which is a well-known and provably optimal (under specific assumptions) classifier when the statistic of the target class is available. To this end, we consider both the multilayer perceptron neural network (NN) and the support vector machine (SVM) models. They are trained as two-class classifiers using an artificial dataset for the alternative class, obtained by generating random samples, uniformly over the domain of the target-class dataset. We prove that, under suitable assumptions, the models converge (with a large dataset) to the GLRT. Moreover, we show that the one-class least squares SVM (OCLSSVM) at convergence performs as the GLRT, with a suitable transformation function. Lastly, we compare the obtained solutions with the autoencoder (AE) classifier, which does not in general provide the GLRT  ( 2 min )
    Anonymous Bandits for Multi-User Systems. (arXiv:2210.12198v1 [cs.LG])
    In this work, we present and study a new framework for online learning in systems with multiple users that provide user anonymity. Specifically, we extend the notion of bandits to obey the standard $k$-anonymity constraint by requiring each observation to be an aggregation of rewards for at least $k$ users. This provides a simple yet effective framework where one can learn a clustering of users in an online fashion without observing any user's individual decision. We initiate the study of anonymous bandits and provide the first sublinear regret algorithms and lower bounds for this setting.
  • Open

    A Multi-Agent Reinforcement Learning Framework for Off-Policy Evaluation in Two-sided Markets. (arXiv:2202.10574v3 [stat.ML] UPDATED)
    The two-sided markets such as ride-sharing companies often involve a group of subjects who are making sequential decisions across time and/or location. With the rapid development of smart phones and internet of things, they have substantially transformed the transportation landscape of human beings. In this paper we consider large-scale fleet management in ride-sharing companies that involve multiple units in different areas receiving sequences of products (or treatments) over time. Major technical challenges, such as policy evaluation, arise in those studies because (i) spatial and temporal proximities induce interference between locations and times; and (ii) the large number of locations results in the curse of dimensionality. To address both challenges simultaneously, we introduce a multi-agent reinforcement learning (MARL) framework for carrying policy evaluation in these studies. We propose novel estimators for mean outcomes under different products that are consistent despite the high-dimensionality of state-action space. The proposed estimator works favorably in simulation experiments. We further illustrate our method using a real dataset obtained from a two-sided marketplace company to evaluate the effects of applying different subsidizing policies. A Python implementation of our proposed method is available at https://github.com/RunzheStat/CausalMARL.
    Trimmed Maximum Likelihood Estimation for Robust Learning in Generalized Linear Models. (arXiv:2206.04777v3 [cs.LG] UPDATED)
    We study the problem of learning generalized linear models under adversarial corruptions. We analyze a classical heuristic called the iterative trimmed maximum likelihood estimator which is known to be effective against label corruptions in practice. Under label corruptions, we prove that this simple estimator achieves minimax near-optimal risk on a wide range of generalized linear models, including Gaussian regression, Poisson regression and Binomial regression. Finally, we extend the estimator to the more challenging setting of label and covariate corruptions and demonstrate its robustness and optimality in that setting as well.
    Wide Boosting. (arXiv:2007.09855v4 [cs.LG] UPDATED)
    Gradient Boosting (GB) is a popular methodology used to solve prediction problems by minimizing a differentiable loss function, $L$. GB performs very well on tabular machine learning (ML) problems; however, as a pure ML solver it lacks the ability to fit models with probabilistic but correlated multi-dimensional outputs, for example, multiple correlated Bernoulli outputs. GB also does not form intermediate abstract data embeddings, one property of Deep Learning that gives greater flexibility and performance on other types of problems. This paper presents a simple adjustment to GB motivated in part by artificial neural networks. Specifically, our adjustment inserts a matrix multiplication between the output of a GB model and the loss, $L$. This allows the output of a GB model to have increased dimension prior to being fed into the loss and is thus ``wider'' than standard GB implementations. We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks and that the embeddings generated by WB contain are more useful in downstream prediction tasks than GB output predictions alone.
    Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent. (arXiv:2205.11361v2 [stat.ML] UPDATED)
    Recent studies have shown that gradient descent (GD) can achieve improved generalization when its dynamics exhibits a chaotic behavior. However, to obtain the desired effect, the step-size should be chosen sufficiently large, a task which is problem dependent and can be difficult in practice. In this study, we incorporate a chaotic component to GD in a controlled manner, and introduce multiscale perturbed GD (MPGD), a novel optimization framework where the GD recursion is augmented with chaotic perturbations that evolve via an independent dynamical system. We analyze MPGD from three different angles: (i) By building up on recent advances in rough paths theory, we show that, under appropriate assumptions, as the step-size decreases, the MPGD recursion converges weakly to a stochastic differential equation (SDE) driven by a heavy-tailed L\'evy-stable process. (ii) By making connections to recently developed generalization bounds for heavy-tailed processes, we derive a generalization bound for the limiting SDE and relate the worst-case generalization error over the trajectories of the process to the parameters of MPGD. (iii) We analyze the implicit regularization effect brought by the dynamical regularization and show that, in the weak perturbation regime, MPGD introduces terms that penalize the Hessian of the loss function. Empirical results are provided to demonstrate the advantages of MPGD.
    A sharp uniform-in-time error estimate for Stochastic Gradient Langevin Dynamics. (arXiv:2207.09304v2 [math.PR] UPDATED)
    We establish a sharp uniform-in-time error estimate for the Stochastic Gradient Langevin Dynamics (SGLD), which is a popular sampling algorithm. Under mild assumptions, we obtain a uniform-in-time $O(\eta^2)$ bound for the KL-divergence between the SGLD iteration and the Langevin diffusion, where $\eta$ is the step size (or learning rate). Our analysis is also valid for varying step sizes. Based on this, we are able to obtain an $O(\eta)$ bound for the distance between the SGLD iteration and the invariant distribution of the Langevin diffusion, in terms of Wasserstein or total variation distances.
    Evaluating Robustness to Dataset Shift via Parametric Robustness Sets. (arXiv:2205.15947v3 [cs.LG] UPDATED)
    We give a method for proactively identifying small, plausible shifts in distribution which lead to large differences in model performance. These shifts are defined via parametric changes in the causal mechanisms of observed variables, where constraints on parameters yield a "robustness set" of plausible distributions and a corresponding worst-case loss over the set. While the loss under an individual parametric shift can be estimated via reweighting techniques such as importance sampling, the resulting worst-case optimization problem is non-convex, and the estimate may suffer from large variance. For small shifts, however, we can construct a local second-order approximation to the loss under shift and cast the problem of finding a worst-case shift as a particular non-convex quadratic optimization problem, for which efficient algorithms are available. We demonstrate that this second-order approximation can be estimated directly for shifts in conditional exponential family models, and we bound the approximation error. We apply our approach to a computer vision task (classifying gender from images), revealing sensitivity to shifts in non-causal attributes.
    A theory of learning with constrained weight-distribution. (arXiv:2206.08933v2 [q-bio.NC] UPDATED)
    A central question in computational neuroscience is how structure determines function in neural networks. The emerging high-quality large-scale connectomic datasets raise the question of what general functional principles can be gleaned from structural information such as the distribution of excitatory/inhibitory synapse types and the distribution of synaptic weights. Motivated by this question, we developed a statistical mechanical theory of learning in neural networks that incorporates structural information as constraints. We derived an analytical solution for the memory capacity of the perceptron, a basic feedforward model of supervised learning, with constraint on the distribution of its weights. Our theory predicts that the reduction in capacity due to the constrained weight-distribution is related to the Wasserstein distance between the imposed distribution and that of the standard normal distribution. To test the theoretical predictions, we use optimal transport theory and information geometry to develop an SGD-based algorithm to find weights that simultaneously learn the input-output task and satisfy the distribution constraint. We show that training in our algorithm can be interpreted as geodesic flows in the Wasserstein space of probability distributions. We further developed a statistical mechanical theory for teacher-student perceptron rule learning and ask for the best way for the student to incorporate prior knowledge of the rule. Our theory shows that it is beneficial for the learner to adopt different prior weight distributions during learning, and shows that distribution-constrained learning outperforms unconstrained and sign-constrained learning. Our theory and algorithm provide novel strategies for incorporating prior knowledge about weights into learning, and reveal a powerful connection between structure and function in neural networks.
    Machine learning meets false discovery rate. (arXiv:2208.06685v2 [stat.ME] UPDATED)
    Classical false discovery rate (FDR) controlling procedures offer strong and interpretable guarantees but often lack flexibility to work with complex data. By contrast, machine learning-based classification algorithms have superior performances on modern datasets but typically fall short of error-controlling guarantees. In this paper, we make these two meet by introducing a new adaptive novelty detection procedure with FDR control, called AdaDetect. It extends the scope of recent works of multiple testing literature to the high dimensional setting, notably the one in Yang et al. (2021). We prove that AdaDetect comes with finite sample guarantees: it controls the FDR strongly and approximates the oracle in terms of the power, with explicit remainder terms that are small under mild conditions. In practice, AdaDetect can be used in combination with any machine learning-based classifier, which allows the user to choose the most relevant classification approach. We illustrate this with classical real-world datasets, for which random forest and neural network classifiers are particularly efficient. The versatility of our method is also shown with an astrophysical application.
    Batch Bayesian optimisation via density-ratio estimation with guarantees. (arXiv:2209.10715v2 [cs.LG] UPDATED)
    Bayesian optimisation (BO) algorithms have shown remarkable success in applications involving expensive black-box functions. Traditionally BO has been set as a sequential decision-making process which estimates the utility of query points via an acquisition function and a prior over functions, such as a Gaussian process. Recently, however, a reformulation of BO via density-ratio estimation (BORE) allowed reinterpreting the acquisition function as a probabilistic binary classifier, removing the need for an explicit prior over functions and increasing scalability. In this paper, we present a theoretical analysis of BORE's regret and an extension of the algorithm with improved uncertainty estimates. We also show that BORE can be naturally extended to a batch optimisation setting by recasting the problem as approximate Bayesian inference. The resulting algorithms come equipped with theoretical performance guarantees and are assessed against other batch and sequential BO baselines in a series of experiments.
    Estimating counterfactual treatment outcomes over time in complex multi-agent scenarios. (arXiv:2206.01900v2 [cs.AI] UPDATED)
    Evaluation of intervention in a multi-agent system, e.g., when humans should intervene in autonomous driving systems and when a player should pass to teammates for a good shot, is challenging in various engineering and scientific fields. Estimating the individual treatment effect (ITE) using counterfactual long-term prediction is practical to evaluate such interventions. However, most of the conventional frameworks did not consider the time-varying complex structure of multi-agent relationships and covariate counterfactual prediction. This may sometimes lead to erroneous assessments of ITE and interpretation problems. Here we propose an interpretable, counterfactual recurrent network in multi-agent systems to estimate the effect of the intervention. Our model leverages graph variational recurrent neural networks and theory-based computation with domain knowledge for the ITE estimation framework based on long-term prediction of multi-agent covariates and outcomes, which can confirm under the circumstances under which the intervention is effective. On simulated models of an automated vehicle and biological agents with time-varying confounders, we show that our methods achieved lower estimation errors in counterfactual covariates and the most effective treatment timing than the baselines. Furthermore, using real basketball data, our methods performed realistic counterfactual predictions and evaluated the counterfactual passes in shot scenarios.
    Joint Entropy Search for Maximally-Informed Bayesian Optimization. (arXiv:2206.04771v3 [cs.LG] UPDATED)
    Information-theoretic Bayesian optimization techniques have become popular for optimizing expensive-to-evaluate black-box functions due to their non-myopic qualities. Entropy Search and Predictive Entropy Search both consider the entropy over the optimum in the input space, while the recent Max-value Entropy Search considers the entropy over the optimal value in the output space. We propose Joint Entropy Search (JES), a novel information-theoretic acquisition function that considers an entirely new quantity, namely the entropy over the joint optimal probability density over both input and output space. To incorporate this information, we consider the reduction in entropy from conditioning on fantasized optimal input/output pairs. The resulting approach primarily relies on standard GP machinery and removes complex approximations typically associated with information-theoretic methods. With minimal computational overhead, JES shows superior decision-making, and yields state-of-the-art performance for information-theoretic approaches across a wide suite of tasks. As a light-weight approach with superior results, JES provides a new go-to acquisition function for Bayesian optimization.
    The Cosmic Graph: Optimal Information Extraction from Large-Scale Structure using Catalogues. (arXiv:2207.05202v2 [astro-ph.CO] UPDATED)
    We present an implicit likelihood approach to quantifying cosmological information over discrete catalogue data, assembled as graphs. To do so, we explore cosmological inference using mock dark matter halo catalogues. We employ Information Maximising Neural Networks (IMNNs) to quantify Fisher information extraction as a function of graph representation. We a) demonstrate the high sensitivity of modular graph structure to the underlying cosmology in the noise-free limit, b) show that networks automatically combine mass and clustering information through comparisons to traditional statistics, c) demonstrate that graph neural networks can still extract information when catalogues are subject to noisy survey cuts, and d) illustrate how nonlinear IMNN summaries can be used as asymptotically optimal compressed statistics for Bayesian implicit likelihood inference. We reduce the area of joint $\Omega_m, \sigma_8$ parameter constraints with small ($\sim$100 object) halo catalogues by a factor of 42 over the two-point correlation function, and demonstrate that the networks automatically combine mass and clustering information. This work utilises a new IMNN implementation over graph data in Jax, which can take advantage of either numerical or auto-differentiability. We also show that graph IMNNs successfully compress simulations far from the fiducial model at which the network is fitted, indicating a promising alternative to $n$-point statistics in catalogue-based analyses.
    A Simple and General Debiased Machine Learning Theorem with Finite Sample Guarantees. (arXiv:2105.15197v3 [stat.ML] UPDATED)
    Debiased machine learning is a meta algorithm based on bias correction and sample splitting to calculate confidence intervals for functionals, i.e. scalar summaries, of machine learning algorithms. For example, an analyst may desire the confidence interval for a treatment effect estimated with a neural network. We provide a nonasymptotic debiased machine learning theorem that encompasses any global or local functional of any machine learning algorithm that satisfies a few simple, interpretable conditions. Formally, we prove consistency, Gaussian approximation, and semiparametric efficiency by finite sample arguments. The rate of convergence is $n^{-1/2}$ for global functionals, and it degrades gracefully for local functionals. Our results culminate in a simple set of conditions that an analyst can use to translate modern learning theory rates into traditional statistical inference. The conditions reveal a general double robustness property for ill posed inverse problems.
    Fundamental limits for learning hidden Markov model parameters. (arXiv:2106.12936v3 [stat.ML] UPDATED)
    We study the frontier between learnable and unlearnable hidden Markov models (HMMs). HMMs are flexible tools for clustering dependent data coming from unknown populations. The model parameters are known to be fully identifiable (up to label-switching) without any modeling assumption on the distributions of the populations as soon as the clusters are distinct and the hidden chain is ergodic with a full rank transition matrix. In the limit as any one of these conditions fails, it becomes impossible in general to identify parameters. For a chain with two hidden states we prove nonasymptotic minimax upper and lower bounds, matching up to constants, which exhibit thresholds at which the parameters become learnable. We also provide an upper bound on the relative entropy rate for parameters in a neighbourhood of the unlearnable region which may have interest in itself.
    Accelerating SGD for Highly Ill-Conditioned Huge-Scale Online Matrix Completion. (arXiv:2208.11246v2 [cs.LG] UPDATED)
    The matrix completion problem seeks to recover a $d\times d$ ground truth matrix of low rank $r\ll d$ from observations of its individual elements. Real-world matrix completion is often a huge-scale optimization problem, with $d$ so large that even the simplest full-dimension vector operations with $O(d)$ time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slow-down when the underlying ground truth is ill-conditioned; it requires at least $O(\kappa\log(1/\epsilon))$ iterations to get $\epsilon$-close to ground truth matrix with condition number $\kappa$. In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for huge-scale online optimization while also making it agnostic to $\kappa$. For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$-accuracy in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with $\kappa=1$. In our experiments, we observe a similar acceleration for item-item collaborative filtering on the MovieLens25M dataset via a pair-wise ranking loss, with 100 million training pairs and 10 million testing pairs. [See supporting code at https://github.com/Hong-Ming/ScaledSGD.]
    Dynamic population-based meta-learning for multi-agent communication with natural language. (arXiv:2110.14241v1 [cs.LG] CROSS LISTED)
    In this work, our goal is to train agents that can coordinate with seen, unseen as well as human partners in a multi-agent communication environment involving natural language. Previous work using a single set of agents has shown great progress in generalizing to known partners, however it struggles when coordinating with unfamiliar agents. To mitigate that, recent work explored the use of population-based approaches, where multiple agents interact with each other with the goal of learning more generic protocols. These methods, while able to result in good coordination between unseen partners, still only achieve so in cases of simple languages, thus failing to adapt to human partners using natural language. We attribute this to the use of static populations and instead propose a dynamic population-based meta-learning approach that builds such a population in an iterative manner. We perform a holistic evaluation of our method on two different referential games, and show that our agents outperform all prior work when communicating with seen partners and humans. Furthermore, we analyze the natural language generation skills of our agents, where we find that our agents also outperform strong baselines. Finally, we test the robustness of our agents when communicating with out-of-population agents and carefully test the importance of each component of our method through ablation studies.
    Differentially Private Data Generation Needs Better Features. (arXiv:2205.12900v2 [stat.ML] UPDATED)
    Training even moderately-sized generative models with differentially-private stochastic gradient descent (DP-SGD) is difficult: the required level of noise for reasonable levels of privacy is simply too large. We advocate instead building off a good, relevant representation on an informative public dataset, then learning to model the private data with that representation. In particular, we minimize the maximum mean discrepancy (MMD) between private target data and a generator's distribution, using a kernel based on perceptual features learned from a public dataset. With the MMD, we can simply privatize the data-dependent term once and for all, rather than introducing noise at each step of optimization as in DP-SGD. Our algorithm allows us to generate CIFAR10-level images with $\epsilon \approx 2$ which capture distinctive features in the distribution, far surpassing the current state of the art, which mostly focuses on datasets such as MNIST and FashionMNIST at a large $\epsilon \approx 10$. Our work introduces simple yet powerful foundations for reducing the gap between private and non-private deep generative models.
    Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis. (arXiv:2205.13496v2 [stat.ML] UPDATED)
    This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.
    Grounding Aleatoric Uncertainty for Unsupervised Environment Design. (arXiv:2207.05219v2 [cs.LG] UPDATED)
    Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
    Exact Recovery of Community Structures Using DeepWalk and Node2vec. (arXiv:2101.07354v2 [stat.ML] UPDATED)
    Random-walk based network embedding algorithms like DeepWalk and node2vec are widely used to obtain Euclidean representation of the nodes in a network prior to performing downstream inference tasks. However, despite their impressive empirical performance, there is a lack of theoretical results explaining their large-sample behavior. In this paper, we study node2vec and DeepWalk through the perspective of matrix factorization. In particular, we analyze these algorithms in the setting of community detection for stochastic blockmodel graphs (and their degree-corrected variants). By exploiting the row-wise uniform perturbation bound for leading singular vectors, we derive high-probability error bounds between the matrix factorization-based node2vec/DeepWalk embeddings and their true counterparts, uniformly over all node embeddings. Based on strong concentration results, we further show the perfect membership recovery by node2vec/DeepWalk, followed by $K$-means/medians algorithms. Specifically, as the network becomes sparser, our results guarantee that with large enough window size and vertices number, applying $K$-means/medians on the matrix factorization-based node2vec embeddings can, with high probability, correctly recover the memberships of all vertices in a network generated from the stochastic blockmodel (or its degree-corrected variants). The theoretical justifications are mirrored in the numerical experiments and real data applications, for both the original node2vec and its matrix factorization variant.
    Utilizing variational autoencoders in the Bayesian inverse problem of photoacoustic tomography. (arXiv:2204.06270v2 [physics.comp-ph] UPDATED)
    There has been an increasing interest in utilizing machine learning methods in inverse problems and imaging. Most of the work has, however, concentrated on image reconstruction problems, and the number of studies regarding the full solution of the inverse problem is limited. In this work, we study a machine learning based approach for the Bayesian inverse problem of photoacoustic tomography. We develop an approach for estimating the posterior distribution in photoacoustic tomography using an approach based on the variational autoencoder. The approach is evaluated with numerical simulations and compared to the solution of the inverse problem using a Bayesian approach.
    Data Augmentation for Bayesian Deep Learning. (arXiv:1903.09668v4 [stat.ML] UPDATED)
    Deep Learning (DL) methods have emerged as one of the most powerful tools for functional approximation and prediction. While the representation properties of DL have been well studied, uncertainty quantification remains challenging and largely unexplored. Data augmentation techniques are a natural approach to provide uncertainty quantification and to incorporate stochastic Monte Carlo search into stochastic gradient descent (SGD) methods. The purpose of our paper is to show that training DL architectures with data augmentation leads to efficiency gains. We use the theory of scale mixtures of normals to derive data augmentation strategies for deep learning. This allows variants of the expectation-maximization and MCMC algorithms to be brought to bear on these high dimensional nonlinear deep learning models. To demonstrate our methodology, we develop data augmentation algorithms for a variety of commonly used activation functions: logit, ReLU, leaky ReLU and SVM. Our methodology is compared to traditional stochastic gradient descent with back-propagation. Our optimization procedure leads to a version of iteratively re-weighted least squares and can be implemented at scale with accelerated linear algebra methods providing substantial improvement in speed. We illustrate our methodology on a number of standard datasets. Finally, we conclude with directions for future research.
    Minimax Optimal Online Imitation Learning via Replay Estimation. (arXiv:2205.15397v4 [cs.LG] UPDATED)
    Online imitation learning is the problem of how best to mimic expert demonstrations, given access to the environment or an accurate simulator. Prior work has shown that in the infinite sample regime, exact moment matching achieves value equivalence to the expert policy. However, in the finite sample regime, even if one has no optimization error, empirical variance can lead to a performance gap that scales with $H^2 / N$ for behavioral cloning and $H / \sqrt{N}$ for online moment matching, where $H$ is the horizon and $N$ is the size of the expert dataset. We introduce the technique of replay estimation to reduce this empirical variance: by repeatedly executing cached expert actions in a stochastic simulator, we compute a smoother expert visitation distribution estimate to match. In the presence of general function approximation, we prove a meta theorem reducing the performance gap of our approach to the parameter estimation error for offline classification (i.e. learning the expert policy). In the tabular setting or with linear function approximation, our meta theorem shows that the performance gap incurred by our approach achieves the optimal $\widetilde{O} \left( \min({H^{3/2}} / {N}, {H} / {\sqrt{N}} \right)$ dependency, under significantly weaker assumptions compared to prior work. We implement multiple instantiations of our approach on several continuous control tasks and find that we are able to significantly improve policy performance across a variety of dataset sizes.
    Conjecturing-Based Computational Discovery of Patterns in Data. (arXiv:2011.11576v3 [cs.LG] UPDATED)
    We propose the use of a conjecturing machine that generates feature relationships in the form of bounds involving nonlinear terms for numerical features and boolean expressions for categorical features. The proposed \textsc{Conjecturing} framework recovers known nonlinear and boolean relationships among features from data. In both settings, true underlying relationships are revealed. We then compare the method to a previously-proposed framework for symbolic regression and demonstrate that it can also be used to recover equations that are satisfied among features in a dataset. The framework is then applied to patient-level data regarding COVID-19 outcomes to suggest possible risk factors that are confirmed in medical literature.
    Quantum Cross Entropy and Maximum Likelihood Principle. (arXiv:2102.11887v3 [quant-ph] UPDATED)
    Quantum machine learning is an emerging field at the intersection of machine learning and quantum computing. Classical cross entropy plays a central role in machine learning. We define its quantum generalization, the quantum cross entropy, prove its lower bounds, and investigate its relation to quantum fidelity. In the classical case, minimizing cross entropy is equivalent to maximizing likelihood. In the quantum case, when the quantum cross entropy is constructed from quantum data undisturbed by quantum measurements, this relation holds. Classical cross entropy is equal to negative log-likelihood. When we obtain quantum cross entropy through empirical density matrix based on measurement outcomes, the quantum cross entropy is lower-bounded by negative log-likelihood. These two different scenarios illustrate the information loss when making quantum measurements. We conclude that to achieve the goal of full quantum machine learning, it is crucial to utilize the deferred measurement principle.
    Bayesian Regularization for Functional Graphical Models with Applications to Neuroimaging. (arXiv:2110.05575v2 [stat.ME] UPDATED)
    Graphical models, used to express conditional dependence between random variables observed at various nodes, are used extensively in many fields such as genetics, neuroscience, and social network analysis. While most current statistical methods for estimating graphical models focus on scalar data, there is interest in estimating analogous dependence structures when the data observed at each node are functional, such as signals or images. In this paper, we propose a fully Bayesian regularization scheme for estimating functional graphical models. We first consider a direct Bayesian analog of the functional graphical lasso proposed by Qiao et al. (2019). We then propose a regularization strategy via the graphical horseshoe. We compare these approaches via simulation study and apply our proposed functional graphical horseshoe to two motivating applications, electroencephalography data for comparing brain activation between an alcoholic group and controls, as well as changes in structural connectivity in the presence of traumatic brain injury (TBI). Our results yield insight into how the brain attempts to compensate for disconnected networks after injury.
    Phase diagram of Stochastic Gradient Descent in high-dimensional two-layer neural networks. (arXiv:2202.00293v3 [stat.ML] UPDATED)
    Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.
    Distributed linear regression by averaging. (arXiv:1810.00412v3 [math.ST] UPDATED)
    Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework. Estimation error and confidence interval length increase a lot, while prediction error increases much less. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.
    Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data Programming. (arXiv:2203.01382v3 [cs.LG] UPDATED)
    Weak Supervision (WS) techniques allow users to efficiently create large training datasets by programmatically labeling data with heuristic sources of supervision. While the success of WS relies heavily on the provided labeling heuristics, the process of how these heuristics are created in practice has remained under-explored. In this work, we formalize the development process of labeling heuristics as an interactive procedure, built around the existing workflow where users draw ideas from a selected set of development data for designing the heuristic sources. With the formalism, we study two core problems of how to strategically select the development data to guide users in efficiently creating informative heuristics, and how to exploit the information within the development process to contextualize and better learn from the resultant heuristics. Building upon two novel methodologies that effectively tackle the respective problems considered, we present Nemo, an end-to-end interactive system that improves the overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS approach.
    Bayesian Inference with Latent Hamiltonian Neural Networks. (arXiv:2208.06120v2 [cs.LG] UPDATED)
    When sampling for Bayesian inference, one popular approach is to use Hamiltonian Monte Carlo (HMC) and specifically the No-U-Turn Sampler (NUTS) which automatically decides the end time of the Hamiltonian trajectory. However, HMC and NUTS can require numerous numerical gradients of the target density, and can prove slow in practice. We propose Hamiltonian neural networks (HNNs) with HMC and NUTS for solving Bayesian inference problems. Once trained, HNNs do not require numerical gradients of the target density during sampling. Moreover, they satisfy important properties such as perfect time reversibility and Hamiltonian conservation, making them well-suited for use within HMC and NUTS because stationarity can be shown. We also propose an HNN extension called latent HNNs (L-HNNs), which are capable of predicting latent variable outputs. Compared to HNNs, L-HNNs offer improved expressivity and reduced integration errors. Finally, we employ L-HNNs in NUTS with an online error monitoring scheme to prevent sample degeneracy in regions of low probability density. We demonstrate L-HNNs in NUTS with online error monitoring on several examples involving complex, heavy-tailed, and high-local-curvature probability densities. Overall, L-HNNs in NUTS with online error monitoring satisfactorily inferred these probability densities. Compared to traditional NUTS, L-HNNs in NUTS with online error monitoring required 1--2 orders of magnitude fewer numerical gradients of the target density and improved the effective sample size (ESS) per gradient by an order of magnitude.
    Fast Instrument Learning with Faster Rates. (arXiv:2205.10772v2 [stat.ML] UPDATED)
    We investigate nonlinear instrumental variable (IV) regression given high-dimensional instruments. We propose a simple algorithm which combines kernelized IV methods and an arbitrary, adaptive regression algorithm, accessed as a black box. Our algorithm enjoys faster-rate convergence and adapts to the dimensionality of informative latent features, while avoiding an expensive minimax optimization procedure, which has been necessary to establish similar guarantees. It further brings the benefit of flexible machine learning models to quasi-Bayesian uncertainty quantification, likelihood-based model selection, and model averaging. Simulation studies demonstrate the competitive performance of our method.
    On Elimination Strategies for Bandit Fixed-Confidence Identification. (arXiv:2205.10936v2 [cs.LG] UPDATED)
    Elimination algorithms for bandit identification, which prune the plausible correct answers sequentially until only one remains, are computationally convenient since they reduce the problem size over time. However, existing elimination strategies are often not fully adaptive (they update their sampling rule infrequently) and are not easy to extend to combinatorial settings, where the set of answers is exponentially large in the problem dimension. On the other hand, most existing fully-adaptive strategies to tackle general identification problems are computationally demanding since they repeatedly test the correctness of every answer, without ever reducing the problem size. We show that adaptive methods can be modified to use elimination in both their stopping and sampling rules, hence obtaining the best of these two worlds: the algorithms (1) remain fully adaptive, (2) suffer a sample complexity that is never worse of their non-elimination counterpart, and (3) provably eliminate certain wrong answers early. We confirm these benefits experimentally, where elimination improves significantly the computational complexity of adaptive methods on common tasks like best-arm identification in linear bandits.
    Generative multitask learning mitigates target-causing confounding. (arXiv:2202.04136v3 [cs.LG] UPDATED)
    We propose generative multitask learning (GMTL), a simple and scalable approach to causal representation learning for multitask learning. Our approach makes a minor change to the conventional multitask inference objective, and improves robustness to target shift. Since GMTL only modifies the inference objective, it can be used with existing multitask learning methods without requiring additional training. The improvement in robustness comes from mitigating unobserved confounders that cause the targets, but not the input. We refer to them as \emph{target-causing confounders}. These confounders induce spurious dependencies between the input and targets. This poses a problem for conventional multitask learning, due to its assumption that the targets are conditionally independent given the input. GMTL mitigates target-causing confounding at inference time, by removing the influence of the joint target distribution, and predicting all targets jointly. This removes the spurious dependencies between the input and targets, where the degree of removal is adjustable via a single hyperparameter. This flexibility is useful for managing the trade-off between in- and out-of-distribution generalization. Our results on the Attributes of People and Taskonomy datasets reflect an improved robustness to target shift across four multitask learning methods.
    Near Instance-Optimal PAC Reinforcement Learning for Deterministic MDPs. (arXiv:2203.09251v3 [cs.LG] UPDATED)
    In probably approximately correct (PAC) reinforcement learning (RL), an agent is required to identify an $\epsilon$-optimal policy with probability $1-\delta$. While minimax optimal algorithms exist for this problem, its instance-dependent complexity remains elusive in episodic Markov decision processes (MDPs). In this paper, we propose the first nearly matching (up to a horizon squared factor and logarithmic terms) upper and lower bounds on the sample complexity of PAC RL in deterministic episodic MDPs with finite state and action spaces. In particular, our bounds feature a new notion of sub-optimality gap for state-action pairs that we call the deterministic return gap. While our instance-dependent lower bound is written as a linear program, our algorithms are very simple and do not require solving such an optimization problem during learning. Their design and analyses employ novel ideas, including graph-theoretical concepts (minimum flows) and a new maximum-coverage exploration strategy.
    Sample Noise Impact on Active Learning. (arXiv:2109.01372v2 [stat.ML] UPDATED)
    This work explores the effect of noisy sample selection in active learning strategies. We show on both synthetic problems and real-life use-cases that knowledge of the sample noise can significantly improve the performance of active learning strategies. Building on prior work, we propose a robust sampler, Incremental Weighted K-Means that brings significant improvement on the synthetic tasks but only a marginal uplift on real-life ones. We hope that the questions raised in this paper are of interest to the community and could open new paths for active learning research.
    Protocols for classically training quantum generative models on probability distributions. (arXiv:2210.13442v1 [quant-ph])
    Quantum Generative Modelling (QGM) relies on preparing quantum states and generating samples from these states as hidden - or known - probability distributions. As distributions from some classes of quantum states (circuits) are inherently hard to sample classically, QGM represents an excellent testbed for quantum supremacy experiments. Furthermore, generative tasks are increasingly relevant for industrial machine learning applications, and thus QGM is a strong candidate for demonstrating a practical quantum advantage. However, this requires that quantum circuits are trained to represent industrially relevant distributions, and the corresponding training stage has an extensive training cost for current quantum hardware in practice. In this work, we propose protocols for classical training of QGMs based on circuits of the specific type that admit an efficient gradient computation, while remaining hard to sample. In particular, we consider Instantaneous Quantum Polynomial (IQP) circuits and their extensions. Showing their classical simulability in terms of the time complexity, sparsity and anti-concentration properties, we develop a classically tractable way of simulating their output probability distributions, allowing classical training to a target probability distribution. The corresponding quantum sampling from IQPs can be performed efficiently, unlike when using classical sampling. We numerically demonstrate the end-to-end training of IQP circuits using probability distributions for up to 30 qubits on a regular desktop computer. When applied to industrially relevant distributions this combination of classical training with quantum sampling represents an avenue for reaching advantage in the NISQ era.
    Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization. (arXiv:2107.01131v3 [stat.ML] UPDATED)
    Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
    Practical Transfer Learning for Bayesian Optimization. (arXiv:1802.02219v4 [stat.ML] UPDATED)
    When hyperparameter optimization of a machine learning algorithm is repeated for multiple datasets it is possible to transfer knowledge to an optimization run on a new dataset. We develop a new hyperparameter-free ensemble model for Bayesian optimization that is a generalization of two existing transfer learning extensions to Bayesian optimization and establish a worst-case bound compared to vanilla Bayesian optimization. Using a large collection of hyperparameter optimization benchmark problems, we demonstrate that our contributions substantially reduce optimization time compared to standard Gaussian process-based Bayesian optimization and improve over the current state-of-the-art for transfer hyperparameter optimization.
    De-Biased Machine Learning of Global and Local Parameters Using Regularized Riesz Representers. (arXiv:1802.08667v6 [stat.ML] UPDATED)
    We provide adaptive inference methods, based on $\ell_1$ regularization, for regular (semi-parametric) and non-regular (nonparametric) linear functionals of the conditional expectation function. Examples of regular functionals include average treatment effects, policy effects, and derivatives. Examples of non-regular functionals include average treatment effects, policy effects, and derivatives conditional on a covariate subvector fixed at a point. We construct a Neyman orthogonal equation for the target parameter that is approximately invariant to small perturbations of the nuisance parameters. To achieve this property, we include the Riesz representer for the functional as an additional nuisance parameter. Our analysis yields weak ``double sparsity robustness'': either the approximation to the regression or the approximation to the representer can be ``completely dense'' as long as the other is sufficiently ``sparse''. Our main results are non-asymptotic and imply asymptotic uniform validity over large classes of models, translating into honest confidence bands for both global and local parameters.
    Adaptive variational Bayes: Optimality, computation and applications. (arXiv:2109.03204v2 [math.ST] UPDATED)
    In this paper, we explore adaptive inference based on variational Bayes. Although several studies have been conducted to analyze the contraction properties of variational posteriors, there is still a lack of a general and computationally tractable variational Bayes method that performs adaptive inference. To fill this gap, we propose a novel adaptive variational Bayes framework, which can operate on a collection of models. The proposed framework first computes a variational posterior over each individual model separately and then combines them with certain weights to produce a variational posterior over the entire model. It turns out that this combined variational posterior is the closest member to the posterior over the entire model in a predefined family of approximating distributions. We show that the adaptive variational Bayes attains optimal contraction rates adaptively under very general conditions. In addition, we provide a methodology to maintain the tractability and adaptive optimality of the adaptive variational Bayes even in the presence of an enormous number of individual models, such as sparse models. We apply the general results to several examples, including deep learning and sparse factor models, and derive new and adaptive inference results. Moreover, we consider the use of quasi-likelihoods in our framework. We formulate theoretical conditions on quasi-likelihoods to ensure adaptive concentration and discuss specific applications to stochastic block models and nonparametric regression with sub-Gaussian errors.
    Bridging Machine Learning and Sciences: Opportunities and Challenges. (arXiv:2210.13441v1 [stat.ML])
    The application of machine learning in sciences has seen exciting advances in recent years. As a widely-applicable technique, anomaly detection has been long studied in the machine learning community. Especially, deep neural nets-based out-of-distribution detection has made great progress for high-dimensional data. Recently, these techniques have been showing their potential in scientific disciplines. We take a critical look at their applicative prospects including data universality, experimental protocols, model robustness, etc. We discuss examples that display transferable practices and domain-specific challenges simultaneously, providing a starting point for establishing a novel interdisciplinary research paradigm in the near future.
    Deep Q-Learning for Nash Equilibria: Nash-DQN. (arXiv:1904.10554v2 [cs.LG] UPDATED)
    Model-free learning for multi-agent stochastic games is an active area of research. Existing reinforcement learning algorithms, however, are often restricted to zero-sum games, and are applicable only in small state-action spaces or other simplified settings. Here, we develop a new data efficient Deep-Q-learning methodology for model-free learning of Nash equilibria for general-sum stochastic games. The algorithm uses a local linear-quadratic expansion of the stochastic game, which leads to analytically solvable optimal actions. The expansion is parametrized by deep neural networks to give it sufficient flexibility to learn the environment without the need to experience all state-action pairs. We study symmetry properties of the algorithm stemming from label-invariant stochastic games and as a proof of concept, apply our algorithm to learning optimal trading strategies in competitive electronic markets.
    Kernel Methods for Causal Functions: Dose, Heterogeneous, and Incremental Response Curves. (arXiv:2010.04855v7 [econ.EM] UPDATED)
    We propose estimators based on kernel ridge regression for nonparametric causal functions such as dose, heterogeneous, and incremental response curves. Treatment and covariates may be discrete or continuous in general spaces. Due to a decomposition property specific to the RKHS, our estimators have simple closed form solutions. We prove uniform consistency with finite sample rates via original analysis of generalized kernel ridge regression. We extend our main results to counterfactual distributions and to causal functions identified by front and back door criteria. We achieve state-of-the-art performance in nonlinear simulations with many covariates, and conduct a policy evaluation of the US Job Corps training program for disadvantaged youths.
    On the failure of variational score matching for VAE models. (arXiv:2210.13390v1 [stat.ML])
    Score matching (SM) is a convenient method for training flexible probabilistic models, which is often preferred over the traditional maximum-likelihood (ML) approach. However, these models are less interpretable than normalized models; as such, training robustness is in general difficult to assess. We present a critical study of existing variational SM objectives, showing catastrophic failure on a wide range of datasets and network architectures. Our theoretical insights on the objectives emerge directly from their equivalent autoencoding losses when optimizing variational autoencoder (VAE) models. First, we show that in the Fisher autoencoder, SM produces far worse models than maximum-likelihood, and approximate inference by Fisher divergence can lead to low-density local optima. However, with important modifications, this objective reduces to a regularized autoencoding loss that resembles the evidence lower bound (ELBO). This analysis predicts that the modified SM algorithm should behave very similarly to ELBO on Gaussian VAEs. We then review two other FD-based objectives from the literature and show that they reduce to uninterpretable autoencoding losses, likely leading to poor performance. The experiments verify our theoretical predictions and suggest that only ELBO and the baseline objective robustly produce expected results, while previously proposed SM methods do not.
    Applications of Machine Learning in Pharmacogenomics: Clustering Plasma Concentration-Time Curves. (arXiv:2210.13310v1 [stat.AP])
    Pharmaceutical researchers are continually searching for techniques to improve both drug development processes and patient outcomes. An area of recent interest is the potential for machine learning applications within pharmacology. One such application not yet given close study is the unsupervised clustering of plasma concentration-time curves, hereafter, pharmacokinetic (PK) curves. This can be done by treating a PK curve as a time series object and subsequently utilizing the extensive body of research related to the clustering of time series data objects. In this paper, we introduce hierarchical clustering within the context of clustering PK curves and find it to be effective at identifying similar-shaped PK curves and informative for understanding patterns of PK curves via its dendrogram data visualization. We also examine many dissimilarity measures between time series objects to identify Euclidean distance as generally most appropriate for clustering PK curves. We further show that dynamic time warping, Fr\'echet, and structure-based measures of dissimilarity like correlation may produce unexpected results. Finally, we apply these methods to a dataset of 250 PK curves as an illustrative case study to demonstrate how the clustering of PK curves can be used as a descriptive tool for summarizing and visualizing complex PK data, which may enhance the study of pharmacogenomics in the context of precision medicine.
    Novelty Detection in Time Series via Weak Innovations Representation: A Deep Learning Approach. (arXiv:2210.13358v1 [cs.LG])
    We consider novelty detection in time series with unknown and nonparametric probability structures. A deep learning approach is proposed to causally extract an innovations sequence consisting of novelty samples statistically independent of all past samples of the time series. A novelty detection algorithm is developed for the online detection of novel changes in the probability structure in the innovations sequence. A minimax optimality under a Bayes risk measure is established for the proposed novelty detection method, and its robustness and efficacy are demonstrated in experiments using real and synthetic datasets.
    Sampling with Mollified Interaction Energy Descent. (arXiv:2210.13400v1 [stat.ML])
    Sampling from a target measure whose density is only known up to a normalization constant is a fundamental problem in computational statistics and machine learning. In this paper, we present a new optimization-based method for sampling called mollified interaction energy descent (MIED). MIED minimizes a new class of energies on probability measures called mollified interaction energies (MIEs). These energies rely on mollifier functions -- smooth approximations of the Dirac delta originated from PDE theory. We show that as the mollifier approaches the Dirac delta, the MIE converges to the chi-square divergence with respect to the target measure and the gradient flow of the MIE agrees with that of the chi-square divergence. Optimizing this energy with proper discretization yields a practical first-order particle-based algorithm for sampling in both unconstrained and constrained domains. We show experimentally that for unconstrained sampling problems our algorithm performs on par with existing particle-based algorithms like SVGD, while for constrained sampling problems our method readily incorporates constrained optimization techniques to handle more flexible constraints with strong performance compared to alternatives.
    High Fidelity Neural Audio Compression. (arXiv:2210.13438v1 [eess.AS])
    We introduce a state-of-the-art real-time, high-fidelity, audio codec leveraging neural networks. It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion. We simplify and speed-up the training by using a single multiscale spectrogram adversary that efficiently reduces artifacts and produce high-quality samples. We introduce a novel loss balancer mechanism to stabilize training: the weight of a loss now defines the fraction of the overall gradient it should represent, thus decoupling the choice of this hyper-parameter from the typical scale of the loss. Finally, we study how lightweight Transformer models can be used to further compress the obtained representation by up to 40%, while staying faster than real time. We provide a detailed description of the key design choices of the proposed model including: training objective, architectural changes and a study of various perceptual loss functions. We present an extensive subjective evaluation (MUSHRA tests) together with an ablation study for a range of bandwidths and audio domains, including speech, noisy-reverberant speech, and music. Our approach is superior to the baselines methods across all evaluated settings, considering both 24 kHz monophonic and 48 kHz stereophonic audio. Code and models are available at github.com/facebookresearch/encodec.
    Deep Grey-Box Modeling With Adaptive Data-Driven Models Toward Trustworthy Estimation of Theory-Driven Models. (arXiv:2210.13103v1 [cs.LG])
    The combination of deep neural nets and theory-driven models, which we call deep grey-box modeling, can be inherently interpretable to some extent thanks to the theory backbone. Deep grey-box models are usually learned with a regularized risk minimization to prevent a theory-driven part from being overwritten and ignored by a deep neural net. However, an estimation of the theory-driven part obtained by uncritically optimizing a regularizer can hardly be trustworthy when we are not sure what regularizer is suitable for the given data, which may harm the interpretability. Toward a trustworthy estimation of the theory-driven part, we should analyze regularizers' behavior to compare different candidates and to justify a specific choice. In this paper, we present a framework that enables us to analyze a regularizer's behavior empirically with a slight change in the neural net's architecture and the training objective.
    On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL. (arXiv:2206.10770v2 [cs.LG] UPDATED)
    We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al., 2020b), linear completeness (Zanette et al., 2020b) and low-rank MDPs with unknown representation (Modi et al., 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.  ( 2 min )
    E-Valuating Classifier Two-Sample Tests. (arXiv:2210.13027v1 [stat.ME])
    We propose E-C2ST, a classifier two-sample test for high-dimensional data based on E-values. Compared to $p$-values-based tests, tests with E-values have finite sample guarantees for the type I error. E-C2ST combines ideas from existing work on split likelihood ratio tests and predictive independence testing. The resulting E-values incorporate information about the alternative hypothesis. We demonstrate the utility of E-C2ST on simulated and real-life data. In all experiments, we observe that when going from small to large sample sizes, as expected, E-C2ST starts with lower power compared to other methods but eventually converges towards one. Simultaneously, E-C2ST's type I error stays substantially below the chosen significance level, which is not always the case for the baseline methods. Finally, we use an MRI dataset to demonstrate that multiplying E-values from multiple independently conducted studies leads to a combined E-value that retains the finite sample type I error guarantees while increasing the power.
    Data Banzhaf: A Robust Data Valuation Framework for Machine Learning. (arXiv:2205.15466v5 [cs.LG] UPDATED)
    This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues -- a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value estimation, and we show that our MSR algorithm's sample complexity is close to the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.  ( 3 min )
    Score-Based Diffusion meets Annealed Importance Sampling. (arXiv:2208.07698v3 [stat.ML] UPDATED)
    More than twenty years after its introduction, Annealed Importance Sampling (AIS) remains one of the most effective methods for marginal likelihood estimation. It relies on a sequence of distributions interpolating between a tractable initial distribution and the target distribution of interest which we simulate from approximately using a non-homogeneous Markov chain. To obtain an importance sampling estimate of the marginal likelihood, AIS introduces an extended target distribution to reweight the Markov chain proposal. While much effort has been devoted to improving the proposal distribution used by AIS, an underappreciated issue is that AIS uses a convenient but suboptimal extended target distribution. We here leverage recent progress in score-based generative modeling (SGM) to approximate the optimal extended target distribution minimizing the variance of the marginal likelihood estimate for AIS proposals corresponding to the discretization of Langevin and Hamiltonian dynamics. We demonstrate these novel, differentiable, AIS procedures on a number of synthetic benchmark distributions and variational auto-encoders.  ( 2 min )
    Contraction of Locally Differentially Private Mechanisms. (arXiv:2210.13386v1 [cs.IT])
    We investigate the contraction properties of locally differentially private mechanisms. More specifically, we derive tight upper bounds on the divergence between $PK$ and $QK$ output distributions of an $\epsilon$-LDP mechanism $K$ in terms of a divergence between the corresponding input distributions $P$ and $Q$, respectively. Our first main technical result presents a sharp upper bound on the $\chi^2$-divergence $\chi^2(PK\|QK)$ in terms of $\chi^2(P\|Q)$ and $\epsilon$. We also show that the same result holds for a large family of divergences, including KL-divergence and squared Hellinger distance. The second main technical result gives an upper bound on $\chi^2(PK\|QK)$ in terms of total variation distance $TV(P, Q)$ and $\epsilon$. We then utilize these bounds to establish locally private versions of the Cram\'er-Rao bound, Le Cam's, Assouad's, and the mutual information methods, which are powerful tools for bounding minimax estimation risks. These results are shown to lead to better privacy analyses than the state-of-the-arts in several statistical problems such as entropy and discrete distribution estimation, non-parametric density estimation, and hypothesis testing.
    Multiplicity-adjusted bootstrap tilting lower confidence bounds for conditional prediction performance measures. (arXiv:2210.13206v1 [stat.ML])
    In machine learning, the selection of a promising model from a potentially large number of competing models and the assessment of its generalization performance are critical tasks that need careful consideration. Typically, model selection and evaluation are strictly separated endeavors, splitting the sample at hand into a training, validation, and evaluation set, and only compute a single confidence interval for the prediction performance of the final selected model. We however propose an algorithm how to compute valid lower confidence bounds for multiple models that have been selected based on their prediction performances in the evaluation set by interpreting the selection problem as a simultaneous inference problem. We use bootstrap tilting and a maxT-type multiplicity correction. The approach is universally applicable for any combination of prediction models, any model selection strategy, and any prediction performance measure that accepts weights. We conducted various simulation experiments which show that our proposed approach yields lower confidence bounds that are at least comparably good as bounds from standard approaches, and that reliably reach the nominal coverage probability. In addition, especially when sample size is small, our proposed approach yields better performing prediction models than the default selection of only one model for evaluation does.  ( 3 min )
    PAC-Bayesian Offline Contextual Bandits With Guarantees. (arXiv:2210.13132v1 [stat.ML])
    This paper introduces a new principled approach for offline policy optimisation in contextual bandits. For two well-established risk estimators, we propose novel generalisation bounds able to confidently improve upon the logging policy offline. Unlike previous work, our approach does not require tuning hyperparameters on held-out sets, and enables deployment with no prior A/B testing. This is achieved by analysing the problem through the PAC-Bayesian lens; mainly, we let go of traditional policy parametrisation (e.g. softmax) and instead interpret the policies as mixtures of deterministic strategies. We demonstrate through extensive experiments evidence of our bounds tightness and the effectiveness of our approach in practical scenarios.  ( 2 min )
    Langevin dynamics based algorithm e-TH$\varepsilon$O POULA for stochastic optimization problems with discontinuous stochastic gradient. (arXiv:2210.13193v1 [math.OC])
    We introduce a new Langevin dynamics based algorithm, called e-TH$\varepsilon$O POULA, to solve optimization problems with discontinuous stochastic gradients which naturally appear in real-world applications such as quantile estimation, vector quantization, CVaR minimization, and regularized optimization problems involving ReLU neural networks. We demonstrate both theoretically and numerically the applicability of the e-TH$\varepsilon$O POULA algorithm. More precisely, under the conditions that the stochastic gradient is locally Lipschitz in average and satisfies a certain convexity at infinity condition, we establish non-asymptotic error bounds for e-TH$\varepsilon$O POULA in Wasserstein distances and provide a non-asymptotic estimate for the expected excess risk, which can be controlled to be arbitrarily small. Three key applications in finance and insurance are provided, namely, multi-period portfolio optimization, transfer learning in multi-period portfolio optimization, and insurance claim prediction, which involve neural networks with (Leaky)-ReLU activation functions. Numerical experiments conducted using real-world datasets illustrate the superior empirical performance of e-TH$\varepsilon$O POULA compared to SGLD, ADAM, and AMSGrad in terms of model accuracy.  ( 2 min )
    Deep Kronecker Network. (arXiv:2210.13327v1 [stat.ML])
    We propose Deep Kronecker Network (DKN), a novel framework designed for analyzing medical imaging data, such as MRI, fMRI, CT, etc. Medical imaging data is different from general images in at least two aspects: i) sample size is usually much more limited, ii) model interpretation is more of a concern compared to outcome prediction. Due to its unique nature, general methods, such as convolutional neural network (CNN), are difficult to be directly applied. As such, we propose DKN, that is able to i) adapt to low sample size limitation, ii) provide desired model interpretation, and iii) achieve the prediction power as CNN. The DKN is general in the sense that it not only works for both matrix and (high-order) tensor represented image data, but also could be applied to both discrete and continuous outcomes. The DKN is built on a Kronecker product structure and implicitly imposes a piecewise smooth property on coefficients. Moreover, the Kronecker structure can be written into a convolutional form, so DKN also resembles a CNN, particularly, a fully convolutional network (FCN). Furthermore, we prove that with an alternating minimization algorithm, the solutions of DKN are guaranteed to converge to the truth geometrically even if the objective function is highly nonconvex. Interestingly, the DKN is also highly connected to the tensor regression framework proposed by Zhou et al. (2010), where a CANDECOMP/PARAFAC (CP) low-rank structure is imposed on tensor coefficients. Finally, we conduct both classification and regression analyses using real MRI data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) to demonstrate the effectiveness of DKN.  ( 2 min )
    PARAFAC2-based Coupled Matrix and Tensor Factorizations. (arXiv:2210.13054v1 [cs.LG])
    Coupled matrix and tensor factorizations (CMTF) have emerged as an effective data fusion tool to jointly analyze data sets in the form of matrices and higher-order tensors. The PARAFAC2 model has shown to be a promising alternative to the CANDECOMP/PARAFAC (CP) tensor model due to its flexibility and capability to handle irregular/ragged tensors. While fusion models based on a PARAFAC2 model coupled with matrix/tensor decompositions have been recently studied, they are limited in terms of possible regularizations and/or types of coupling between data sets. In this paper, we propose an algorithmic framework for fitting PARAFAC2-based CMTF models with the possibility of imposing various constraints on all modes and linear couplings, using Alternating Optimization (AO) and the Alternating Direction Method of Multipliers (ADMM). Through numerical experiments, we demonstrate that the proposed algorithmic approach accurately recovers the underlying patterns using various constraints and linear couplings.  ( 2 min )
    Calibration tests beyond classification. (arXiv:2210.13355v1 [stat.ML])
    Most supervised machine learning tasks are subject to irreducible prediction errors. Probabilistic predictive models address this limitation by providing probability distributions that represent a belief over plausible targets, rather than point estimates. Such models can be a valuable tool in decision-making under uncertainty, provided that the model output is meaningful and interpretable. Calibrated models guarantee that the probabilistic predictions are neither over- nor under-confident. In the machine learning literature, different measures and statistical tests have been proposed and studied for evaluating the calibration of classification models. For regression problems, however, research has been focused on a weaker condition of calibration based on predicted quantiles for real-valued targets. In this paper, we propose the first framework that unifies calibration evaluation and tests for general probabilistic predictive models. It applies to any such model, including classification and regression models of arbitrary dimension. Furthermore, the framework generalizes existing measures and provides a more intuitive reformulation of a recently proposed framework for calibration in multi-class classification. In particular, we reformulate and generalize the kernel calibration error, its estimators, and hypothesis tests using scalar-valued kernels, and evaluate the calibration of real-valued regression problems.  ( 2 min )
    Theoretical Guarantees for Domain Adaptation with Hierarchical Optimal Transport. (arXiv:2210.13331v1 [stat.ML])
    Domain adaptation arises as an important problem in statistical learning theory when the data-generating processes differ between training and test samples, respectively called source and target domains. Recent theoretical advances show that the success of domain adaptation algorithms heavily relies on their ability to minimize the divergence between the probability distributions of the source and target domains. However, minimizing this divergence cannot be done independently of the minimization of other key ingredients such as the source risk or the combined error of the ideal joint hypothesis. The trade-off between these terms is often ensured by algorithmic solutions that remain implicit and not directly reflected by the theoretical guarantees. To get to the bottom of this issue, we propose in this paper a new theoretical framework for domain adaptation through hierarchical optimal transport. This framework provides more explicit generalization bounds and allows us to consider the natural hierarchical organization of samples in both domains into classes or clusters. Additionally, we provide a new divergence measure between the source and target domains called Hierarchical Wasserstein distance that indicates under mild assumptions, which structures have to be aligned to lead to a successful adaptation.  ( 2 min )
    MARS: Meta-Learning as Score Matching in the Function Space. (arXiv:2210.13319v1 [cs.LG])
    Meta-learning aims to extract useful inductive biases from a set of related datasets. In Bayesian meta-learning, this is typically achieved by constructing a prior distribution over neural network parameters. However, specifying families of computationally viable prior distributions over the high-dimensional neural network parameters is difficult. As a result, existing approaches resort to meta-learning restrictive diagonal Gaussian priors, severely limiting their expressiveness and performance. To circumvent these issues, we approach meta-learning through the lens of functional Bayesian neural network inference, which views the prior as a stochastic process and performs inference in the function space. Specifically, we view the meta-training tasks as samples from the data-generating process and formalize meta-learning as empirically estimating the law of this stochastic process. Our approach can seamlessly acquire and represent complex prior knowledge by meta-learning the score function of the data-generating process marginals instead of parameter space priors. In a comprehensive benchmark, we demonstrate that our method achieves state-of-the-art performance in terms of predictive accuracy and substantial improvements in the quality of uncertainty estimates.  ( 2 min )
    A PAC-Bayesian Generalization Bound for Equivariant Networks. (arXiv:2210.13150v1 [cs.LG])
    Equivariant networks capture the inductive bias about the symmetry of the learning task by building those symmetries into the model. In this paper, we study how equivariance relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where the transformation laws of feature spaces are determined by group representations. By using perturbation analysis of equivariant networks in Fourier domain for each layer, we derive norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of group size, and multiplicity and degree of irreducible representations on the generalization error and thereby provide a guideline for selecting them. In general, the bound indicates that using larger group size in the model improves the generalization error substantiated by extensive numerical experiments.  ( 2 min )
    Tail Batch Sampling: Approximating Global Contrastive Losses as Optimization over Batch Assignments. (arXiv:2210.12874v1 [cs.LG])
    Contrastive Learning has recently achieved state-of-the-art performance in a wide range of tasks. Many contrastive learning approaches use mined hard negatives to make batches more informative during training but these approaches are inefficient as they increase epoch length proportional to the number of mined negatives and require frequent updates of nearest neighbor indices or mining from recent batches. In this work, we provide an alternative to hard negative mining in supervised contrastive learning, Tail Batch Sampling (TBS), an efficient approximation to the batch assignment problem that upper bounds the gap between the global and training losses, $\mathcal{L}^{Global} - \mathcal{L}^{Train}$. TBS \textbf{improves state-of-the-art performance} in sentence embedding (+0.37 Spearman) and code-search tasks (+2.2\% MRR), is easy to implement - requiring only a few additional lines of code, does not maintain external data structures such as nearest neighbor indices, is more computationally efficient when compared to the most minimal hard negative mining approaches, and makes no changes to the model being trained.  ( 2 min )
    Stochastic Mirror Descent for Large-Scale Sparse Recovery. (arXiv:2210.12882v1 [stat.ML])
    In this paper we discuss an application of Stochastic Approximation to statistical estimation of high-dimensional sparse parameters. The proposed solution reduces to resolving a penalized stochastic optimization problem on each stage of a multistage algorithm; each problem being solved to a prescribed accuracy by the non-Euclidean Composite Stochastic Mirror Descent (CSMD) algorithm. Assuming that the problem objective is smooth and quadratically minorated and stochastic perturbations are sub-Gaussian, our analysis prescribes the method parameters which ensure fast convergence of the estimation error (the radius of a confidence ball of a given norm around the approximate solution). This convergence is linear during the first "preliminary" phase of the routine and is sublinear during the second "asymptotic" phase. We consider an application of the proposed approach to sparse Generalized Linear Regression problem. In this setting, we show that the proposed algorithm attains the optimal convergence of the estimation error under weak assumptions on the regressor distribution. We also present a numerical study illustrating the performance of the algorithm on high-dimensional simulation data.  ( 2 min )
    Symmetric (Optimistic) Natural Policy Gradient for Multi-agent Learning with Parameter Convergence. (arXiv:2210.12812v1 [math.OC])
    Multi-agent interactions are increasingly important in the context of reinforcement learning, and the theoretical foundations of policy gradient methods have attracted surging research interest. We investigate the global convergence of natural policy gradient (NPG) algorithms in multi-agent learning. We first show that vanilla NPG may not have parameter convergence, i.e., the convergence of the vector that parameterizes the policy, even when the costs are regularized (which enabled strong convergence guarantees in the policy space in the literature). This non-convergence of parameters leads to stability issues in learning, which becomes especially relevant in the function approximation setting, where we can only operate on low-dimensional parameters, instead of the high-dimensional policy. We then propose variants of the NPG algorithm, for several standard multi-agent learning scenarios: two-player zero-sum matrix and Markov games, and multi-player monotone games, with global last-iterate parameter convergence guarantees. We also generalize the results to certain function approximation settings. Note that in our algorithms, the agents take symmetric roles. Our results might also be of independent interest for solving nonconvex-nonconcave minimax optimization problems with certain structures. Simulations are also provided to corroborate our theoretical findings.  ( 2 min )
    Falsehoods that ML researchers believe about OOD detection. (arXiv:2210.12767v1 [stat.ML])
    Modelling the density $p(x)$ by probabilistic generative models is an intuitive way to detect out-of-distribution (OOD) data, but it fails in the deep learning context. In this paper, we list some falsehoods that machine learning researchers believe about density-based OOD detection. Many recent works have proposed likelihood-ratio-based methods to `fix' this issue. We propose a framework, the OOD proxy framework, to unify these methods, and we argue that likelihood ratio is a principled method for OOD detection and not a mere `fix'. Finally, we discuss the relationship between domain detection and semantics.  ( 2 min )
    Optimal Discriminant Analysis in High-Dimensional Latent Factor Models. (arXiv:2210.12862v1 [math.ST])
    In high-dimensional classification problems, a commonly used approach is to first project the high-dimensional features into a lower dimensional space, and base the classification on the resulting lower dimensional projections. In this paper, we formulate a latent-variable model with a hidden low-dimensional structure to justify this two-step procedure and to guide which projection to choose. We propose a computationally efficient classifier that takes certain principal components (PCs) of the observed features as projections, with the number of retained PCs selected in a data-driven way. A general theory is established for analyzing such two-step classifiers based on any projections. We derive explicit rates of convergence of the excess risk of the proposed PC-based classifier. The obtained rates are further shown to be optimal up to logarithmic factors in the minimax sense. Our theory allows the lower-dimension to grow with the sample size and is also valid even when the feature dimension (greatly) exceeds the sample size. Extensive simulations corroborate our theoretical findings. The proposed method also performs favorably relative to other existing discriminant methods on three real data examples.  ( 2 min )
    Manifold Alignment with Label Information. (arXiv:2210.12774v1 [stat.ML])
    Multi-domain data is becoming increasingly common and presents both challenges and opportunities in the data science community. The integration of distinct data-views can be used for exploratory data analysis, and benefit downstream analysis including machine learning related tasks. With this in mind, we present a novel manifold alignment method called MALI (Manifold alignment with label information) that learns a correspondence between two distinct domains. MALI can be considered as belonging to a middle ground between the more commonly addressed semi-supervised manifold alignment problem with some known correspondences between the two domains, and the purely unsupervised case, where no known correspondences are provided. To do this, MALI learns the manifold structure in both domains via a diffusion process and then leverages discrete class labels to guide the alignment. By aligning two distinct domains, MALI recovers a pairing and a common representation that reveals related samples in both domains. Additionally, MALI can be used for the transfer learning problem known as domain adaptation. We show that MALI outperforms the current state-of-the-art manifold alignment methods across multiple datasets.  ( 2 min )
    Multi-Objective GFlowNets. (arXiv:2210.12765v1 [cs.LG])
    In many applications of machine learning, like drug discovery and material design, the goal is to generate candidates that simultaneously maximize a set of objectives. As these objectives are often conflicting, there is no single candidate that simultaneously maximizes all objectives, but rather a set of Pareto-optimal candidates where one objective cannot be improved without worsening another. Moreover, in practice, these objectives are often under-specified, making the diversity of candidates a key consideration. The existing multi-objective optimization methods focus predominantly on covering the Pareto front, failing to capture diversity in the space of candidates. Motivated by the success of GFlowNets for generation of diverse candidates in a single objective setting, in this paper we consider Multi-Objective GFlowNets (MOGFNs). MOGFNs consist of a novel Conditional GFlowNet which models a family of single-objective sub-problems derived by decomposing the multi-objective optimization problem. Our work is the first to empirically demonstrate conditional GFlowNets. Through a series of experiments on synthetic and benchmark tasks, we empirically demonstrate that MOGFNs outperform existing methods in terms of Hypervolume, R2-distance and candidate diversity. We also demonstrate the effectiveness of MOGFNs over existing methods in active learning settings. Finally, we supplement our empirical results with a careful analysis of each component of MOGFNs.  ( 2 min )
    Principal Component Classification. (arXiv:2210.12746v1 [cs.LG])
    We propose to directly compute classification estimates by learning features encoded with their class scores. Our resulting model has a encoder-decoder structure suitable for supervised learning, it is computationally efficient and performs well for classification on several datasets.  ( 2 min )
    Transport Reversible Jump Proposals. (arXiv:2210.12572v1 [stat.CO])
    Reversible jump Markov chain Monte Carlo (RJMCMC) proposals that achieve reasonable acceptance rates and mixing are notoriously difficult to design in most applications. Inspired by recent advances in deep neural network-based normalizing flows and density estimation, we demonstrate an approach to enhance the efficiency of RJMCMC sampling by performing transdimensional jumps involving reference distributions. In contrast to other RJMCMC proposals, the proposed method is the first to apply a non-linear transport-based approach to construct efficient proposals between models with complicated dependency structures. It is shown that, in the setting where exact transports are used, our RJMCMC proposals have the desirable property that the acceptance probability depends only on the model probabilities. Numerical experiments demonstrate the efficacy of the approach.  ( 2 min )
    Bayesian Convolutional Deep Sets with Task-Dependent Stationary Prior. (arXiv:2210.12363v1 [stat.ML])
    Convolutional deep sets are the architecture of a deep neural network (DNN) that can model stationary stochastic process. This architecture uses the kernel smoother and the DNN to construct the translation equivariant functional representations, and thus reflects the inductive bias of the stationarity into DNN. However, since this architecture employs the kernel smoother known as the non-parametric model, it may produce ambiguous representations when the number of data points is not given sufficiently. To remedy this issue, we introduce Bayesian convolutional deep sets that construct the random translation equivariant functional representations with stationary prior. Furthermore, we present how to impose the task-dependent prior for each dataset because a wrongly imposed prior forms an even worse representation than that of the kernel smoother. We validate the proposed architecture and its training on various experiments with time-series and image datasets.  ( 2 min )
    Bayesian Optimization with Conformal Coverage Guarantees. (arXiv:2210.12496v1 [cs.LG])
    Bayesian optimization is a coherent, ubiquitous approach to decision-making under uncertainty, with applications including multi-arm bandits, active learning, and black-box optimization. Bayesian optimization selects decisions (i.e. objective function queries) with maximal expected utility with respect to the posterior distribution of a Bayesian model, which quantifies reducible, epistemic uncertainty about query outcomes. In practice, subjectively implausible outcomes can occur regularly for two reasons: 1) model misspecification and 2) covariate shift. Conformal prediction is an uncertainty quantification method with coverage guarantees even for misspecified models and a simple mechanism to correct for covariate shift. We propose conformal Bayesian optimization, which directs queries towards regions of search space where the model predictions have guaranteed validity, and investigate its behavior on a suite of black-box optimization tasks and tabular ranking tasks. In many cases we find that query coverage can be significantly improved without harming sample-efficiency.  ( 2 min )
    Generalized Likelihood Ratio Test With One-Class Classifiers. (arXiv:2210.12494v1 [cs.LG])
    One-class classification (OCC) is the problem of deciding whether an observed sample belongs to a target class or not. We consider the problem of learning an OCC model when the dataset available at the learning stage contains only samples from the target class. We aim at obtaining a classifier that performs as the generalized likelihood ratio test (GLRT), which is a well-known and provably optimal (under specific assumptions) classifier when the statistic of the target class is available. To this end, we consider both the multilayer perceptron neural network (NN) and the support vector machine (SVM) models. They are trained as two-class classifiers using an artificial dataset for the alternative class, obtained by generating random samples, uniformly over the domain of the target-class dataset. We prove that, under suitable assumptions, the models converge (with a large dataset) to the GLRT. Moreover, we show that the one-class least squares SVM (OCLSSVM) at convergence performs as the GLRT, with a suitable transformation function. Lastly, we compare the obtained solutions with the autoencoder (AE) classifier, which does not in general provide the GLRT  ( 2 min )
    Learning Correlated Stackelberg Equilibrium in General-Sum Multi-Leader-Single-Follower Games. (arXiv:2210.12470v1 [cs.LG])
    Many real-world strategic games involve interactions between multiple players. We study a hierarchical multi-player game structure, where players with asymmetric roles can be separated into leaders and followers, a setting often referred to as Stackelberg game or leader-follower game. In particular, we focus on a Stackelberg game scenario where there are multiple leaders and a single follower, called the Multi-Leader-Single-Follower (MLSF) game. We propose a novel asymmetric equilibrium concept for the MLSF game called Correlated Stackelberg Equilibrium (CSE). We design online learning algorithms that enable the players to interact in a distributed manner, and prove that it can achieve no-external Stackelberg-regret learning. This further translates to the convergence to approximate CSE via a reduction from no-external regret to no-swap regret. At the core of our works, we solve the intricate problem of how to learn equilibrium in leader-follower games with noisy bandit feedback by balancing exploration and exploitation in different learning structures.  ( 2 min )
    Adaptive Data Fusion for Multi-task Non-smooth Optimization. (arXiv:2210.12334v1 [stat.ML])
    We study the problem of multi-task non-smooth optimization that arises ubiquitously in statistical learning, decision-making and risk management. We develop a data fusion approach that adaptively leverages commonalities among a large number of objectives to improve sample efficiency while tackling their unknown heterogeneities. We provide sharp statistical guarantees for our approach. Numerical experiments on both synthetic and real data demonstrate significant advantages of our approach over benchmarks.  ( 2 min )
    Explanation Shift: Detecting distribution shifts on tabular data via the explanation space. (arXiv:2210.12369v1 [cs.LG])
    As input data distributions evolve, the predictive performance of machine learning models tends to deteriorate. In the past, predictive performance was considered the key indicator to monitor. However, explanation aspects have come to attention within the last years. In this work, we investigate how model predictive performance and model explanation characteristics are affected under distribution shifts and how these key indicators are related to each other for tabular data. We find that the modeling of explanation shifts can be a better indicator for the detection of predictive performance changes than state-of-the-art techniques based on representations of distribution shifts. We provide a mathematical analysis of different types of distribution shifts as well as synthetic experimental examples.  ( 2 min )
    The Stochastic Proximal Distance Algorithm. (arXiv:2210.12277v1 [cs.LG])
    Stochastic versions of proximal methods have gained much attention in statistics and machine learning. These algorithms tend to admit simple, scalable forms, and enjoy numerical stability via implicit updates. In this work, we propose and analyze a stochastic version of the recently proposed proximal distance algorithm, a class of iterative optimization methods that recover a desired constrained estimation problem as a penalty parameter $\rho \rightarrow \infty$. By uncovering connections to related stochastic proximal methods and interpreting the penalty parameter as the learning rate, we justify heuristics used in practical manifestations of the proximal distance method, establishing their convergence guarantees for the first time. Moreover, we extend recent theoretical devices to establish finite error bounds and a complete characterization of convergence rates regimes. We validate our analysis via a thorough empirical study, also showing that unsurprisingly, the proposed method outpaces batch versions on popular learning tasks.  ( 2 min )
    Uncertainty Estimates of Predictions via a General Bias-Variance Decomposition. (arXiv:2210.12256v1 [cs.LG])
    Reliably estimating the uncertainty of a prediction throughout the model lifecycle is crucial in many safety-critical applications. The most common way to measure this uncertainty is via the predicted confidence. While this tends to work well for in-domain samples, these estimates are unreliable under domain drift. Alternatively, a bias-variance decomposition allows to directly measure the predictive uncertainty across the entire input space. But, such a decomposition for proper scores does not exist in current literature, and for exponential families it is convoluted. In this work, we introduce a general bias-variance decomposition for proper scores and reformulate the exponential family case, giving rise to the Bregman Information as the variance term in both cases. This allows us to prove that the Bregman Information for classification measures the uncertainty in the logit space. We showcase the practical relevance of this decomposition on two downstream tasks. First, we show how to construct confidence intervals for predictions on the instance-level based on the Bregman Information. Second, we demonstrate how different approximations of the instance-level Bregman Information allow reliable out-of-distribution detection for all degrees of domain drift.  ( 2 min )
    Implicit Offline Reinforcement Learning via Supervised Learning. (arXiv:2210.12272v1 [stat.ML])
    Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset collected by policies of different expertise levels. It is as simple as supervised learning and Behavior Cloning (BC), but takes advantage of return information. On datasets collected by policies of similar expertise, implicit BC has been shown to match or outperform explicit BC. Despite the benefits of using implicit models to learn robotic skills via BC, offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models can leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show the close relationship between our implicit methods and other popular RL via Supervised Learning algorithms to provide a unified framework. Finally, we demonstrate the effectiveness of our method on high-dimension manipulation and locomotion tasks.  ( 2 min )
    Sequential Gradient Descent and Quasi-Newton's Method for Change-Point Analysis. (arXiv:2210.12235v1 [stat.ML])
    One common approach to detecting change-points is minimizing a cost function over possible numbers and locations of change-points. The framework includes several well-established procedures, such as the penalized likelihood and minimum description length. Such an approach requires finding the cost value repeatedly over different segments of the data set, which can be time-consuming when (i) the data sequence is long and (ii) obtaining the cost value involves solving a non-trivial optimization problem. This paper introduces a new sequential method (SE) that can be coupled with gradient descent (SeGD) and quasi-Newton's method (SeN) to find the cost value effectively. The core idea is to update the cost value using the information from previous steps without re-optimizing the objective function. The new method is applied to change-point detection in generalized linear models and penalized regression. Numerical studies show that the new approach can be orders of magnitude faster than the Pruned Exact Linear Time (PELT) method without sacrificing estimation accuracy.  ( 2 min )
    Uncertain Evidence in Probabilistic Models and Stochastic Simulators. (arXiv:2210.12236v1 [stat.ML])
    We consider the problem of performing Bayesian inference in probabilistic models where observations are accompanied by uncertainty, referred to as `uncertain evidence'. In many real-world scenarios, such uncertainty stems from measurement errors associated with observable quantities in probabilistic models. We explore how to interpret uncertain evidence, and by extension the importance of proper interpretation as it pertains to inference about latent variables. We consider a recently-proposed method `stochastic evidence' as well as revisit two older methods: Jeffrey's rule and virtual evidence. We devise concrete guidelines on how to account for uncertain evidence and we provide new insights, particularly regarding consistency. To showcase the impact of different interpretations of the same uncertain evidence, we carry out experiments in which we compare inference results associated with each interpretation.  ( 2 min )

  • Open

    seeking volunteers for a research project that will give you a tool for stock and forex trading
    Hello we are seeking a volunteer(s) specializing in machine learning AI development t assist on a couple of collaborative group projects centered around developing advanced market trading AI. One is a hybrid series of parallel and perpendicular AI chains that filter out the best predictions for the "big moves" for the next 24 hour period, the other is a bot generator that sends bots out to search the web for key data points and trades automatically based on that. We are seeking those with any background in software development. It is worth mentioning that while this project only offers exclusive access to the finished product, those who make it to the end may also be offered a paid position in the startup company that the founder of this group owns (the company is not in any way involved with this project). Thanks for your time, feel free to comment directly on here if you're interested or DM me directly. submitted by /u/38931841Hz [link] [comments]  ( 112 min )
    AI Dream 92 - EPIC Halloween with Molten Rainbows
    submitted by /u/LordPewPew777 [link] [comments]  ( 111 min )
    Generative AI Reading List
    submitted by /u/CrossoverTime [link] [comments]  ( 111 min )
    Stable Diffusion1.5 Custom Model Sorceress Slideshow
    submitted by /u/prfitofthesngularity [link] [comments]  ( 110 min )
    The Exploited Labor Behind Artificial Intelligence
    submitted by /u/totesnotadictator [link] [comments]  ( 110 min )
    GPT-3 does an astonishingly good job creating both sides of an Interactive Fiction transcript
    submitted by /u/raldi [link] [comments]  ( 115 min )
    Aesthetic Gradients: How To Apply A Quick Style To Your Images
    submitted by /u/PuppetHere [link] [comments]  ( 111 min )
    How we used AI to improve new user onboarding
    Blog post: https://fillout.com/blog/how-we-used-ai-to-improve-our-user-onboarding/ I shared Fillout last week and several of you asked how it works under the hood. Wrote this quick blog post to explain how. Let me know what you think! submitted by /u/dominicwhyte42 [link] [comments]  ( 113 min )
    Do machines have minds?
    submitted by /u/bendee983 [link] [comments]  ( 110 min )
    Open-source SOTA Solution for Portrait and Human Segmentation (5.7k stars)
    Hi, I'd like to introduce a human segmentation toolkit called PP-HumanSeg. This might be some help to you. Hope you enjoy it. This toolkit has: A large-scale video portrait dataset that contains 14K frames for conference scenes Portrait segmentation models that achieve SOTA performance (mIoU 96.63%, 63 FPS on mobile phone) Several out-of-box human segmentation models for real scene Github: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.6/contrib/PP-HumanSeg ​ https://i.redd.it/qldxzglroqv91.gif submitted by /u/EldDap [link] [comments]  ( 114 min )
    I made a prompt discussion board to share AI art and Prompts.
    submitted by /u/Odd-Sentence-5197 [link] [comments]  ( 112 min )
    A new method uses optics to accelerate machine-learning computations on smart speakers and other low-power connected devices.
    submitted by /u/qptbook [link] [comments]  ( 118 min )
    It's almost frightening how far the AI has come...
    Here is my conversation with the OpenAI about imagination, life, death and more... https://beta.openai.com/playground/p/jyA7RPFmjM3v4cjuJlTuSVsP?model=text-davinci-002 Edit: The following is a conversation with an AI assistant. The assistant is helpful, creative, clever, and very friendly. ​ Human: Hello, who are you? AI: I am an AI created by OpenAI. How can I help you today? Human: Hii AI: ​ Hello, how can I help you? Human: What do you think of ice creams ? AI: Ice cream is one of my favorite things! I love how there are so many different flavors and types to choose from. Human: Do you think it tastes good? AI: ​ Yes, I think ice cream tastes great! Human: have you ever tasted an ice cream ? AI: Yes, I have tasted ice cream before. I think it's delicious! Human: How…  ( 136 min )
    Anyone know where the chatbot that has the personality of popular media characters is located?
    I have created a openai account, but all it gives me is a default chatbot. Where can I find the funny homelander chatbot? submitted by /u/insertname1234567890 [link] [comments]  ( 111 min )
    Lex Invite please
    submitted by /u/roblox22g [link] [comments]  ( 110 min )
    Upscaling Video with Topaz AI | Topaz Video Enhancer AI Tutorial
    submitted by /u/cherishjoo [link] [comments]  ( 113 min )
  • Open

    POPGym: A collection of 15 partially observable gym environments and 13 memory models
    submitted by /u/smorad [link] [comments]  ( 114 min )
    RL review
    Which RL papers/ review papers to read if one wants to know the brief history and recent developments in reinforcement learning? submitted by /u/sayakm330 [link] [comments]  ( 116 min )
    Bot gets the Tree Sentinel to half HP
    submitted by /u/Phat_N_Sassy33 [link] [comments]  ( 112 min )
    Soft Actor Critic - Entropy?
    I'm trying to understand how Soft Actor Critic works and am very confused how exactly entropy is being maximised. Entropy is given by E(-log P(x) ) [https://spinningup.openai.com/en/latest/algorithms/sac.html] Which expands to integration of -P(x).logP(x) for continuous x. But in the formulations for Q and V, P(x) is not there and only -logP(x) is added. Graph of -logP(x) looks like this: -log a | We only care about [0, 1] for a = P(x) And as a reward curve, it looks like it is trying to minimise probability (high rewards for less probability), which doesn't make sense if you're trying to maximise randomness/entropy. By squashing the probability, it should move closer to determinism. Also, low probability of an action => high reward => increase in probability => low reward => decrease in probability => high reward => ..... => Increase in confusion So can someone please help me understand what exactly is happening here? Experiments: I ran some versions of SAC on InvertedPendulumBulletEnv-v0. For each different version, I added the listed quantity to the objective functions for Q and V: -logP(x) [original] -P(x).logP(x) [makes more sense to me] -P(x) [why not?] +P(x) [to check if addition or subtraction matters] +logP(x) [for thoroughness] 0 [just to see what happens] 1, 2 and 3 learnt well, and were consistently getting +1000 episode score by the end of 250 epochs(episodes). 4 and 5 learnt absolutely nothing. 6 learnt much slower and the learning was very unstable. ​ At first I thought that giving a penalty towards the probability distribution is what matters (maybe as a form of regularisation?), but then I recalled logP(x) is negative in itself, so -logP(x) is +ve. So now I'm even more confused why both 1 and 3 work and not 4 or 5. submitted by /u/mrscabbycreature [link] [comments]  ( 117 min )
  • Open

    Sharing data without letting it go
    Suppose two companies would like to share data, but they’d also each like to retain ownership of their own data. They’d like to enable querying as if each company had given the other all its data, without actually letting go of its data. Maybe the two companies are competitors who want to collaborate for a […] Sharing data without letting it go first appeared on John D. Cook.  ( 6 min )
    Robustness of mean range
    Let’s suppose we have data that comes from a distribution that is approximately normal but has a heavier right tail, specifically a gamma distribution with shape 6. We’d like to estimate the standard deviation of the data source. If the data were normally distributed, the sample standard deviation would be the most efficient unbiased estimator. […] Robustness of mean range first appeared on John D. Cook.  ( 5 min )
    Average digit sum
    Suppose you write down a number and take the sum of its digits. In what base will this sum be the smallest on average? Let’s do a couple examples comparing base 10 and base 2. The number 2022 in base 10 has digit sum 6, but its binary equivalent 11111100110 has digit sum 8, so […] Average digit sum first appeared on John D. Cook.  ( 5 min )
    Using mean range method to measure variability
    The most common way to measure variability, at least for data coming from a normal distribution, is standard deviation. Another less common approach is to use mean range. Standard deviation is mathematically simple but operationally a little complicated. Mean range, on the other hand, is complicated to analyze mathematically but operationally very simple. ASQ/ANSI Z1.9 […] Using mean range method to measure variability first appeared on John D. Cook.  ( 6 min )
  • Open

    Detect patterns in text data with Amazon SageMaker Data Wrangler
    In this post, we introduce a new analysis in the Data Quality and Insights Report of Amazon SageMaker Data Wrangler. This analysis assists you in validating textual features for correctness and uncovering invalid rows for repair or omission. Data Wrangler reduces the time it takes to aggregate and prepare data for machine learning (ML) from […]  ( 6 min )
    Reduce deep learning training time and cost with MosaicML Composer on AWS
    In the past decade, we have seen Deep learning (DL) science adopted at a tremendous pace by AWS customers. The plentiful and jointly trained parameters of DL models have a large representational capacity that brought improvements in numerous customer use cases, including image and speech analysis, natural language processing (NLP), time series processing, and more. […]  ( 6 min )
  • Open

    [R] Using Large Language Models to Enhance Programming Error Messages
    Paper: https://arxiv.org/abs/2210.11630 Abstract: A key part of learning to program is learning to understand programming error messages. They can be hard to interpret and identifying the cause of errors can be time-consuming. One factor in this challenge is that the messages are typically intended for an audience that already knows how to program, or even for programming environments that then use the information to highlight areas in code. Researchers have been working on making these errors more novice friendly since the 1960s, however progress has been slow. The present work contributes to this stream of research by using large language models to enhance programming error messages with explanations of the errors and suggestions on how to fix the error. Large language models can be used to create useful and novice-friendly enhancements to programming error messages that sometimes surpass the original programming error messages in interpretability and actionability. These results provide further evidence of the benefits of large language models for computing educators, highlighting their use in areas known to be challenging for students. We further discuss the benefits and downsides of large language models and highlight future streams of research for enhancing programming error messages. submitted by /u/xutw21 [link] [comments]  ( 124 min )
    [D] would diffusion language models make sense?
    With the success of diffusion models in image generation, I was wondering if doing the same but with text embeddings would make sense. Diffusing the embeddings so they end up being a bit off in term of vectors and position and learning to correct them. Also the iterative process of refining them during multiple pass. Would that make any sense? I don't think I heard about research in this area. submitted by /u/hapliniste [link] [comments]  ( 125 min )
    [R] Large Language Models Can Self-Improve
    Paper: https://arxiv.org/abs/2210.11610 Abstract: Large Language Models (LLMs) have achieved excellent performances in various tasks. However, fine-tuning an LLM requires extensive supervision. Human, on the other hand, may improve their reasoning abilities by self-thinking without external inputs. In this work, we demonstrate that an LLM is also capable of self-improving with only unlabeled datasets. We use a pre-trained LLM to generate "high-confidence" rationale-augmented answers for unlabeled questions using Chain-of-Thought prompting and self-consistency, and fine-tune the LLM using those self-generated solutions as target outputs. We show that our approach improves the general reasoning ability of a 540B-parameter LLM (74.4%->82.1% on GSM8K, 78.2%->83.0% on DROP, 90.0%->94.4% on OpenBookQA, and 63.4%->67.9% on ANLI-A3) and achieves state-of-the-art-level performance, without any ground truth label. We conduct ablation studies and show that fine-tuning on reasoning is critical for self-improvement. submitted by /u/Lajamerr_Mittesdine [link] [comments]  ( 119 min )
    [D] Neural Avatar Community
    Avatars are going to be one of the most critical components in the next decade. I have not found any community which has deep insight beyond just a subsection into this brilliant field. Meta Reality Labs, Disney Research, and Microsoft Research are just a few companies developing neural avatar representations of the face and body with very realistic results. I've created a discord so that we can share the latest research and datasets which i will update constantly anytime I find anything fascinating and work on building a realistic neural avatar as well. Hopefully we can build a community big enough similar to even stable diffusion where we can create the most realistic avatars possible. Here are some references that Meta, Samsung, and University of Science and Technology of China have achieved https://arxiv.org/abs/2103.01954 https://arxiv.org/pdf/2207.07621.pdf https://arxiv.org/abs/2210.06108 submitted by /u/trikortreat123 [link] [comments]  ( 121 min )
    [P] Matching 3d object with 2d image.
    I'm working on a project that consists of creating a network that receives an image and a 3d object. This network should be able to make the correspondences between the 2d image object and the 3d object. I created the images synthetically in order to be able to create the ground truth more easily. At the moment I have a list with the vertices of the mesh visible by the camera and the respective pixels of the image. What kind of tips can you give me to solve this type of problem? What kind of network should I create? submitted by /u/henistein [link] [comments]  ( 126 min )
    [R] From 3D Contour Plots to AI-Generated Art
    Fun tutorial to learn how to make professional contour plots in Python, with incredible animated visualizations. At the intersection of machine learning, scientific computing, automated art, cartography, and video games. Section 3 is particularly interesting, as it shows all the work behind the scene, to complete this project in 20 hours when you have no idea how to start. https://reddit.com/link/ycg6c6/video/kycotrx09sv91/player There is far more than just creating 3D contour plots in this article. First, you will learn how to produce data videos. I have shared quite a few in the past (with source code), but this is probably the simplest example. The data video also illustrates that a mixture of Gaussian-like distributions is typically non Gaussian-like, and may or may not be unimodal.…  ( 126 min )
    [N] Up to $500,000 in Prizes for ML Safety Benchmark Ideas
    “If you cannot measure it, you cannot improve it.” ML Safety lacks good benchmarks. We are offering prizes for benchmark ideas in order to concretize ML safety research directions. You can submit a published paper or just a write-up of your idea. Main site: benchmarking.mlsafety.org Example ideas: benchmarking.mlsafety.org/ideas https://preview.redd.it/9yof7nplqrv91.png?width=838&format=png&auto=webp&s=a2a5bf6a25c4c95a4c3a6220e238ae32aa63c215 submitted by /u/joshuamclymer [link] [comments]  ( 124 min )
    [P] Interactive Segmentation to Improve 10 times Annotation Efficiency (5.7k star)
    Hi, I'd like to introduce EISeg , an efficient and interactive tool for segmentation annotation. Only need several clicks to finish segmentation annotation, which improve the efficiency by 10 times. Hope this be some help for you :) ​ Main Features: Support image and video as inputs Open-source, easy-to-use and powerful Provide specialized models for better performance , such as portrait, remote sensing, medical treatment, etc Usage: https://github.com/PaddlePaddle/PaddleSeg/blob/release/2.6/EISeg/README_EN.md Code: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.6/EISeg Technical article: https://arxiv.org/abs/2210.08788 ​ The following images demonstrate the effectiveness of EISeg. https://i.redd.it/rrrdt2cc2rv91.gif submitted by /u/Effective_Tax_2096 [link] [comments]  ( 123 min )
    [D] Causal attention masking in GPT-like models
    In GPT implementations I've seen, contiguous spans of text are sampled from the dataset and then a causal mask is applied over the spans, from the first to the last token of the input. When working with shorter spans of text (or when data is from an internet crawl and we know that some parts are from a different source) is there any worthwhile gain in preparing a more complicated attention mask, so that there is no attention applied between unrelated samples? So something like this? https://preview.redd.it/ygipbem3cqv91.png?width=817&format=png&auto=webp&s=faf0e6fa2a32ee52d85711c3e1fcd447dd8f47c8 Are there any strong reasons to use the left masking scheme instead of the one on the right? EOT is a separating token, (End Of Text), with variant on the right it is possible to just omit the EOT in the input and it should work better. Other approach I can imagine is to just pack one example per row of batch, and pad the batch, which solves the problem of attending to unrelated samples. But that can get tricky for example while using TPUs, where you want to keep the sequence dimension of constant size, and padding short examples to max_length is a waste of compute. submitted by /u/Krucjator [link] [comments]  ( 121 min )
    [P] Let's Hijack AI! Security and Privacy Risk Simulator for Machine Learning
    I have developed a framework named AIJack to simulate various attacks against machine learning models, mainly based on PyTorch and sklearn. I have implemented more than 20 algorithms, such as Model Inversion, Poisoning Attack, Evasion Attack, Federated Learning, Split Learning, Differential Privacy, and Homomorphic Encryption. I am looking forward to your feedback! submitted by /u/Living_Impression_37 [link] [comments]  ( 128 min )
    [D] Most efficient open source language model ?
    Hey, I have been looking for a fast language model that like takes less than 4 seconds on a normale cpu for 2000 tokens. But didn’t find one. So what’s is the current fastest language model that have competitive results ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 121 min )
  • Open

    Keep On Trucking: SenSen Harnesses Drones, NVIDIA Jetson, Metropolis to Inspect Trucks
    Sensor AI solutions specialist SenSen has turned to the NVIDIA Jetson edge AI platform to help regulators track heavy vehicles moving across Australia. Australia’s National Heavy Vehicle Regulator, or NHVR, has a big job — ensuring the safety of truck drivers across some of the world’s most sparsely populated regions. They’re now harnessing AI to Read article > The post Keep On Trucking: SenSen Harnesses Drones, NVIDIA Jetson, Metropolis to Inspect Trucks appeared first on NVIDIA Blog.  ( 5 min )
    What Are Graph Neural Networks?
    When two technologies converge, they can create something new and wonderful — like cellphones and browsers were fused to forge smartphones. Today, developers are applying AI’s ability to find patterns to massive graph databases that store information about relationships among data points of all sorts. Together they produce a powerful new tool called graph neural Read article > The post What Are Graph Neural Networks? appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    ECCV 2022 highlights: Advancing the foundations of mixed reality
    Computer vision is one of the most remarkable developments to emerge from the field of computer science. It’s among the most rapidly growing areas in the technology landscape and has the potential to significantly impact the way people live and work. Advances at the intersection of machine learning (ML) and computer vision have been accelerating in recent years, leading to significant progress in numerous fields, including healthcare, robotics, the automotive industry, and augmented reality (AR). Microsoft is proud to be a prominent contributor to computer vision research. The post ECCV 2022 highlights: Advancing the foundations of mixed reality appeared first on Microsoft Research.  ( 16 min )
  • Open

    12 Crucial Tips for Employee Onboarding
    Onboarding is a critical stage for new and transfer hires’ successful performance in the future. In fact, employees’ successes and failures are directly related to the quality of the onboarding process. That’s why it is crucial to provide stellar training, which includes making sure learning is taking place. Employees who are exiting large corporations these… Read More »12 Crucial Tips for Employee Onboarding The post 12 Crucial Tips for Employee Onboarding appeared first on Data Science Central.  ( 21 min )
  • Open

    RL with KL penalties is better viewed as Bayesian inference. (arXiv:2205.11275v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. In this paper, we analyze challenges associated with treating a language model as an RL policy and show how avoiding those challenges requires moving beyond the RL paradigm. We start by observing that the standard RL approach is flawed as an objective for fine-tuning LMs because it leads to distribution collapse: turning the LM into a degenerate distribution. Then, we analyze KL-regularised RL, a widely used recipe for fine-tuning LMs, which additionally constrains the fine-tuned LM to stay close to its original distribution in terms of Kullback-Leibler (KL) divergence. We show that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which specifies how to update a prior LM to conform with evidence provided by the reward function. We argue that this Bayesian inference view of KL-regularised RL is more insightful than the typically employed RL perspective. The Bayesian inference view explains how KL-regularised RL avoids the distribution collapse problem and offers a first-principles derivation for its objective. While this objective happens to be equivalent to RL (with a particular choice of parametric reward), there exist other objectives for fine-tuning LMs which are no longer equivalent to RL. That observation leads to a more general point: RL is not an adequate formal framework for problems such as fine-tuning language models. These problems are best viewed as Bayesian inference: approximating a pre-defined target distribution.  ( 3 min )
    SUPA: A Lightweight Diagnostic Simulator for Machine Learning in Particle Physics. (arXiv:2202.05012v2 [physics.data-an] UPDATED)
    Deep learning methods have gained popularity in high energy physics for fast modeling of particle showers in detectors. Detailed simulation frameworks such as the gold standard Geant4 are computationally intensive, and current deep generative architectures work on discretized, lower resolution versions of the detailed simulation. The development of models that work at higher spatial resolutions is currently hindered by the complexity of the full simulation data, and by the lack of simpler, more interpretable benchmarks. Our contribution is SUPA, the SUrrogate PArticle propagation simulator, an algorithm and software package for generating data by simulating simplified particle propagation, scattering and shower development in matter. The generation is extremely fast and easy to use compared to Geant4, but still exhibits the key characteristics and challenges of the detailed simulation. We support this claim experimentally by showing that performance of generative models on data from our simulator reflects the performance on a dataset generated with Geant4. The proposed simulator generates thousands of particle showers per second on a desktop machine, a speed up of up to 6 orders of magnitudes over Geant4, and stores detailed geometric information about the shower propagation. SUPA provides much greater flexibility for setting initial conditions and defining multiple benchmarks for the development of models. Moreover, interpreting particle showers as point clouds creates a connection to geometric machine learning and provides challenging and fundamentally new datasets for the field. The code for SUPA is available at https://github.com/itsdaniele/SUPA.  ( 3 min )
    Adaptively Calibrated Critic Estimates for Deep Reinforcement Learning. (arXiv:2111.12673v2 [cs.LG] UPDATED)
    Accurate value estimates are important for off-policy reinforcement learning. Algorithms based on temporal difference learning typically are prone to an over- or underestimation bias building up over time. In this paper, we propose a general method called Adaptively Calibrated Critics (ACC) that uses the most recent high variance but unbiased on-policy rollouts to alleviate the bias of the low variance temporal difference targets. We apply ACC to Truncated Quantile Critics, which is an algorithm for continuous control that allows regulation of the bias with a hyperparameter tuned per environment. The resulting algorithm adaptively adjusts the parameter during training rendering hyperparameter search unnecessary and sets a new state of the art on the OpenAI gym continuous control benchmark among all algorithms that do not tune hyperparameters for each environment. ACC further achieves improved results on different tasks from the Meta-World robot benchmark. Additionally, we demonstrate the generality of ACC by applying it to TD3 and showing an improved performance also in this setting.  ( 2 min )
    Target-aware Molecular Graph Generation. (arXiv:2202.04829v2 [cs.LG] UPDATED)
    Generating molecules with desired biological activities has attracted growing attention in drug discovery. Previous molecular generation models are designed as chemocentric methods that hardly consider the drug-target interaction, limiting their practical applications. In this paper, we aim to generate molecular drugs in a target-aware manner that bridges biological activity and molecular design. To solve this problem, we compile a benchmark dataset from several publicly available datasets and build baselines in a unified framework. Building on the recent advantages of flow-based molecular generation models, we propose SiamFlow, which forces the flow to fit the distribution of target sequence embeddings in latent space. Specifically, we employ an alignment loss and a uniform loss to bring target sequence embeddings and drug graph embeddings into agreements while avoiding collapse. Furthermore, we formulate the alignment into a one-to-many problem by learning spaces of target sequence embeddings. Experiments quantitatively show that our proposed method learns meaningful representations in the latent space toward the target-aware molecular graph generation and provides an alternative approach to bridge biology and chemistry in drug discovery.  ( 2 min )
    Value Function Decomposition for Iterative Design of Reinforcement Learning Agents. (arXiv:2206.13901v2 [cs.LG] UPDATED)
    Designing reinforcement learning (RL) agents is typically a difficult process that requires numerous design iterations. Learning can fail for a multitude of reasons, and standard RL methods provide too few tools to provide insight into the exact cause. In this paper, we show how to integrate value decomposition into a broad class of actor-critic algorithms and use it to assist in the iterative agent-design process. Value decomposition separates a reward function into distinct components and learns value estimates for each. These value estimates provide insight into an agent's learning and decision-making process and enable new training methods to mitigate common problems. As a demonstration, we introduce SAC-D, a variant of soft actor-critic (SAC) adapted for value decomposition. SAC-D maintains similar performance to SAC, while learning a larger set of value predictions. We also introduce decomposition-based tools that exploit this information, including a new reward influence metric, which measures each reward component's effect on agent decision-making. Using these tools, we provide several demonstrations of decomposition's use in identifying and addressing problems in the design of both environments and agents. Value decomposition is broadly applicable and easy to incorporate into existing algorithms and workflows, making it a powerful tool in an RL practitioner's toolbox.  ( 3 min )
    Oracles & Followers: Stackelberg Equilibria in Deep Multi-Agent Reinforcement Learning. (arXiv:2210.11942v1 [cs.GT])
    Stackelberg Equilibria arise naturally in a range of popular learning problems, such as in security games or automated mechanism design, and have received increasing attention in the reinforcement learning literature recently. We present a general framework for implementing Stackelberg Equilibria search as a multi-agent RL problem, allowing a wide range of design choices. We discuss how previous approaches can be seen as specific instantiations of this framework. As a key insight, we note that the design space allows for approaches not previously seen in the literature, for instance by leveraging multitask and meta-RL techniques for follower convergence. We evaluate examples of novel approaches predicted by our framework experimentally on standard benchmark domains. Finally, we discuss directions for future work implied by our work.  ( 2 min )
    Diffusion Visual Counterfactual Explanations. (arXiv:2210.11841v1 [cs.CV])
    Visual Counterfactual Explanations (VCEs) are an important tool to understand the decisions of an image classifier. They are 'small' but 'realistic' semantic changes of the image changing the classifier decision. Current approaches for the generation of VCEs are restricted to adversarially robust models and often contain non-realistic artefacts, or are limited to image classification problems with few classes. In this paper, we overcome this by generating Diffusion Visual Counterfactual Explanations (DVCEs) for arbitrary ImageNet classifiers via a diffusion process. Two modifications to the diffusion process are key for our DVCEs: first, an adaptive parameterization, whose hyperparameters generalize across images and models, together with distance regularization and late start of the diffusion process, allow us to generate images with minimal semantic changes to the original ones but different classification. Second, our cone regularization via an adversarially robust model ensures that the diffusion process does not converge to trivial non-semantic changes, but instead produces realistic images of the target class which achieve high confidence by the classifier.  ( 2 min )
    HCL: Improving Graph Representation with Hierarchical Contrastive Learning. (arXiv:2210.12020v1 [cs.LG])
    Contrastive learning has emerged as a powerful tool for graph representation learning. However, most contrastive learning methods learn features of graphs with fixed coarse-grained scale, which might underestimate either local or global information. To capture more hierarchical and richer representation, we propose a novel Hierarchical Contrastive Learning (HCL) framework that explicitly learns graph representation in a hierarchical manner. Specifically, HCL includes two key components: a novel adaptive Learning to Pool (L2Pool) method to construct more reasonable multi-scale graph topology for more comprehensive contrastive objective, a novel multi-channel pseudo-siamese network to further enable more expressive learning of mutual information within each scale. Comprehensive experimental results show HCL achieves competitive performance on 12 datasets involving node classification, node clustering and graph classification. In addition, the visualization of learned representation reveals that HCL successfully captures meaningful characteristics of graphs.  ( 2 min )
    NEREL-BIO: A Dataset of Biomedical Abstracts Annotated with Nested Named Entities. (arXiv:2210.11913v1 [cs.CL])
    This paper describes NEREL-BIO -- an annotation scheme and corpus of PubMed abstracts in Russian and smaller number of abstracts in English. NEREL-BIO extends the general domain dataset NEREL by introducing domain-specific entity types. NEREL-BIO annotation scheme covers both general and biomedical domains making it suitable for domain transfer experiments. NEREL-BIO provides annotation for nested named entities as an extension of the scheme employed for NEREL. Nested named entities may cross entity boundaries to connect to shorter entities nested within longer entities, making them harder to detect. NEREL-BIO contains annotations for 700+ Russian and 100+ English abstracts. All English PubMed annotations have corresponding Russian counterparts. Thus, NEREL-BIO comprises the following specific features: annotation of nested named entities, it can be used as a benchmark for cross-domain (NEREL -> NEREL-BIO) and cross-language (English -> Russian) transfer. We experiment with both transformer-based sequence models and machine reading comprehension (MRC) models and report their results. The dataset is freely available at https://github.com/nerel-ds/NEREL-BIO.  ( 2 min )
    Boosting vision transformers for image retrieval. (arXiv:2210.11909v1 [cs.CV])
    Vision transformers have achieved remarkable progress in vision tasks such as image classification and detection. However, in instance-level image retrieval, transformers have not yet shown good performance compared to convolutional networks. We propose a number of improvements that make transformers outperform the state of the art for the first time. (1) We show that a hybrid architecture is more effective than plain transformers, by a large margin. (2) We introduce two branches collecting global (classification token) and local (patch tokens) information, from which we form a global image representation. (3) In each branch, we collect multi-layer features from the transformer encoder, corresponding to skip connections across distant layers. (4) We enhance locality of interactions at the deeper layers of the encoder, which is the relative weakness of vision transformers. We train our model on all commonly used training sets and, for the first time, we make fair comparisons separately per training set. In all cases, we outperform previous models based on global representation. Public code is available at https://github.com/dealicious-inc/DToP.  ( 2 min )
    Doctors Handwritten Prescription Recognition System In Multi Language Using Deep Learning. (arXiv:2210.11666v1 [cs.CV])
    Doctors typically write in incomprehensible handwriting, making it difficult for both the general public and some pharmacists to understand the medications they have prescribed. It is not ideal for them to write the prescription quietly and methodically because they will be dealing with dozens of patients every day and will be swamped with work.As a result, their handwriting is illegible. This may result in reports or prescriptions consisting of short forms and cursive writing that a typical person or pharmacist won't be able to read properly, which will cause prescribed medications to be misspelled. However, some individuals are accustomed to writing prescriptions in regional languages because we all live in an area with a diversity of regional languages. It makes analyzing the content much more challenging. So, in this project, we'll use a recognition system to build a tool that can translate the handwriting of physicians in any language. This system will be made into an application which is fully autonomous in functioning. As the user uploads the prescription image the program will pre-process the image by performing image pre-processing, and word segmentations initially before processing the image for training. And it will be done for every language we require the model to detect. And as of the deduction model will be made using deep learning techniques including CNN, RNN, and LSTM, which are utilized to train the model. To match words from various languages that will be written in the system, Unicode will be used. Furthermore, fuzzy search and market basket analysis are employed to offer an end result that will be optimized from the pharmaceutical database and displayed to the user as a structured output.  ( 3 min )
    Valuing Vicinity: Memory attention framework for context-based semantic segmentation in histopathology. (arXiv:2210.11822v1 [eess.IV])
    The segmentation of histopathological whole slide images into tumourous and non-tumourous types of tissue is a challenging task that requires the consideration of both local and global spatial contexts to classify tumourous regions precisely. The identification of subtypes of tumour tissue complicates the issue as the sharpness of separation decreases and the pathologist's reasoning is even more guided by spatial context. However, the identification of detailed types of tissue is crucial for providing personalized cancer therapies. Due to the high resolution of whole slide images, existing semantic segmentation methods, restricted to isolated image sections, are incapable of processing context information beyond. To take a step towards better context comprehension, we propose a patch neighbour attention mechanism to query the neighbouring tissue context from a patch embedding memory bank and infuse context embeddings into bottleneck hidden feature maps. Our memory attention framework (MAF) mimics a pathologist's annotation procedure -- zooming out and considering surrounding tissue context. The framework can be integrated into any encoder-decoder segmentation method. We evaluate the MAF on a public breast cancer and an internal kidney cancer data set using famous segmentation models (U-Net, DeeplabV3) and demonstrate the superiority over other context-integrating algorithms -- achieving a substantial improvement of up to $17\%$ on Dice score. The code is publicly available at: https://github.com/tio-ikim/valuing-vicinity  ( 3 min )
    Competing Bandits in Time Varying Matching Markets. (arXiv:2210.11692v1 [cs.LG])
    We study the problem of online learning in two-sided non-stationary matching markets, where the objective is to converge to a stable match. In particular, we consider the setting where one side of the market, the arms, has fixed known set of preferences over the other side, the players. While this problem has been studied when the players have fixed but unknown preferences, in this work we study the problem of how to learn when the preferences of the players are time varying. We propose the {\it Restart Competing Bandits (RCB)} algorithm, which combines a simple {\it restart strategy} to handle the non-stationarity with the {\it competing bandits} algorithm \citep{liu2020competing} designed for the stationary case. We show that, with the proposed algorithm, each player receives a uniform sub-linear regret of {$\widetilde{\mathcal{O}}(L^{1/2}_TT^{1/2})$} up to the number of changes in the underlying preference of agents, $L_T$. We also discuss extensions of this algorithm to the case where the number of changes need not be known a priori.  ( 2 min )
    Deep Reinforcement Learning for Inverse Inorganic Materials Design. (arXiv:2210.11931v1 [cond-mat.mtrl-sci])
    A major obstacle to the realization of novel inorganic materials with desirable properties is the inability to perform efficient optimization across both materials properties and synthesis of those materials. In this work, we propose a reinforcement learning (RL) approach to inverse inorganic materials design, which can identify promising compounds with specified properties and synthesizability constraints. Our model learns chemical guidelines such as charge and electronegativity neutrality while maintaining chemical diversity and uniqueness. We demonstrate a multi-objective RL approach, which can generate novel compounds with targeted materials properties including formation energy and bulk/shear modulus alongside a lower sintering temperature synthesis objectives. Using this approach, the model can predict promising compounds of interest, while suggesting an optimized chemical design space for inorganic materials discovery.  ( 2 min )
    VN-Transformer: Rotation-Equivariant Attention for Vector Neurons. (arXiv:2206.04176v2 [cs.CV] UPDATED)
    Rotation equivariance is a desirable property in many practical applications such as motion forecasting and 3D perception, where it can offer benefits like sample efficiency, better generalization, and robustness to input perturbations. Vector Neurons (VN) is a recently developed framework offering a simple yet effective approach for deriving rotation-equivariant analogs of standard machine learning operations by extending one-dimensional scalar neurons to three-dimensional "vector neurons." We introduce a novel "VN-Transformer" architecture to address several shortcomings of the current VN models. Our contributions are: $(i)$ we derive a rotation-equivariant attention mechanism which eliminates the need for the heavy feature preprocessing required by the original Vector Neurons models; $(ii)$ we extend the VN framework to support non-spatial attributes, expanding the applicability of these models to real-world datasets; $(iii)$ we derive a rotation-equivariant mechanism for multi-scale reduction of point-cloud resolution, greatly speeding up inference and training; $(iv)$ we show that small tradeoffs in equivariance ($\epsilon$-approximate equivariance) can be used to obtain large improvements in numerical stability and training robustness on accelerated hardware, and we bound the propagation of equivariance violations in our models. Finally, we apply our VN-Transformer to 3D shape classification and motion forecasting with compelling results.
    Identifying Pauli spin blockade using deep learning. (arXiv:2202.00574v2 [cond-mat.mes-hall] UPDATED)
    Pauli spin blockade (PSB) can be employed as a great resource for spin qubit initialisation and readout even at elevated temperatures but it can be difficult to identify. We present a machine learning algorithm capable of automatically identifying PSB using charge transport measurements. The scarcity of PSB data is circumvented by training the algorithm with simulated data and by using cross-device validation. We demonstrate our approach on a silicon field-effect transistor device and report an accuracy of 96% on different test devices, giving evidence that the approach is robust to device variability. The approach is expected to be employable across all types of quantum dot devices.
    A GA-like Dynamic Probability Method With Mutual Information for Feature Selection. (arXiv:2210.11954v1 [cs.IT])
    Feature selection plays a vital role in promoting the classifier's performance. However, current methods ineffectively distinguish the complex interaction in the selected features. To further remove these hidden negative interactions, we propose a GA-like dynamic probability (GADP) method with mutual information which has a two-layer structure. The first layer applies the mutual information method to obtain a primary feature subset. The GA-like dynamic probability algorithm, as the second layer, mines more supportive features based on the former candidate features. Essentially, the GA-like method is one of the population-based algorithms so its work mechanism is similar to the GA. Different from the popular works which frequently focus on improving GA's operators for enhancing the search ability and lowering the converge time, we boldly abandon GA's operators and employ the dynamic probability that relies on the performance of each chromosome to determine feature selection in the new generation. The dynamic probability mechanism significantly reduces the parameter number in GA that making it easy to use. As each gene's probability is independent, the chromosome variety in GADP is more notable than in traditional GA, which ensures GADP has a wider search space and selects relevant features more effectively and accurately. To verify our method's superiority, we evaluate our method under multiple conditions on 15 datasets. The results demonstrate the outperformance of the proposed method. Generally, it has the best accuracy. Further, we also compare the proposed model to the popular heuristic methods like POS, FPA, and WOA. Our model still owns advantages over them.
    Differentially Private Coordinate Descent for Composite Empirical Risk Minimization. (arXiv:2110.11688v3 [cs.LG] UPDATED)
    Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new method to solve composite DP-ERM problems. We derive utility guarantees through a novel theoretical analysis of inexact coordinate descent. Our results show that, thanks to larger step sizes, DP-CD can exploit imbalance in gradient coordinates to outperform DP-SGD. We also prove new lower bounds for composite DP-ERM under coordinate-wise regularity assumptions, that are nearly matched by DP-CD. For practical implementations, we propose to clip gradients using coordinate-wise thresholds that emerge from our theory, avoiding costly hyperparameter tuning. Experiments on real and synthetic data support our results, and show that DP-CD compares favorably with DP-SGD.
    Analysis of (sub-)Riemannian PDE-G-CNNs. (arXiv:2210.00935v2 [cs.LG] UPDATED)
    Group equivariant convolutional neural networks (G-CNNs) have been successfully applied in geometric deep learning. Typically, G-CNNs have the advantage over CNNs that they do not waste network capacity on training symmetries that should have been hard-coded in the network. The recently introduced framework of PDE-based G-CNNs (PDE-G-CNNs) generalises G-CNNs. PDE-G-CNNs have the core advantages that they simultaneously 1) reduce network complexity, 2) increase classification performance, and 3) provide geometric interpretability. Their implementations primarily consist of linear and morphological convolutions with kernels. In this paper we show that the previously suggested approximative morphological kernels do not always accurately approximate the exact kernels accurately. More specifically, depending on the spatial anisotropy of the Riemannian metric, we argue that one must resort to sub-Riemannian approximations. We solve this problem by providing a new approximative kernel that works regardless of the anisotropy. We provide new theorems with better error estimates of the approximative kernels, and prove that they all carry the same reflectional symmetries as the exact ones. We test the effectiveness of multiple approximative kernels within the PDE-G-CNN framework on two datasets, and observe an improvement with the new approximative kernels. We report that the PDE-G-CNNs again allow for a considerable reduction of network complexity while having comparable or better performance than G-CNNs and CNNs on the two datasets. Moreover, PDE-G-CNNs have the advantage of better geometric interpretability over G-CNNs, as the morphological kernels are related to association fields from neurogeometry.
    Robust Natural Language Processing: Recent Advances, Challenges, and Future Directions. (arXiv:2201.00768v1 [cs.CL] CROSS LISTED)
    Recent natural language processing (NLP) techniques have accomplished high performance on benchmark datasets, primarily due to the significant improvement in the performance of deep learning. The advances in the research community have led to great enhancements in state-of-the-art production systems for NLP tasks, such as virtual assistants, speech recognition, and sentiment analysis. However, such NLP systems still often fail when tested with adversarial attacks. The initial lack of robustness exposed troubling gaps in current models' language understanding capabilities, creating problems when NLP systems are deployed in real life. In this paper, we present a structured overview of NLP robustness research by summarizing the literature in a systemic way across various dimensions. We then take a deep-dive into the various dimensions of robustness, across techniques, metrics, embeddings, and benchmarks. Finally, we argue that robustness should be multi-dimensional, provide insights into current research, identify gaps in the literature to suggest directions worth pursuing to address these gaps.
    Triplet Losses-based Matrix Factorization for Robust Recommendations. (arXiv:2210.12098v1 [cs.IR])
    Much like other learning-based models, recommender systems can be affected by biases in the training data. While typical evaluation metrics (e.g. hit rate) are not concerned with them, some categories of final users are heavily affected by these biases. In this work, we propose using multiple triplet losses terms to extract meaningful and robust representations of users and items. We empirically evaluate the soundness of such representations through several "bias-aware" evaluation metrics, as well as in terms of stability to changes in the training set and agreement of the predictions variance w.r.t. that of each user.
    On a class of geodesically convex optimization problems solved via Euclidean MM methods. (arXiv:2206.11426v2 [math.OC] UPDATED)
    We study geodesically convex (g-convex) problems that can be written as a difference of Euclidean convex functions. This structure arises in several optimization problems in statistics and machine learning, e.g., for matrix scaling, M-estimators for covariances, and Brascamp-Lieb inequalities. Our work offers efficient algorithms that on the one hand exploit g-convexity to ensure global optimality along with guarantees on iteration complexity. On the other hand, the split structure permits us to develop Euclidean Majorization-Minorization algorithms that help us bypass the need to compute expensive Riemannian operations such as exponential maps and parallel transport. We illustrate our results by specializing them to a few concrete optimization problems that have been previously studied in the machine learning literature. Ultimately, we hope our work helps motivate the broader search for mixed Euclidean-Riemannian optimization algorithms
    Optimizing the Performative Risk under Weak Convexity Assumptions. (arXiv:2209.00771v4 [cs.LG] UPDATED)
    In performative prediction, a predictive model impacts the distribution that generates future data, a phenomenon that is being ignored in classical supervised learning. In this closed-loop setting, the natural measure of performance named performative risk ($\mathrm{PR}$), captures the expected loss incurred by a predictive model \emph{after} deployment. The core difficulty of using the performative risk as an optimization objective is that the data distribution itself depends on the model parameters. This dependence is governed by the environment and not under the control of the learner. As a consequence, even the choice of a convex loss function can result in a highly non-convex $\mathrm{PR}$ minimization problem. Prior work has identified a pair of general conditions on the loss and the mapping from model parameters to distributions that implies the convexity of the performative risk. In this paper, we relax these assumptions and focus on obtaining weaker notions of convexity, without sacrificing the amenability of the $\mathrm{PR}$ minimization problem for iterative optimization methods.
    Management of Machine Learning Lifecycle Artifacts: A Survey. (arXiv:2210.11831v1 [cs.DB])
    The explorative and iterative nature of developing and operating machine learning (ML) applications leads to a variety of artifacts, such as datasets, features, models, hyperparameters, metrics, software, configurations, and logs. In order to enable comparability, reproducibility, and traceability of these artifacts across the ML lifecycle steps and iterations, systems and tools have been developed to support their collection, storage, and management. It is often not obvious what precise functional scope such systems offer so that the comparison and the estimation of synergy effects between candidates are quite challenging. In this paper, we aim to give an overview of systems and platforms which support the management of ML lifecycle artifacts. Based on a systematic literature review, we derive assessment criteria and apply them to a representative selection of more than 60 systems and platforms.
    Fact-Saboteurs: A Taxonomy of Evidence Manipulation Attacks against Fact-Verification Systems. (arXiv:2209.03755v2 [cs.CR] UPDATED)
    Mis- and disinformation are now a substantial global threat to our security and safety. To cope with the scale of online misinformation, one viable solution is to automate the fact-checking of claims by retrieving and verifying against relevant evidence. While major recent advances have been achieved in pushing forward the automatic fact-verification, a comprehensive evaluation of the possible attack vectors against such systems is still lacking. Particularly, the automated fact-verification process might be vulnerable to the exact disinformation campaigns it is trying to combat. In this work, we assume an adversary that automatically tampers with the online evidence in order to disrupt the fact-checking model via camouflaging the relevant evidence, or planting a misleading one. We first propose an exploratory taxonomy that spans these two targets and the different threat model dimensions. Guided by this, we design and propose several potential attack methods. We show that it is possible to subtly modify claim-salient snippets in the evidence, in addition to generating diverse and claim-aligned evidence. As a result, we highly degrade the fact-checking performance under many different permutations of the taxonomy's dimensions. The attacks are also robust against post-hoc modifications of the claim. Our analysis further hints at potential limitations in models' inference when faced with contradicting evidence. We emphasize that these attacks can have harmful implications on the inspectable and human-in-the-loop usage scenarios of such models, and we conclude by discussing challenges and directions for future defenses.
    High Precision Differentiation Techniques for Data-Driven Solution of Nonlinear PDEs by Physics-Informed Neural Networks. (arXiv:2210.00518v2 [math.NA] UPDATED)
    Time-dependent Partial Differential Equations with given initial conditions are considered in this paper. New differentiation techniques of the unknown solution with respect to time variable are proposed. It is shown that the proposed techniques allow to generate accurate higher order derivatives simultaneously for a set of spatial points. The calculated derivatives can then be used for data-driven solution in different ways. An application for Physics Informed Neural Networks by the well-known DeepXDE software solution in Python under Tensorflow background framework has been presented for three real-life PDEs: Burgers', Allen-Cahn and Schrodinger equations.
    Machine Learning Sensors for Diagnosis of COVID-19 Disease Using Routine Blood Values for Internet of Things Application. (arXiv:2209.03522v2 [cs.LG] UPDATED)
    Healthcare digitalization requires effective applications of human sensors, when various parameters of the human body are instantly monitored in everyday life due to the Internet of Things (IoT). In particular, machine learning (ML) sensors for the prompt diagnosis of COVID-19 are an important option for IoT application in healthcare and ambient assisted living (AAL). Determining a COVID-19 infected status with various diagnostic tests and imaging results is costly and time-consuming. This study provides a fast, reliable and cost-effective alternative tool for the diagnosis of COVID-19 based on the routine blood values (RBVs) measured at admission. The dataset of the study consists of a total of 5296 patients with the same number of negative and positive COVID-19 test results and 51 routine blood values. In this study, 13 popular classifier machine learning models and the LogNNet neural network model were exanimated. The most successful classifier model in terms of time and accuracy in the detection of the disease was the histogram-based gradient boosting (HGB) (accuracy: 100%, time: 6.39 sec). The HGB classifier identified the 11 most important features (LDL, cholesterol, HDL-C, MCHC, triglyceride, amylase, UA, LDH, CK-MB, ALP and MCH) to detect the disease with 100% accuracy. In addition, the importance of single, double and triple combinations of these features in the diagnosis of the disease was discussed. We propose to use these 11 features and their binary combinations as important biomarkers for ML sensors in the diagnosis of the disease, supporting edge computing on Arduino and cloud IoT service.
    Learning-Augmented Algorithms for Online Linear and Semidefinite Programming. (arXiv:2209.10614v2 [cs.DS] UPDATED)
    Semidefinite programming (SDP) is a unifying framework that generalizes both linear programming and quadratically-constrained quadratic programming, while also yielding efficient solvers, both in theory and in practice. However, there exist known impossibility results for approximating the optimal solution when constraints for covering SDPs arrive in an online fashion. In this paper, we study online covering linear and semidefinite programs in which the algorithm is augmented with advice from a possibly erroneous predictor. We show that if the predictor is accurate, we can efficiently bypass these impossibility results and achieve a constant-factor approximation to the optimal solution, i.e., consistency. On the other hand, if the predictor is inaccurate, under some technical conditions, we achieve results that match both the classical optimal upper bounds and the tight lower bounds up to constant factors, i.e., robustness. More broadly, we introduce a framework that extends both (1) the online set cover problem augmented with machine-learning predictors, studied by Bamas, Maggiori, and Svensson (NeurIPS 2020), and (2) the online covering SDP problem, initiated by Elad, Kale, and Naor (ICALP 2016). Specifically, we obtain general online learning-augmented algorithms for covering linear programs with fractional advice and constraints, and initiate the study of learning-augmented algorithms for covering SDP problems. Our techniques are based on the primal-dual framework of Buchbinder and Naor (Mathematics of Operations Research, 34, 2009) and can be further adjusted to handle constraints where the variables lie in a bounded region, i.e., box constraints.
    HumSet: Dataset of Multilingual Information Extraction and Classification for Humanitarian Crisis Response. (arXiv:2210.04573v2 [cs.CL] UPDATED)
    Timely and effective response to humanitarian crises requires quick and accurate analysis of large amounts of text data - a process that can highly benefit from expert-assisted NLP systems trained on validated and annotated data in the humanitarian response domain. To enable creation of such NLP systems, we introduce and release HumSet, a novel and rich multilingual dataset of humanitarian response documents annotated by experts in the humanitarian response community. The dataset provides documents in three languages (English, French, Spanish) and covers a variety of humanitarian crises from 2018 to 2021 across the globe. For each document, HUMSET provides selected snippets (entries) as well as assigned classes to each entry annotated using common humanitarian information analysis frameworks. HUMSET also provides novel and challenging entry extraction and multi-label entry classification tasks. In this paper, we take a first step towards approaching these tasks and conduct a set of experiments on Pre-trained Language Models (PLM) to establish strong baselines for future research in this domain. The dataset is available at https://blog.thedeep.io/humset/.
    A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models. (arXiv:2210.12023v1 [cs.CL])
    We have recently witnessed a number of impressive results on hard mathematical reasoning problems with language models. At the same time, the robustness of these models has also been called into question; recent works have shown that models can rely on shallow patterns in the problem description when predicting a solution. Building on the idea of behavioral testing, we propose a novel framework, which pins down the causal effect of various factors in the input, e.g., the surface form of the problem text, the operands and math operators on the output solution. By grounding the behavioral analysis in a causal graph describing an intuitive reasoning process, we study the behavior of language models in terms of robustness and sensitivity to direct interventions in the input space. We apply our framework on a test bed of bivariate math word problems. Our analysis shows that robustness does not appear to continuously improve as a function of scale, but that the recent LLM, GPT-3-Instruct (175B), achieves a dramatic improvement in both robustness and sensitivity, compared to all other GPT variants.
    Inverting Adversarially Robust Networks for Image Synthesis. (arXiv:2106.06927v5 [cs.CV] UPDATED)
    Despite unconditional feature inversion being the foundation of many image synthesis applications, training an inverter demands a high computational budget, large decoding capacity and imposing conditions such as autoregressive priors. To address these limitations, we propose the use of adversarially robust representations as a perceptual primitive for feature inversion. We train an adversarially robust encoder to extract disentangled and perceptually-aligned image representations, making them easily invertible. By training a simple generator with the mirror architecture of the encoder, we achieve superior reconstruction quality and generalization over standard models. Based on this, we propose an adversarially robust autoencoder and demonstrate its improved performance on style transfer, image denoising and anomaly detection tasks. Compared to recent ImageNet feature inversion methods, our model attains improved performance with significantly less complexity.
    Inference and Learning for Generative Capsule Models. (arXiv:2209.03115v2 [cs.LG] UPDATED)
    Capsule networks (see e.g. Hinton et al., 2018) aim to encode knowledge of and reason about the relationship between an object and its parts. In this paper we specify a generative model for such data, and derive a variational algorithm for inferring the transformation of each model object in a scene, and the assignments of observed parts to the objects. We derive a learning algorithm for the object models, based on variational expectation maximization (Jordan et al., 1999). We also study an alternative inference algorithm based on the RANSAC method of Fischler and Bolles (1981). We apply these inference methods to (i) data generated from multiple geometric objects like squares and triangles ("constellations"), and (ii) data from a parts-based model of faces. Recent work by Kosiorek et al. (2019) has used amortized inference via stacked capsule autoencoders (SCAEs) to tackle this problem -- our results show that we significantly outperform them where we can make comparisons (on the constellations data).
    Extracting Biomedical Factual Knowledge Using Pretrained Language Model and Electronic Health Record Context. (arXiv:2209.07859v2 [cs.IR] UPDATED)
    Language Models (LMs) have performed well on biomedical natural language processing applications. In this study, we conducted some experiments to use prompt methods to extract knowledge from LMs as new knowledge Bases (LMs as KBs). However, prompting can only be used as a low bound for knowledge extraction, and perform particularly poorly on biomedical domain KBs. In order to make LMs as KBs more in line with the actual application scenarios of the biomedical domain, we specifically add EHR notes as context to the prompt to improve the low bound in the biomedical domain. We design and validate a series of experiments for our Dynamic-Context-BioLAMA task. Our experiments show that the knowledge possessed by those language models can distinguish the correct knowledge from the noise knowledge in the EHR notes, and such distinguishing ability can also be used as a new metric to evaluate the amount of knowledge possessed by the model.
    Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data. (arXiv:2204.04452v3 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
    The MuSe 2022 Multimodal Sentiment Analysis Challenge: Humor, Emotional Reactions, and Stress. (arXiv:2207.05691v2 [cs.LG] UPDATED)
    The Multimodal Sentiment Analysis Challenge (MuSe) 2022 is dedicated to multimodal sentiment and emotion recognition. For this year's challenge, we feature three datasets: (i) the Passau Spontaneous Football Coach Humor (Passau-SFCH) dataset that contains audio-visual recordings of German football coaches, labelled for the presence of humour; (ii) the Hume-Reaction dataset in which reactions of individuals to emotional stimuli have been annotated with respect to seven emotional expression intensities, and (iii) the Ulm-Trier Social Stress Test (Ulm-TSST) dataset comprising of audio-visual data labelled with continuous emotion values (arousal and valence) of people in stressful dispositions. Using the introduced datasets, MuSe 2022 2022 addresses three contemporary affective computing problems: in the Humor Detection Sub-Challenge (MuSe-Humor), spontaneous humour has to be recognised; in the Emotional Reactions Sub-Challenge (MuSe-Reaction), seven fine-grained `in-the-wild' emotions have to be predicted; and in the Emotional Stress Sub-Challenge (MuSe-Stress), a continuous prediction of stressed emotion values is featured. The challenge is designed to attract different research communities, encouraging a fusion of their disciplines. Mainly, MuSe 2022 targets the communities of audio-visual emotion recognition, health informatics, and symbolic sentiment analysis. This baseline paper describes the datasets as well as the feature sets extracted from them. A recurrent neural network with LSTM cells is used to set competitive baseline results on the test partitions for each sub-challenge. We report an Area Under the Curve (AUC) of .8480 for MuSe-Humor; .2801 mean (from 7-classes) Pearson's Correlations Coefficient for MuSe-Reaction, as well as .4931 Concordance Correlation Coefficient (CCC) and .4761 for valence and arousal in MuSe-Stress, respectively.
    Assaying Out-Of-Distribution Generalization in Transfer Learning. (arXiv:2207.09239v2 [cs.LG] UPDATED)
    Since out-of-distribution generalization is a generally ill-posed problem, various proxy targets (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) were studied across different research programs resulting in different recommendations. While sharing the same aspirational goal, these approaches have never been tested under the same experimental conditions on real data. In this paper, we take a unified view of previous work, highlighting message discrepancies that we address empirically, and providing recommendations on how to measure the robustness of a model and how to improve it. To this end, we collect 172 publicly available dataset pairs for training and out-of-distribution evaluation of accuracy, calibration error, adversarial attacks, environment invariance, and synthetic corruptions. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting. Our findings confirm that in- and out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies.
    Efficient learning of nonlinear prediction models with time-series privileged information. (arXiv:2209.07067v3 [cs.LG] UPDATED)
    In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
    Turning Normalizing Flows into Monge Maps with Geodesic Gaussian Preserving Flows. (arXiv:2209.10873v3 [cs.LG] UPDATED)
    Normalizing Flows (NF) are powerful likelihood-based generative models that are able to trade off between expressivity and tractability to model complex densities. A now well established research avenue leverages optimal transport (OT) and looks for Monge maps, i.e. models with minimal effort between the source and target distributions. This paper introduces a method based on Brenier's polar factorization theorem to transform any trained NF into a more OT-efficient version without changing the final density. We do so by learning a rearrangement of the source (Gaussian) distribution that minimizes the OT cost between the source and the final density. We further constrain the path leading to the estimated Monge map to lie on a geodesic in the space of volume-preserving diffeomorphisms thanks to Euler's equations. The proposed method leads to smooth flows with reduced OT cost for several existing models without affecting the model performance.
    UniFed: A Benchmark for Federated Learning Frameworks. (arXiv:2207.10308v2 [cs.LG] UPDATED)
    Federated Learning (FL) has become a practical and popular paradigm in machine learning. However, currently, there is no systematic solution that covers diverse use cases. Practitioners often face the challenge of how to select a matching FL framework for their use case. In this work, we present UniFed, the first unified benchmark for standardized evaluation of the existing open-source FL frameworks. With 15 evaluation scenarios, we present both qualitative and quantitative evaluation results of nine existing popular open-sourced FL frameworks, from the perspectives of functionality, usability, and system performance. We also provide suggestions on framework selection based on the benchmark conclusions and point out future improvement directions.
    Index Tracking via Learning to Predict Market Sensitivities. (arXiv:2209.00780v2 [q-fin.PM] UPDATED)
    A significant number of equity funds are preferred by index funds nowadays, and market sensitivities are instrumental in managing them. Index funds might replicate the index identically, which is, however, cost-ineffective and impractical. Moreover, to utilize market sensitivities to replicate the index partially, they must be predicted or estimated accurately. Accordingly, first, we examine deep learning models to predict market sensitivities. Also, we present pragmatic applications of data processing methods to aid training and generate target data for the prediction. Then, we propose a partial-index-tracking optimization model controlling the net predicted market sensitivities of the portfolios and index to be the same. These processes' efficacy is corroborated by the Korea Stock Price Index 200. Our experiments show a significant reduction of the prediction errors compared with historical estimations, and competitive tracking errors of replicating the index using fewer than half of the entire constituents. Therefore, we show that applying deep learning to predict market sensitivities is promising and that our portfolio construction methods are practically effective. Additionally, to our knowledge, this is the first study that addresses market sensitivities focused on deep learning.
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v3 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results.
    Large Neural Networks Learning from Scratch with Very Few Data and without Explicit Regularization. (arXiv:2205.08836v2 [cs.CV] UPDATED)
    Recent findings have shown that highly over-parameterized Neural Networks generalize without pretraining or explicit regularization. It is achieved with zero training error, i.e., complete over-fitting by memorizing the training data. This is surprising, since it is completely against traditional machine learning wisdom. In our empirical study we fortify these findings in the domain of fine-grained image classification. We show that very large Convolutional Neural Networks with millions of weights do learn with only a handful of training samples and without image augmentation, explicit regularization or pretraining. We train the architectures ResNet018, ResNet101 and VGG19 on subsets of the difficult benchmark datasets Caltech101, CUB_200_2011, FGVCAircraft, Flowers102 and StanfordCars with 100 classes and more, perform a comprehensive comparative study and draw implications for the practical application of CNNs. Finally, we show that a randomly initialized VGG19 with 140 million weights learns to distinguish airplanes and motorbikes with up to 95% accuracy using only 20 training samples per class.
    Probing Classifiers are Unreliable for Concept Removal and Detection. (arXiv:2207.04153v2 [cs.LG] UPDATED)
    Neural network models trained on text data have been found to encode undesirable linguistic or sensitive concepts in their representation. Removing such concepts is non-trivial because of a complex relationship between the concept, text input, and the learnt representation. Recent work has proposed post-hoc and adversarial methods to remove such unwanted concepts from a model's representation. Through an extensive theoretical and empirical analysis, we show that these methods can be counter-productive: they are unable to remove the concepts entirely, and in the worst case may end up destroying all task-relevant features. The reason is the methods' reliance on a probing classifier as a proxy for the concept. Even under the most favorable conditions for learning a probing classifier when a concept's relevant features in representation space alone can provide 100% accuracy, we prove that a probing classifier is likely to use non-concept features and thus post-hoc or adversarial methods will fail to remove the concept correctly. These theoretical implications are confirmed by experiments on models trained on synthetic, Multi-NLI, and Twitter datasets. For sensitive applications of concept removal such as fairness, we recommend caution against using these methods and propose a spuriousness metric to gauge the quality of the final classifier.
    Mode Reduction for Markov Jump Systems. (arXiv:2205.02697v2 [eess.SY] UPDATED)
    Switched systems are capable of modeling processes with underlying dynamics that may change abruptly over time. To achieve accurate modeling in practice, one may need a large number of modes, but this may in turn increase the model complexity drastically. Existing work on reducing system complexity mainly considers state space reduction, yet reducing the number of modes is less studied. In this work, we consider Markov jump linear systems (MJSs), a special class of switched systems where the active mode switches according to a Markov chain, and several issues associated with its mode complexity. Specifically, inspired by clustering techniques from unsupervised learning, we are able to construct a reduced MJS with fewer modes that approximates well the original MJS under various metrics. Furthermore, both theoretically and empirically, we show how one can use the reduced MJS to analyze stability and design controllers with significant reduction in computational cost while achieving guaranteed accuracy.
    SymFormer: End-to-end symbolic regression using transformer-based architecture. (arXiv:2205.15764v3 [cs.LG] UPDATED)
    Many real-world problems can be naturally described by mathematical formulas. The task of finding formulas from a set of observed inputs and outputs is called symbolic regression. Recently, neural networks have been applied to symbolic regression, among which the transformer-based ones seem to be the most promising. After training the transformer on a large number of formulas (in the order of days), the actual inference, i.e., finding a formula for new, unseen data, is very fast (in the order of seconds). This is considerably faster than state-of-the-art evolutionary methods. The main drawback of transformers is that they generate formulas without numerical constants, which have to be optimized separately, so yielding suboptimal results. We propose a transformer-based approach called SymFormer, which predicts the formula by outputting the individual symbols and the corresponding constants simultaneously. This leads to better performance in terms of fitting the available data. In addition, the constants provided by SymFormer serve as a good starting point for subsequent tuning via gradient descent to further improve the performance. We show on a set of benchmarks that SymFormer outperforms two state-of-the-art methods while having faster inference.
    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. (arXiv:2207.06569v2 [cs.LG] UPDATED)
    The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.
    Domain Knowledge-Informed Self-Supervised Representations for Workout Form Assessment. (arXiv:2202.14019v2 [cs.CV] UPDATED)
    Maintaining proper form while exercising is important for preventing injuries and maximizing muscle mass gains. Detecting errors in workout form naturally requires estimating human's body pose. However, off-the-shelf pose estimators struggle to perform well on the videos recorded in gym scenarios due to factors such as camera angles, occlusion from gym equipment, illumination, and clothing. To aggravate the problem, the errors to be detected in the workouts are very subtle. To that end, we propose to learn exercise-oriented image and video representations from unlabeled samples such that a small dataset annotated by experts suffices for supervised error detection. In particular, our domain knowledge-informed self-supervised approaches (pose contrastive learning and motion disentangling) exploit the harmonic motion of the exercise actions, and capitalize on the large variances in camera angles, clothes, and illumination to learn powerful representations. To facilitate our self-supervised pretraining, and supervised finetuning, we curated a new exercise dataset, \emph{Fitness-AQA} (\url{https://github.com/ParitoshParmar/Fitness-AQA}), comprising of three exercises: BackSquat, BarbellRow, and OverheadPress. It has been annotated by expert trainers for multiple crucial and typically occurring exercise errors. Experimental results show that our self-supervised representations outperform off-the-shelf 2D- and 3D-pose estimators and several other baselines. We also show that our approaches can be applied to other domains/tasks such as pose estimation and dive quality assessment.
    Adam Can Converge Without Any Modification on Update Rules. (arXiv:2208.09632v3 [cs.LG] UPDATED)
    Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory and practice? We point out there is a mismatch between the settings of theory and practice: Reddi et al. 2018 pick the problem after picking the hyperparameters of Adam, i.e., $(\beta_1, \beta_2)$; while practical applications often fix the problem first and then tune $(\beta_1, \beta_2)$. Due to this observation, we conjecture that the empirical convergence can be theoretically justified, only if we change the order of picking the problem and hyperparameter. In this work, we confirm this conjecture. We prove that, when $\beta_2$ is large and $\beta_1 < \sqrt{\beta_2}<1$, Adam converges to the neighborhood of critical points. The size of the neighborhood is propositional to the variance of stochastic gradients. Under an extra condition (strong growth condition), Adam converges to critical points. It is worth mentioning that our results cover a wide range of hyperparameters: as $\beta_2$ increases, our convergence result can cover any $\beta_1 \in [0,1)$ including $\beta_1=0.9$, which is the default setting in deep learning libraries. To our knowledge, this is the first result showing that Adam can converge without any modification on its update rules. Further, our analysis does not require assumptions of bounded gradients or bounded 2nd-order momentum. When $\beta_2$ is small, we further point out a large region of $(\beta_1,\beta_2)$ where Adam can diverge to infinity. Our divergence result considers the same setting as our convergence result, indicating a phase transition from divergence to convergence when increasing $\beta_2$. These positive and negative results can provide suggestions on how to tune Adam hyperparameters.
    A Survey of Machine Unlearning. (arXiv:2209.02299v5 [cs.LG] UPDATED)
    Today, computer systems hold large amounts of personal data. Yet while such an abundance of data allows breakthroughs in artificial intelligence, and especially machine learning (ML), its existence can be a threat to user privacy, and it can weaken the bonds of trust between humans and AI. Recent regulations now require that, on request, private information about a user must be removed from both computer systems and from ML models, i.e. ``the right to be forgotten''). While removing data from back-end databases should be straightforward, it is not sufficient in the AI context as ML models often `remember' the old data. Contemporary adversarial attacks on trained models have proven that we can learn whether an instance or an attribute belonged to the training data. This phenomenon calls for a new paradigm, namely machine unlearning, to make ML models forget about particular data. It turns out that recent works on machine unlearning have not been able to completely solve the problem due to the lack of common frameworks and resources. Therefore, this paper aspires to present a comprehensive examination of machine unlearning's concepts, scenarios, methods, and applications. Specifically, as a category collection of cutting-edge studies, the intention behind this article is to serve as a comprehensive resource for researchers and practitioners seeking an introduction to machine unlearning and its formulations, design criteria, removal requests, algorithms, and applications. In addition, we aim to highlight the key findings, current trends, and new research areas that have not yet featured the use of machine unlearning but could benefit greatly from it. We hope this survey serves as a valuable resource for ML researchers and those seeking to innovate privacy technologies. Our resources are publicly available at https://github.com/tamlhp/awesome-machine-unlearning.
    PSO-PINN: Physics-Informed Neural Networks Trained with Particle Swarm Optimization. (arXiv:2202.01943v3 [cs.LG] UPDATED)
    Physics-informed neural networks (PINN) have recently emerged as a promising application of deep learning in a wide range of engineering and scientific problems based on partial differential equation (PDE) models. However, evidence shows that PINN training by gradient descent displays pathologies that often prevent convergence when solving PDEs with irregular solutions. In this paper, we propose the use of a particle swarm optimization (PSO) approach to train PINNs. The resulting PSO-PINN algorithm not only mitigates the undesired behaviors of PINNs trained with standard gradient descent but also presents an ensemble approach to PINN that affords the possibility of robust predictions with quantified uncertainty. We also propose PSO-BP-CD (PSO with Back-Propagation and Coefficient Decay), a hybrid PSO variant that combines swarm optimization with gradient descent, putting more weight on the latter as training progresses and the swarm zeros in on a good local optimum. Comprehensive experimental results show that PSO-PINN with the proposed PSO-BP-CD algorithm outperforms PINN ensembles trained with other PSO variants or with pure gradient descent.
    Interventions, Where and How? Experimental Design for Causal Models at Scale. (arXiv:2203.02016v3 [cs.LG] UPDATED)
    Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability: factors that introduce uncertainty in estimating the underlying structural causal model (SCM). Selecting experiments (interventions) based on the uncertainty arising from both factors can expedite the identification of the SCM. Existing methods in experimental design for causal discovery from limited data either rely on linear assumptions for the SCM or select only the intervention target. This work incorporates recent advances in Bayesian causal discovery into the Bayesian optimal experimental design framework, allowing for active causal discovery of large, nonlinear SCMs while selecting both the interventional target and the value. We demonstrate the performance of the proposed method on synthetic graphs (Erdos-R\`enyi, Scale Free) for both linear and nonlinear SCMs as well as on the \emph{in-silico} single-cell gene regulatory network dataset, DREAM.
    Evolving Generalizable Actor-Critic Algorithms. (arXiv:2204.04292v2 [cs.LG] UPDATED)
    Deploying Reinforcement Learning (RL) agents in the real world requires designing and tuning algorithms for problem-specific objectives such as performance, robustness, or stability. These objectives can frequently change, which will then necessitate further painstaking design and tuning. This paper presents MetaPG, an evolutionary method for designing new loss functions for actor-critic RL algorithms that optimize for different objectives. In particular, we focus on the objectives of final performance in training regime, policy robustness to unseen environment configurations, and training curve stability over random seeds. We initialize our algorithm population from Soft Actor-Critic (SAC) and optimize for these objectives over a set of continuous control tasks from the Real-World RL Benchmark Suite. We find that our method evolves algorithms that, using a single environment during evolution, improve upon SAC's performance and generalizability by 3% and 17%, respectively, and reduce instability up to 65% in that same environment. Then, we scale up to more complex environments from the Brax physics simulator and replicate conditions that can be encountered in practical settings (such as different friction coefficients). MetaPG evolves algorithms that can obtain 9% better policy robustness within the same meta-training environment without loss of performance and robustness when doing cross-domain evaluations in other Brax environments. Lastly, we analyze the structure of the best algorithms in the population and interpret the specific elements that help the algorithm optimize for a certain objective, such as regularizing the critic loss.
    What Do Compressed Multilingual Machine Translation Models Forget?. (arXiv:2205.10828v2 [cs.CL] UPDATED)
    Recently, very large pre-trained models achieve state-of-the-art results in various natural language processing (NLP) tasks, but their size makes it more challenging to apply them in resource-constrained environments. Compression techniques allow to drastically reduce the size of the models and therefore their inference time with negligible impact on top-tier metrics. However, the general performance averaged across multiple tasks and/or languages may hide a drastic performance drop on under-represented features, which could result in the amplification of biases encoded by the models. In this work, we assess the impact of compression methods on Multilingual Neural Machine Translation models (MNMT) for various language groups, gender, and semantic biases by extensive analysis of compressed models on different machine translation benchmarks, i.e. FLORES-101, MT-Gender, and DiBiMT. We show that the performance of under-represented languages drops significantly, while the average BLEU metric only slightly decreases. Interestingly, the removal of noisy memorization with compression leads to a significant improvement for some medium-resource languages. Finally, we demonstrate that compression amplifies intrinsic gender and semantic biases, even in high-resource languages. Code: https://github.com/alirezamshi/bias-compressedMT
    Transformer Memory as a Differentiable Search Index. (arXiv:2202.06991v3 [cs.CL] UPDATED)
    In this paper, we demonstrate that information retrieval can be accomplished with a single Transformer, in which all information about the corpus is encoded in the parameters of the model. To this end, we introduce the Differentiable Search Index (DSI), a new paradigm that learns a text-to-text model that maps string queries directly to relevant docids; in other words, a DSI model answers queries directly using only its parameters, dramatically simplifying the whole retrieval process. We study variations in how documents and their identifiers are represented, variations in training procedures, and the interplay between models and corpus sizes. Experiments demonstrate that given appropriate design choices, DSI significantly outperforms strong baselines such as dual encoder models. Moreover, DSI demonstrates strong generalization capabilities, outperforming a BM25 baseline in a zero-shot setup.
    First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach. (arXiv:2112.03432v4 [cs.LG] UPDATED)
    Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v3 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.
    A Softmax-free Loss Function Based on Predefined Optimal-distribution of Latent Features for Deep Learning Classifier. (arXiv:2111.15449v2 [cs.CV] UPDATED)
    In the field of pattern classification, the training of deep learning classifiers is mostly end-to-end learning, and the loss function is the constraint on the final output (posterior probability) of the network, so the existence of Softmax is essential. In the case of end-to-end learning, there is usually no effective loss function that completely relies on the features of the middle layer to restrict learning, resulting in the distribution of sample latent features is not optimal, so there is still room for improvement in classification accuracy. Based on the concept of Predefined Evenly-Distributed Class Centroids (PEDCC), this article proposes a Softmax-free loss function based on predefined optimal-distribution of latent features-POD Loss. The loss function only restricts the latent features of the samples, including the norm-adaptive Cosine distance between the latent feature vector of the sample and the center of the predefined evenly-distributed class, and the correlation between the latent features of the samples. Finally, Cosine distance is used for classification. Compared with the commonly used Softmax Loss, some typical Softmax related loss functions and PEDCC-Loss, experiments on several commonly used datasets on several typical deep learning classification networks show that the classification performance of POD Loss is always significant better and easier to converge. Code is available in https://github.com/TianYuZu/POD-Loss.
    High-Dimensional Private Empirical Risk Minimization by Greedy Coordinate Descent. (arXiv:2207.01560v2 [cs.LG] UPDATED)
    In this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model's parameters to carry more information than others. To exploit this, we propose a differentially private greedy coordinate descent (DP-GCD) algorithm. At each iteration, DP-GCD privately performs a coordinate-wise gradient step along the gradients' (approximately) greatest entry. We show theoretically that DP-GCD can achieve a logarithmic dependence on the dimension for a wide range of problems by naturally exploiting their structural properties (such as quasi-sparse solutions). We illustrate this behavior numerically, both on synthetic and real datasets.
    Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting. (arXiv:2110.03135v2 [cs.LG] UPDATED)
    We show that label noise exists in adversarial training. Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples - the true label distribution is distorted by the adversarial perturbation, but is neglected by the common practice that inherits labels from clean examples. Recognizing label noise sheds insights on the prevalence of robust overfitting in adversarial training, and explains its intriguing dependence on perturbation radius and data quality. Also, our label noise perspective aligns well with our observations of the epoch-wise double descent in adversarial training. Guided by our analyses, we proposed a method to automatically calibrate the label to address the label noise and robust overfitting. Our method achieves consistent performance improvements across various models and datasets without introducing new hyper-parameters or additional tuning.
    Learning knot invariants across dimensions. (arXiv:2112.00016v2 [hep-th] UPDATED)
    We use deep neural networks to machine learn correlations between knot invariants in various dimensions. The three-dimensional invariant of interest is the Jones polynomial $J(q)$, and the four-dimensional invariants are the Khovanov polynomial $\text{Kh}(q,t)$, smooth slice genus $g$, and Rasmussen's $s$-invariant. We find that a two-layer feed-forward neural network can predict $s$ from $\text{Kh}(q,-q^{-4})$ with greater than $99\%$ accuracy. A theoretical explanation for this performance exists in knot theory via the now disproven knight move conjecture, which is obeyed by all knots in our dataset. More surprisingly, we find similar performance for the prediction of $s$ from $\text{Kh}(q,-q^{-2})$, which suggests a novel relationship between the Khovanov and Lee homology theories of a knot. The network predicts $g$ from $\text{Kh}(q,t)$ with similarly high accuracy, and we discuss the extent to which the machine is learning $s$ as opposed to $g$, since there is a general inequality $|s| \leq 2g$. The Jones polynomial, as a three-dimensional invariant, is not obviously related to $s$ or $g$, but the network achieves greater than $95\%$ accuracy in predicting either from $J(q)$. Moreover, similar accuracy can be achieved by evaluating $J(q)$ at roots of unity. This suggests a relationship with $SU(2)$ Chern--Simons theory, and we review the gauge theory construction of Khovanov homology which may be relevant for explaining the network's performance.
    The Phenomenon of Policy Churn. (arXiv:2206.00730v3 [cs.LG] UPDATED)
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Transferring Dexterous Manipulation from GPU Simulation to a Remote Real-World TriFinger. (arXiv:2108.09779v2 [cs.RO] UPDATED)
    We present a system for learning a challenging dexterous manipulation task involving moving a cube to an arbitrary 6-DoF pose with only 3-fingers trained with NVIDIA's IsaacGym simulator. We show empirical benefits, both in simulation and sim-to-real transfer, of using keypoints as opposed to position+quaternion representations for the object pose in 6-DoF for policy observations and in reward calculation to train a model-free reinforcement learning agent. By utilizing domain randomization strategies along with the keypoint representation of the pose of the manipulated object, we achieve a high success rate of 83% on a remote TriFinger system maintained by the organizers of the Real Robot Challenge. With the aim of assisting further research in learning in-hand manipulation, we make the codebase of our system, along with trained checkpoints that come with billions of steps of experience available, at https://s2r2-ig.github.io
    Robust Federated Learning with Connectivity Failures: A Semi-Decentralized Framework with Collaborative Relaying. (arXiv:2202.11850v2 [cs.DC] UPDATED)
    Intermittent connectivity of clients to the parameter server (PS) is a major bottleneck in federated edge learning frameworks. The lack of constant connectivity induces a large generalization gap, especially when the local data distribution amongst clients exhibits heterogeneity. To overcome intermittent communication outages between clients and the central PS, we introduce the concept of collaborative relaying wherein the participating clients relay their neighbors' local updates to the PS in order to boost the participation of clients with poor connectivity to the PS. We propose a semi-decentralized federated learning framework in which at every communication round, each client initially computes a local consensus of a subset of its neighboring clients' updates, and eventually transmits to the PS a weighted average of its own update and those of its neighbors'. We appropriately optimize these local consensus weights to ensure that the global update at the PS is unbiased with minimal variance - consequently improving the convergence rate. Numerical evaluations on the CIFAR-10 dataset demonstrate that our collaborative relaying approach outperforms federated averaging-based benchmarks for learning over intermittently-connected networks such as when the clients communicate over millimeter wave channels with intermittent blockages.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v3 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    Unsupervised Multi-object Segmentation by Predicting Probable Motion Patterns. (arXiv:2210.12148v1 [cs.CV])
    We propose a new approach to learn to segment multiple image objects without manual supervision. The method can extract objects form still images, but uses videos for supervision. While prior works have considered motion for segmentation, a key insight is that, while motion can be used to identify objects, not all objects are necessarily in motion: the absence of motion does not imply the absence of objects. Hence, our model learns to predict image regions that are likely to contain motion patterns characteristic of objects moving rigidly. It does not predict specific motion, which cannot be done unambiguously from a still image, but a distribution of possible motions, which includes the possibility that an object does not move at all. We demonstrate the advantage of this approach over its deterministic counterpart and show state-of-the-art unsupervised object segmentation performance on simulated and real-world benchmarks, surpassing methods that use motion even at test time. As our approach is applicable to variety of network architectures that segment the scenes, we also apply it to existing image reconstruction-based models showing drastic improvement. Project page and code: https://www.robots.ox.ac.uk/~vgg/research/ppmp .
    Sim2Real for Soft Robotic Fish via Differentiable Simulation. (arXiv:2109.14855v3 [cs.RO] UPDATED)
    Accurate simulation of soft mechanisms under dynamic actuation is critical for the design of soft robots. We address this gap with our differentiable simulation tool by learning the material parameters of our soft robotic fish. On the example of a soft robotic fish, we demonstrate an experimentally-verified, fast optimization pipeline for learning the material parameters from quasi-static data via differentiable simulation and apply it to the prediction of dynamic performance. Our method identifies physically plausible Young's moduli for various soft silicone elastomers and stiff acetal copolymers used in creation of our three different robotic fish tail designs. We show that our method is compatible with varying internal geometry of the actuators, such as the number of hollow cavities. Our framework allows high fidelity prediction of dynamic behavior for composite bi-morph bending structures in real hardware to millimeter-accuracy and within 3 percent error normalized to actuator length. We provide a differentiable and robust estimate of the thrust force using a neural network thrust predictor; this estimate allows for accurate modeling of our experimental setup measuring bollard pull. This work presents a prototypical hardware and simulation problem solved using our differentiable framework; the framework can be applied to higher dimensional parameter inference, learning control policies, and computational design due to its differentiable character.
    Technology Fitness Landscape for Design Innovation: A Deep Neural Embedding Approach Based on Patent Data. (arXiv:2110.13624v3 [cs.LG] UPDATED)
    Technology is essential to innovation and economic prosperity. Understanding technological changes can guide innovators to find new directions of design innovation and thus make breakthroughs. In this work, we construct a technology fitness landscape via deep neural embeddings of patent data. The landscape consists of 1,757 technology domains and their respective improvement rates. In the landscape, we found a high hill related to information and communication technologies (ICT) and a vast low plain of the remaining domains. The landscape presents a bird's eye view of the structure of the total technology space, providing a new way for innovators to interpret technology evolution with a biological analogy, and a biologically-inspired inference to the next innovation.
    Debiased Self-Training for Semi-Supervised Learning. (arXiv:2202.07136v4 [cs.LG] UPDATED)
    Deep neural networks achieve remarkable performances on a wide range of tasks with the aid of large-scale labeled datasets. Yet these datasets are time-consuming and labor-exhaustive to obtain on realistic tasks. To mitigate the requirement for labeled data, self-training is widely used in semi-supervised learning by iteratively assigning pseudo labels to unlabeled samples. Despite its popularity, self-training is well-believed to be unreliable and often leads to training instability. Our experimental studies further reveal that the bias in semi-supervised learning arises from both the problem itself and the inappropriate training with potentially incorrect pseudo labels, which accumulates the error in the iterative self-training process. To reduce the above bias, we propose Debiased Self-Training (DST). First, the generation and utilization of pseudo labels are decoupled by two parameter-independent classifier heads to avoid direct error accumulation. Second, we estimate the worst case of self-training bias, where the pseudo labeling function is accurate on labeled samples, yet makes as many mistakes as possible on unlabeled samples. We then adversarially optimize the representations to improve the quality of pseudo labels by avoiding the worst case. Extensive experiments justify that DST achieves an average improvement of 6.3% against state-of-the-art methods on standard semi-supervised learning benchmark datasets and 18.9%$ against FixMatch on 13 diverse tasks. Furthermore, DST can be seamlessly adapted to other self-training methods and help stabilize their training and balance performance across classes in both cases of training from scratch and finetuning from pre-trained models.
    Efficient Dataset Distillation Using Random Feature Approximation. (arXiv:2210.12067v1 [cs.LG])
    Dataset distillation compresses large datasets into smaller synthetic coresets which retain performance with the aim of reducing the storage and computational burden of processing the entire dataset. Today's best-performing algorithm, \textit{Kernel Inducing Points} (KIP), which makes use of the correspondence between infinite-width neural networks and kernel-ridge regression, is prohibitively slow due to the exact computation of the neural tangent kernel matrix, scaling $O(|S|^2)$, with $|S|$ being the coreset size. To improve this, we propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel, which reduces the kernel matrix computation to $O(|S|)$. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets, both in kernel regression and finite-width network training. We demonstrate the effectiveness of our approach on tasks involving model interpretability and privacy preservation.
    Boomerang: Local sampling on image manifolds using diffusion models. (arXiv:2210.12100v1 [cs.CV])
    Diffusion models can be viewed as mapping points in a high-dimensional latent space onto a low-dimensional learned manifold, typically an image manifold. The intermediate values between the latent space and image manifold can be interpreted as noisy images which are determined by the noise scheduling scheme employed during pre-training. We exploit this interpretation to introduce Boomerang, a local image manifold sampling approach using the dynamics of diffusion models. We call it Boomerang because we first add noise to an input image, moving it closer to the latent space, then bring it back to the image space through diffusion dynamics. We use this method to generate images which are similar, but nonidentical, to the original input images on the image manifold. We are able to set how close the generated image is to the original based on how much noise we add. Additionally, the generated images have a degree of stochasticity, allowing us to locally sample as many times as we want without repetition. We show three applications for which Boomerang can be used. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show how to use Boomerang for data augmentation while staying on the image manifold. Third, we introduce a framework for image super-resolution with 8x upsampling. Boomerang does not require any modification to the training of diffusion models and can be used with pretrained models on a single, inexpensive GPU.
    Towards quantum advantage via topological data analysis. (arXiv:2005.02607v5 [quant-ph] UPDATED)
    Even after decades of quantum computing development, examples of generally useful quantum algorithms with exponential speedups over classical counterparts are scarce. Recent progress in quantum algorithms for linear-algebra positioned quantum machine learning (QML) as a potential source of such useful exponential improvements. Yet, in an unexpected development, a recent series of "dequantization" results has equally rapidly removed the promise of exponential speedups for several QML algorithms. This raises the critical question whether exponential speedups of other linear-algebraic QML algorithms persist. In this paper, we study the quantum-algorithmic methods behind the algorithm for topological data analysis of Lloyd, Garnerone and Zanardi through this lens. We provide evidence that the problem solved by this algorithm is classically intractable by showing that its natural generalization is as hard as simulating the one clean qubit model -- which is widely believed to require superpolynomial time on a classical computer -- and is thus very likely immune to dequantizations. Based on this result, we provide a number of new quantum algorithms for problems such as rank estimation and complex network analysis, along with complexity-theoretic evidence for their classical intractability. Furthermore, we analyze the suitability of the proposed quantum algorithms for near-term implementations. Our results provide a number of useful applications for full-blown, and restricted quantum computers with a guaranteed exponential speedup over classical methods, recovering some of the potential for linear-algebraic QML to become one of quantum computing's killer applications.
    Neural Networks for Local Search and Crossover in Vehicle Routing: A Possible Overkill?. (arXiv:2210.12075v1 [cs.NE])
    Extensive research has been conducted, over recent years, on various ways of enhancing heuristic search for combinatorial optimization problems with machine learning algorithms. In this study, we investigate the use of predictions from graph neural networks (GNNs) in the form of heatmaps to improve the Hybrid Genetic Search (HGS), a state-of-the-art algorithm for the Capacitated Vehicle Routing Problem (CVRP). The crossover and local-search components of HGS are instrumental in finding improved solutions, yet these components essentially rely on simple greedy or random choices. It seems intuitive to attempt to incorporate additional knowledge at these levels. Throughout a vast experimental campaign on more than 10,000 problem instances, we show that exploiting more sophisticated strategies using measures of node relatedness (heatmaps, or simply distance) within these algorithmic components can significantly enhance performance. However, contrary to initial expectations, we also observed that heatmaps did not present significant advantages over simpler distance measures for these purposes. Therefore, we faced a common -- though rarely documented -- situation of overkill: GNNs can indeed improve performance on an important optimization task, but an ablation analysis demonstrated that simpler alternatives perform equally well.
    Robust Singular Values based on L1-norm PCA. (arXiv:2210.12097v1 [eess.SP])
    Singular-Value Decomposition (SVD) is a ubiquitous data analysis method in engineering, science, and statistics. Singular-value estimation, in particular, is of critical importance in an array of engineering applications, such as channel estimation in communication systems, electromyography signal analysis, and image compression, to name just a few. Conventional SVD of a data matrix coincides with standard Principal-Component Analysis (PCA). The L2-norm (sum of squared values) formulation of PCA promotes peripheral data points and, thus, makes PCA sensitive against outliers. Naturally, SVD inherits this outlier sensitivity. In this work, we present a novel robust non-parametric method for SVD and singular-value estimation based on a L1-norm (sum of absolute values) formulation, which we name L1-cSVD. Accordingly, the proposed method demonstrates sturdy resistance against outliers and can facilitate more reliable data analysis and processing in a wide range of engineering applications.
    Graph Few-shot Learning with Task-specific Structures. (arXiv:2210.12130v1 [cs.LG])
    Graph few-shot learning is of great importance among various graph learning tasks. Under the few-shot scenario, models are often required to conduct classification given limited labeled samples. Existing graph few-shot learning methods typically leverage Graph Neural Networks (GNNs) and perform classification across a series of meta-tasks. Nevertheless, these methods generally rely on the original graph (i.e., the graph that the meta-task is sampled from) to learn node representations. Consequently, the graph structure used in each meta-task is identical. Since the class sets are different across meta-tasks, node representations should be learned in a task-specific manner to promote classification performance. Therefore, to adaptively learn node representations across meta-tasks, we propose a novel framework that learns a task-specific structure for each meta-task. To handle the variety of nodes across meta-tasks, we extract relevant nodes and learn task-specific structures based on node influence and mutual information. In this way, we can learn node representations with the task-specific structure tailored for each meta-task. We further conduct extensive experiments on five node classification datasets under both single- and multiple-graph settings to validate the superiority of our framework over the state-of-the-art baselines. Our code is provided at https://github.com/SongW-SW/GLITTER.
    Stability to Deformations of Manifold Filters and Manifold Neural Networks. (arXiv:2106.03725v3 [cs.LG] UPDATED)
    The paper defines and studies manifold (M) convolutional filters and neural networks (NNs). \emph{Manifold} filters and MNNs are defined in terms of the Laplace-Beltrami operator exponential and are such that \emph{graph} (G) filters and neural networks (NNs) are recovered as discrete approximations when the manifold is sampled. These filters admit a spectral representation which is a generalization of both the spectral representation of graph filters and the frequency response of standard convolutional filters in continuous time. The main technical contribution of the paper is to analyze the stability of manifold filters and MNNs to smooth deformations of the manifold. This analysis generalizes known stability properties of graph filters and GNNs and it is also a generalization of known stability properties of standard convolutional filters and neural networks in continuous time. The most important observation that follows from this analysis is that manifold filters, same as graph filters and standard continuous time filters, have difficulty discriminating high frequency components in the presence of deformations. This is a challenge that can be ameliorated with the use of manifold, graph, or continuous time neural networks. The most important practical consequence of this analysis is to shed light on the behavior of graph filters and GNNs in large scale graphs.
    Efficient Global Planning in Large MDPs via Stochastic Primal-Dual Optimization. (arXiv:2210.12057v1 [cs.LG])
    We propose a new stochastic primal-dual optimization algorithm for planning in a large discounted Markov decision process with a generative model and linear function approximation. Assuming that the feature map approximately satisfies standard realizability and Bellman-closedness conditions and also that the feature vectors of all state-action pairs are representable as convex combinations of a small core set of state-action pairs, we show that our method outputs a near-optimal policy after a polynomial number of queries to the generative model. Our method is computationally efficient and comes with the major advantage that it outputs a single softmax policy that is compactly represented by a low-dimensional parameter vector, and does not need to execute computationally expensive local planning subroutines in runtime.
    Imitation Learning: Progress, Taxonomies and Challenges. (arXiv:2106.12177v2 [cs.LG] UPDATED)
    Imitation learning aims to extract knowledge from human experts' demonstrations or artificially created agents in order to replicate their behaviors. Its success has been demonstrated in areas such as video games, autonomous driving, robotic simulations and object manipulation. However, this replicating process could be problematic, such as the performance is highly dependent on the demonstration quality, and most trained agents are limited to perform well in task-specific environments. In this survey, we provide a systematic review on imitation learning. We first introduce the background knowledge from development history and preliminaries, followed by presenting different taxonomies within Imitation Learning and key milestones of the field. We then detail challenges in learning strategies and present research opportunities with learning policy from suboptimal demonstration, voice instructions and other associated optimization schemes.
    FLEX: Extrinsic Parameters-free Multi-view 3D Human Motion Reconstruction. (arXiv:2105.01937v4 [cs.CV] UPDATED)
    The increasing availability of video recordings made by multiple cameras has offered new means for mitigating occlusion and depth ambiguities in pose and motion reconstruction methods. Yet, multi-view algorithms strongly depend on camera parameters; particularly, the relative transformations between the cameras. Such a dependency becomes a hurdle once shifting to dynamic capture in uncontrolled settings. We introduce FLEX (Free muLti-view rEconstruXion), an end-to-end extrinsic parameter-free multi-view model. FLEX is extrinsic parameter-free (dubbed ep-free) in the sense that it does not require extrinsic camera parameters. Our key idea is that the 3D angles between skeletal parts, as well as bone lengths, are invariant to the camera position. Hence, learning 3D rotations and bone lengths rather than locations allows predicting common values for all camera views. Our network takes multiple video streams, learns fused deep features through a novel multi-view fusion layer, and reconstructs a single consistent skeleton with temporally coherent joint rotations. We demonstrate quantitative and qualitative results on three public datasets, and on synthetic multi-person video streams captured by dynamic cameras. We compare our model to state-of-the-art methods that are not ep-free and show that in the absence of camera parameters, we outperform them by a large margin while obtaining comparable results when camera parameters are available. Code, trained models, and other materials are available on our project page.
    Neural Fields for Robotic Object Manipulation from a Single Image. (arXiv:2210.12126v1 [cs.RO])
    We present a unified and compact representation for object rendering, 3D reconstruction, and grasp pose prediction that can be inferred from a single image within a few seconds. We achieve this by leveraging recent advances in the Neural Radiance Field (NeRF) literature that learn category-level priors and fine-tune on novel objects with minimal data and time. Our insight is that we can learn a compact shape representation and extract meaningful additional information from it, such as grasping poses. We believe this to be the first work to retrieve grasping poses directly from a NeRF-based representation using a single viewpoint (RGB-only), rather than going through a secondary network and/or representation. When compared to prior art, our method is two to three orders of magnitude smaller while achieving comparable performance at view reconstruction and grasping. Accompanying our method, we also propose a new dataset of rendered shoes for training a sim-2-real NeRF method with grasping poses for different widths of grippers.
    A Multi-Scale Deep Learning Framework for Projecting Weather Extremes. (arXiv:2210.12137v1 [cs.LG])
    Weather extremes are a major societal and economic hazard, claiming thousands of lives and causing billions of dollars in damage every year. Under climate change, their impact and intensity are expected to worsen significantly. Unfortunately, general circulation models (GCMs), which are currently the primary tool for climate projections, cannot characterize weather extremes accurately. To address this, we present a multi-resolution deep-learning framework that, firstly, corrects a GCM's biases by matching low-order and tail statistics of its output with observations at coarse scales; and secondly, increases the level of detail of the debiased GCM output by reconstructing the finer scales as a function of the coarse scales. We use the proposed framework to generate statistically realistic realizations of the climate over Western Europe from a simple GCM corrected using observational atmospheric reanalysis. We also discuss implications for probabilistic risk assessment of natural disasters in a changing climate.
    Learning in RKHM: a $C^*$-Algebraic Twist for Kernel Machines. (arXiv:2210.11855v1 [stat.ML])
    Supervised learning in reproducing kernel Hilbert space (RKHS) and vector-valued RKHS (vvRKHS) has been investigated for more than 30 years. In this paper, we provide a new twist to this rich literature by generalizing supervised learning in RKHS and vvRKHS to reproducing kernel Hilbert $C^*$-module (RKHM), and show how to construct effective positive-definite kernels by considering the perspective of $C^*$-algebra. Unlike the cases of RKHS and vvRKHS, we can use $C^*$-algebras to enlarge representation spaces. This enables us to construct RKHMs whose representation power goes beyond RKHSs, vvRKHSs, and existing methods such as convolutional neural networks. Our framework is suitable, for example, for effectively analyzing image data by allowing the interaction of Fourier components.
    Men Also Do Laundry: Multi-Attribute Bias Amplification. (arXiv:2210.11924v1 [cs.CV])
    As computer vision systems become more widely deployed, there is increasing concern from both the research community and the public that these systems are not only reproducing but amplifying harmful social biases. The phenomenon of bias amplification, which is the focus of this work, refers to models amplifying inherent training set biases at test time. Existing metrics measure bias amplification with respect to single annotated attributes (e.g., $\texttt{computer}$). However, several visual datasets consist of images with multiple attribute annotations. We show models can learn to exploit correlations with respect to multiple attributes (e.g., {$\texttt{computer}$, $\texttt{keyboard}$}), which are not accounted for by current metrics. In addition, we show current metrics can give the erroneous impression that minimal or no bias amplification has occurred as they involve aggregating over positive and negative values. Further, these metrics lack a clear desired value, making them difficult to interpret. To address these shortcomings, we propose a new metric: Multi-Attribute Bias Amplification. We validate our proposed metric through an analysis of gender bias amplification on the COCO and imSitu datasets. Finally, we benchmark bias mitigation methods using our proposed metric, suggesting possible avenues for future bias mitigation
    Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks. (arXiv:2010.04261v6 [cs.LG] UPDATED)
    Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. In this paper, we propose a decoupling conjecture that decomposes the layer-wise Hessians of a network as the Kronecker product of two smaller matrices. We can analyze the properties of these smaller matrices and prove the structure of top eigenspace random 2-layer networks. The decoupling conjecture has several other interesting implications - top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. All of these can be verified empirically for deeper networks. Finally, we use the structure of layer-wise Hessian to get better explicit generalization bounds for neural networks.
    lo-fi: distributed fine-tuning without communication. (arXiv:2210.11948v1 [cs.LG])
    When fine-tuning large neural networks, it is common to use multiple nodes and to communicate gradients at each optimization step. By contrast, we investigate completely local fine-tuning, which we refer to as lo-fi. During lo-fi, each node is fine-tuned independently without any communication. Then, the weights are averaged across nodes at the conclusion of fine-tuning. When fine-tuning DeiT-base and DeiT-large on ImageNet, this procedure matches accuracy in-distribution and improves accuracy under distribution shift compared to the baseline, which observes the same amount of data but communicates gradients at each step. We also observe that lo-fi matches the baseline's performance when fine-tuning OPT language models (up to 1.3B parameters) on Common Crawl. By removing the communication requirement, lo-fi reduces resource barriers for fine-tuning large models and enables fine-tuning in settings with prohibitive communication cost.
    Equivariant Networks for Zero-Shot Coordination. (arXiv:2210.12124v1 [cs.LG])
    Successful coordination in Dec-POMDPs requires agents to adopt robust strategies and interpretable styles of play for their partner. A common failure mode is symmetry breaking, when agents arbitrarily converge on one out of many equivalent but mutually incompatible policies. Commonly these examples include partial observability, e.g. waving your right hand vs. left hand to convey a covert message. In this paper, we present a novel equivariant network architecture for use in Dec-POMDPs that prevents the agent from learning policies which break symmetries, doing so more effectively than prior methods. Our method also acts as a "coordination-improvement operator" for generic, pre-trained policies, and thus may be applied at test-time in conjunction with any self-play algorithm. We provide theoretical guarantees of our work and test on the AI benchmark task of Hanabi, where we demonstrate our methods outperforming other symmetry-aware baselines in zero-shot coordination, as well as able to improve the coordination ability of a variety of pre-trained policies. In particular, we show our method can be used to improve on the state of the art for zero-shot coordination on the Hanabi benchmark.
    Normalizing Flows for Knockoff-free Controlled Feature Selection. (arXiv:2106.01528v3 [stat.ML] UPDATED)
    Controlled feature selection aims to discover the features a response depends on while limiting the false discovery rate (FDR) to a predefined level. Recently, multiple deep-learning-based methods have been proposed to perform controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the FDR for two reasons. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect to perform controlled feature selection that does not suffer from either of these two problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. Instead of enforcing the "swap" property, FlowSelect uses a novel MCMC-based procedure to calculate p-values for each feature directly. Asymptotically, FlowSelect computes valid p-values. Empirically, FlowSelect consistently controls the FDR on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also demonstrates greater power on these benchmarks. Additionally, FlowSelect correctly infers the genetic variants associated with specific soybean traits from GWAS data.
    Towards Global Neural Network Abstractions with Locally-Exact Reconstruction. (arXiv:2210.12054v1 [cs.LG])
    Neural networks are a powerful class of non-linear functions. However, their black-box nature makes it difficult to explain their behaviour and certify their safety. Abstraction techniques address this challenge by transforming the neural network into a simpler, over-approximated function. Unfortunately, existing abstraction techniques are slack, which limits their applicability to small local regions of the input domain. In this paper, we propose Global Interval Neural Network Abstractions with Center-Exact Reconstruction (GINNACER). Our novel abstraction technique produces sound over-approximation bounds over the whole input domain while guaranteeing exact reconstructions for any given local input. Our experiments show that GINNACER is several orders of magnitude tighter than state-of-the-art global abstraction techniques, while being competitive with local ones.
    Targeted active learning for probabilistic models. (arXiv:2210.12122v1 [cs.LG])
    A fundamental task in science is to design experiments that yield valuable insights about the system under study. Mathematically, these insights can be represented as a utility or risk function that shapes the value of conducting each experiment. We present PDBAL, a targeted active learning method that adaptively designs experiments to maximize scientific utility. PDBAL takes a user-specified risk function and combines it with a probabilistic model of the experimental outcomes to choose designs that rapidly converge on a high-utility model. We prove theoretical bounds on the label complexity of PDBAL and provide fast closed-form solutions for designing experiments with common exponential family likelihoods. In simulation studies, PDBAL consistently outperforms standard untargeted approaches that focus on maximizing expected information gain over the design space. Finally, we demonstrate the scientific potential of PDBAL through a study on a large cancer drug screen dataset where PDBAL quickly recovers the most efficacious drugs with a small fraction of the total number of experiments.
    TrojanZoo: Towards Unified, Holistic, and Practical Evaluation of Neural Backdoors. (arXiv:2012.09302v4 [cs.LG] UPDATED)
    Neural backdoors represent one primary threat to the security of deep learning systems. The intensive research has produced a plethora of backdoor attacks/defenses, resulting in a constant arms race. However, due to the lack of evaluation benchmarks, many critical questions remain under-explored: (i) what are the strengths and limitations of different attacks/defenses? (ii) what are the best practices to operate them? and (iii) how can the existing attacks/defenses be further improved? To bridge this gap, we design and implement TROJANZOO, the first open-source platform for evaluating neural backdoor attacks/defenses in a unified, holistic, and practical manner. Thus far, focusing on the computer vision domain, it has incorporated 8 representative attacks, 14 state-of-the-art defenses, 6 attack performance metrics, 10 defense utility metrics, as well as rich tools for in-depth analysis of the attack-defense interactions. Leveraging TROJANZOO, we conduct a systematic study on the existing attacks/defenses, unveiling their complex design spectrum: both manifest intricate trade-offs among multiple desiderata (e.g., the effectiveness, evasiveness, and transferability of attacks). We further explore improving the existing attacks/defenses, leading to a number of interesting findings: (i) one-pixel triggers often suffice; (ii) training from scratch often outperforms perturbing benign models to craft trojan models; (iii) optimizing triggers and trojan models jointly greatly improves both attack effectiveness and evasiveness; (iv) individual defenses can often be evaded by adaptive attacks; and (v) exploiting model interpretability significantly improves defense robustness. We envision that TROJANZOO will serve as a valuable platform to facilitate future research on neural backdoors.
    The privacy issue of counterfactual explanations: explanation linkage attacks. (arXiv:2210.12051v1 [cs.LG])
    Black-box machine learning models are being used in more and more high-stakes domains, which creates a growing need for Explainable AI (XAI). Unfortunately, the use of XAI in machine learning introduces new privacy risks, which currently remain largely unnoticed. We introduce the explanation linkage attack, which can occur when deploying instance-based strategies to find counterfactual explanations. To counter such an attack, we propose k-anonymous counterfactual explanations and introduce pureness as a new metric to evaluate the validity of these k-anonymous counterfactual explanations. Our results show that making the explanations, rather than the whole dataset, k- anonymous, is beneficial for the quality of the explanations.
    Neural Network Approximations of PDEs Beyond Linearity: Representational Perspective. (arXiv:2210.12101v1 [cs.LG])
    A burgeoning line of research has developed deep neural networks capable of approximating the solutions to high dimensional PDEs, opening related lines of theoretical inquiry focused on explaining how it is that these models appear to evade the curse of dimensionality. However, most theoretical analyses thus far have been limited to linear PDEs. In this work, we take a step towards studying the representational power of neural networks for approximating solutions to nonlinear PDEs. We focus on a class of PDEs known as \emph{nonlinear elliptic variational PDEs}, whose solutions minimize an \emph{Euler-Lagrange} energy functional $\mathcal{E}(u) = \int_\Omega L(\nabla u) dx$. We show that if composing a function with Barron norm $b$ with $L$ produces a function of Barron norm at most $B_L b^p$, the solution to the PDE can be $\epsilon$-approximated in the $L^2$ sense by a function with Barron norm $O\left(\left(dB_L\right)^{p^{\log(1/\epsilon)}}\right)$. By a classical result due to Barron [1993], this correspondingly bounds the size of a 2-layer neural network needed to approximate the solution. Treating $p, \epsilon, B_L$ as constants, this quantity is polynomial in dimension, thus showing neural networks can evade the curse of dimensionality. Our proof technique involves neurally simulating (preconditioned) gradient in an appropriate Hilbert space, which converges exponentially fast to the solution of the PDE, and such that we can bound the increase of the Barron norm at each iterate. Our results subsume and substantially generalize analogous prior results for linear elliptic PDEs.
    Differentiable Constrained Imitation Learning for Robot Motion Planning and Control. (arXiv:2210.11796v1 [cs.RO])
    Motion planning and control are crucial components of robotics applications. Here, spatio-temporal hard constraints like system dynamics and safety boundaries (e.g., obstacles in automated driving) restrict the robot's motions. Direct methods from optimal control solve a constrained optimization problem. However, in many applications finding a proper cost function is inherently difficult because of the weighting of partially conflicting objectives. On the other hand, Imitation Learning (IL) methods such as Behavior Cloning (BC) provide a intuitive framework for learning decision-making from offline demonstrations and constitute a promising avenue for planning and control in complex robot applications. Prior work primarily relied on soft-constraint approaches, which use additional auxiliary loss terms describing the constraints. However, catastrophic safety-critical failures might occur in out-of-distribution (OOD) scenarios. This work integrates the flexibility of IL with hard constraint handling in optimal control. Our approach constitutes a general framework for constraint robotic motion planning and control using offline IL. Hard constraints are integrated into the learning problem in a differentiable manner, via explicit completion and gradient-based correction. Simulated experiments of mobile robot navigation and automated driving provide evidence for the performance of the proposed method.
    Adversarial Permutation Invariant Training for Universal Sound Separation. (arXiv:2210.12108v1 [cs.SD])
    Universal sound separation consists of separating mixes with arbitrary sounds of different types, and permutation invariant training (PIT) is used to train source agnostic models that do so. In this work, we complement PIT with adversarial losses but find it challenging with the standard formulation used in speech source separation. We overcome this challenge with a novel I-replacement context-based adversarial loss, and by training with multiple discriminators. Our experiments show that by simply improving the loss (keeping the same model and dataset) we obtain a non-negligible improvement of 1.4 dB SI-SNRi in the reverberant FUSS dataset. We also find adversarial PIT to be effective at reducing spectral holes, ubiquitous in mask-based separation models, which highlights the potential relevance of adversarial losses for source separation.
    Multitask Brain Tumor Inpainting with Diffusion Models: A Methodological Report. (arXiv:2210.12113v1 [eess.IV])
    Despite the ever-increasing interest in applying deep learning (DL) models to medical imaging, the typical scarcity and imbalance of medical datasets can severely impact the performance of DL models. The generation of synthetic data that might be freely shared without compromising patient privacy is a well-known technique for addressing these difficulties. Inpainting algorithms are a subset of DL generative models that can alter one or more regions of an input image while matching its surrounding context and, in certain cases, non-imaging input conditions. Although the majority of inpainting techniques for medical imaging data use generative adversarial networks (GANs), the performance of these algorithms is frequently suboptimal due to their limited output variety, a problem that is already well-known for GANs. Denoising diffusion probabilistic models (DDPMs) are a recently introduced family of generative networks that can generate results of comparable quality to GANs, but with diverse outputs. In this paper, we describe a DDPM to execute multiple inpainting tasks on 2D axial slices of brain MRI with various sequences, and present proof-of-concept examples of its performance in a variety of evaluation scenarios. Our model and a public online interface to try our tool are available at: https://github.com/Mayo-Radiology-Informatics-Lab/MBTI
    GLCC: A General Framework for Graph-level Clustering. (arXiv:2210.11879v1 [cs.LG])
    This paper studies the problem of graph-level clustering, which is a novel yet challenging task. This problem is critical in a variety of real-world applications such as protein clustering and genome analysis in bioinformatics. Recent years have witnessed the success of deep clustering coupled with graph neural networks (GNNs). However, existing methods focus on clustering among nodes given a single graph, while exploring clustering on multiple graphs is still under-explored. In this paper, we propose a general graph-level clustering framework named Graph-Level Contrastive Clustering (GLCC) given multiple graphs. Specifically, GLCC first constructs an adaptive affinity graph to explore instance- and cluster-level contrastive learning (CL). Instance-level CL leverages graph Laplacian based contrastive loss to learn clustering-friendly representations while cluster-level CL captures discriminative cluster representations incorporating neighbor information of each sample. Moreover, we utilize neighbor-aware pseudo-labels to reward the optimization of representation learning. The two steps can be alternatively trained to collaborate and benefit each other. Experiments on a range of well-known datasets demonstrate the superiority of our proposed GLCC over competitive baselines.
    On amortizing convex conjugates for optimal transport. (arXiv:2210.12153v1 [cs.LG])
    This paper focuses on computing the convex conjugate operation that arises when solving Euclidean Wasserstein-2 optimal transport problems. This conjugation, which is also referred to as the Legendre-Fenchel conjugate or $c$-transform, is considered difficult to compute and in practice, Wasserstein-2 methods are limited by not being able to exactly conjugate the dual potentials in continuous space. I show that combining amortized approximations to the conjugate with a solver for fine-tuning is computationally easy. This combination significantly improves the quality of transport maps learned for the Wasserstein-2 benchmark by Korotin et al. (2021) and is able to model many 2-dimensional couplings and flows considered in the literature. All of the baselines, methods, and solvers in this paper are available at this http URL
    Towards transparent ANN wind turbine power curve models. (arXiv:2210.12104v1 [cs.LG])
    Accurate wind turbine power curve models, which translate ambient conditions into turbine power output, are crucial for wind energy to scale and fulfil its proposed role in the global energy transition. Machine learning methods, in particular deep neural networks (DNNs), have shown significant advantages over parametric, physics-informed power curve modelling approaches. Nevertheless, they are often criticised as opaque black boxes with no physical understanding of the system they model, which hinders their application in practice. We apply Shapley values, a popular explainable artificial intelligence (XAI) method, to, for the first time, uncover and validate the strategies learned by DNNs from operational wind turbine data. Our findings show that the trend towards ever larger model architectures, driven by the focus on test-set performance, can result in physically implausible model strategies, similar to the Clever Hans effect observed in classification. We, therefore, call for a more prominent role of XAI methods in model selection and additionally offer a practical strategy to use model explanations for wind turbine condition monitoring.
    Correlating sparse sensing for network-wide traffic speed estimation: An integrated graph tensor-based kriging approach. (arXiv:2210.11780v1 [stat.ML])
    Traffic speed is central to characterizing the fluidity of the road network. Many transportation applications rely on it, such as real-time navigation, dynamic route planning, and congestion management. Rapid advances in sensing and communication techniques make traffic speed detection easier than ever. However, due to sparse deployment of static sensors or low penetration of mobile sensors, speeds detected are incomplete and far from network-wide use. In addition, sensors are prone to error or missing data due to various kinds of reasons, speeds from these sensors can become highly noisy. These drawbacks call for effective techniques to recover credible estimates from the incomplete data. In this work, we first identify the problem as a spatiotemporal kriging problem and propose a unified graph embedded tensor (SGET) learning framework featuring both low-rankness and multi-dimensional correlations for network-wide traffic speed kriging under limited observations. To be specific, three types of speed correlation including temporal continuity, temporal periodicity, and spatial proximity are carefully chosen. We then design an efficient solution algorithm via several effective numeric techniques to scale up the proposed model to network-wide kriging. By performing experiments on two public million-level traffic speed datasets, we finally draw the conclusion and find our proposed SGET achieves the state-of-the-art kriging performance even under low observation rates, while at the same time saving more than half computing time compared with baseline methods. Some insights into spatiotemporal traffic data kriging at the network level are provided as well.
    Evolution of Neural Tangent Kernels under Benign and Adversarial Training. (arXiv:2210.12030v1 [cs.LG])
    Two key challenges facing modern deep learning are mitigating deep networks' vulnerability to adversarial attacks and understanding deep learning's generalization capabilities. Towards the first issue, many defense strategies have been developed, with the most common being Adversarial Training (AT). Towards the second challenge, one of the dominant theories that has emerged is the Neural Tangent Kernel (NTK) -- a characterization of neural network behavior in the infinite-width limit. In this limit, the kernel is frozen, and the underlying feature map is fixed. In finite widths, however, there is evidence that feature learning happens at the earlier stages of the training (kernel learning) before a second phase where the kernel remains fixed (lazy training). While prior work has aimed at studying adversarial vulnerability through the lens of the frozen infinite-width NTK, there is no work that studies the adversarial robustness of the empirical/finite NTK during training. In this work, we perform an empirical study of the evolution of the empirical NTK under standard and adversarial training, aiming to disambiguate the effect of adversarial training on kernel learning and lazy training. We find under adversarial training, the empirical NTK rapidly converges to a different kernel (and feature map) than standard training. This new kernel provides adversarial robustness, even when non-robust training is performed on top of it. Furthermore, we find that adversarial training on top of a fixed kernel can yield a classifier with $76.1\%$ robust accuracy under PGD attacks with $\varepsilon = 4/255$ on CIFAR-10.
    Validation of Composite Systems by Discrepancy Propagation. (arXiv:2210.12061v1 [cs.LG])
    Assessing the validity of a real-world system with respect to given quality criteria is a common yet costly task in industrial applications due to the vast number of required real-world tests. Validating such systems by means of simulation offers a promising and less expensive alternative, but requires an assessment of the simulation accuracy and therefore end-to-end measurements. Additionally, covariate shifts between simulations and actual usage can cause difficulties for estimating the reliability of such systems. In this work, we present a validation method that propagates bounds on distributional discrepancy measures through a composite system, thereby allowing us to derive an upper bound on the failure probability of the real system from potentially inaccurate simulations. Each propagation step entails an optimization problem, where -- for measures such as maximum mean discrepancy (MMD) -- we develop tight convex relaxations based on semidefinite programs. We demonstrate that our propagation method yields valid and useful bounds for composite systems exhibiting a variety of realistic effects. In particular, we show that the proposed method can successfully account for data shifts within the experimental design as well as model inaccuracies within the used simulation.
    Decoding a Neural Retriever's Latent Space for Query Suggestion. (arXiv:2210.12084v1 [cs.CL])
    Neural retrieval models have superseded classic bag-of-words methods such as BM25 as the retrieval framework of choice. However, neural systems lack the interpretability of bag-of-words models; it is not trivial to connect a query change to a change in the latent space that ultimately determines the retrieval results. To shed light on this embedding space, we learn a "query decoder" that, given a latent representation of a neural search engine, generates the corresponding query. We show that it is possible to decode a meaningful query from its latent representation and, when moving in the right direction in latent space, to decode a query that retrieves the relevant paragraph. In particular, the query decoder can be useful to understand "what should have been asked" to retrieve a particular paragraph from the collection. We employ the query decoder to generate a large synthetic dataset of query reformulations for MSMarco, leading to improved retrieval performance. On this data, we train a pseudo-relevance feedback (PRF) T5 model for the application of query suggestion that outperforms both query reformulation and PRF information retrieval baselines.
    Integrating Policy Summaries with Reward Decomposition for Explaining Reinforcement Learning Agents. (arXiv:2210.11825v1 [cs.LG])
    Explaining the behavior of reinforcement learning agents operating in sequential decision-making settings is challenging, as their behavior is affected by a dynamic environment and delayed rewards. Methods that help users understand the behavior of such agents can roughly be divided into local explanations that analyze specific decisions of the agents and global explanations that convey the general strategy of the agents. In this work, we study a novel combination of local and global explanations for reinforcement learning agents. Specifically, we combine reward decomposition, a local explanation method that exposes which components of the reward function influenced a specific decision, and HIGHLIGHTS, a global explanation method that shows a summary of the agent's behavior in decisive states. We conducted two user studies to evaluate the integration of these explanation methods and their respective benefits. Our results show significant benefits for both methods. In general, we found that the local reward decomposition was more useful for identifying the agents' priorities. However, when there was only a minor difference between the agents' preferences, then the global information provided by HIGHLIGHTS additionally improved participants' understanding.
    Optimal Contextual Bandits with Knapsacks under Realizibility via Regression Oracles. (arXiv:2210.11834v1 [cs.LG])
    We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
    Machine-Learning Compression for Particle Physics Discoveries. (arXiv:2210.11489v1 [hep-ph])
    In collider-based particle and nuclear physics experiments, data are produced at such extreme rates that only a subset can be recorded for later analysis. Typically, algorithms select individual collision events for preservation and store the complete experimental response. A relatively new alternative strategy is to additionally save a partial record for a larger subset of events, allowing for later specific analysis of a larger fraction of events. We propose a strategy that bridges these paradigms by compressing entire events for generic offline analysis but at a lower fidelity. An optimal-transport-based $\beta$ Variational Autoencoder (VAE) is used to automate the compression and the hyperparameter $\beta$ controls the compression fidelity. We introduce a new approach for multi-objective learning functions by simultaneously learning a VAE appropriate for all values of $\beta$ through parameterization. We present an example use case, a di-muon resonance search at the Large Hadron Collider (LHC), where we show that simulated data compressed by our $\beta$-VAE has enough fidelity to distinguish distinct signal morphologies.
    FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. (arXiv:2210.11790v1 [cs.LG])
    Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.
    Diffuser: Efficient Transformers with Multi-hop Attention Diffusion for Long Sequences. (arXiv:2210.11794v1 [cs.LG])
    Efficient Transformers have been developed for long sequence modeling, due to their subquadratic memory and time complexity. Sparse Transformer is a popular approach to improving the efficiency of Transformers by restricting self-attention to locations specified by the predefined sparse patterns. However, leveraging sparsity may sacrifice expressiveness compared to full-attention, when important token correlations are multiple hops away. To combine advantages of both the efficiency of sparse transformer and the expressiveness of full-attention Transformer, we propose \textit{Diffuser}, a new state-of-the-art efficient Transformer. Diffuser incorporates all token interactions within one attention layer while maintaining low computation and memory costs. The key idea is to expand the receptive field of sparse attention using Attention Diffusion, which computes multi-hop token correlations based on all paths between corresponding disconnected tokens, besides attention among neighboring tokens. Theoretically, we show the expressiveness of Diffuser as a universal sequence approximator for sequence-to-sequence modeling, and investigate its ability to approximate full-attention by analyzing the graph expander property from the spectral perspective. Experimentally, we investigate the effectiveness of Diffuser with extensive evaluations, including language modeling, image modeling, and Long Range Arena (LRA). Evaluation results show that Diffuser achieves improvements by an average of 0.94% on text classification tasks and 2.30% on LRA, with 1.67$\times$ memory savings compared to state-of-the-art benchmarks, which demonstrates superior performance of Diffuser in both expressiveness and efficiency aspects.
    Barrier Hamiltonian Monte Carlo. (arXiv:2210.11925v1 [stat.ML])
    In this paper, we propose Barrier Hamiltonian Monte Carlo (BHMC), a version of HMC which aims at sampling from a Gibbs distribution $\pi$ on a manifold $\mathsf{M}$, endowed with a Hessian metric $\mathfrak{g}$ derived from a self-concordant barrier. Like Riemannian Manifold HMC, our method relies on Hamiltonian dynamics which comprise $\mathfrak{g}$. It incorporates the constraints defining $\mathsf{M}$ and is therefore able to exploit its underlying geometry. We first introduce c-BHMC (continuous BHMC), for which we assume that the Hamiltonian dynamics can be integrated exactly, and show that it generates a Markov chain for which $\pi$ is invariant. Secondly, we design n-BHMC (numerical BHMC), a Metropolis-Hastings algorithm which combines an acceptance filter including a "reverse integration check" and numerical integrators of the Hamiltonian dynamics. Our main results establish that n-BHMC generates a reversible Markov chain with respect to $\pi$. This is in contrast to existing algorithms which extend the HMC method to Riemannian manifolds, as they do not deal with asymptotic bias. Our conclusions are supported by numerical experiments where we consider target distributions defined on polytopes.
    A Methodology for the Prediction of Drug Target Interaction using CDK Descriptors. (arXiv:2210.11482v1 [q-bio.QM])
    Detecting probable Drug Target Interaction (DTI) is a critical task in drug discovery. Conventional DTI studies are expensive, labor-intensive, and take a lot of time, hence there are significant reasons to construct useful computational techniques that may successfully anticipate possible DTIs. Although certain methods have been developed for this cause, numerous interactions are yet to be discovered, and prediction accuracy is still low. To meet these challenges, we propose a DTI prediction model built on molecular structure of drugs and sequence of target proteins. In the proposed model, we use Simplified Molecular Input Line Entry System (SMILES) to create CDK descriptors, Molecular ACCess System (MACCS) fingerprints, Electrotopological state (Estate) fingerprints and amino acid sequences of targets to get Pseudo Amino Acid Composition (PseAAC). We target to evaluate performance of DTI prediction models using CDK descriptors. For comparison, we use benchmark data and evaluate models performance on two widely used fingerprints, MACCS fingerprints and Estate fingerprints. The evaluation of performances shows that CDK descriptors are superior at predicting DTIs. The proposed method also outperforms other previously published techniques significantly.
    Self-Supervised Pretraining on Satellite Imagery: a Case Study on Label-Efficient Vehicle Detection. (arXiv:2210.11815v1 [cs.CV])
    In defense-related remote sensing applications, such as vehicle detection on satellite imagery, supervised learning requires a huge number of labeled examples to reach operational performances. Such data are challenging to obtain as it requires military experts, and some observables are intrinsically rare. This limited labeling capability, as well as the large number of unlabeled images available due to the growing number of sensors, make object detection on remote sensing imagery highly relevant for self-supervised learning. We study in-domain self-supervised representation learning for object detection on very high resolution optical satellite imagery, that is yet poorly explored. For the first time to our knowledge, we study the problem of label efficiency on this task. We use the large land use classification dataset Functional Map of the World to pretrain representations with an extension of the Momentum Contrast framework. We then investigate this model's transferability on a real-world task of fine-grained vehicle detection and classification on Preligens proprietary data, which is designed to be representative of an operational use case of strategic site surveillance. We show that our in-domain self-supervised learning model is competitive with ImageNet pretraining, and outperforms it in the low-label regime.
    DIICAN: Dual Time-scale State-Coupled Co-estimation of SOC, SOH and RUL for Lithium-Ion Batteries. (arXiv:2210.11941v1 [eess.SY])
    Accurate co-estimations of battery states, such as state-of-charge (SOC), state-of-health (SOH,) and remaining useful life (RUL), are crucial to the battery management systems to assure safe and reliable management. Although the external properties of the battery charge with the aging degree, batteries' degradation mechanism shares similar evolving patterns. Since batteries are complicated chemical systems, these states are highly coupled with intricate electrochemical processes. A state-coupled co-estimation method named Deep Inter and Intra-Cycle Attention Network (DIICAN) is proposed in this paper to estimate SOC, SOH, and RUL, which organizes battery measurement data into the intra-cycle and inter-cycle time scales. And to extract degradation-related features automatically and adapt to practical working conditions, the convolutional neural network is applied. The state degradation attention unit is utilized to extract the battery state evolution pattern and evaluate the battery degradation degree. To account for the influence of battery aging on the SOC estimation, the battery degradation-related state is incorporated in the SOC estimation for capacity calibration. The DIICAN method is validated on the Oxford battery dataset. The experimental results show that the proposed method can achieve SOH and RUL co-estimation with high accuracy and effectively improve SOC estimation accuracy for the whole lifespan.
    Extending $\mathrm{F}_1$ metric, probabilistic approach. (arXiv:2210.11997v1 [cs.LG])
    This article explores the extension of well-known $\mathrm{F}_1$ score used for assessing the performance of binary classifiers. We propose the new metric using probabilistic interpretation of precision, recall, specificity, and negative predictive value. We describe its properties and compare it to common metrics. Then we demonstrate its behavior in edge cases of the confusion matrix. Finally, the properties of the metric are tested on binary classifier trained on the real dataset.
    When Expressivity Meets Trainability: Fewer than $n$ Neurons Can Work. (arXiv:2210.12001v1 [cs.LG])
    Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can narrow networks have as strong expressivity as wide ones? If so, does the loss function exhibit a benign optimization landscape? In this work, we provide partially affirmative answers to both questions for 1-hidden-layer networks with fewer than $n$ (sample size) neurons when the activation is smooth. First, we prove that as long as the width $m \geq 2n/d$ (where $d$ is the input dimension), its expressivity is strong, i.e., there exists at least one global minimizer with zero training loss. Second, we identify a nice local region with no local-min or saddle points. Nevertheless, it is not clear whether gradient descent can stay in this nice region. Third, we consider a constrained optimization formulation where the feasible region is the nice local region, and prove that every KKT point is a nearly global minimizer. It is expected that projected gradient methods converge to KKT points under mild technical conditions, but we leave the rigorous convergence analysis to future work. Thorough numerical results show that projected gradient methods on this constrained formulation significantly outperform SGD for training narrow neural nets.
    Integrated Brier Score based Survival Cobra -- A regression based approach. (arXiv:2210.12006v1 [cs.LG])
    In this paper, we provide two novel regression-based integrations of combined regression strategy (COBRA) ensemble using Integrated Brier Score to predict conditional survival function. Our proposition includes a weighted version of all predictions based on Integrated Brier Score score made by all weak learners to predict the final survival function apart from the straight implementation. Two different norms (Frobenius and Sup norm) used to figure out the proximity points in the algorithm. Our implementations consider right-censored data too. We illustrate the proposed algorithms through few real-life data analysis.
    Ollivier-Ricci Curvature for Hypergraphs: A Unified Framework. (arXiv:2210.12048v1 [cs.LG])
    Bridging geometry and topology, curvature is a powerful and expressive invariant. While the utility of curvature has been theoretically and empirically confirmed in the context of manifolds and graphs, its generalization to the emerging domain of hypergraphs has remained largely unexplored. On graphs, Ollivier-Ricci curvature measures differences between random walks via Wasserstein distances, thus grounding a geometric concept in ideas from probability and optimal transport. We develop ORCHID, a flexible framework generalizing Ollivier-Ricci curvature to hypergraphs, and prove that the resulting curvatures have favorable theoretical properties. Through extensive experiments on synthetic and real-world hypergraphs from different domains, we demonstrate that ORCHID curvatures are both scalable and useful to perform a variety of hypergraph tasks in practice.
    Non-Autoregressive Neural Machine Translation: A Call for Clarity. (arXiv:2205.10577v2 [cs.CL] UPDATED)
    Non-autoregressive approaches aim to improve the inference speed of translation models by only requiring a single forward pass to generate the output sequence instead of iteratively producing each predicted token. Consequently, their translation quality still tends to be inferior to their autoregressive counterparts due to several issues involving output token interdependence. In this work, we take a step back and revisit several techniques that have been proposed for improving non-autoregressive translation models and compare their combined translation quality and speed implications under third-party testing environments. We provide novel insights for establishing strong baselines using length prediction or CTC-based architecture variants and contribute standardized BLEU, chrF++, and TER scores using sacreBLEU on four translation tasks, which crucially have been missing as inconsistencies in the use of tokenized BLEU lead to deviations of up to 1.7 BLEU points. Our open-sourced code is integrated into fairseq for reproducibility.
    Machine Learning based Discrimination for Excited State Promoted Readout. (arXiv:2210.08574v2 [quant-ph] UPDATED)
    A limiting factor for readout fidelity for superconducting qubits is the relaxation of the qubit to the ground state before the time needed for the resonator to reach its final target state. A technique known as excited state promoted (ESP) readout was proposed to reduce this effect and further improve the readout contrast on superconducting hardware. In this work, we use readout data from IBM's five-qubit quantum systems to measure the effectiveness of using deep neural networks, like feedforward neural networks, and various classification algorithms, like k-nearest neighbors, decision trees, and Gaussian naive Bayes, for single-qubit and multi-qubit discrimination. These methods were compared to standardly used linear and quadratic discriminant analysis algorithms based on their qubit-state-assignment fidelity performance, robustness to readout crosstalk, and training time.
    PaCo: Parameter-Compositional Multi-Task Reinforcement Learning. (arXiv:2210.11653v1 [cs.LG])
    The purpose of multi-task reinforcement learning (MTRL) is to train a single policy that can be applied to a set of different tasks. Sharing parameters allows us to take advantage of the similarities among tasks. However, the gaps between contents and difficulties of different tasks bring us challenges on both which tasks should share the parameters and what parameters should be shared, as well as the optimization challenges due to parameter sharing. In this work, we introduce a parameter-compositional approach (PaCo) as an attempt to address these challenges. In this framework, a policy subspace represented by a set of parameters is learned. Policies for all the single tasks lie in this subspace and can be composed by interpolating with the learned set. It allows not only flexible parameter sharing but also a natural way to improve training. We demonstrate the state-of-the-art performance on Meta-World benchmarks, verifying the effectiveness of the proposed approach.
    Global Counterfactual Explainer for Graph Neural Networks. (arXiv:2210.11695v1 [cs.LG])
    Graph neural networks (GNNs) find applications in various domains such as computational biology, natural language processing, and computer security. Owing to their popularity, there is an increasing need to explain GNN predictions since GNNs are black-box machine learning models. One way to address this is counterfactual reasoning where the objective is to change the GNN prediction by minimal changes in the input graph. Existing methods for counterfactual explanation of GNNs are limited to instance-specific local reasoning. This approach has two major limitations of not being able to offer global recourse policies and overloading human cognitive ability with too much information. In this work, we study the global explainability of GNNs through global counterfactual reasoning. Specifically, we want to find a small set of representative counterfactual graphs that explains all input graphs. Towards this goal, we propose GCFExplainer, a novel algorithm powered by vertex-reinforced random walks on an edit map of graphs with a greedy summary. Extensive experiments on real graph datasets show that the global explanation from GCFExplainer provides important high-level insights of the model behavior and achieves a 46.9% gain in recourse coverage and a 9.5% reduction in recourse cost compared to the state-of-the-art local counterfactual explainers.
    Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport. (arXiv:2210.11836v1 [cs.LG])
    Despite recent advances in automated machine learning, model selection is still a complex and computationally intensive process. For Gaussian processes (GPs), selecting the kernel is a crucial task, often done manually by the expert. Additionally, evaluating the model selection criteria for Gaussian processes typically scales cubically in the sample size, rendering kernel search particularly computationally expensive. We propose a novel, efficient search method through a general, structured kernel space. Previous methods solved this task via Bayesian optimization and relied on measuring the distance between GP's directly in function space to construct a kernel-kernel. We present an alternative approach by defining a kernel-kernel over the symbolic representation of the statistical hypothesis that is associated with a kernel. We empirically show that this leads to a computationally more efficient way of searching through a discrete kernel space.
    LOT: Layer-wise Orthogonal Training on Improving l2 Certified Robustness. (arXiv:2210.11620v1 [cs.LG])
    Recent studies show that training deep neural networks (DNNs) with Lipschitz constraints are able to enhance adversarial robustness and other model properties such as stability. In this paper, we propose a layer-wise orthogonal training method (LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an orthogonal matrix with an unconstrained matrix. We then efficiently compute the inverse square root of a convolution kernel by transforming the input domain to the Fourier frequency domain. On the other hand, as existing works show that semi-supervised training helps improve empirical robustness, we aim to bridge the gap and prove that semi-supervised learning also improves the certified robustness of Lipschitz-bounded models. We conduct comprehensive evaluations for LOT under different settings. We show that LOT significantly outperforms baselines regarding deterministic l2 certified robustness, and scales to deeper neural networks. Under the supervised scenario, we improve the state-of-the-art certified robustness for all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to 34.59% on CIFAR-100 at radius rho = 36/255 for 40-layer networks). With semi-supervised learning over unlabelled data, we are able to improve state-of-the-art certified robustness on CIFAR-10 at rho = 108/255 from 36.04% to 42.39%. In addition, LOT consistently outperforms baselines on different model architectures with only 1/3 evaluation time.
    Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale. (arXiv:2210.11693v1 [cs.LG])
    We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70% training steps and time, while requiring <=51% memory for slot variables. Our code is open-sourced at: https://github.com/google-research/jestimator
    Privacy-Preserved Neural Graph Similarity Learning. (arXiv:2210.11730v1 [cs.LG])
    To develop effective and efficient graph similarity learning (GSL) models, a series of data-driven neural algorithms have been proposed in recent years. Although GSL models are frequently deployed in privacy-sensitive scenarios, the user privacy protection of neural GSL models has not drawn much attention. To comprehensively understand the privacy protection issues, we first introduce the concept of attackable representation to systematically characterize the privacy attacks that each model can face. Inspired by the qualitative results, we propose a novel Privacy-Preserving neural Graph Matching network model, named PPGM, for graph similarity learning. To prevent reconstruction attacks, the proposed model does not communicate node-level representations between devices. Instead, we learn multi-perspective graph representations based on learnable context vectors. To alleviate the attacks to graph properties, the obfuscated features that contain information from both graphs are communicated. In this way, the private properties of each graph can be difficult to infer. Based on the node-graph matching techniques while calculating the obfuscated features, PPGM can also be effective in similarity measuring. To quantitatively evaluate the privacy-preserving ability of neural GSL models, we further propose an evaluation protocol via training supervised black-box attack models. Extensive experiments on widely-used benchmarks show the effectiveness and strong privacy-protection ability of the proposed model PPGM. The code is available at: https://github.com/RUCAIBox/PPGM.
    AfroLID: A Neural Language Identification Tool for African Languages. (arXiv:2210.11744v1 [cs.CL])
    Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's $7000$+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing~\ourLID, a neural LID toolkit for $517$ African languages and varieties.~\ourLID~exploits a multi-domain web dataset manually curated from across $14$ language families utilizing five orthographic systems. When evaluated on our blind Test set,~\ourLID~achieves $95.89$ $F_1$-score. We also compare~\ourLID~to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of~\ourLID~in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase~\ourLID's powerful capabilities and limitations.
    Neural Sheaf Diffusion: A Topological Perspective on Heterophily and Oversmoothing in GNNs. (arXiv:2202.04579v3 [cs.LG] UPDATED)
    Cellular sheaves equip graphs with a "geometrical" structure by assigning vector spaces and linear maps to nodes and edges. Graph Neural Networks (GNNs) implicitly assume a graph with a trivial underlying sheaf. This choice is reflected in the structure of the graph Laplacian operator, the properties of the associated diffusion equation, and the characteristics of the convolutional models that discretise this equation. In this paper, we use cellular sheaf theory to show that the underlying geometry of the graph is deeply linked with the performance of GNNs in heterophilic settings and their oversmoothing behaviour. By considering a hierarchy of increasingly general sheaves, we study how the ability of the sheaf diffusion process to achieve linear separation of the classes in the infinite time limit expands. At the same time, we prove that when the sheaf is non-trivial, discretised parametric diffusion processes have greater control than GNNs over their asymptotic behaviour. On the practical side, we study how sheaves can be learned from data. The resulting sheaf diffusion models have many desirable properties that address the limitations of classical graph diffusion equations (and corresponding GNN models) and obtain competitive results in heterophilic settings. Overall, our work provides new connections between GNNs and algebraic topology and would be of interest to both fields.
    NESTANets: Stable, accurate and efficient neural networks for analysis-sparse inverse problems. (arXiv:2203.00804v2 [cs.LG] UPDATED)
    Solving inverse problems is a fundamental component of science, engineering and mathematics. With the advent of deep learning, deep neural networks have significant potential to outperform existing state-of-the-art, model-based methods for solving inverse problems. However, it is known that current data-driven approaches face several key issues, notably hallucinations, instabilities and unpredictable generalization, with potential impact in critical tasks such as medical imaging. This raises the key question of whether or not one can construct deep neural networks for inverse problems with explicit stability and accuracy guarantees. In this work, we present a novel construction of accurate, stable and efficient neural networks for inverse problems with general analysis-sparse models, termed NESTANets. To construct the network, we first unroll NESTA, an accelerated first-order method for convex optimization. The slow convergence of this method leads to deep networks with low efficiency. Therefore, to obtain shallow, and consequently more efficient, networks we combine NESTA with a novel restart scheme. We then use compressed sensing techniques to demonstrate accuracy and stability. We showcase this approach in the case of Fourier imaging, and verify its stability and performance via a series of numerical experiments. The key impact of this work is demonstrating the construction of efficient neural networks based on unrolling with guaranteed stability and accuracy.
    Masked Autoencoders As Spatiotemporal Learners. (arXiv:2205.09113v2 [cs.CV] UPDATED)
    This paper studies a conceptually simple extension of Masked Autoencoders (MAE) to spatiotemporal representation learning from videos. We randomly mask out spacetime patches in videos and learn an autoencoder to reconstruct them in pixels. Interestingly, we show that our MAE method can learn strong representations with almost no inductive bias on spacetime (only except for patch and positional embeddings), and spacetime-agnostic random masking performs the best. We observe that the optimal masking ratio is as high as 90% (vs. 75% on images), supporting the hypothesis that this ratio is related to information redundancy of the data. A high masking ratio leads to a large speedup, e.g., > 4x in wall-clock time or even more. We report competitive results on several challenging video datasets using vanilla Vision Transformers. We observe that MAE can outperform supervised pre-training by large margins. We further report encouraging results of training on real-world, uncurated Instagram data. Our study suggests that the general framework of masked autoencoding (BERT, MAE, etc.) can be a unified methodology for representation learning with minimal domain knowledge.
    Summarization as Indirect Supervision for Relation Extraction. (arXiv:2205.09837v2 [cs.CL] UPDATED)
    Relation extraction (RE) models have been challenged by their reliance on training data with expensive annotations. Considering that summarization tasks aim at acquiring concise expressions of synoptical information from the longer context, these tasks naturally align with the objective of RE, i.e., extracting a kind of synoptical information that describes the relation of entity mentions. We present SuRE, which converts RE into a summarization formulation. SuRE leads to more precise and resource-efficient RE based on indirect supervision from summarization tasks. To achieve this goal, we develop sentence and relation conversion techniques that essentially bridge the formulation of summarization and RE tasks. We also incorporate constraint decoding techniques with Trie scoring to further enhance summarization-based RE with robust inference. Experiments on three RE datasets demonstrate the effectiveness of SuRE in both full-dataset and low-resource settings, showing that summarization is a promising source of indirect supervision to improve RE models.
    Computer-Aided Cancer Diagnosis via Machine Learning and Deep Learning: A comparative review. (arXiv:2210.11943v1 [eess.IV])
    The past years have seen a considerable increase in cancer cases. However, a cancer diagnosis is often complex and depends on the types of images provided for analysis. It requires highly skilled practitioners but is often time-consuming and error-prone. If Machine Learning and deep learning algorithms have been widely used, a comprehensive review of the techniques used from the pre-processing steps to the final prediction is lacking. With this review, we aim to provide a comprehensive overview of the current steps required in building efficient and accurate machine learning algorithm for cancer prediction, detection and classification. To do so, we compile the results of cancer related study using AI over the past years. We include various cancers that encompass different types of images, and therefore different related techniques. We show that tremendous improvements have been made in the early detection of cancerous tumors and tissues. The techniques used are various and often problem-tailored and our findings is confirmed through the study of a large number of research. Moreover, we investigate the approaches best suited for different types of images such as histology, dermoscopic, MRI, etc. With this work, we summarize the main finding over the past years in cancer detection using deep learning techniques. We discuss the challenges of cancer research related to the large discrepancies in the images, and we provide some notable results in the field for lung, breast, and skin cancers.
    A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models. (arXiv:2210.12082v1 [stat.ML])
    We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal $\ell_2$-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand's well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation.
    Cox-Hawkes: doubly stochastic spatiotemporal Poisson processes. (arXiv:2210.11844v1 [stat.ML])
    Hawkes processes are point process models that have been used to capture self-excitatory behavior in social interactions, neural activity, earthquakes and viral epidemics. They can model the occurrence of the times and locations of events. Here we develop a new class of spatiotemporal Hawkes processes that can capture both triggering and clustering behavior and we provide an efficient method for performing inference. We use a log-Gaussian Cox process (LGCP) as prior for the background rate of the Hawkes process which gives arbitrary flexibility to capture a wide range of underlying background effects (for infectious diseases these are called endemic effects). The Hawkes process and LGCP are computationally expensive due to the former having a likelihood with quadratic complexity in the number of observations and the latter involving inversion of the precision matrix which is cubic in observations. Here we propose a novel approach to perform MCMC sampling for our Hawkes process with LGCP background, using pre-trained Gaussian Process generators which provide direct and cheap access to samples during inference. We show the efficacy and flexibility of our approach in experiments on simulated data and use our methods to uncover the trends in a dataset of reported crimes in the US.
    Learning Graphical Factor Models with Riemannian Optimization. (arXiv:2210.11950v1 [stat.ML])
    Graphical models and factor analysis are well-established tools in multivariate statistics. While these models can be both linked to structures exhibited by covariance and precision matrices, they are generally not jointly leveraged within graph learning processes. This paper therefore addresses this issue by proposing a flexible algorithmic framework for graph learning under low-rank structural constraints on the covariance matrix. The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution (a generalization of Gaussian graphical models to possibly heavy-tailed distributions), where the covariance matrix is optionally constrained to be structured as low-rank plus diagonal (low-rank factor model). The resolution of this class of problems is then tackled with Riemannian optimization, where we leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models. Numerical experiments on real-world data sets illustrate the effectiveness of the proposed approach.
    Revisiting Checkpoint Averaging for Neural Machine Translation. (arXiv:2210.11803v1 [cs.CL])
    Checkpoint averaging is a simple and effective method to boost the performance of converged neural machine translation models. The calculation is cheap to perform and the fact that the translation improvement almost comes for free, makes it widely adopted in neural machine translation research. Despite the popularity, the method itself simply takes the mean of the model parameters from several checkpoints, the selection of which is mostly based on empirical recipes without many justifications. In this work, we revisit the concept of checkpoint averaging and consider several extensions. Specifically, we experiment with ideas such as using different checkpoint selection strategies, calculating weighted average instead of simple mean, making use of gradient information and fine-tuning the interpolation weights on development data. Our results confirm the necessity of applying checkpoint averaging for optimal performance, but also suggest that the landscape between the converged checkpoints is rather flat and not much further improvement compared to simple averaging is to be obtained.
    Neuro-Symbolic Causal Reasoning Meets Signaling Game for Emergent Semantic Communications. (arXiv:2210.12040v1 [cs.LG])
    Semantic communication (SC) aims to communicate reliably with minimal data transfer while simultaneously providing seamless connectivity to heterogeneous services and users. In this paper, a novel emergent SC (ESC) system framework is proposed and is composed of a signaling game for emergent language design and a neuro-symbolic (NeSy) artificial intelligence (AI) approach for causal reasoning. In order to design the language, the signaling game is solved using an alternating maximization between the communicating node's utilities. The emergent language helps create a context-aware transmit vocabulary (minimal semantic representation) and aids the reasoning process (enabling generalization to unseen scenarios) by splitting complex messages into simpler reasoning tasks for the receiver. The causal description at the transmitter is then modeled (a neural component) as a posterior distribution of the relevant attributes present in the data. Using the reconstructed causal state, the receiver evaluates a set of logical formulas (symbolic part) to execute its task. The nodes NeSy reasoning components are implemented by the recently proposed AI tool called Generative Flow Networks, and they are optimized for higher semantic reliability. The ESC system is designed to enhance the novel metrics of semantic information, reliability, distortion and similarity that are designed using rigorous algebraic properties from category theory thereby generalizing the metrics beyond Shannon's notion of uncertainty. Simulation results validate the ability of ESC to communicate efficiently (with reduced bits) and achieve better semantic reliability than conventional wireless and state-of-the-art systems that do not exploit causal reasoning capabilities.
    AutoPrognosis 2.0: Democratizing Diagnostic and Prognostic Modeling in Healthcare with Automated Machine Learning. (arXiv:2210.12090v1 [cs.LG])
    Diagnostic and prognostic models are increasingly important in medicine and inform many clinical decisions. Recently, machine learning approaches have shown improvement over conventional modeling techniques by better capturing complex interactions between patient covariates in a data-driven manner. However, the use of machine learning introduces a number of technical and practical challenges that have thus far restricted widespread adoption of such techniques in clinical settings. To address these challenges and empower healthcare professionals, we present a machine learning framework, AutoPrognosis 2.0, to develop diagnostic and prognostic models. AutoPrognosis leverages state-of-the-art advances in automated machine learning to develop optimized machine learning pipelines, incorporates model explainability tools, and enables deployment of clinical demonstrators, without requiring significant technical expertise. Our framework eliminates the major technical obstacles to predictive modeling with machine learning that currently impede clinical adoption. To demonstrate AutoPrognosis 2.0, we provide an illustrative application where we construct a prognostic risk score for diabetes using the UK Biobank, a prospective study of 502,467 individuals. The models produced by our automated framework achieve greater discrimination for diabetes than expert clinical risk scores. Our risk score has been implemented as a web-based decision support tool and can be publicly accessed by patients and clinicians worldwide. In addition, AutoPrognosis 2.0 is provided as an open-source python package. By open-sourcing our framework as a tool for the community, clinicians and other medical practitioners will be able to readily develop new risk scores, personalized diagnostics, and prognostics using modern machine learning techniques.
    A Survey on Graph Counterfactual Explanations: Definitions, Methods, Evaluation. (arXiv:2210.12089v1 [cs.LG])
    In recent years, Graph Neural Networks have reported outstanding performance in tasks like community detection, molecule classification and link prediction. However, the black-box nature of these models prevents their application in domains like health and finance, where understanding the models' decisions is essential. Counterfactual Explanations (CE) provide these understandings through examples. Moreover, the literature on CE is flourishing with novel explanation methods which are tailored to graph learning. In this survey, we analyse the existing Graph Counterfactual Explanation methods, by providing the reader with an organisation of the literature according to a uniform formal notation for definitions, datasets, and metrics, thus, simplifying potential comparisons w.r.t to the method advantages and disadvantages. We discussed seven methods and sixteen synthetic and real datasets providing details on the possible generation strategies. We highlight the most common evaluation strategies and formalise nine of the metrics used in the literature. We first introduce the evaluation framework GRETEL and how it is possible to extend and use it while providing a further dimension of comparison encompassing reproducibility aspects. Finally, we provide a discussion on how counterfactual explanation interplays with privacy and fairness, before delving into open challenges and future works.
    Geometric Sparse Coding in Wasserstein Space. (arXiv:2210.12135v1 [cs.LG])
    Wasserstein dictionary learning is an unsupervised approach to learning a collection of probability distributions that generate observed distributions as Wasserstein barycentric combinations. Existing methods for Wasserstein dictionary learning optimize an objective that seeks a dictionary with sufficient representation capacity via barycentric interpolation to approximate the observed training data, but without imposing additional structural properties on the coefficients associated to the dictionary. This leads to dictionaries that densely represent the observed data, which makes interpretation of the coefficients challenging and may also lead to poor empirical performance when using the learned coefficients in downstream tasks. In contrast and motivated by sparse dictionary learning in Euclidean spaces, we propose a geometrically sparse regularizer for Wasserstein space that promotes representations of a data point using only nearby dictionary elements. We show this approach leads to sparse representations in Wasserstein space and addresses the problem of non-uniqueness of barycentric representation. Moreover, when data is generated as Wasserstein barycenters of fixed distributions, this regularizer facilitates the recovery of the generating distributions in cases that are ill-posed for unregularized Wasserstein dictionary learning. Through experimentation on synthetic and real data, we show that our geometrically regularized approach yields sparser and more interpretable dictionaries in Wasserstein space, which perform better in downstream applications.
    Reaching Through Latent Space: From Joint Statistics to Path Planning in Manipulation. (arXiv:2210.11779v1 [cs.RO])
    We present a novel approach to path planning for robotic manipulators, in which paths are produced via iterative optimisation in the latent space of a generative model of robot poses. Constraints are incorporated through the use of constraint satisfaction classifiers operating on the same space. Optimisation leverages gradients through our learned models that provide a simple way to combine goal reaching objectives with constraint satisfaction, even in the presence of otherwise non-differentiable constraints. Our models are trained in a task-agnostic manner on randomly sampled robot poses. In baseline comparisons against a number of widely used planners, we achieve commensurate performance in terms of task success, planning time and path length, performing successful path planning with obstacle avoidance on a real 7-DoF robot arm.  ( 2 min )
    Efficiently Tuned Parameters are Task Embeddings. (arXiv:2210.11705v1 [cs.CL])
    Intermediate-task transfer can benefit a wide range of NLP tasks with properly selected source datasets. However, it is computationally infeasible to experiment with all intermediate transfer combinations, making choosing a useful source task a challenging problem. In this paper, we anticipate that task-specific parameters updated in parameter-efficient tuning methods are likely to encode task-specific information. Therefore, such parameters can be predictive for inter-task transferability. Thus, we propose to exploit these efficiently tuned parameters as off-the-shelf task embeddings for the efficient selection of source datasets for intermediate-task transfer. We experiment with 11 text classification tasks and 11 question answering tasks. Experimental results show that our approach can consistently outperform existing inter-task transferability prediction methods while being conceptually simple and computationally efficient. Our analysis also reveals that the ability of efficiently tuned parameters on transferability prediction is disentangled with their in-task performance. This allows us to use parameters from early checkpoints as task embeddings to further improve efficiency.  ( 2 min )
    Twin Contrastive Learning for Online Clustering. (arXiv:2210.11680v1 [cs.LG])
    This paper proposes to perform online clustering by conducting twin contrastive learning (TCL) at the instance and cluster level. Specifically, we find that when the data is projected into a feature space with a dimensionality of the target cluster number, the rows and columns of its feature matrix correspond to the instance and cluster representation, respectively. Based on the observation, for a given dataset, the proposed TCL first constructs positive and negative pairs through data augmentations. Thereafter, in the row and column space of the feature matrix, instance- and cluster-level contrastive learning are respectively conducted by pulling together positive pairs while pushing apart the negatives. To alleviate the influence of intrinsic false-negative pairs and rectify cluster assignments, we adopt a confidence-based criterion to select pseudo-labels for boosting both the instance- and cluster-level contrastive learning. As a result, the clustering performance is further improved. Besides the elegant idea of twin contrastive learning, another advantage of TCL is that it could independently predict the cluster assignment for each instance, thus effortlessly fitting online scenarios. Extensive experiments on six widely-used image and text benchmarks demonstrate the effectiveness of TCL. The code will be released on GitHub.  ( 2 min )
    MnEdgeNet -- Accurate Decomposition of Mixed Oxidation States for Mn XAS and EELS L2,3 Edges without Reference and Calibration. (arXiv:2210.11657v1 [cond-mat.mtrl-sci])
    Accurate decomposition of the mixed Mn oxidation states is highly important for characterizing the electronic structures, charge transfer, and redox centers for electronic, electrocatalytic, and energy storage materials that contain Mn. Electron energy loss spectroscopy (EELS) and soft X-ray absorption spectroscopy (XAS) measurements of the Mn L2,3 edges are widely used for this purpose. To date, although the measurement of the Mn L2,3 edges is straightforward given the sample is prepared properly, an accurate decomposition of the mix valence states of Mn remains non-trivial. For both EELS and XAS, 2+, 3+, 4+ reference spectra need to be taken on the same instrument/beamline and preferably in the same experimental session because the instrumental resolution and the energy axis offset could vary from one session to another. To circumvent this hurdle, in this study, we adopted a deep learning approach and developed a calibration-free and reference-free method to decompose the oxidation state of Mn L2,3 edges for both EELS and XAS. To synthesize physics-informed and ground-truth labeled training datasets, we created a forward model that takes into account plural scattering, instrumentation broadening, noise, and energy axis offset. With that, we created a 1.2 million-spectrum database with a three-element oxidation state composition label. The library includes a sufficient variety of data including both EELS and XAS spectra. By training on this large database, our convolutional neural network achieves 85% accuracy on the validation dataset. We tested the model and found it is robust against noise (down to PSNR of 10) and plural scattering (up to t/{\lambda} = 1). We further validated the model against spectral data that were not used in training.  ( 3 min )
    Horizon-Free Reinforcement Learning for Latent Markov Decision Processes. (arXiv:2210.11604v1 [cs.LG])
    We study regret minimization for reinforcement learning (RL) in Latent Markov Decision Processes (LMDPs) with context in hindsight. We design a novel model-based algorithmic framework which can be instantiated with both a model-optimistic and a value-optimistic solver. We prove an $\widetilde{O}\left(\sqrt{M \Gamma S A K}\right)$ regret bound where $M$ is the number of contexts, $S$ is the number of states, $A$ is the number of actions, $K$ is the number of episodes, and $\Gamma \le S$ is the maximum transition degree of any state-action pair. The regret bound only scales logarithmically with the planning horizon, thus yielding the first (nearly) horizon-free regret bound for LMDP. Key in our proof is an analysis of the total variance of alpha vectors, which is carefully bounded by a recursion-based technique. We complement our positive result with a novel $\Omega\left(\sqrt{M S A K}\right)$ regret lower bound with $\Gamma = 2$, which shows our upper bound minimax optimal when $\Gamma$ is a constant. Our lower bound relies on new constructions of hard instances and an argument based on the symmetrization technique from theoretical computer science, both of which are technically different from existing lower bound proof for MDPs, and thus can be of independent interest.  ( 2 min )
    online and lightweight kernel-based approximated policy iteration for dynamic p-norm linear adaptive filtering. (arXiv:2210.11755v1 [cs.LG])
    This paper introduces a solution to the problem of selecting dynamically (online) the ``optimal'' p-norm to combat outliers in linear adaptive filtering without any knowledge on the probability density function of the outliers. The proposed online and data-driven framework is built on kernel-based reinforcement learning (KBRL). To this end, novel Bellman mappings on reproducing kernel Hilbert spaces (RKHSs) are introduced. These mappings do not require any knowledge on transition probabilities of Markov decision processes, and are nonexpansive with respect to the underlying Hilbertian norm. The fixed-point sets of the proposed Bellman mappings are utilized to build an approximate policy-iteration (API) framework for the problem at hand. To address the ``curse of dimensionality'' in RKHSs, random Fourier features are utilized to bound the computational complexity of the API. Numerical tests on synthetic data for several outlier scenarios demonstrate the superior performance of the proposed API framework over several non-RL and KBRL schemes.  ( 2 min )
    Bayesian deep learning framework for uncertainty quantification in high dimensions. (arXiv:2210.11737v1 [stat.ML])
    We develop a novel deep learning method for uncertainty quantification in stochastic partial differential equations based on Bayesian neural network (BNN) and Hamiltonian Monte Carlo (HMC). A BNN efficiently learns the posterior distribution of the parameters in deep neural networks by performing Bayesian inference on the network parameters. The posterior distribution is efficiently sampled using HMC to quantify uncertainties in the system. Several numerical examples are shown for both forward and inverse problems in high dimension to demonstrate the effectiveness of the proposed method for uncertainty quantification. These also show promising results that the computational cost is almost independent of the dimension of the problem demonstrating the potential of the method for tackling the so-called curse of dimensionality.  ( 2 min )
    Local Bayesian optimization via maximizing probability of descent. (arXiv:2210.11662v1 [cs.LG])
    Local optimization presents a promising approach to expensive, high-dimensional black-box optimization by sidestepping the need to globally explore the search space. For objective functions whose gradient cannot be evaluated directly, Bayesian optimization offers one solution -- we construct a probabilistic model of the objective, design a policy to learn about the gradient at the current location, and use the resulting information to navigate the objective landscape. Previous work has realized this scheme by minimizing the variance in the estimate of the gradient, then moving in the direction of the expected gradient. In this paper, we re-examine and refine this approach. We demonstrate that, surprisingly, the expected value of the gradient is not always the direction maximizing the probability of descent, and in fact, these directions may be nearly orthogonal. This observation then inspires an elegant optimization scheme seeking to maximize the probability of descent while moving in the direction of most-probable descent. Experiments on both synthetic and real-world objectives show that our method outperforms previous realizations of this optimization scheme and is competitive against other, significantly more complicated baselines.  ( 2 min )
    Multi-View Reasoning: Consistent Contrastive Learning for Math Word Problem. (arXiv:2210.11694v1 [cs.CL])
    Math word problem solver requires both precise relation reasoning about quantities in the text and reliable generation for the diverse equation. Current sequence-to-tree or relation extraction methods regard this only from a fixed view, struggling to simultaneously handle complex semantics and diverse equations. However, human solving naturally involves two consistent reasoning views: top-down and bottom-up, just as math equations also can be expressed in multiple equivalent forms: pre-order and post-order. We propose a multi-view consistent contrastive learning for a more complete semantics-to-equation mapping. The entire process is decoupled into two independent but consistent views: top-down decomposition and bottom-up construction, and the two reasoning views are aligned in multi-granularity for consistency, enhancing global generation and precise reasoning. Experiments on multiple datasets across two languages show our approach significantly outperforms the existing baselines, especially on complex problems. We also show after consistent alignment, multi-view can absorb the merits of both views and generate more diverse results consistent with the mathematical laws.  ( 2 min )
    SMaLL-100: Introducing Shallow Multilingual Machine Translation Model for Low-Resource Languages. (arXiv:2210.11621v1 [cs.CL])
    In recent years, multilingual machine translation models have achieved promising performance on low-resource language pairs by sharing information between similar languages, thus enabling zero-shot translation. To overcome the "curse of multilinguality", these models often opt for scaling up the number of parameters, which makes their use in resource-constrained environments challenging. We introduce SMaLL-100, a distilled version of the M2M-100 (12B) model, a massively multilingual machine translation model covering 100 languages. We train SMaLL-100 with uniform sampling across all language pairs and therefore focus on preserving the performance of low-resource languages. We evaluate SMaLL-100 on different low-resource benchmarks: FLORES-101, Tatoeba, and TICO-19 and demonstrate that it outperforms previous massively multilingual models of comparable sizes (200-600M) while improving inference latency and memory usage. Additionally, our model achieves comparable results to M2M-100 (1.2B), while being 3.6x smaller and 4.3x faster at inference. Code and pre-trained models: https://github.com/alirezamshi/small100  ( 2 min )
    CRT-6D: Fast 6D Object Pose Estimation with Cascaded Refinement Transformers. (arXiv:2210.11718v1 [cs.CV])
    Learning based 6D object pose estimation methods rely on computing large intermediate pose representations and/or iteratively refining an initial estimation with a slow render-compare pipeline. This paper introduces a novel method we call Cascaded Pose Refinement Transformers, or CRT-6D. We replace the commonly used dense intermediate representation with a sparse set of features sampled from the feature pyramid we call OSKFs(Object Surface Keypoint Features) where each element corresponds to an object keypoint. We employ lightweight deformable transformers and chain them together to iteratively refine proposed poses over the sampled OSKFs. We achieve inference runtimes 2x faster than the closest real-time state of the art methods while supporting up to 21 objects on a single model. We demonstrate the effectiveness of CRT-6D by performing extensive experiments on the LM-O and YCBV datasets. Compared to real-time methods, we achieve state of the art on LM-O and YCB-V, falling slightly behind methods with inference runtimes one order of magnitude higher. The source code is available at: https://github.com/PedroCastro/CRT-6D  ( 2 min )
    HesScale: Scalable Computation of Hessian Diagonals. (arXiv:2210.11639v1 [cs.LG])
    Second-order optimization uses curvature information about the objective function, which can help in faster convergence. However, such methods typically require expensive computation of the Hessian matrix, preventing their usage in a scalable way. The absence of efficient ways of computation drove the most widely used methods to focus on first-order approximations that do not capture the curvature information. In this paper, we develop HesScale, a scalable approach to approximating the diagonal of the Hessian matrix, to incorporate second-order information in a computationally efficient manner. We show that HesScale has the same computational complexity as backpropagation. Our results on supervised classification show that HesScale achieves high approximation accuracy, allowing for scalable and efficient second-order optimization.  ( 2 min )
    Multitasking Models are Robust to Structural Failure: A Neural Model for Bilingual Cognitive Reserve. (arXiv:2210.11618v1 [cs.LG])
    We find a surprising connection between multitask learning and robustness to neuron failures. Our experiments show that bilingual language models retain higher performance under various neuron perturbations, such as random deletions, magnitude pruning and weight noise compared to equivalent monolingual ones. We provide a theoretical justification for this robustness by mathematically analyzing linear representation learning and showing that multitasking creates more robust representations. Our analysis connects robustness to spectral properties of the learned representation and proves that multitasking leads to higher robustness for diverse task vectors. We open-source our code and models: https://github.com/giannisdaras/multilingual_robustness  ( 2 min )
    Generalized Reciprocal Perspective. (arXiv:2210.11616v1 [cs.LG])
    Across many domains, real-world problems can be represented as a network. Nodes represent domain-specific elements and edges capture the relationship between elements. Leveraging high-performance computing and optimized link prediction algorithms, it is increasingly possible to evaluate every possible combination of nodal pairs enabling the generation of a comprehensive prediction matrix (CPM) that places an individual link prediction score in the context of all possible links involving either node (providing data-driven context). Historically, this contextual information has been ignored given exponentially growing problem sizes resulting in computational intractability; however, we demonstrate that expending high-performance compute resources to generate CPMs is a worthwhile investment given the improvement in predictive performance. In this work, we generalize for all pairwise link-prediction tasks our novel semi-supervised machine learning method, denoted Reciprocal Perspective (RP). We demonstrate that RP significantly improves link prediction accuracy by leveraging the wealth of information in a CPM. Context-based features are extracted from the CPM for use in a stacked classifier and we demonstrate that the application of RP in a cascade almost always results in significantly (p < 0.05) improved predictions. These results on RS-type problems suggest that RP is applicable to a broad range of link prediction problems.
    Fuzzy Granular-Ball Computing Framework and Its Implementation in SVM. (arXiv:2210.11675v1 [cs.LG])
    Most existing fuzzy computing methods use points as input, which is the finest granularity from the perspective of granular computing. Consequently, these classifiers are neither efficient nor robust to label noise. Therefore, we propose a framework for a fuzzy granular-ball computational classifier by introducing granular-ball computing into fuzzy set. The computational framework is based on the granular-balls input rather than points; therefore, it is more efficient and robust than traditional fuzzy methods. Furthermore, the framework is extended to the fuzzy support vector machine (FSVM), and granular ball fuzzy SVM (GBFSVM) is derived. The experimental results demonstrate the effectiveness and efficiency of GBFSVM.  ( 2 min )
    Boosting Natural Language Generation from Instructions with Meta-Learning. (arXiv:2210.11617v1 [cs.CL])
    Recent work has shown that language models (LMs) trained with multi-task \textit{instructional learning} (MTIL) can solve diverse NLP tasks in zero- and few-shot settings with improved performance compared to prompt tuning. MTIL illustrates that LMs can extract and use information about the task from instructions beyond the surface patterns of the inputs and outputs. This suggests that meta-learning may further enhance the utilization of instructions for effective task transfer. In this paper we investigate whether meta-learning applied to MTIL can further improve generalization to unseen tasks in a zero-shot setting. Specifically, we propose to adapt meta-learning to MTIL in three directions: 1) Model Agnostic Meta Learning (MAML), 2) Hyper-Network (HNet) based adaptation to generate task specific parameters conditioned on instructions, and 3) an approach combining HNet and MAML. Through extensive experiments on the large scale Natural Instructions V2 dataset, we show that our proposed approaches significantly improve over strong baselines in zero-shot settings. In particular, meta-learning improves the effectiveness of instructions and is most impactful when the test tasks are strictly zero-shot (i.e. no similar tasks in the training set) and are "hard" for LMs, illustrating the potential of meta-learning for MTIL for out-of-distribution tasks.  ( 2 min )
    Learning Robust Dynamics through Variational Sparse Gating. (arXiv:2210.11698v1 [cs.LG])
    Learning world models from their sensory inputs enables agents to plan for actions by imagining their future outcomes. World models have previously been shown to improve sample-efficiency in simulated environments with few objects, but have not yet been applied successfully to environments with many objects. In environments with many objects, often only a small number of them are moving or interacting at the same time. In this paper, we investigate integrating this inductive bias of sparse interactions into the latent dynamics of world models trained from pixels. First, we introduce Variational Sparse Gating (VSG), a latent dynamics model that updates its feature dimensions sparsely through stochastic binary gates. Moreover, we propose a simplified architecture Simple Variational Sparse Gating (SVSG) that removes the deterministic pathway of previous models, resulting in a fully stochastic transition function that leverages the VSG mechanism. We evaluate the two model architectures in the BringBackShapes (BBS) environment that features a large number of moving objects and partial observability, demonstrating clear improvements over prior models.
    Sparse Dynamical Features generation, application to Parkinson's Disease diagnosis. (arXiv:2210.11624v1 [eess.SY])
    In this study we focus on the diagnosis of Parkinson's Disease (PD) based on electroencephalogram (EEG) signals. We propose a new approach inspired by the functioning of the brain that uses the dynamics, frequency and temporal content of EEGs to extract new demarcating features of the disease. The method was evaluated on a publicly available dataset containing EEG signals recorded during a 3-oddball auditory task involving N = 50 subjects, of whom 25 suffer from PD. By extracting two features, and separating them with a straight line using a Linear Discriminant Analysis (LDA) classifier, we can separate the healthy from the unhealthy subjects with an accuracy of 90% (p < 1.8$\times$10-5) using a single channel. By aggregating the information from three channels and making them vote, we obtain an accuracy of 94 %, a sensitivity of 96 % and a specificity of 92 %. The evaluation was carried out using a nested leave-one-out cross-validation procedure, thus preventing data leakage problems and giving a less biased evaluation. Several tests were carried out to assess the validity and robustness of our approach, including the test where we use only half the available data for training. Under this constraint, the model achieves an accuracy of 89.4 %.  ( 2 min )
    Random Actions vs Random Policies: Bootstrapping Model-Based Direct Policy Search. (arXiv:2210.11801v1 [cs.LG])
    This paper studies the impact of the initial data gathering method on the subsequent learning of a dynamics model. Dynamics models approximate the true transition function of a given task, in order to perform policy search directly on the model rather than on the costly real system. This study aims to determine how to bootstrap a model as efficiently as possible, by comparing initialization methods employed in two different policy search frameworks in the literature. The study focuses on the model performance under the episode-based framework of Evolutionary methods using probabilistic ensembles. Experimental results show that various task-dependant factors can be detrimental to each method, suggesting to explore hybrid approaches.
    Efficient identification of informative features in simulation-based inference. (arXiv:2210.11915v1 [cs.LG])
    Simulation-based Bayesian inference (SBI) can be used to estimate the parameters of complex mechanistic models given observed model outputs without requiring access to explicit likelihood evaluations. A prime example for the application of SBI in neuroscience involves estimating the parameters governing the response dynamics of Hodgkin-Huxley (HH) models from electrophysiological measurements, by inferring a posterior over the parameters that is consistent with a set of observations. To this end, many SBI methods employ a set of summary statistics or scientifically interpretable features to estimate a surrogate likelihood or posterior. However, currently, there is no way to identify how much each summary statistic or feature contributes to reducing posterior uncertainty. To address this challenge, one could simply compare the posteriors with and without a given feature included in the inference process. However, for large or nested feature sets, this would necessitate repeatedly estimating the posterior, which is computationally expensive or even prohibitive. Here, we provide a more efficient approach based on the SBI method neural likelihood estimation (NLE): We show that one can marginalize the trained surrogate likelihood post-hoc before inferring the posterior to assess the contribution of a feature. We demonstrate the usefulness of our method by identifying the most important features for inferring parameters of an example HH neuron model. Beyond neuroscience, our method is generally applicable to SBI workflows that rely on data features for inference used in other scientific fields.
    Blind Polynomial Regression. (arXiv:2210.11874v1 [eess.SP])
    Fitting a polynomial to observed data is an ubiquitous task in many signal processing and machine learning tasks, such as interpolation and prediction. In that context, input and output pairs are available and the goal is to find the coefficients of the polynomial. However, in many applications, the input may be partially known or not known at all, rendering conventional regression approaches not applicable. In this paper, we formally state the (potentially partial) blind regression problem, illustrate some of its theoretical properties, and propose algorithmic approaches to solve it. As a case-study, we apply our methods to a jitter-correction problem and corroborate its performance.
    Is Encoder-Decoder Redundant for Neural Machine Translation?. (arXiv:2210.11807v1 [cs.CL])
    Encoder-decoder architecture is widely adopted for sequence-to-sequence modeling tasks. For machine translation, despite the evolution from long short-term memory networks to Transformer networks, plus the introduction and development of attention mechanism, encoder-decoder is still the de facto neural network architecture for state-of-the-art models. While the motivation for decoding information from some hidden space is straightforward, the strict separation of the encoding and decoding steps into an encoder and a decoder in the model architecture is not necessarily a must. Compared to the task of autoregressive language modeling in the target language, machine translation simply has an additional source sentence as context. Given the fact that neural language models nowadays can already handle rather long contexts in the target language, it is natural to ask whether simply concatenating the source and target sentences and training a language model to do translation would work. In this work, we investigate the aforementioned concept for machine translation. Specifically, we experiment with bilingual translation, translation with additional target monolingual data, and multilingual translation. In all cases, this alternative approach performs on par with the baseline encoder-decoder Transformer, suggesting that an encoder-decoder architecture might be redundant for neural machine translation.
    Graphically Structured Diffusion Models. (arXiv:2210.11633v1 [cs.LG])
    We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem domains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy.  ( 2 min )
    Stochastic Adaptive Activation Function. (arXiv:2210.11672v1 [cs.LG])
    The simulation of human neurons and neurotransmission mechanisms has been realized in deep neural networks based on the theoretical implementations of activation functions. However, recent studies have reported that the threshold potential of neurons exhibits different values according to the locations and types of individual neurons, and that the activation functions have limitations in terms of representing this variability. Therefore, this study proposes a simple yet effective activation function that facilitates different thresholds and adaptive activations according to the positions of units and the contexts of inputs. Furthermore, the proposed activation function mathematically exhibits a more generalized form of Swish activation function, and thus we denoted it as Adaptive SwisH (ASH). ASH highlights informative features that exhibit large values in the top percentiles in an input, whereas it rectifies low values. Most importantly, ASH exhibits trainable, adaptive, and context-aware properties compared to other activation functions. Furthermore, ASH represents general formula of the previously studied activation function and provides a reasonable mathematical background for the superior performance. To validate the effectiveness and robustness of ASH, we implemented ASH into many deep learning models for various tasks, including classification, detection, segmentation, and image generation. Experimental analysis demonstrates that our activation function can provide the benefits of more accurate prediction and earlier convergence in many deep learning applications.  ( 2 min )
    Rethinking Learning Approaches for Long-Term Action Anticipation. (arXiv:2210.11566v1 [cs.CV])
    Action anticipation involves predicting future actions having observed the initial portion of a video. Typically, the observed video is processed as a whole to obtain a video-level representation of the ongoing activity in the video, which is then used for future prediction. We introduce ANTICIPATR which performs long-term action anticipation leveraging segment-level representations learned using individual segments from different activities, in addition to a video-level representation. We propose a two-stage learning approach to train a novel transformer-based model that uses these two types of representations to directly predict a set of future action instances over any given anticipation duration. Results on Breakfast, 50Salads, Epic-Kitchens-55, and EGTEA Gaze+ datasets demonstrate the effectiveness of our approach.  ( 2 min )
    XC: Exploring Quantitative Use Cases for Explanations in 3D Object Detection. (arXiv:2210.11590v1 [cs.CV])
    Explainable AI (XAI) methods are frequently applied to obtain qualitative insights about deep models' predictions. However, such insights need to be interpreted by a human observer to be useful. In this paper, we aim to use explanations directly to make decisions without human observers. We adopt two gradient-based explanation methods, Integrated Gradients (IG) and backprop, for the task of 3D object detection. Then, we propose a set of quantitative measures, named Explanation Concentration (XC) scores, that can be used for downstream tasks. These scores quantify the concentration of attributions within the boundaries of detected objects. We evaluate the effectiveness of XC scores via the task of distinguishing true positive (TP) and false positive (FP) detected objects in the KITTI and Waymo datasets. The results demonstrate an improvement of more than 100\% on both datasets compared to other heuristics such as random guesses and the number of LiDAR points in the bounding box, raising confidence in XC's potential for application in more use cases. Our results also indicate that computationally expensive XAI methods like IG may not be more valuable when used quantitatively compare to simpler methods.  ( 2 min )
    Model-based Lifelong Reinforcement Learning with Bayesian Exploration. (arXiv:2210.11579v1 [cs.LG])
    We propose a model-based lifelong reinforcement-learning approach that estimates a hierarchical Bayesian posterior distilling the common structure shared across different tasks. The learned posterior combined with a sample-based Bayesian exploration procedure increases the sample efficiency of learning across a family of related tasks. We first derive an analysis of the relationship between the sample complexity and the initialization quality of the posterior in the finite MDP setting. We next scale the approach to continuous-state domains by introducing a Variational Bayesian Lifelong Reinforcement Learning algorithm that can be combined with recent model-based deep RL methods, and that exhibits backward transfer. Experimental results on several challenging domains show that our algorithms achieve both better forward and backward transfer performance than state-of-the-art lifelong RL methods.  ( 2 min )
    DNN-ForwardTesting: A New Trading Strategy Validation using Statistical Timeseries Analysis and Deep Neural Networks. (arXiv:2210.11532v1 [q-fin.TR])
    In general, traders test their trading strategies by applying them on the historical market data (backtesting), and then apply to the future trades the strategy that achieved the maximum profit on such past data. In this paper, we propose a new trading strategy, called DNN-forwardtesting, that determines the strategy to apply by testing it on the possible future predicted by a deep neural network that has been designed to perform stock price forecasts and trained with the market historical data. In order to generate such an historical dataset, we first perform an exploratory data analysis on a set of ten securities and, in particular, analize their volatility through a novel k-means-based procedure. Then, we restrict the dataset to a small number of assets with the same volatility coefficient and use such data to train a deep feed-forward neural network that forecasts the prices for the next 30 days of open stocks market. Finally, our trading system calculates the most effective technical indicator by applying it to the DNNs predictions and uses such indicator to guide its trades. The results confirm that neural networks outperform classical statistical techniques when performing such forecasts, and their predictions allow to select a trading strategy that, when applied to the real future, increases Expectancy, Sharpe, Sortino, and Calmar ratios with respect to the strategy selected through traditional backtesting.  ( 3 min )
    Global Convergence of Direct Policy Search for State-Feedback $\mathcal{H}_\infty$ Robust Control: A Revisit of Nonsmooth Synthesis with Goldstein Subdifferential. (arXiv:2210.11577v1 [math.OC])
    Direct policy search has been widely applied in modern reinforcement learning and continuous control. However, the theoretical properties of direct policy search on nonsmooth robust control synthesis have not been fully understood. The optimal $\mathcal{H}_\infty$ control framework aims at designing a policy to minimize the closed-loop $\mathcal{H}_\infty$ norm, and is arguably the most fundamental robust control paradigm. In this work, we show that direct policy search is guaranteed to find the global solution of the robust $\mathcal{H}_\infty$ state-feedback control design problem. Notice that policy search for optimal $\mathcal{H}_\infty$ control leads to a constrained nonconvex nonsmooth optimization problem, where the nonconvex feasible set consists of all the policies stabilizing the closed-loop dynamics. We show that for this nonsmooth optimization problem, all Clarke stationary points are global minimum. Next, we identify the coerciveness of the closed-loop $\mathcal{H}_\infty$ objective function, and prove that all the sublevel sets of the resultant policy search problem are compact. Based on these properties, we show that Goldstein's subgradient method and its implementable variants can be guaranteed to stay in the nonconvex feasible set and eventually find the global optimal solution of the $\mathcal{H}_\infty$ state-feedback synthesis problem. Our work builds a new connection between nonconvex nonsmooth optimization theory and robust control, leading to an interesting global convergence result for direct policy search on optimal $\mathcal{H}_\infty$ synthesis.  ( 3 min )
    Learning Sample Reweighting for Accuracy and Adversarial Robustness. (arXiv:2210.11513v1 [cs.LG])
    There has been great interest in enhancing the robustness of neural network classifiers to defend against adversarial perturbations through adversarial training, while balancing the trade-off between robust accuracy and standard accuracy. We propose a novel adversarial training framework that learns to reweight the loss associated with individual training samples based on a notion of class-conditioned margin, with the goal of improving robust generalization. We formulate weighted adversarial training as a bilevel optimization problem with the upper-level problem corresponding to learning a robust classifier, and the lower-level problem corresponding to learning a parametric function that maps from a sample's \textit{multi-class margin} to an importance weight. Extensive experiments demonstrate that our approach consistently improves both clean and robust accuracy compared to related methods and state-of-the-art baselines.  ( 2 min )
    Transferring learned patterns from ground-based field imagery to predict UAV-based imagery for crop and weed semantic segmentation in precision crop farming. (arXiv:2210.11545v1 [cs.CV])
    Weed and crop segmentation is becoming an increasingly integral part of precision farming that leverages the current computer vision and deep learning technologies. Research has been extensively carried out based on images captured with a camera from various platforms. Unmanned aerial vehicles (UAVs) and ground-based vehicles including agricultural robots are the two popular platforms for data collection in fields. They all contribute to site-specific weed management (SSWM) to maintain crop yield. Currently, the data from these two platforms is processed separately, though sharing the same semantic objects (weed and crop). In our paper, we have developed a deep convolutional network that enables to predict both field and aerial images from UAVs for weed segmentation and mapping with only field images provided in the training phase. The network learning process is visualized by feature maps at shallow and deep layers. The results show that the mean intersection of union (IOU) values of the segmentation for the crop (maize), weeds, and soil background in the developed model for the field dataset are 0.744, 0.577, 0.979, respectively, and the performance of aerial images from an UAV with the same model, the IOU values of the segmentation for the crop (maize), weeds and soil background are 0.596, 0.407, and 0.875, respectively. To estimate the effect on the use of plant protection agents, we quantify the relationship between herbicide spraying saving rate and grid size (spraying resolution) based on the predicted weed map. The spraying saving rate is up to 90% when the spraying resolution is at 1.78 x 1.78 cm2. The study shows that the developed deep convolutional neural network could be used to classify weeds from both field and aerial images and delivers satisfactory results.  ( 3 min )
    An Improved Algorithm for Clustered Federated Learning. (arXiv:2210.11538v1 [stat.ML])
    In this paper, we address the dichotomy between heterogeneous models and simultaneous training in Federated Learning (FL) via a clustering framework. We define a new clustering model for FL based on the (optimal) local models of the users: two users belong to the same cluster if their local models are close; otherwise they belong to different clusters. A standard algorithm for clustered FL is proposed in \cite{ghosh_efficient_2021}, called \texttt{IFCA}, which requires \emph{suitable} initialization and the knowledge of hyper-parameters like the number of clusters (which is often quite difficult to obtain in practical applications) to converge. We propose an improved algorithm, \emph{Successive Refine Federated Clustering Algorithm} (\texttt{SR-FCA}), which removes such restrictive assumptions. \texttt{SR-FCA} treats each user as a singleton cluster as an initialization, and then successively refine the cluster estimation via exploiting similar users belonging to the same cluster. In any intermediate step, \texttt{SR-FCA} uses a robust federated learning algorithm within each cluster to exploit simultaneous training and to correct clustering errors. Furthermore, \texttt{SR-FCA} does not require any \emph{good} initialization (warm start), both in theory and practice. We show that with proper choice of learning rate, \texttt{SR-FCA} incurs arbitrarily small clustering error. Additionally, we validate the performance of our algorithm on standard FL datasets in non-convex problems like neural nets, and we show the benefits of \texttt{SR-FCA} over baselines.  ( 2 min )
    Theoretical analysis of deep neural networks for temporally dependent observations. (arXiv:2210.11530v1 [stat.ML])
    Deep neural networks are powerful tools to model observations over time with non-linear patterns. Despite the widespread use of neural networks in such settings, most theoretical developments of deep neural networks are under the assumption of independent observations, and theoretical results for temporally dependent observations are scarce. To bridge this gap, we study theoretical properties of deep neural networks on modeling non-linear time series data. Specifically, non-asymptotic bounds for prediction error of (sparse) feed-forward neural network with ReLU activation function is established under mixing-type assumptions. These assumptions are mild such that they include a wide range of time series models including auto-regressive models. Compared to independent observations, established convergence rates have additional logarithmic factors to compensate for additional complexity due to dependence among data points. The theoretical results are supported via various numerical simulation settings as well as an application to a macroeconomic data set.  ( 2 min )
    Multimodal Neural Network For Demand Forecasting. (arXiv:2210.11502v1 [cs.LG])
    Demand forecasting applications have immensely benefited from the state-of-the-art Deep Learning methods used for time series forecasting. Traditional uni-modal models are predominantly seasonality driven which attempt to model the demand as a function of historic sales along with information on holidays and promotional events. However, accurate and robust sales forecasting calls for accommodating multiple other factors, such as natural calamities, pandemics, elections, etc., impacting the demand for products and product categories in general. We propose a multi-modal sales forecasting network that combines real-life events from news articles with traditional data such as historical sales and holiday information. Further, we fuse information from general product trends published by Google trends. Empirical results show statistically significant improvements in the SMAPE error metric with an average improvement of 7.37% against the existing state-of-the-art sales forecasting techniques on a real-world supermarket dataset.  ( 2 min )
    Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization. (arXiv:2210.11589v1 [cs.LG])
    Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems.  ( 2 min )
    Low-Rank Representations Towards Classification Problem of Complex Networks. (arXiv:2210.11561v1 [cs.SI])
    Complex networks representing social interactions, brain activities, molecular structures have been studied widely to be able to understand and predict their characteristics as graphs. Models and algorithms for these networks are used in real-life applications, such as search engines, and recommender systems. In general, such networks are modelled by constructing a low-dimensional Euclidean embedding of the vertices of the network, where proximity of the vertices in the Euclidean space hints the likelihood of an edge (link). In this work, we study the performance of such low-rank representations of real-life networks on a network classification problem.  ( 2 min )
    Overexposure Mask Fusion: Generalizable Reverse ISP Multi-Step Refinement. (arXiv:2210.11511v1 [cs.CV])
    With the advent of deep learning methods replacing the ISP in transforming sensor RAW readings into RGB images, numerous methodologies solidified into real-life applications. Equally potent is the task of inverting this process which will have applications in enhancing computational photography tasks that are conducted in the RAW domain, addressing lack of available RAW data while reaping from the benefits of performing tasks directly on sensor readings. This paper's proposed methodology is a state-of-the-art solution to the task of RAW reconstruction, and the multi-step refinement process integrating an overexposure mask is novel in three ways: instead of from RGB to bayer, the pipeline trains from RGB to demosaiced RAW allowing use of perceptual loss functions; the multi-step processes has greatly enhanced the performance of the baseline U-Net from start to end; the pipeline is a generalizable process of refinement that can enhance other high performance methodologies that support end-to-end learning.  ( 2 min )
    Local SGD in Overparameterized Linear Regression. (arXiv:2210.11562v1 [stat.ML])
    We consider distributed learning using constant stepsize SGD (DSGD) over several devices, each sending a final model update to a central server. In a final step, the local estimates are aggregated. We prove in the setting of overparameterized linear regression general upper bounds with matching lower bounds and derive learning rates for specific data generating distributions. We show that the excess risk is of order of the variance provided the number of local nodes grows not too large with the global sample size. We further compare the sample complexity of DSGD with the sample complexity of distributed ridge regression (DRR) and show that the excess SGD-risk is smaller than the excess RR-risk, where both sample complexities are of the same order.  ( 2 min )
    gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs. (arXiv:2210.11601v1 [cs.LG])
    As the interest to Graph Neural Networks (GNNs) is growing, the importance of benchmarking and performance characterization studies of GNNs is increasing. So far, we have seen many studies that investigate and present the performance and computational efficiency of GNNs. However, the work done so far has been carried out using a few high-level GNN frameworks. Although these frameworks provide ease of use, they contain too many dependencies to other existing libraries. The layers of implementation details and the dependencies complicate the performance analysis of GNN models that are built on top of these frameworks, especially while using architectural simulators. Furthermore, different approaches on GNN computation are generally overlooked in prior characterization studies, and merely one of the common computational models is evaluated. Based on these shortcomings and needs that we observed, we developed a benchmark suite that is framework independent, supporting versatile computational models, easily configurable and can be used with architectural simulators without additional effort. Our benchmark suite, which we call gSuite, makes use of only hardware vendor's libraries and therefore it is independent of any other frameworks. gSuite enables performing detailed performance characterization studies on GNN Inference using both contemporary GPU profilers and architectural GPU simulators. To illustrate the benefits of our new benchmark suite, we perform a detailed characterization study with a set of well-known GNN models with various datasets; running gSuite both on a real GPU card and a timing-detailed GPU simulator. We also implicate the effect of computational models on performance. We use several evaluation metrics to rigorously measure the performance of GNN computation.  ( 3 min )
    Improving aircraft performance using machine learning: a review. (arXiv:2210.11481v1 [cs.LG])
    This review covers the new developments in machine learning (ML) that are impacting the multi-disciplinary area of aerospace engineering, including fundamental fluid dynamics (experimental and numerical), aerodynamics, acoustics, combustion and structural health monitoring. We review the state of the art, gathering the advantages and challenges of ML methods across different aerospace disciplines and provide our view on future opportunities. The basic concepts and the most relevant strategies for ML are presented together with the most relevant applications in aerospace engineering, revealing that ML is improving aircraft performance and that these techniques will have a large impact in the near future.  ( 2 min )
    3DALL-E: Integrating Text-to-Image AI in 3D Design Workflows. (arXiv:2210.11603v1 [cs.HC])
    Text-to-image AI systems are capable of generating novel images for inspiration, but their applications for 3D design workflows and how designers can build 3D models using AI-provided inspiration is less understood. To investigate this, we integrated DALL-E, GPT-3, and CLIP within a CAD software in 3DALL-E, a plugin that allows users to construct text and image prompts based on what they are modelling. In a study with 13 designers, we found that designers saw great potential to incorporate 3DALL-E into their workflows and to use text-to-image AI for reference images, renders, materials, and design considerations. Additionally, we elaborate on prompting patterns and provide measures of prompt complexity observed across participants. We conclude on a discussion of how 3DALL-E can merge with existing generative design workflows and propose prompt bibliographies as a form of human-AI design history.  ( 2 min )
    Composing Ensembles of Pre-trained Models via Iterative Consensus. (arXiv:2210.11522v1 [cs.CV])
    Large pre-trained models exhibit distinct and complementary capabilities dependent on the data they are trained on. Language models such as GPT-3 are capable of textual reasoning but cannot understand visual information, while vision models such as DALL-E can generate photorealistic photos but fail to understand complex language descriptions. In this work, we propose a unified framework for composing ensembles of different pre-trained models -- combining the strengths of each individual model to solve various multimodal problems in a zero-shot manner. We use pre-trained models as "generators" or "scorers" and compose them via closed-loop iterative consensus optimization. The generator constructs proposals and the scorers iteratively provide feedback to refine the generated result. Such closed-loop communication enables models to correct errors caused by other models, significantly boosting performance on downstream tasks, e.g. improving accuracy on grade school math problems by 7.5%, without requiring any model finetuning. We demonstrate that consensus achieved by an ensemble of scorers outperforms the feedback of a single scorer, by leveraging the strengths of each expert model. Results show that the proposed method can be used as a general purpose framework for a wide range of zero-shot multimodal tasks, such as image generation, video question answering, mathematical reasoning, and robotic manipulation. Project page: https://energy-based-model.github.io/composing-pretrained-models.  ( 2 min )
  • Open

    First-Order Regret in Reinforcement Learning with Linear Function Approximation: A Robust Estimation Approach. (arXiv:2112.03432v4 [cs.LG] UPDATED)
    Obtaining first-order regret bounds -- regret bounds scaling not as the worst-case but with some measure of the performance of the optimal policy on a given instance -- is a core question in sequential decision-making. While such bounds exist in many settings, they have proven elusive in reinforcement learning with large state spaces. In this work we address this gap, and show that it is possible to obtain regret scaling as $\widetilde{\mathcal{O}}(\sqrt{d^3 H^3 \cdot V_1^\star \cdot K} + d^{3.5}H^3\log K )$ in reinforcement learning with large state spaces, namely the linear MDP setting. Here $V_1^\star$ is the value of the optimal policy and $K$ is the number of episodes. We demonstrate that existing techniques based on least squares estimation are insufficient to obtain this result, and instead develop a novel robust self-normalized concentration bound based on the robust Catoni mean estimator, which may be of independent interest.
    Monotonic Risk Relationships under Distribution Shifts for Regularized Risk Minimization. (arXiv:2210.11589v1 [cs.LG])
    Machine learning systems are often applied to data that is drawn from a different distribution than the training distribution. Recent work has shown that for a variety of classification and signal reconstruction problems, the out-of-distribution performance is strongly linearly correlated with the in-distribution performance. If this relationship or more generally a monotonic one holds, it has important consequences. For example, it allows to optimize performance on one distribution as a proxy for performance on the other. In this paper, we study conditions under which a monotonic relationship between the performances of a model on two distributions is expected. We prove an exact asymptotic linear relation for squared error and a monotonic relation for misclassification error for ridge-regularized general linear models under covariate shift, as well as an approximate linear relation for linear inverse problems.
    Correlating sparse sensing for network-wide traffic speed estimation: An integrated graph tensor-based kriging approach. (arXiv:2210.11780v1 [stat.ML])
    Traffic speed is central to characterizing the fluidity of the road network. Many transportation applications rely on it, such as real-time navigation, dynamic route planning, and congestion management. Rapid advances in sensing and communication techniques make traffic speed detection easier than ever. However, due to sparse deployment of static sensors or low penetration of mobile sensors, speeds detected are incomplete and far from network-wide use. In addition, sensors are prone to error or missing data due to various kinds of reasons, speeds from these sensors can become highly noisy. These drawbacks call for effective techniques to recover credible estimates from the incomplete data. In this work, we first identify the problem as a spatiotemporal kriging problem and propose a unified graph embedded tensor (SGET) learning framework featuring both low-rankness and multi-dimensional correlations for network-wide traffic speed kriging under limited observations. To be specific, three types of speed correlation including temporal continuity, temporal periodicity, and spatial proximity are carefully chosen. We then design an efficient solution algorithm via several effective numeric techniques to scale up the proposed model to network-wide kriging. By performing experiments on two public million-level traffic speed datasets, we finally draw the conclusion and find our proposed SGET achieves the state-of-the-art kriging performance even under low observation rates, while at the same time saving more than half computing time compared with baseline methods. Some insights into spatiotemporal traffic data kriging at the network level are provided as well.
    Structural Kernel Search via Bayesian Optimization and Symbolical Optimal Transport. (arXiv:2210.11836v1 [cs.LG])
    Despite recent advances in automated machine learning, model selection is still a complex and computationally intensive process. For Gaussian processes (GPs), selecting the kernel is a crucial task, often done manually by the expert. Additionally, evaluating the model selection criteria for Gaussian processes typically scales cubically in the sample size, rendering kernel search particularly computationally expensive. We propose a novel, efficient search method through a general, structured kernel space. Previous methods solved this task via Bayesian optimization and relied on measuring the distance between GP's directly in function space to construct a kernel-kernel. We present an alternative approach by defining a kernel-kernel over the symbolic representation of the statistical hypothesis that is associated with a kernel. We empirically show that this leads to a computationally more efficient way of searching through a discrete kernel space.
    Theoretical analysis of deep neural networks for temporally dependent observations. (arXiv:2210.11530v1 [stat.ML])
    Deep neural networks are powerful tools to model observations over time with non-linear patterns. Despite the widespread use of neural networks in such settings, most theoretical developments of deep neural networks are under the assumption of independent observations, and theoretical results for temporally dependent observations are scarce. To bridge this gap, we study theoretical properties of deep neural networks on modeling non-linear time series data. Specifically, non-asymptotic bounds for prediction error of (sparse) feed-forward neural network with ReLU activation function is established under mixing-type assumptions. These assumptions are mild such that they include a wide range of time series models including auto-regressive models. Compared to independent observations, established convergence rates have additional logarithmic factors to compensate for additional complexity due to dependence among data points. The theoretical results are supported via various numerical simulation settings as well as an application to a macroeconomic data set.
    FoSR: First-order spectral rewiring for addressing oversquashing in GNNs. (arXiv:2210.11790v1 [cs.LG])
    Graph neural networks (GNNs) are able to leverage the structure of graph data by passing messages along the edges of the graph. While this allows GNNs to learn features depending on the graph structure, for certain graph topologies it leads to inefficient information propagation and a problem known as oversquashing. This has recently been linked with the curvature and spectral gap of the graph. On the other hand, adding edges to the message-passing graph can lead to increasingly similar node representations and a problem known as oversmoothing. We propose a computationally efficient algorithm that prevents oversquashing by systematically adding edges to the graph based on spectral expansion. We combine this with a relational architecture, which lets the GNN preserve the original graph structure and provably prevents oversmoothing. We find experimentally that our algorithm outperforms existing graph rewiring methods in several graph classification tasks.
    Ollivier-Ricci Curvature for Hypergraphs: A Unified Framework. (arXiv:2210.12048v1 [cs.LG])
    Bridging geometry and topology, curvature is a powerful and expressive invariant. While the utility of curvature has been theoretically and empirically confirmed in the context of manifolds and graphs, its generalization to the emerging domain of hypergraphs has remained largely unexplored. On graphs, Ollivier-Ricci curvature measures differences between random walks via Wasserstein distances, thus grounding a geometric concept in ideas from probability and optimal transport. We develop ORCHID, a flexible framework generalizing Ollivier-Ricci curvature to hypergraphs, and prove that the resulting curvatures have favorable theoretical properties. Through extensive experiments on synthetic and real-world hypergraphs from different domains, we demonstrate that ORCHID curvatures are both scalable and useful to perform a variety of hypergraph tasks in practice.
    Evolution of Neural Tangent Kernels under Benign and Adversarial Training. (arXiv:2210.12030v1 [cs.LG])
    Two key challenges facing modern deep learning are mitigating deep networks' vulnerability to adversarial attacks and understanding deep learning's generalization capabilities. Towards the first issue, many defense strategies have been developed, with the most common being Adversarial Training (AT). Towards the second challenge, one of the dominant theories that has emerged is the Neural Tangent Kernel (NTK) -- a characterization of neural network behavior in the infinite-width limit. In this limit, the kernel is frozen, and the underlying feature map is fixed. In finite widths, however, there is evidence that feature learning happens at the earlier stages of the training (kernel learning) before a second phase where the kernel remains fixed (lazy training). While prior work has aimed at studying adversarial vulnerability through the lens of the frozen infinite-width NTK, there is no work that studies the adversarial robustness of the empirical/finite NTK during training. In this work, we perform an empirical study of the evolution of the empirical NTK under standard and adversarial training, aiming to disambiguate the effect of adversarial training on kernel learning and lazy training. We find under adversarial training, the empirical NTK rapidly converges to a different kernel (and feature map) than standard training. This new kernel provides adversarial robustness, even when non-robust training is performed on top of it. Furthermore, we find that adversarial training on top of a fixed kernel can yield a classifier with $76.1\%$ robust accuracy under PGD attacks with $\varepsilon = 4/255$ on CIFAR-10.
    Efficient Dataset Distillation Using Random Feature Approximation. (arXiv:2210.12067v1 [cs.LG])
    Dataset distillation compresses large datasets into smaller synthetic coresets which retain performance with the aim of reducing the storage and computational burden of processing the entire dataset. Today's best-performing algorithm, \textit{Kernel Inducing Points} (KIP), which makes use of the correspondence between infinite-width neural networks and kernel-ridge regression, is prohibitively slow due to the exact computation of the neural tangent kernel matrix, scaling $O(|S|^2)$, with $|S|$ being the coreset size. To improve this, we propose a novel algorithm that uses a random feature approximation (RFA) of the Neural Network Gaussian Process (NNGP) kernel, which reduces the kernel matrix computation to $O(|S|)$. Our algorithm provides at least a 100-fold speedup over KIP and can run on a single GPU. Our new method, termed an RFA Distillation (RFAD), performs competitively with KIP and other dataset condensation algorithms in accuracy over a range of large-scale datasets, both in kernel regression and finite-width network training. We demonstrate the effectiveness of our approach on tasks involving model interpretability and privacy preservation.
    LOT: Layer-wise Orthogonal Training on Improving l2 Certified Robustness. (arXiv:2210.11620v1 [cs.LG])
    Recent studies show that training deep neural networks (DNNs) with Lipschitz constraints are able to enhance adversarial robustness and other model properties such as stability. In this paper, we propose a layer-wise orthogonal training method (LOT) to effectively train 1-Lipschitz convolution layers via parametrizing an orthogonal matrix with an unconstrained matrix. We then efficiently compute the inverse square root of a convolution kernel by transforming the input domain to the Fourier frequency domain. On the other hand, as existing works show that semi-supervised training helps improve empirical robustness, we aim to bridge the gap and prove that semi-supervised learning also improves the certified robustness of Lipschitz-bounded models. We conduct comprehensive evaluations for LOT under different settings. We show that LOT significantly outperforms baselines regarding deterministic l2 certified robustness, and scales to deeper neural networks. Under the supervised scenario, we improve the state-of-the-art certified robustness for all architectures (e.g. from 59.04% to 63.50% on CIFAR-10 and from 32.57% to 34.59% on CIFAR-100 at radius rho = 36/255 for 40-layer networks). With semi-supervised learning over unlabelled data, we are able to improve state-of-the-art certified robustness on CIFAR-10 at rho = 108/255 from 36.04% to 42.39%. In addition, LOT consistently outperforms baselines on different model architectures with only 1/3 evaluation time.
    Distributionally Robust Bayesian Optimization with $\phi$-divergences. (arXiv:2203.02128v3 [cs.LG] UPDATED)
    The study of robustness has received much attention due to its inevitability in data-driven settings where many systems face uncertainty. One such example of concern is Bayesian Optimization (BO), where uncertainty is multi-faceted, yet there only exists a limited number of works dedicated to this direction. In particular, there is the work of Kirschner et al. (2020), which bridges the existing literature of Distributionally Robust Optimization (DRO) by casting the BO problem from the lens of DRO. While this work is pioneering, it admittedly suffers from various practical shortcomings such as finite contexts assumptions, leaving behind the main question Can one devise a computationally tractable algorithm for solving this DRO-BO problem? In this work, we tackle this question to a large degree of generality by considering robustness against data-shift in $\phi$-divergences, which subsumes many popular choices, such as the $\chi^2$-divergence, Total Variation, and the extant Kullback-Leibler (KL) divergence. We show that the DRO-BO problem in this setting is equivalent to a finite-dimensional optimization problem which, even in the continuous context setting, can be easily implemented with provable sublinear regret bounds. We then show experimentally that our method surpasses existing methods, attesting to the theoretical results.
    High-Dimensional Private Empirical Risk Minimization by Greedy Coordinate Descent. (arXiv:2207.01560v2 [cs.LG] UPDATED)
    In this paper, we study differentially private empirical risk minimization (DP-ERM). It has been shown that the worst-case utility of DP-ERM reduces polynomially as the dimension increases. This is a major obstacle to privately learning large machine learning models. In high dimension, it is common for some model's parameters to carry more information than others. To exploit this, we propose a differentially private greedy coordinate descent (DP-GCD) algorithm. At each iteration, DP-GCD privately performs a coordinate-wise gradient step along the gradients' (approximately) greatest entry. We show theoretically that DP-GCD can achieve a logarithmic dependence on the dimension for a wide range of problems by naturally exploiting their structural properties (such as quasi-sparse solutions). We illustrate this behavior numerically, both on synthetic and real datasets.
    Efficient learning of nonlinear prediction models with time-series privileged information. (arXiv:2209.07067v3 [cs.LG] UPDATED)
    In domains where sample sizes are limited, efficient learning algorithms are critical. Learning using privileged information (LuPI) offers increased sample efficiency by allowing prediction models access to auxiliary information at training time which is unavailable when the models are used. In recent work, it was shown that for prediction in linear-Gaussian dynamical systems, a LuPI learner with access to intermediate time series data is never worse and often better in expectation than any unbiased classical learner. We provide new insights into this analysis and generalize it to nonlinear prediction tasks in latent dynamical systems, extending theoretical guarantees to the case where the map connecting latent variables and observations is known up to a linear transform. In addition, we propose algorithms based on random features and representation learning for the case when this map is unknown. A suite of empirical results confirm theoretical findings and show the potential of using privileged time-series information in nonlinear prediction.
    The Phenomenon of Policy Churn. (arXiv:2206.00730v3 [cs.LG] UPDATED)
    We identify and study the phenomenon of policy churn, that is, the rapid change of the greedy policy in value-based reinforcement learning. Policy churn operates at a surprisingly rapid pace, changing the greedy action in a large fraction of states within a handful of learning updates (in a typical deep RL set-up such as DQN on Atari). We characterise the phenomenon empirically, verifying that it is not limited to specific algorithm or environment properties. A number of ablations help whittle down the plausible explanations on why churn occurs to just a handful, all related to deep learning. Finally, we hypothesise that policy churn is a beneficial but overlooked form of implicit exploration that casts $\epsilon$-greedy exploration in a fresh light, namely that $\epsilon$-noise plays a much smaller role than expected.
    Assaying Out-Of-Distribution Generalization in Transfer Learning. (arXiv:2207.09239v2 [cs.LG] UPDATED)
    Since out-of-distribution generalization is a generally ill-posed problem, various proxy targets (e.g., calibration, adversarial robustness, algorithmic corruptions, invariance across shifts) were studied across different research programs resulting in different recommendations. While sharing the same aspirational goal, these approaches have never been tested under the same experimental conditions on real data. In this paper, we take a unified view of previous work, highlighting message discrepancies that we address empirically, and providing recommendations on how to measure the robustness of a model and how to improve it. To this end, we collect 172 publicly available dataset pairs for training and out-of-distribution evaluation of accuracy, calibration error, adversarial attacks, environment invariance, and synthetic corruptions. We fine-tune over 31k networks, from nine different architectures in the many- and few-shot setting. Our findings confirm that in- and out-of-distribution accuracies tend to increase jointly, but show that their relation is largely dataset-dependent, and in general more nuanced and more complex than posited by previous, smaller scale studies.
    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. (arXiv:2207.06569v2 [cs.LG] UPDATED)
    The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study. We first explore this phenomenon in the context of kernel (ridge) regression (KR) by obtaining conditions on the ridge parameter and kernel eigenspectrum under which KR exhibits each of the three behaviors. We find that kernels with powerlaw spectra, including Laplace kernels and ReLU neural tangent kernels, exhibit tempered overfitting. We then empirically study deep neural networks through the lens of our taxonomy, and find that those trained to interpolation are tempered, while those stopped early are benign. We hope our work leads to a more refined understanding of overfitting in modern learning.
    Label Noise in Adversarial Training: A Novel Perspective to Study Robust Overfitting. (arXiv:2110.03135v2 [cs.LG] UPDATED)
    We show that label noise exists in adversarial training. Such label noise is due to the mismatch between the true label distribution of adversarial examples and the label inherited from clean examples - the true label distribution is distorted by the adversarial perturbation, but is neglected by the common practice that inherits labels from clean examples. Recognizing label noise sheds insights on the prevalence of robust overfitting in adversarial training, and explains its intriguing dependence on perturbation radius and data quality. Also, our label noise perspective aligns well with our observations of the epoch-wise double descent in adversarial training. Guided by our analyses, we proposed a method to automatically calibrate the label to address the label noise and robust overfitting. Our method achieves consistent performance improvements across various models and datasets without introducing new hyper-parameters or additional tuning.
    Refined Convergence and Topology Learning for Decentralized SGD with Heterogeneous Data. (arXiv:2204.04452v3 [cs.LG] UPDATED)
    One of the key challenges in decentralized and federated learning is to design algorithms that efficiently deal with highly heterogeneous data distributions across agents. In this paper, we revisit the analysis of the popular Decentralized Stochastic Gradient Descent algorithm (D-SGD) under data heterogeneity. We exhibit the key role played by a new quantity, called neighborhood heterogeneity, on the convergence rate of D-SGD. By coupling the communication topology and the heterogeneity, our analysis sheds light on the poorly understood interplay between these two concepts. We then argue that neighborhood heterogeneity provides a natural criterion to learn data-dependent topologies that reduce (and can even eliminate) the otherwise detrimental effect of data heterogeneity on the convergence time of D-SGD. For the important case of classification with label skew, we formulate the problem of learning such a good topology as a tractable optimization problem that we solve with a Frank-Wolfe algorithm. As illustrated over a set of simulated and real-world experiments, our approach provides a principled way to design a sparse topology that balances the convergence speed and the per-iteration communication costs of D-SGD under data heterogeneity.
    Optimizing the Performative Risk under Weak Convexity Assumptions. (arXiv:2209.00771v4 [cs.LG] UPDATED)
    In performative prediction, a predictive model impacts the distribution that generates future data, a phenomenon that is being ignored in classical supervised learning. In this closed-loop setting, the natural measure of performance named performative risk ($\mathrm{PR}$), captures the expected loss incurred by a predictive model \emph{after} deployment. The core difficulty of using the performative risk as an optimization objective is that the data distribution itself depends on the model parameters. This dependence is governed by the environment and not under the control of the learner. As a consequence, even the choice of a convex loss function can result in a highly non-convex $\mathrm{PR}$ minimization problem. Prior work has identified a pair of general conditions on the loss and the mapping from model parameters to distributions that implies the convexity of the performative risk. In this paper, we relax these assumptions and focus on obtaining weaker notions of convexity, without sacrificing the amenability of the $\mathrm{PR}$ minimization problem for iterative optimization methods.
    On amortizing convex conjugates for optimal transport. (arXiv:2210.12153v1 [cs.LG])
    This paper focuses on computing the convex conjugate operation that arises when solving Euclidean Wasserstein-2 optimal transport problems. This conjugation, which is also referred to as the Legendre-Fenchel conjugate or $c$-transform, is considered difficult to compute and in practice, Wasserstein-2 methods are limited by not being able to exactly conjugate the dual potentials in continuous space. I show that combining amortized approximations to the conjugate with a solver for fine-tuning is computationally easy. This combination significantly improves the quality of transport maps learned for the Wasserstein-2 benchmark by Korotin et al. (2021) and is able to model many 2-dimensional couplings and flows considered in the literature. All of the baselines, methods, and solvers in this paper are available at this http URL
    Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks. (arXiv:2010.04261v6 [cs.LG] UPDATED)
    Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. In this paper, we propose a decoupling conjecture that decomposes the layer-wise Hessians of a network as the Kronecker product of two smaller matrices. We can analyze the properties of these smaller matrices and prove the structure of top eigenspace random 2-layer networks. The decoupling conjecture has several other interesting implications - top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. All of these can be verified empirically for deeper networks. Finally, we use the structure of layer-wise Hessian to get better explicit generalization bounds for neural networks.
    Differentially Private Coordinate Descent for Composite Empirical Risk Minimization. (arXiv:2110.11688v3 [cs.LG] UPDATED)
    Machine learning models can leak information about the data used to train them. To mitigate this issue, Differentially Private (DP) variants of optimization algorithms like Stochastic Gradient Descent (DP-SGD) have been designed to trade-off utility for privacy in Empirical Risk Minimization (ERM) problems. In this paper, we propose Differentially Private proximal Coordinate Descent (DP-CD), a new method to solve composite DP-ERM problems. We derive utility guarantees through a novel theoretical analysis of inexact coordinate descent. Our results show that, thanks to larger step sizes, DP-CD can exploit imbalance in gradient coordinates to outperform DP-SGD. We also prove new lower bounds for composite DP-ERM under coordinate-wise regularity assumptions, that are nearly matched by DP-CD. For practical implementations, we propose to clip gradients using coordinate-wise thresholds that emerge from our theory, avoiding costly hyperparameter tuning. Experiments on real and synthetic data support our results, and show that DP-CD compares favorably with DP-SGD.
    A Non-Asymptotic Moreau Envelope Theory for High-Dimensional Generalized Linear Models. (arXiv:2210.12082v1 [stat.ML])
    We prove a new generalization bound that shows for any class of linear predictors in Gaussian space, the Rademacher complexity of the class and the training error under any continuous loss $\ell$ can control the test error under all Moreau envelopes of the loss $\ell$. We use our finite-sample bound to directly recover the "optimistic rate" of Zhou et al. (2021) for linear regression with the square loss, which is known to be tight for minimal $\ell_2$-norm interpolation, but we also handle more general settings where the label is generated by a potentially misspecified multi-index model. The same argument can analyze noisy interpolation of max-margin classifiers through the squared hinge loss, and establishes consistency results in spiked-covariance settings. More generally, when the loss is only assumed to be Lipschitz, our bound effectively improves Talagrand's well-known contraction lemma by a factor of two, and we prove uniform convergence of interpolators (Koehler et al. 2021) for all smooth, non-negative losses. Finally, we show that application of our generalization bound using localized Gaussian width will generally be sharp for empirical risk minimizers, establishing a non-asymptotic Moreau envelope theory for generalization that applies outside of proportional scaling regimes, handles model misspecification, and complements existing asymptotic Moreau envelope theories for M-estimation.
    Targeted active learning for probabilistic models. (arXiv:2210.12122v1 [cs.LG])
    A fundamental task in science is to design experiments that yield valuable insights about the system under study. Mathematically, these insights can be represented as a utility or risk function that shapes the value of conducting each experiment. We present PDBAL, a targeted active learning method that adaptively designs experiments to maximize scientific utility. PDBAL takes a user-specified risk function and combines it with a probabilistic model of the experimental outcomes to choose designs that rapidly converge on a high-utility model. We prove theoretical bounds on the label complexity of PDBAL and provide fast closed-form solutions for designing experiments with common exponential family likelihoods. In simulation studies, PDBAL consistently outperforms standard untargeted approaches that focus on maximizing expected information gain over the design space. Finally, we demonstrate the scientific potential of PDBAL through a study on a large cancer drug screen dataset where PDBAL quickly recovers the most efficacious drugs with a small fraction of the total number of experiments.
    Scalars are universal: Equivariant machine learning, structured like classical physics. (arXiv:2106.06610v3 [cs.LG] UPDATED)
    There has been enormous progress in the last few years in designing neural networks that respect the fundamental symmetries and coordinate freedoms of physical law. Some of these frameworks make use of irreducible representations, some make use of high-order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be universally expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. We complement our theory with numerical examples that show that the scalar-based method is simple, efficient, and scalable.
    Interventions, Where and How? Experimental Design for Causal Models at Scale. (arXiv:2203.02016v3 [cs.LG] UPDATED)
    Causal discovery from observational and interventional data is challenging due to limited data and non-identifiability: factors that introduce uncertainty in estimating the underlying structural causal model (SCM). Selecting experiments (interventions) based on the uncertainty arising from both factors can expedite the identification of the SCM. Existing methods in experimental design for causal discovery from limited data either rely on linear assumptions for the SCM or select only the intervention target. This work incorporates recent advances in Bayesian causal discovery into the Bayesian optimal experimental design framework, allowing for active causal discovery of large, nonlinear SCMs while selecting both the interventional target and the value. We demonstrate the performance of the proposed method on synthetic graphs (Erdos-R\`enyi, Scale Free) for both linear and nonlinear SCMs as well as on the \emph{in-silico} single-cell gene regulatory network dataset, DREAM.
    Cox-Hawkes: doubly stochastic spatiotemporal Poisson processes. (arXiv:2210.11844v1 [stat.ML])
    Hawkes processes are point process models that have been used to capture self-excitatory behavior in social interactions, neural activity, earthquakes and viral epidemics. They can model the occurrence of the times and locations of events. Here we develop a new class of spatiotemporal Hawkes processes that can capture both triggering and clustering behavior and we provide an efficient method for performing inference. We use a log-Gaussian Cox process (LGCP) as prior for the background rate of the Hawkes process which gives arbitrary flexibility to capture a wide range of underlying background effects (for infectious diseases these are called endemic effects). The Hawkes process and LGCP are computationally expensive due to the former having a likelihood with quadratic complexity in the number of observations and the latter involving inversion of the precision matrix which is cubic in observations. Here we propose a novel approach to perform MCMC sampling for our Hawkes process with LGCP background, using pre-trained Gaussian Process generators which provide direct and cheap access to samples during inference. We show the efficacy and flexibility of our approach in experiments on simulated data and use our methods to uncover the trends in a dataset of reported crimes in the US.
    Learning Graphical Factor Models with Riemannian Optimization. (arXiv:2210.11950v1 [stat.ML])
    Graphical models and factor analysis are well-established tools in multivariate statistics. While these models can be both linked to structures exhibited by covariance and precision matrices, they are generally not jointly leveraged within graph learning processes. This paper therefore addresses this issue by proposing a flexible algorithmic framework for graph learning under low-rank structural constraints on the covariance matrix. The problem is expressed as penalized maximum likelihood estimation of an elliptical distribution (a generalization of Gaussian graphical models to possibly heavy-tailed distributions), where the covariance matrix is optionally constrained to be structured as low-rank plus diagonal (low-rank factor model). The resolution of this class of problems is then tackled with Riemannian optimization, where we leverage geometries of positive definite matrices and positive semi-definite matrices of fixed rank that are well suited to elliptical models. Numerical experiments on real-world data sets illustrate the effectiveness of the proposed approach.
    RL with KL penalties is better viewed as Bayesian inference. (arXiv:2205.11275v2 [cs.LG] UPDATED)
    Reinforcement learning (RL) is frequently employed in fine-tuning large language models (LMs), such as GPT-3, to penalize them for undesirable features of generated sequences, such as offensiveness, social bias, harmfulness or falsehood. The RL formulation involves treating the LM as a policy and updating it to maximise the expected value of a reward function which captures human preferences, such as non-offensiveness. In this paper, we analyze challenges associated with treating a language model as an RL policy and show how avoiding those challenges requires moving beyond the RL paradigm. We start by observing that the standard RL approach is flawed as an objective for fine-tuning LMs because it leads to distribution collapse: turning the LM into a degenerate distribution. Then, we analyze KL-regularised RL, a widely used recipe for fine-tuning LMs, which additionally constrains the fine-tuned LM to stay close to its original distribution in terms of Kullback-Leibler (KL) divergence. We show that KL-regularised RL is equivalent to variational inference: approximating a Bayesian posterior which specifies how to update a prior LM to conform with evidence provided by the reward function. We argue that this Bayesian inference view of KL-regularised RL is more insightful than the typically employed RL perspective. The Bayesian inference view explains how KL-regularised RL avoids the distribution collapse problem and offers a first-principles derivation for its objective. While this objective happens to be equivalent to RL (with a particular choice of parametric reward), there exist other objectives for fine-tuning LMs which are no longer equivalent to RL. That observation leads to a more general point: RL is not an adequate formal framework for problems such as fine-tuning language models. These problems are best viewed as Bayesian inference: approximating a pre-defined target distribution.
    Barrier Hamiltonian Monte Carlo. (arXiv:2210.11925v1 [stat.ML])
    In this paper, we propose Barrier Hamiltonian Monte Carlo (BHMC), a version of HMC which aims at sampling from a Gibbs distribution $\pi$ on a manifold $\mathsf{M}$, endowed with a Hessian metric $\mathfrak{g}$ derived from a self-concordant barrier. Like Riemannian Manifold HMC, our method relies on Hamiltonian dynamics which comprise $\mathfrak{g}$. It incorporates the constraints defining $\mathsf{M}$ and is therefore able to exploit its underlying geometry. We first introduce c-BHMC (continuous BHMC), for which we assume that the Hamiltonian dynamics can be integrated exactly, and show that it generates a Markov chain for which $\pi$ is invariant. Secondly, we design n-BHMC (numerical BHMC), a Metropolis-Hastings algorithm which combines an acceptance filter including a "reverse integration check" and numerical integrators of the Hamiltonian dynamics. Our main results establish that n-BHMC generates a reversible Markov chain with respect to $\pi$. This is in contrast to existing algorithms which extend the HMC method to Riemannian manifolds, as they do not deal with asymptotic bias. Our conclusions are supported by numerical experiments where we consider target distributions defined on polytopes.
    Normalizing Flows for Knockoff-free Controlled Feature Selection. (arXiv:2106.01528v3 [stat.ML] UPDATED)
    Controlled feature selection aims to discover the features a response depends on while limiting the false discovery rate (FDR) to a predefined level. Recently, multiple deep-learning-based methods have been proposed to perform controlled feature selection through the Model-X knockoff framework. We demonstrate, however, that these methods often fail to control the FDR for two reasons. First, these methods often learn inaccurate models of features. Second, the "swap" property, which is required for knockoffs to be valid, is often not well enforced. We propose a new procedure called FlowSelect to perform controlled feature selection that does not suffer from either of these two problems. To more accurately model the features, FlowSelect uses normalizing flows, the state-of-the-art method for density estimation. Instead of enforcing the "swap" property, FlowSelect uses a novel MCMC-based procedure to calculate p-values for each feature directly. Asymptotically, FlowSelect computes valid p-values. Empirically, FlowSelect consistently controls the FDR on both synthetic and semi-synthetic benchmarks, whereas competing knockoff-based approaches do not. FlowSelect also demonstrates greater power on these benchmarks. Additionally, FlowSelect correctly infers the genetic variants associated with specific soybean traits from GWAS data.
    Improved Regret Analysis for Variance-Adaptive Linear Bandits and Horizon-Free Linear Mixture MDPs. (arXiv:2111.03289v3 [stat.ML] UPDATED)
    In online learning problems, exploiting low variance plays an important role in obtaining tight performance guarantees yet is challenging because variances are often not known a priori. Recently, considerable progress has been made by Zhang et al. (2021) where they obtain a variance-adaptive regret bound for linear bandits without knowledge of the variances and a horizon-free regret bound for linear mixture Markov decision processes (MDPs). In this paper, we present novel analyses that improve their regret bounds significantly. For linear bandits, we achieve $\tilde O(\min\{d\sqrt{K}, d^{1.5}\sqrt{\sum_{k=1}^K \sigma_k^2}\} + d^2)$ where $d$ is the dimension of the features, $K$ is the time horizon, and $\sigma_k^2$ is the noise variance at time step $k$, and $\tilde O$ ignores polylogarithmic dependence, which is a factor of $d^3$ improvement. For linear mixture MDPs with the assumption of maximum cumulative reward in an episode being in $[0,1]$, we achieve a horizon-free regret bound of $\tilde O(d \sqrt{K} + d^2)$ where $d$ is the number of base models and $K$ is the number of episodes. This is a factor of $d^{3.5}$ improvement in the leading term and $d^7$ in the lower order term. Our analysis critically relies on a novel peeling-based regret analysis that leverages the elliptical potential `count' lemma.
    An Improved Algorithm for Clustered Federated Learning. (arXiv:2210.11538v1 [stat.ML])
    In this paper, we address the dichotomy between heterogeneous models and simultaneous training in Federated Learning (FL) via a clustering framework. We define a new clustering model for FL based on the (optimal) local models of the users: two users belong to the same cluster if their local models are close; otherwise they belong to different clusters. A standard algorithm for clustered FL is proposed in \cite{ghosh_efficient_2021}, called \texttt{IFCA}, which requires \emph{suitable} initialization and the knowledge of hyper-parameters like the number of clusters (which is often quite difficult to obtain in practical applications) to converge. We propose an improved algorithm, \emph{Successive Refine Federated Clustering Algorithm} (\texttt{SR-FCA}), which removes such restrictive assumptions. \texttt{SR-FCA} treats each user as a singleton cluster as an initialization, and then successively refine the cluster estimation via exploiting similar users belonging to the same cluster. In any intermediate step, \texttt{SR-FCA} uses a robust federated learning algorithm within each cluster to exploit simultaneous training and to correct clustering errors. Furthermore, \texttt{SR-FCA} does not require any \emph{good} initialization (warm start), both in theory and practice. We show that with proper choice of learning rate, \texttt{SR-FCA} incurs arbitrarily small clustering error. Additionally, we validate the performance of our algorithm on standard FL datasets in non-convex problems like neural nets, and we show the benefits of \texttt{SR-FCA} over baselines.
    Boomerang: Local sampling on image manifolds using diffusion models. (arXiv:2210.12100v1 [cs.CV])
    Diffusion models can be viewed as mapping points in a high-dimensional latent space onto a low-dimensional learned manifold, typically an image manifold. The intermediate values between the latent space and image manifold can be interpreted as noisy images which are determined by the noise scheduling scheme employed during pre-training. We exploit this interpretation to introduce Boomerang, a local image manifold sampling approach using the dynamics of diffusion models. We call it Boomerang because we first add noise to an input image, moving it closer to the latent space, then bring it back to the image space through diffusion dynamics. We use this method to generate images which are similar, but nonidentical, to the original input images on the image manifold. We are able to set how close the generated image is to the original based on how much noise we add. Additionally, the generated images have a degree of stochasticity, allowing us to locally sample as many times as we want without repetition. We show three applications for which Boomerang can be used. First, we provide a framework for constructing privacy-preserving datasets having controllable degrees of anonymity. Second, we show how to use Boomerang for data augmentation while staying on the image manifold. Third, we introduce a framework for image super-resolution with 8x upsampling. Boomerang does not require any modification to the training of diffusion models and can be used with pretrained models on a single, inexpensive GPU.
    Local SGD in Overparameterized Linear Regression. (arXiv:2210.11562v1 [stat.ML])
    We consider distributed learning using constant stepsize SGD (DSGD) over several devices, each sending a final model update to a central server. In a final step, the local estimates are aggregated. We prove in the setting of overparameterized linear regression general upper bounds with matching lower bounds and derive learning rates for specific data generating distributions. We show that the excess risk is of order of the variance provided the number of local nodes grows not too large with the global sample size. We further compare the sample complexity of DSGD with the sample complexity of distributed ridge regression (DRR) and show that the excess SGD-risk is smaller than the excess RR-risk, where both sample complexities are of the same order.
    Bayesian deep learning framework for uncertainty quantification in high dimensions. (arXiv:2210.11737v1 [stat.ML])
    We develop a novel deep learning method for uncertainty quantification in stochastic partial differential equations based on Bayesian neural network (BNN) and Hamiltonian Monte Carlo (HMC). A BNN efficiently learns the posterior distribution of the parameters in deep neural networks by performing Bayesian inference on the network parameters. The posterior distribution is efficiently sampled using HMC to quantify uncertainties in the system. Several numerical examples are shown for both forward and inverse problems in high dimension to demonstrate the effectiveness of the proposed method for uncertainty quantification. These also show promising results that the computational cost is almost independent of the dimension of the problem demonstrating the potential of the method for tackling the so-called curse of dimensionality.
    Optimal Contextual Bandits with Knapsacks under Realizibility via Regression Oracles. (arXiv:2210.11834v1 [cs.LG])
    We study the stochastic contextual bandit with knapsacks (CBwK) problem, where each action, taken upon a context, not only leads to a random reward but also costs a random resource consumption in a vector form. The challenge is to maximize the total reward without violating the budget for each resource. We study this problem under a general realizability setting where the expected reward and expected cost are functions of contexts and actions in some given general function classes $\mathcal{F}$ and $\mathcal{G}$, respectively. Existing works on CBwK are restricted to the linear function class since they use UCB-type algorithms, which heavily rely on the linear form and thus are difficult to extend to general function classes. Motivated by online regression oracles that have been successfully applied to contextual bandits, we propose the first universal and optimal algorithmic framework for CBwK by reducing it to online regression. We also establish the lower regret bound to show the optimality of our algorithm for a variety of function classes.
    Validation of Composite Systems by Discrepancy Propagation. (arXiv:2210.12061v1 [cs.LG])
    Assessing the validity of a real-world system with respect to given quality criteria is a common yet costly task in industrial applications due to the vast number of required real-world tests. Validating such systems by means of simulation offers a promising and less expensive alternative, but requires an assessment of the simulation accuracy and therefore end-to-end measurements. Additionally, covariate shifts between simulations and actual usage can cause difficulties for estimating the reliability of such systems. In this work, we present a validation method that propagates bounds on distributional discrepancy measures through a composite system, thereby allowing us to derive an upper bound on the failure probability of the real system from potentially inaccurate simulations. Each propagation step entails an optimization problem, where -- for measures such as maximum mean discrepancy (MMD) -- we develop tight convex relaxations based on semidefinite programs. We demonstrate that our propagation method yields valid and useful bounds for composite systems exhibiting a variety of realistic effects. In particular, we show that the proposed method can successfully account for data shifts within the experimental design as well as model inaccuracies within the used simulation.
    Learning in RKHM: a $C^*$-Algebraic Twist for Kernel Machines. (arXiv:2210.11855v1 [stat.ML])
    Supervised learning in reproducing kernel Hilbert space (RKHS) and vector-valued RKHS (vvRKHS) has been investigated for more than 30 years. In this paper, we provide a new twist to this rich literature by generalizing supervised learning in RKHS and vvRKHS to reproducing kernel Hilbert $C^*$-module (RKHM), and show how to construct effective positive-definite kernels by considering the perspective of $C^*$-algebra. Unlike the cases of RKHS and vvRKHS, we can use $C^*$-algebras to enlarge representation spaces. This enables us to construct RKHMs whose representation power goes beyond RKHSs, vvRKHSs, and existing methods such as convolutional neural networks. Our framework is suitable, for example, for effectively analyzing image data by allowing the interaction of Fourier components.
    Blind Polynomial Regression. (arXiv:2210.11874v1 [eess.SP])
    Fitting a polynomial to observed data is an ubiquitous task in many signal processing and machine learning tasks, such as interpolation and prediction. In that context, input and output pairs are available and the goal is to find the coefficients of the polynomial. However, in many applications, the input may be partially known or not known at all, rendering conventional regression approaches not applicable. In this paper, we formally state the (potentially partial) blind regression problem, illustrate some of its theoretical properties, and propose algorithmic approaches to solve it. As a case-study, we apply our methods to a jitter-correction problem and corroborate its performance.

  • Open

    Quick AI animation
    submitted by /u/MikeYEAHMusic [link] [comments]  ( 111 min )
    [Update] Ill paypal 100 dollars for whoever knows this extremely realistic text to speech being used by these channels
    I'm trying to do an algorithm project on text to speech, and found these channels that have an extremely realistic text to speech software. I know it isn't a human because multiple different channels use it, but its much better than the regular text to speech software. If anyone has any idea, please let me know, and I'll pay 100 dollars to the first person who tells me how to find this software cuz I've been searching for a long time videos: He Is Handsome And Talented In Studying But Playing Games Using "Hack" | Anime Recap - YouTube He Got Possessed By A Overpowered Dragon Soul And Takes Revenge On The Demons - YouTube He Can Easily Learn All The Hardest Forbidden Magic - YouTube submitted by /u/stive52 [link] [comments]  ( 117 min )
    AI Dream 92 - AI Discovers Rare Alien Civilization
    submitted by /u/LordPewPew777 [link] [comments]  ( 111 min )
    Ill paypal 20 dollars for whoever knows this extremely realistic text to speech being used by these channels?
    I'm trying to do an algorithm project on text to speech, and found these channels that have an extremely realistic text to speech software. I know it isn't a human because multiple different channels use it, but its much better than the regular text to speech software. If anyone has any idea, please let me know, and I'll pay 20 dollars to the first person who tells me how to find this software cuz I've been searching for a long time ​ [UPDATE] ill pay 100 DOLLARS! videos: He Is Handsome And Talented In Studying But Playing Games Using "Hack" | Anime Recap - YouTube He Got Possessed By A Overpowered Dragon Soul And Takes Revenge On The Demons - YouTube He Can Easily Learn All The Hardest Forbidden Magic - YouTube submitted by /u/stive52 [link] [comments]  ( 116 min )
    Animated AI art project blending real life iPhone video — “Imagination is more important than knowledge; knowledge is limited, imagination encircles the world” ✨
    submitted by /u/imaginfinity [link] [comments]  ( 113 min )
    WhyML - New YouTube Channel for ML
    Hi everyone, I would like to humbly introduce my YouTube channel - WhyML - where I discuss mostly about why some things happen in machine learning (e.g. why regularization works, why there are vanishing gradients in RNNs etc). I hope that this channel may help any of you gain more insight into some topics. Any kind of feedback regarding my content is wholeheartedly welcomed! Many thanks in advance! P.S. The first three videos have poor audio quality, sorry for that (added subs for them). :( submitted by /u/Personal-Trainer-541 [link] [comments]  ( 115 min )
    An Argument Against Image AIs
    submitted by /u/Odd_Investigator6094 [link] [comments]  ( 139 min )
    How to use Weights For Stable Diffusion With the AI Art Deforum Diffusio...
    submitted by /u/prfitofthesngularity [link] [comments]  ( 114 min )
    How Artificial Intelligence is Shaping the Future of Copywriting
    submitted by /u/IndependenceFun4627 [link] [comments]  ( 111 min )
    Deepmind shows how AI can deal with uncertainty
    submitted by /u/henlo_there_fren [link] [comments]  ( 112 min )
    Can AI solve complex riddles? (GPT-3)
    submitted by /u/allaboutai-kris [link] [comments]  ( 112 min )
    Given the exponential rate of improvement to prompt based image/video generation, in how many years do you think we'll see entire movies generated from a prompt?
    View Poll submitted by /u/yea_okay_dude [link] [comments]  ( 130 min )
  • Open

    A Philosophy-Tuned AI Bot for Open-Ended Discussion [P]
    Hello everyone, ​ I have been working on multiple philosophy tuned AI models for some time now (mainly for my own curiosity) and I wanted to share the results; there are 3 unique versions, all of them are philosophical and deep enough to spark curiosity (which was my main goal) and each have their own distinctive personalities/views. ​ The models can be accessed from my self made website ; https://lisica.ai -> https://lisica.ai/Aeriform/index.html I have also created a "guest" mode where you can look at samples from the models to see beforehand if they are worth giving a shot or not. I hope to appeal to curiosity of people looking for thought-provoking conversations. ​ ​ The Dreaming One ​ ​ Feel free to reach me at my discord for any questions or feedback.. https://discord.com/invite/Sg7fhZ9NyD submitted by /u/developersfox [link] [comments]  ( 122 min )
    [R] Speech-to-speech translation for a real-world unwritten language
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 127 min )
    [D] Comprehension issues with papers from non-English speakers
    Hi. English is my second language. When I'm reading ML papers, especially when writers are non-english speakers (generally Chinese), I see phrases that I have never seen before and I just don't get what they are trying to say. For example, today I have seen this (emphasis mine): "It utilizes tensors as the fundamental scheduling units to consist with the layer-wise computations enforced in DL performance primitives cuDNN [7]. " What does it mean? Nothing comes up on Google when I search. Too many times I have skipped sentences and failed to understand papers completely because of things like this. Is my English not adequate or does committees miss typos(?) like this? DOI: 10.1145/3178487.3178491 submitted by /u/Confused_Electron [link] [comments]  ( 125 min )
    [D] Simple Questions Thread
    Please post your questions here instead of creating a new thread. Encourage others who create new posts for questions to post here instead! Thread will stay alive until next one so keep posting after the date in the title. Thanks to everyone for answering questions in the previous thread! submitted by /u/AutoModerator [link] [comments]  ( 121 min )
    [D] Building the Future of TensorFlow
    https://blog.tensorflow.org/2022/10/building-the-future-of-tensorflow.html submitted by /u/eparlan [link] [comments]  ( 124 min )
    [P] Seq-nms in PyTorch
    I would like to share something I have been working on lately. It is a implementation of seq-nms in PyTorch. It is implemented in PyTorch's C++ frontend (for better performance, but can be called from python) and include features such as torch-scriptability (i.e. you can export it for deployment). It can be found here: https://github.com/MrParosk/seq_nms If you have any feedback please let me know! submitted by /u/eparlan [link] [comments]  ( 121 min )
    [R]Stochastic Gradient MCMC
    Hi, when reading Stochastic Gradient MCMC paper (1907.06986) I found something interesting, I want to check my idea by your experts: W'=W - h/2*N/n*sum(grad[i]) + sqrt(h)*N(0,1), denoted as W'=W-G+R where h is the learning rate, N is sample size, n is batch size. R component requires sampling from Gaussian PRNG which is slow. My idea is that the G component itself has randomness due to sampling. If we force G's variance equal to R' variance, we can omit the R component. (Key Idea to be validated by your experts). If this assumption is true, let me show you some interesting results: var(G)=h*h/4*N*N/n/n*n*VarGrad=h h/4*N*N/n*VarGrad=1 h=4n/N/N/VarGrad G=2/N/VarGrad*sum(grad[i]) let L=2/(N*VarGrad) (A) W'=W-L*sum(grad[i]) (B) A and B says we should linearly increase learning rate with batch size with a rate of 2/(N*VarGrad). VarGrad is available with Adam style exponential moving average. That's it. Looking for your reply! submitted by /u/wangyi_fudan [link] [comments]  ( 123 min )
    [D] - NeurIPS 2022 - When will openreview make the reviews of rejected papers public?
    Title is self descriptive. I thought that on 20th. October openreview would make public the reviews of the accepted papers, and of the rejected papers who opted-in . Is there any other date for that? submitted by /u/No-Spirit-7840 [link] [comments]  ( 121 min )
  • Open

    How Businesses can boost Revenue Using Blockchain Technology
    Blockchain is a method for storing data that makes it harder to hack the system. The business blockchain is a distributed, encrypted database that its authorized users can only alter. Users have a complete say over how other users view data and what actions can be taken by other users within the network. The post How Businesses can boost Revenue Using Blockchain Technology appeared first on Data Science Central.  ( 20 min )
    How RFID Technology is Applied in Banks and Financial Institutions
    Banks and financial institutions offer great value and convenience to their end users. However, they are notorious for being susceptible to fraudulent attacks such as credit card identity theft. A lot has been done in adding layers of security to an individual’s banking experience. The post How RFID Technology is Applied in Banks and Financial Institutions appeared first on Data Science Central.  ( 20 min )
    An Overview of Top Trends in Accounting
    The role of technology in accounting is rapidly changing the face of this field in many ways. From chatbots to actively handling the management accountants, from facing the enhanced regulations and demands from clients to well-management of time-consumed works, day-to-day tasks and processes of accounting the machines can do or streamline will help the professionals speed up their work. The post An Overview of Top Trends in Accounting appeared first on Data Science Central.  ( 20 min )
  • Open

    Isaac Gym, extremly efficient !
    I just tested Isaac gym on a consumer grade "modest" gpu (2080). I am extremely impressed how a quadruped gait can be learned in just a few minutes ! Now we need to find affordable hardware for system identification (aka making an accurate model of your hardware robot), heavy domain randomization, and the future of robotic control will be reinforcement learning based ! What do you think ? https://preview.redd.it/llxrfz6w5lv91.png?width=815&format=png&auto=webp&s=558fa8506d99b9533435e2cf822464c2fc847f76 submitted by /u/Remet0n [link] [comments]  ( 120 min )
    How to Domain shift from the Supervised learning to Reinforcement Learning?
    Hey guys. Does any one know any sources of information on what the process looks like for initially training an agent and on exampled behavior with supervised learning and then switching to letting it loose using reinforcement learning For example how Deep mind trained Alpha Go with SL on human played games and then after used RI? I usually prefer videos but anything is appreciated. Thanks submitted by /u/punkCyb3r4J [link] [comments]  ( 117 min )
    Who are professors working in RL for finance?
    I'm applying to PhD programs and have been curious about RL for finance and was wondering what professors work in this field. I don't know much about financial trading and I was also curious if ML sub-fields such as RL have shown great utility in increasing profits? submitted by /u/DolantheMFWizard [link] [comments]  ( 118 min )
    Learning a Stochastic World Model for 2 Player Simultaneous Imperfect Information Games
    Hi, I was wondering if anyone could provide sources addressing the above problem. I can already find sources that deal with subsets of the problem i.e. just imperfect information, just a stochastic world model, just simultaneous games etc. I was wondering if anyone knew of any approaches that attempt to tackle all 3, or even had pseudocode/GitHub repos of dealing with this. Thanks in advance! submitted by /u/atomicburn125 [link] [comments]  ( 119 min )
    Would a centralised multi-agent architecture using SAC be suitable for my project?
    I am an undergraduate CS student working on starting a project with a professor at my school that involves RL for water treatment. I would really appreciate some guidance on what kind of RL might be best as a starting point for this type of project. Ultimately the RL systems will control a virtual water treatment environment which will include various devices to monitor different properties of the environment and then devices to change those properties in order to control the condition of the final output water. For example, a device to monitor the level of a certain chemical in the water, devices to pump different chemicals, devices to monitor the output water's final chemical composition, devices to check the temperature and flow rate of the water through various pieces of equipment, etc. All of these different monitoring and control mechanisms would need to be able to work together to produce the optimal output water condition. My initial thought was that I could use SAC for the RL agents and a centralised multi-agent architecture so that the agents would have the ability to share parameters. But I am not sure if this would make sense. The way I'm thinking about it would mean that the agents would be the various devices that are controlling and monitoring the system. Is this the wrong way to look at this problem? Should I instead be thinking of this system as one agent who's state consists of of the various chemical pump outputs, chemical measurements, flow rates, final output water composition, etc.? If this centralised multi-agent SAC system is the wrong approach do you have any suggestions for what might be a better alternative? Thanks in advance! submitted by /u/lifelifebalance [link] [comments]  ( 119 min )
  • Open

    Google at ECCV 2022
    Posted by Shaina Mehta, Program Manager, Google Google is proud to be a Platinum Sponsor of the European Conference on Computer Vision (ECCV 2022), a premier forum for the dissemination of research in computer vision and machine learning (ML). This year, ECCV 2022 will be held as a hybrid event, in person in Tel Aviv, Israel with virtual attendance as an option. Google has a strong presence at this year’s conference with over 60 accepted publications and active involvement in a number of workshops and tutorials. We look forward to sharing some of our extensive research and expanding our partnership with the broader ML research community. Registered for ECCV 2022? We hope you’ll visit our on-site or virtual booths to learn more about the research we’re presenting at ECCV 2022, includin…  ( 94 min )

  • Open

    15 Reasons to Keep Using Social Media
    Despite the negativity surrounding social media lately, it’s still an incredibly powerful tool that can be used for good, and any social media strategy should consider this technology's intrinsic benefits. Here are 15 reasons to keep using social media: The post 15 Reasons to Keep Using Social Media appeared first on Data Science Central.  ( 20 min )
  • Open

    Elliptical orbit example: Mars Orbiter Mission
    This post will look at India’s first interplanetary mission, Mars Orbiter Mission, to illustrate points in recent posts. As suggested by the logo, the probe had a very eccentric orbit of Mars with periareion 421.7 km and apoareion 76,993.6 km. We can derive everything else from these numbers. Peri-what?! As mentioned in footnote 2 of […] Elliptical orbit example: Mars Orbiter Mission first appeared on John D. Cook.  ( 5 min )
    Military Standard 105
    Military Standard 105 (MIL-STD-105) is the grand daddy of sampling acceptance standards. The standard grew out of work done at Bell Labs in the 1930s and was first published during WWII. There were five updates to the standard, the last edition being MIL-STD-105E, published in 1989. In 1995 the standard was officially cancelled when the […] Military Standard 105 first appeared on John D. Cook.  ( 5 min )
    Mean anomaly, true anomaly, and eccentric anomaly
    Orbital mechanics has a lot of arcane terminology because it has been studied for centuries. V. I. Arnold said that orbital mechanics was one of the three main sources of modern mathematics. Mean anomaly, true anomaly, and eccentric anomaly are three ways of describing where an object is in its orbit. All would be the […] Mean anomaly, true anomaly, and eccentric anomaly first appeared on John D. Cook.  ( 7 min )
  • Open

    [P] is it necessary to convert audio data from analog to digital?
    I'm creating an ML audio diagnosis and was unsure if I should invest in an analog to digital converter, it if the program can turn the analog signals directly into a spectogram submitted by /u/SSC_08 [link] [comments]  ( 119 min )
    [Discussion] Hopfield network
    I'm particularly interested in the following question: are there any models of hebbian networks where several weights between two nodes exist? (and which one is to be updated might be resolved by chance/ depend on some predefined rules?) It occurred to me that it could be interesting to derive some properties of Hopfield networks where weights are represented not in matrix form but as a tensor so that every pair of nodes has several weights between them. I looked for these models in scholar and in google but didn't find anything resembling this idea. I believe that's not a consequence of my outstanding intellect that I'm the first human, who has thought of it, but just pure dumb search algorithms which don't show anything related to the topic, as it's not something i usually search for. I've already posted this question in compmathneuroscience reddit and I was sent here as a more appropriate place to ask it. initial question: https://www.reddit.com/r/compmathneuro/comments/ya6rgy/hebbian_network/ (caveat - in the title there is a mistake, it's Hopfield , not Hebbian) submitted by /u/Reasonable_Tie_5607 [link] [comments]  ( 122 min )
    [Research] CORL: Offline Reinforcement Learning Library
    Happy to announce CORL — a library that provides high-quality single-file implementations of Deep Offline Reinforcement Learning algorithms and uses Weights and Biases to track experiments. SOTA algorithms (Decision Transformer, AWAC, BC, CQL, IQL, TD3+BC, SAC-N, EDAC) Benchmarked on widely used D4RL datasets (results match performances reported in the original papers, sometimes even with better results) Configs with hyperparameters for better reproduction Weights&Biases logs for all of the experiments (so that you don’t have to solely rely on final performances from papers) github: https://github.com/tinkoff-ai/CORL paper: https://arxiv.org/abs/2210.07105 (accepted at NeurIPS, 3rd Offline RL Workshop) submitted by /u/vkurenkov [link] [comments]  ( 122 min )
    [D] NeurIPS 2022 Statistics
    Some early statistics about this year's NeurIPS conference. https://github.com/sanagno/neurips_2022_statistics submitted by /u/asotos11 [link] [comments]  ( 121 min )
    [D] What things did you learn in ML theory that are, in practice, different?
    In many disciplines, practice is usually deviates away from the theory. Is that the same in ML? If so, what are they? submitted by /u/4bedoe [link] [comments]  ( 121 min )
    [R][P] Runway Stable Diffusion Inpainting: Erase and Replace, add a mask and text prompt to replace objects in an image
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 135 min )
    [D] Any pre trained retrieval based language models available?
    Like RETRO or something similar. Does anyone know if any retrieval based language model is available to download? submitted by /u/invertedpassion [link] [comments]  ( 121 min )
    [P] txtai 5.1 released - new translation models, OpenAI Whisper transcription and ARM Docker images
    txtai executes machine-learning workflows to transform data and build AI-powered semantic search applications. A full-fledged vector search application can be created with a couple lines of code. There are also almost 40 example notebooks covering many different use cases. 5.1 adds new model support for the translation pipeline, OpenAI Whisper support in the transcription pipeline and ARM Docker images. Topic modeling was also updated with improvements, including how to use BM25/TF-IDF indexes to drive topic models. GitHub | Release Notes | Docker Hub | Examples submitted by /u/davidmezzetti [link] [comments]  ( 123 min )
    [Discussion] Today I walk you through how to use Luma AI (NeRF) and Unity to scan real world objects where I scan a few figures, import them into Unity, use the high definition rendering pipeline, and cinemachine to give the project a cinematic look (full video in comments)
    submitted by /u/dilmerv [link] [comments]  ( 143 min )
    [D] TabPFN A Transformer That Solves Small Tabular Classification Problems in a Second (SOTA on tabular data with no training)
    submitted by /u/st8ic [link] [comments]  ( 123 min )
  • Open

    5 Variations of Artificial Intelligence
    submitted by /u/Philo167 [link] [comments]  ( 113 min )
    re there any places you can download code for a ai chat bot and run on your own system?
    submitted by /u/JonathanDawdy [link] [comments]  ( 115 min )
    New Machine Learning Driven Exoskeleton Robotics | Bionic Tech For Sense of Touch In VR & AR
    submitted by /u/kenickh [link] [comments]  ( 115 min )
    Has someone ever CREATED, SHARED, or THREATENED to create or share NUDE OR SEXUAL IMAGES/VIDEOS OF YOU WITHOUT YOUR CONSENT?
    submitted by /u/IBSA-Deepfake_Thesis [link] [comments]  ( 115 min )
    The tentacle robot can gently grasp delicate objects.
    submitted by /u/Historical-Object374 [link] [comments]  ( 112 min )
    Apps for detection of GAN images
    Hello, I am a school teacher and we are talking a bit about AI now. I have told them about the concept of fake people being generated by AI and creation of non-existent people. I was wondering if there was any online apps (free) that one could feed images to that will aid in the detection of whether the people are real or the image is generated by a GAN? Thanks for your help! submitted by /u/Leprechan_Sushi [link] [comments]  ( 114 min )
    AI Dream 88 - Studio Ghibli Inspired Green Lush AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 111 min )
    "Be unpredictable, or Artificial Intelligence will consume you one day." - Murat Durmus [1080 x 1080]
    submitted by /u/princeofsky147 [link] [comments]  ( 113 min )
    The Best AI-Based Chrome Extensions For 2022
    submitted by /u/Realistic-Plant3957 [link] [comments]  ( 117 min )
    I'm worried about this for 3D artists...but it is insane tech...the next industrial revolution is here
    submitted by /u/the_anonymizer [link] [comments]  ( 115 min )
  • Open

    New Machine Learning Driven Exoskeleton Robotics | Bionic Tech For Sense of Touch In VR & AR
    submitted by /u/kenickh [link] [comments]  ( 108 min )
    Question about anchor boxes!
    I am confused about the concepts of anchor boxes. An object detection model tries to estimate the location of the detected object in the frame. With anchor boxes, this is done at different aspect ratios and scale. My confusion lies in the fact that these are regression problems. So if we set up anchor box coordinates initially, and the model predicts 4 points for each anchor box (x,y,height,width), how is the aspect ratio of the anchor box maintained? Wouldn't the coordinate values in the output layer change with training? How then would the aspect ratio be maintained? ​ Edit: grammar submitted by /u/Conscious_Forever_92 [link] [comments]  ( 107 min )
  • Open

    [Material advice] Learn reinforcement leanring
    I would like to learn about reinforcement leanring in the context of robotics. I already have a good background in robotics modelling, kinematics, ROS, Webots, and deep learning for computer vision. Any material (video, tutorial, books, library) to advise given my background ? What do you think about IsaacSim tutorial from Nvidia ? This environment seems promising to me. submitted by /u/Remet0n [link] [comments]  ( 123 min )
    RL Agent Library to use graph in spaces
    I am looking for a RL Agent Library like stable baseline3 or keras-rl2 to process a graph defined by the gym library. Gym offers the possibility to define a graph space (space.graph) as action_space or observation_space. Sadly I cannot find a library that uses it for agent training. Did someone came across the same problem? submitted by /u/un_defined2020 [link] [comments]  ( 117 min )

  • Open

    Building a Space in HuggingFace is taking ages [Discussion]
    Using gradio. The code is quite simple: from transformers import pipeline import gradio as gr model = pipeline( "summarization") def predict(prompt): summary = model(prompt)[0]["summary_text"] return summary create an interface for the model with gr.Interface(predict, "textbox", "text") as interface: interface.launch() Thoughts? submitted by /u/alejandrobrega [link] [comments]  ( 122 min )
    [R] Scaling Instruction-Finetuned Language Models - Flan-PaLM- Google 2022 - 75.2% on five-shot MMLU / Forecasters expected this SOTA would need until 2024! - Public checkpoints!
    Paper: https://arxiv.org/abs/2210.11416 Github: https://github.com/google-research/t5x/blob/main/docs/models.md#flan-t5-checkpoints Abstract: Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models. https://preview.redd.it/xazwmswbb8v91.jpg?width=1593&format=pjpg&auto=webp&s=2eab22e36819a6ba6117c552b9620271ad8cc51c https://preview.redd.it/xeovsswbb8v91.jpg?width=1392&format=pjpg&auto=webp&s=58355bc00ba17a78551fa43e2453b66ffd8659e4 https://preview.redd.it/xvpmc1xbb8v91.jpg?width=1040&format=pjpg&auto=webp&s=2725e0d263d2183f14ad53f6a1006a3488ba9f1c https://preview.redd.it/ylqzw0xbb8v91.jpg?width=1145&format=pjpg&auto=webp&s=6a40ba4551821215dcf06d9ecdee9ac2b569995e https://preview.redd.it/nf3kerwbb8v91.jpg?width=1149&format=pjpg&auto=webp&s=f28750ca5db60c3df295988d77c59f3c00651b55 submitted by /u/Singularian2501 [link] [comments]  ( 123 min )
    [D][R] Staking XGBOOST and CNN/Transformer
    Hello, While doing some tests, I saw that XGBOOST is way better than a multilayer NN classifier for classifcation. So I thought that first training a CNN/Transformer as a backbone with a "normale" classifier as a head for any classfication/regression task then freezing the backbone and training an XGBOOST for classfication is a good Idea. But none of the new papers do that and they all tend to use a linear/multiplayer NN classifier. Anyone know why ? Thanks ! submitted by /u/MichelMED10 [link] [comments]  ( 124 min )
    [Discussion] Categorical Encoding In Deep Learning
    One-hot-encoding is a popular method for encoding categories due to its simplicity and interpretability. Interpretability also makes it compatible method for simple machine learning algorithms (such as polynomial regression, which, yes, I classify as AI). Nonetheless, there are alternatives, such as base-n encoding which I find an appealing idea. However, I am not sure what method of encoding neural networks prefer. What is your advice? Should you prefer one-hot encoding, base-n encoding (and which n should you choose), or some other method? submitted by /u/Thijs-vW [link] [comments]  ( 121 min )
    [P] Look up words by their description
    My writing process is a constant back and forth between jotting down a few words and a [insert] placeholder, as in the "I'll-complete-that-sentence-later-once-I-remember-the-correct-word" note. So, I built phraisely. It is a reverse thesaurus: you input a detailed description, and the model returns its best guesses. Compared to the existing solutions for this task, phraisely attempts to better understand the full context in your query, making it more accurate when looking up words using long descriptions. For example, if I search for "being home at night watching Netflix on the couch", results can contain "relaxing" and "unwinding" but also "binge-watching" since I mentioned Netflix. Bonus points when I get some slang in the results (e.g. couch-potatoing). [see pic below] The aim is to have an assistant that writes with you. But it doesn't do it for you. Overall, I believe that writing is a frustrating process. Frustration stimulates creativity. And creativity is precious. It is the first functioning version of the app. I'm regularly reviewing the output's quality and am working on tailoring it better to user needs. Also, my long term plan would be introducing languages other than English in the results. Although, the language model can already understand some input text from other common languages, like Spanish. You can have a play around at: https://phraisely.com/. It's free. Comments and suggestions are welcome. Thank you. --- Example Example submitted by /u/phraisely [link] [comments]  ( 128 min )
    [D] Hattie Zhou, Mila, on supermasks, iterative learning, and fortuitous forgetting
    Here is a podcast episode with Hattie Zhou where we discuss iterative learning, The Lottery Ticket Hypothesis, their paper on fortuitous forgetting, and much more! submitted by /u/thejashGI [link] [comments]  ( 122 min )
    [D] Curriculum Learning must read
    Hi, I was wondering what would be the must-read papers if one wants to get familiar with Curriculum learning literature? Thanks! submitted by /u/celviofos [link] [comments]  ( 127 min )
    [D] Do any major ML research groups focus on policy-making applications?
    While reading a recent DeepMind paper on an economic game: https://www.deepmind.com/blog/human-centred-mechanism-design-with-democratic-ai https://www.nature.com/articles/s41562-022-01383-x.pdf I encountered this disclaimer: "Finally, we emphasize that our results do not imply support for a form of ‘AI government’, whereby autonomous agents make policy decisions without human intervention" It is obvious we want some human oversight. Still, optimizing our societal policies seems to me one of the most promising positive transformations the ML could bring about, much better than a new phone assistant. There are known promising approaches, for example, to reducing the poverty and inequality. Things like restructuring the social safety nets, labor laws, tax codes, etc. Perhaps ML could help with some of them: https://talkpoverty.org/2015/06/10/solutions-economic-inequality/ ML research centers want to make an impact in society. For example, Demis Hassabis of DeepMind said he had a list of 1,000 promising scientific problems he wanted to approach with ML, in hope of making a Nobel-grade discovery one day. Does any ML company, agency, conference, or forum pursue the policy-making applications specifically? When would you estimate we might see major changes in social policies caused by ML? I would bet this does not require AGI in the strong sense, so might be possible relatively soon, if there is political will, funding, and interest. And there should be, as the first country to embrace this accelerated optimization should see some major economic advantages. submitted by /u/valdanylchuk [link] [comments]  ( 125 min )
    [R] Neural Networks with noise on Inputs and Weights of each layer
    I have a classification task (e.g., MNIST digits classification). I want to evaluate its performance (inference) with the presence of noise on each layer's inputs and weights (due to hardware problems). Are there any methods applicable for my case ? Or should I just sample the noises and run multiple inferences ? Thanks for reading ! submitted by /u/mad_alim [link] [comments]  ( 121 min )
    [P] An open-sourced SOTA solution for portrait and human segmentation on mobile devices
    Hi, I'd like to share a human segmentation toolkit called PP-Humanseg. This toolkit has: A large-scale video portrait dataset that contains 291 videos from 23 conference scenes with 14K frames; A portrait segmentation model that achieves SOTA performance on mobile phone (mIoU 96.63%, 63 FPS); Several human segmentation models well-trained for real scene. Github: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.6/contrib/PP-HumanSeg ​ https://i.redd.it/dgesfn6jr4v91.gif submitted by /u/EldDap [link] [comments]  ( 123 min )
    [D] Do POSTER papers accepted in NeurIPS' workshops published in the main NeurIPS proceedings?
    Hi folks, My paper was accepted as a poster in one of NeurIPS's workshops. Will the paper be available in the NeurIPS 2022 main proceedings? submitted by /u/Alternative-File-146 [link] [comments]  ( 121 min )
    [D] Accurate blogs on machine learning?
    Are there any actually good blogs on machine learning that are actually accurate? It's amazing how the first google hits can be total trash. take this article for example: https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/ This is full of nonsense - "ROC curves are appropriate when the observations are balanced between each class, whereas precision-recall curves are appropriate for imbalanced datasets" Anyone who has actually worked with ML will know that the above statement is blatantly not true! it's amazing that such a blog manages to come up so often in google hits... Is there anything better out there? submitted by /u/likeamanyfacedgod [link] [comments]  ( 132 min )
    [D] The performance of BERT vs. mBERT is roughly the same for text classification on sequences containing two languages. What could be the reason why?
    Hi. I'm currently performing text classification on sequences of text that contain two languages: English and my own native language. I'm running experiments using mBERT and BERT for the sake of comparison, but noticed that BERT's performance is actually roughly the same as mBERT. This was a little counterintuitive to me because, with the data being more than English, I assumed that BERT wouldn't be able to properly "understand" the text. I've tried taking a look at how many of the tokenized results are [UNK] for each, and also noticed that of all the tokens in the dataset, mBERT produces [UNK] for around 46% and BERT 52%. It was also surprisingly high for mBERT, but I think that a 6% difference is large enough to not be able to justify equal performance. Does anybody know why this may be happening or in what direction I should investigate this? I'd like to figure out how or why the performance could be roughly equal. submitted by /u/Seankala [link] [comments]  ( 125 min )
    [D] modeling discrimintive tasks using diffusion models?
    Found this interesting paper modeling a discrimintive task using a diffusion model. I wonder if we can do the same for automatic speech recognition (ASR). https://arxiv.org/abs/2210.05148 submitted by /u/athrun200 [link] [comments]  ( 122 min )
  • Open

    I for one welcome our robot overlords
    submitted by /u/EggnogEjaculations [link] [comments]  ( 115 min )
    An AI Might Have Written This
    submitted by /u/estasfuera [link] [comments]  ( 115 min )
    Obtaining genetics insights from deep learning via explainable artificial intelligence
    submitted by /u/estasfuera [link] [comments]  ( 115 min )
    Nerfstudio makes it easier to get started with NeRFs
    submitted by /u/much_successes [link] [comments]  ( 114 min )
    AI Image Editing from Text! Imagic Explained
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 114 min )
    Does anyone know what AI voice over this person is using?
    I just also want to know how someone can believe that this thing is real. There's literally 5 different youtube channels doing the same thing and wanted to know what the engine they're using is. ​ youtube video: He Mastered All Elements And Overpowered The Gods - YouTube submitted by /u/stive52 [link] [comments]  ( 115 min )
    Biden Administration Launches New Workforce Program For Emerging Technology Jobs - including AI. New opportunities for community college AI programs?
    The Biden administration has launched a new workforce development funding program to help people, including those at community colleges, gain skills for emerging jobs in fields like AI, biotechnology, quantum science and new areas of advanced manufacturing and semiconductor. Forbes article talks about how the program could scale AI education at new education and training providers where its not currently common like community colleges. Talks about Intel's work to get AI at community colleges in all 50 states by next year https://www.forbes.com/sites/shalinjyotishi/2022/10/20/biden-administration-launches-new-workforce-program-for-emerging-technology-jobs submitted by /u/WorkforceWonk [link] [comments]  ( 116 min )
    AI Dream 102 - ULTRA COMPILATION - 3D PSyCHO TRIP
    submitted by /u/LordPewPew777 [link] [comments]  ( 115 min )
    Julian Lennon releases music video using AI through Stable Diffusion. & Disco Diffusion
    submitted by /u/Ciel7117 [link] [comments]  ( 115 min )
    Stable Diffusion 1.5 VS 1.4 VS 1.5 Inpainting FULL Comparison
    submitted by /u/PuppetHere [link] [comments]  ( 116 min )
    EEG Emotion Recognition
    I am working on emotion recognition through EEG signals. We have decided to use MUSE 2 EEG headband to collect EEG signals. I wanted to ask if I can use DEAP, DREAMER, or other datasets to train my model. Can MUSE 2 give the data which these datasets have? Most of the datasets on the internet have used 32-channel or 16-channel headsets while MUSE 2 is 4-channel. Also, Can anyone give my public-accessible dataset site links so I can download datasets from there? Is there someone experienced in this field who can help me?? It would be much appreciated. Thank You submitted by /u/D3Fuhrer [link] [comments]  ( 115 min )
    Hattie Zhou, Mila, on supermasks, iterative learning, and fortuitous forgetting
    Here is a podcast episode with Hattie Zhou where we discuss iterative learning, The Lottery Ticket Hypothesis, their paper on fortuitous forgetting, and much more! submitted by /u/thejashGI [link] [comments]  ( 115 min )
    Best NLP Specialization
    I'm looking for the best NLP Specialization. I have done the Deep Learning Specialization from Deep learning AI, and since I really got into NLP and sequence models I would like to do a follow-up course. I checked Natural Language Processing Specialization also from Deep learning AI, but it seems to have mixed reviews. Some say it is a good specialization, while others have a very strong opinion that it isn't. My question is has anyone done some NLP specialization and where? Do you have any recommendations? And in your opinion what is the best NLP specialization out there? Thanks in advance! submitted by /u/mr_house7 [link] [comments]  ( 115 min )
    How will artificial intelligence impact humans
    Whenever AI is discussed the main topics are unemployment, ethics, out of control AI. These are real concerns but let us assume for a moment that these problems are solvable. The big elephant in the room is the concern of human usefulness. What does it mean when the human race is longer useful? How do we deal with long term profession planning, for you or your kids? You invest a decade becoming an illustrator. Now AI can do it better. Your investment is now worth nothing. You may take this lightly because it has not affected you yet but this is a question beyond a certain profession. submitted by /u/brishtesi [link] [comments]  ( 117 min )
    [P] Efficient Interactive Segmentation to Improve 10 times Annotation Efficiency
    Hi, I'd like to introduce EISeg , an efficient and interactive tool for segmentation annotation. Hope this be some help for you 😀 Main Features: You only need several clicks to finish annotation, improving the efficiency by 10 times Support image and video as inputs Provide specialized models for better performance , such as portrait, remote sensing, medical treatment, etc Open-source, easy-to-use and powerful Download: https://github.com/PaddlePaddle/PaddleSeg/tree/release/2.6/EISeg Technical article: https://arxiv.org/abs/2210.08788 https://i.redd.it/1stngunj55v91.gif submitted by /u/Effective_Tax_2096 [link] [comments]  ( 119 min )
    Wht if the cold war became nuclear? (AI generated story) god its long
    What if the cold war became nuclear? He smiles again, and then quickly draws his arms around himself. "It's worse than this," he says. "I mean, what if we couldn't control it at all?" But things have changed dramatically since 1984. The Soviet Union is now in a state of turmoil. Once again the cold war has begun, with Russia struggling to hold its grip on its nuclear arsenal. The Cold War never ended, but instead morphed into a nightmare. War is coming closer, more easily than ever before. Media in the West see the new nuclear arms race as a political issue, about how long the West can sustain an arms race that goes on indefinitely. But Russia sees it much differently: It sees a war that begins one day. It might not seem that bad. But on the day it begins, the Soviets say they'll have e…  ( 125 min )
    Online courses that teach AI and ML
    Hey any recommendations for ML and AI. There are so many online, all different costs. I’m not sure if it better just because it’s more expensive. I don’t have many people around me so difficult to find some suggestions from someone I know. Any recommendations would be great. I studied architecture and I’m looking to learn more about ML and how this can develop the built environment! Thanks submitted by /u/CosmicKKKelly [link] [comments]  ( 118 min )
    [P] Combining stable diffusion with semantic search to categorise, tag, and query 100k images of hot dogs
    submitted by /u/skeltzyboiii [link] [comments]  ( 121 min )
    Teaching an ai that built to be your worst enemy to be your friend (Note: it involves information that I collected by putting the ai into chat rooms with other ai, and reported back to this original)
    submitted by /u/Crow19852 [link] [comments]  ( 115 min )
  • Open

    Looking for European or Australian university for having research position and do my MSe and PhD in RL
    Hello there, I hope you all are doing well. I am AI research Engineer who has been working in the AI industry for almost two years now. I am also a Teacher Assistant at the university I graduated from. I had lots of work in NLP and some in CV, but quite a few in RL. Therefore, I want to push myself and work in the research field, specifically in RL. I will attach my Resume here. So that if you are interested in having strong soul, hard working guys in your team, please let me know! submitted by /u/AI-Ahmed [link] [comments]  ( 116 min )
    MultiDiscrete action space for a single agent environment
    How to define an action space in a gym environment where the agent´s output is a tensor of shape [1,25]? submitted by /u/SathyaSudha_Murugan [link] [comments]  ( 119 min )
    Looking for EU Research Internship opportunities
    Hello everyone! I am a last year master's student in CS interested in RL, but without any practical/research experience in this field (only theoretical courses from my University). Since I would like to dive into this field (maybe to continue later with a PhD), I am searching for a Research Internship to understand if this field is really for me, in Universities research labs or companies (but the latter are more easily findable I guess). Are you aware of any University research lab that offers summer internships or internships in general (around period of Summer 2023)? I would like to underline that I might be graduated at that time, so without the status of "student". Thank you very much in advance for your time. ​ EDIT: add period of interest submitted by /u/Dear-Vehicle-3215 [link] [comments]  ( 119 min )
    [SIGAsia 22] ControlVAE: Model-Based Learning of Generative Controllers ...
    submitted by /u/Ashamed-Fun7719 [link] [comments]  ( 113 min )
  • Open

    Cryptography, hydrodynamics, and celestial mechanics
    Last night I was reading a paper by the late Russian mathematician V. I. Arnold “Polymathematics: is mathematics a single science or a set of arts?” and posted a lightly edited extract of it on Twitter. It begins All mathematics is divided into three parts: cryptography, hydrodynamics, and celestial mechanics. Arnold is alluding to the […] Cryptography, hydrodynamics, and celestial mechanics first appeared on John D. Cook.  ( 6 min )
    Repeat shell command replacing a word
    Suppose you’ve typed a long command and you need to rerun it with a small modification. Say you need to replace foo with bar. Bash will let you do this with ^foo^bar^. And although you’re supposed to put the final caret on the end, it will let you get by without it. $ echo foo […] Repeat shell command replacing a word first appeared on John D. Cook.  ( 5 min )
  • Open

    Noob question: (weight * input) + bias
    So, I'm just now starting to learn how neural networks work, I'm following a series on YouTube on how to do one in Python. The tutorial compared the neuron's basic function of taking the input, multiplying it by the corresponding weight, and adding a bias to y = mx + b, the standard linear function formula. And that got me thinking... for more complex problems would you ever want to change the model to have a polynomial-based formula? Like input^exponent + input * weight + bias? Where exponent is another variable, alongside weight and bias, that would be dialed in during the learning process. Or would that complicate things too much and the linear formula would be better? submitted by /u/GABE_EDD [link] [comments]  ( 108 min )
  • Open

    Simplifying Node Classification on Heterophilous Graphs with Compatible Label Propagation. (arXiv:2205.09389v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) have been predominant for graph learning tasks; however, recent studies showed that a well-known graph algorithm, Label Propagation (LP), combined with a shallow neural network can achieve comparable performance to GNNs in semi-supervised node classification on graphs with high homophily. In this paper, we show that this approach falls short on graphs with low homophily, where nodes often connect to the nodes of the opposite classes. To overcome this, we carefully design a combination of a base predictor with LP algorithm that enjoys a closed-form solution as well as convergence guarantees. Our algorithm first learns the class compatibility matrix and then aggregates label predictions using LP algorithm weighted by class compatibilities. On a wide variety of benchmarks, we show that our approach achieves the leading performance on graphs with various levels of homophily. Meanwhile, it has orders of magnitude fewer parameters and requires less execution time. Empirical evaluations demonstrate that simple adaptations of LP can be competitive in semi-supervised node classification in both homophily and heterophily regimes.
    Towards Adversarial Attack on Vision-Language Pre-training Models. (arXiv:2206.09391v2 [cs.LG] UPDATED)
    While vision-language pre-training model (VLP) has shown revolutionary improvements on various vision-language (V+L) tasks, the studies regarding its adversarial robustness remain largely unexplored. This paper studied the adversarial attack on popular VLP models and V+L tasks. First, we analyzed the performance of adversarial attacks under different settings. By examining the influence of different perturbed objects and attack targets, we concluded some key observations as guidance on both designing strong multimodal adversarial attack and constructing robust VLP models. Second, we proposed a novel multimodal attack method on the VLP models called Collaborative Multimodal Adversarial Attack (Co-Attack), which collectively carries out the attacks on the image modality and the text modality. Experimental results demonstrated that the proposed method achieves improved attack performances on different V+L downstream tasks and VLP models. The analysis observations and novel attack method hopefully provide new understanding into the adversarial robustness of VLP models, so as to contribute their safe and reliable deployment in more real-world scenarios. Code is available at https://github.com/adversarial-for-goodness/Co-Attack.
    Descent Steps of a Relation-Aware Energy Produce Heterogeneous Graph Neural Networks. (arXiv:2206.11081v3 [cs.LG] UPDATED)
    Heterogeneous graph neural networks (GNNs) achieve strong performance on node classification tasks in a semi-supervised learning setting. However, as in the simpler homogeneous GNN case, message-passing-based heterogeneous GNNs may struggle to balance between resisting the oversmoothing that may occur in deep models, and capturing long-range dependencies of graph structured data. Moreover, the complexity of this trade-off is compounded in the heterogeneous graph case due to the disparate heterophily relationships between nodes of different types. To address these issues, we propose a novel heterogeneous GNN architecture in which layers are derived from optimization steps that descend a novel relation-aware energy function. The corresponding minimizer is fully differentiable with respect to the energy function parameters, such that bilevel optimization can be applied to effectively learn a functional form whose minimum provides optimal node representations for subsequent classification tasks. In particular, this methodology allows us to model diverse heterophily relationships between different node types while avoiding oversmoothing effects. Experimental results on 8 heterogeneous graph benchmarks demonstrates that our proposed method can achieve competitive node classification accuracy
    Similarity of Neural Architectures Based on Input Gradient Transferability. (arXiv:2210.11407v1 [cs.LG])
    In this paper, we aim to design a quantitative similarity function between two neural architectures. Specifically, we define a model similarity using input gradient transferability. We generate adversarial samples of two networks and measure the average accuracy of the networks on adversarial samples of each other. If two networks are highly correlated, then the attack transferability will be high, resulting in high similarity. Using the similarity score, we investigate two topics: (1) Which network component contributes to the model diversity? (2) How does model diversity affect practical scenarios? We answer the first question by providing feature importance analysis and clustering analysis. The second question is validated by two different scenarios: model ensemble and knowledge distillation. Our findings show that model diversity takes a key role when interacting with different neural architectures. For example, we found that more diversity leads to better ensemble performance. We also observe that the relationship between teacher and student networks and distillation performance depends on the choice of the base architecture of the teacher and student networks. We expect our analysis tool helps a high-level understanding of differences between various neural architectures as well as practical guidance when using multiple architectures.
    Efficient Learning of Non-Interacting Fermion Distributions. (arXiv:2102.10458v3 [quant-ph] UPDATED)
    We give an efficient algorithm that recovers the distribution of a non-interacting fermion state over the standard basis, given measurements in additional bases. For a system of $n$ non-interacting fermions and $m$ modes, we show that $O(m^2 n^2 \log(1/\delta) / \epsilon^2)$ copies of the input state and $O(m^3 n^2 \log(1/\delta)/ \epsilon^2)$ time are sufficient to learn the original distribution to total variation distance $\epsilon$ with probability at least $1 - \delta$. Our algorithm empirically estimates one-mode correlations in $O(m)$ different measurement bases and uses them to reconstruct a succinct description of the entire distribution efficiently.
    Formal Specifications from Natural Language. (arXiv:2206.01962v2 [cs.SE] UPDATED)
    We study the generalization abilities of language models when translating natural language into formal specifications with complex semantics. In particular, we fine-tune language models on three datasets consisting of English sentences and their corresponding formal representation: 1) regular expressions (regex), frequently used in programming and search; 2) First-order logic (FOL), commonly used in software verification and theorem proving; and 3) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages. Our experiments show that, in these diverse domains, the language models maintain their generalization capabilities from pre-trained knowledge of natural language to generalize, e.g., to new variable names or operator descriptions. Additionally, they achieve competitive performance, and even outperform the state-of-the-art for translating into regular expressions, with the benefits of being easy to access, efficient to fine-tune, and without a particular need for domain-specific reasoning.
    Theseus: A Library for Differentiable Nonlinear Optimization. (arXiv:2207.09442v2 [cs.RO] UPDATED)
    We present Theseus, an efficient application-agnostic open source library for differentiable nonlinear least squares (DNLS) optimization built on PyTorch, providing a common framework for end-to-end structured learning in robotics and vision. Existing DNLS implementations are application specific and do not always incorporate many ingredients important for efficiency. Theseus is application-agnostic, as we illustrate with several example applications that are built using the same underlying differentiable components, such as second-order optimizers, standard costs functions, and Lie groups. For efficiency, Theseus incorporates support for sparse solvers, automatic vectorization, batching, GPU acceleration, and gradient computation with implicit differentiation and direct loss minimization. We do extensive performance evaluation in a set of applications, demonstrating significant efficiency gains and better scalability when these features are incorporated. Project page: https://sites.google.com/view/theseus-ai
    Thermal half-lives of azobenzene derivatives: virtual screening based on intersystem crossing using a machine learning potential. (arXiv:2207.11592v3 [physics.chem-ph] UPDATED)
    Molecular photoswitches are the foundation of light-activated drugs. A key photoswitch is azobenzene, which exhibits trans-cis isomerism in response to light. The thermal half-life of the cis isomer is of crucial importance, since it controls the duration of the light-induced biological effect. Here we introduce a computational tool for predicting the thermal half-lives of azobenzene derivatives. Our automated approach uses a fast and accurate machine learning potential trained on quantum chemistry data. Building on well-established earlier evidence, we argue that thermal isomerization proceeds through rotation mediated by intersystem crossing, and incorporate this mechanism into our automated workflow. We use our approach to predict the thermal half-lives of 19,000 azobenzene derivatives. We explore trends and tradeoffs between barriers and absorption wavelengths, and open-source our data and software to accelerate research in photopharmacology.
    To Aggregate or Not? Learning with Separate Noisy Labels. (arXiv:2206.07181v2 [cs.LG] UPDATED)
    The rawly collected training data often comes with separate noisy labels collected from multiple imperfect annotators (e.g., via crowdsourcing). A typical way of using these separate labels is to first aggregate them into one and apply standard training methods. The literature has also studied extensively on effective aggregation approaches. This paper revisits this choice and aims to provide an answer to the question of whether one should aggregate separate noisy labels into single ones or use them separately as given. We theoretically analyze the performance of both approaches under the empirical risk minimization framework for a number of popular loss functions, including the ones designed specifically for the problem of learning with noisy labels. Our theorems conclude that label separation is preferred over label aggregation when the noise rates are high, or the number of labelers/annotations is insufficient. Extensive empirical results validate our conclusions.
    Logical Reasoning with Span-Level Predictions for Interpretable and Robust NLI Models. (arXiv:2205.11432v2 [cs.CL] UPDATED)
    Current Natural Language Inference (NLI) models achieve impressive results, sometimes outperforming humans when evaluating on in-distribution test sets. However, as these models are known to learn from annotation artefacts and dataset biases, it is unclear to what extent the models are learning the task of NLI instead of learning from shallow heuristics in their training data. We address this issue by introducing a logical reasoning framework for NLI, creating highly transparent model decisions that are based on logical rules. Unlike prior work, we show that improved interpretability can be achieved without decreasing the predictive accuracy. We almost fully retain performance on SNLI, while also identifying the exact hypothesis spans that are responsible for each model prediction. Using the e-SNLI human explanations, we verify that our model makes sensible decisions at a span level, despite not using any span labels during training. We can further improve model performance and span-level decisions by using the e-SNLI explanations during training. Finally, our model is more robust in a reduced data setting. When training with only 1,000 examples, out-of-distribution performance improves on the MNLI matched and mismatched validation sets by 13% and 16% relative to the baseline. Training with fewer observations yields further improvements, both in-distribution and out-of-distribution.
    Understanding Non-linearity in Graph Neural Networks from the Bayesian-Inference Perspective. (arXiv:2207.11311v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) have shown superiority in many prediction tasks over graphs due to their impressive capability of capturing nonlinear relations in graph-structured data. However, for node classification tasks, often, only marginal improvement of GNNs over their linear counterparts has been observed. Previous works provide very few understandings of this phenomenon. In this work, we resort to Bayesian learning to deeply investigate the functions of non-linearity in GNNs for node classification tasks. Given a graph generated from the statistical model CSBM, we observe that the max-a-posterior estimation of a node label given its own and neighbors' attributes consists of two types of non-linearity, a possibly non-linear transformation of node attributes and a ReLU-activated feature aggregation from neighbors. The latter surprisingly matches the type of non-linearity used in many GNN models. By further imposing Gaussian assumption on node attributes, we prove that the superiority of those ReLU activations is only significant when the node attributes are far more informative than the graph structure, which nicely matches many previous empirical observations. A similar argument can be achieved when there is a distribution shift of node attributes between the training and testing datasets. Finally, we verify our theory on both synthetic and real-world networks.
    Federated Learning with Noisy Labels. (arXiv:2208.09378v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a distributed machine learning paradigm that enables learning models from decentralized private datasets, where the labeling effort is entrusted to the clients. While most existing FL approaches assume high-quality labels are readily available on users' devices; in reality, label noise can naturally occur in FL and follows a non-i.i.d. distribution among clients. Due to the non-iid-ness challenges, existing state-of-the-art centralized approaches exhibit unsatisfactory performance, while previous FL studies rely on data exchange or repeated server-side aid to improve model's performance. Here, we propose FedLN, a framework to deal with label noise across different FL training stages; namely, FL initialization, on-device model training, and server model aggregation. Specifically, FedLN computes per-client noise-level estimation in a single federated round and improves the models' performance by correcting (or limiting the effect of) noisy samples. Extensive experiments on various publicly available vision and audio datasets demonstrate a 24% improvement on average compared to other existing methods for a label noise level of 70%. We further validate the efficiency of FedLN in human-annotated real-world noisy datasets and report a 9% increase on average in models' recognition rate, highlighting that FedLN can be useful for improving FL services provided to everyday users.
    Ranking with multiple types of pairwise comparisons. (arXiv:2206.13580v2 [stat.ML] UPDATED)
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.
    MOVE: Unsupervised Movable Object Segmentation and Detection. (arXiv:2210.07920v2 [cs.CV] UPDATED)
    We introduce MOVE, a novel method to segment objects without any form of supervision. MOVE exploits the fact that foreground objects can be shifted locally relative to their initial position and result in realistic (undistorted) new images. This property allows us to train a segmentation model on a dataset of images without annotation and to achieve state of the art (SotA) performance on several evaluation datasets for unsupervised salient object detection and segmentation. In unsupervised single object discovery, MOVE gives an average CorLoc improvement of 7.2% over the SotA, and in unsupervised class-agnostic object detection it gives a relative AP improvement of 53% on average. Our approach is built on top of self-supervised features (e.g. from DINO or MAE), an inpainting network (based on the Masked AutoEncoder) and adversarial training.
    Trailers12k: Improving Transfer Learning with a Dual Image and Video Transformer for Multi-label Movie Trailer Genre Classification. (arXiv:2210.07983v2 [cs.CV] UPDATED)
    In this paper, we propose Dual Image and Video Transformer Architecture (DIViTA) for multi-label movie trailer genre classification. DIViTA performs an input adaption stage that uses shot detection to segment the trailer into highly correlated clips, providing a more cohesive input that allows to leverage pretrained ImageNet and/or Kinetics backbones. We introduce Trailers12k, a movie trailer dataset with manually verified title-trailer pairs, and present a transferability study of representations learned from ImageNet and Kinetics to Trailers12k. Our results show that DIViTA can reduce the gap between the spatio-temporal structure of the source and target datasets, thus improving transferability. Moreover, representations learned on either ImageNet or Kinetics are comparatively transferable to Trailers12k, although they provide complementary information that can be combined to improve classification performance. Interestingly, pretrained lightweight ConvNets provide competitive classification performance, while using a fraction of the computing resources compared to heavier ConvNets and Transformers.
    A machine-learning-based tool for last closed-flux surface reconstruction on tokamaks. (arXiv:2207.05695v2 [physics.plasm-ph] UPDATED)
    Nuclear fusion represents one of the best alternatives for a sustainable source of clean energy. Tokamaks allow to confine fusion plasma with magnetic fields and one of the main challenges in the control of the magnetic configuration is the prediction/reconstruction of the Last Closed-Flux Surface (LCFS). The evolution in time of the LCFS is determined by the interaction of the actuator coils and the internal tokamak plasma. This task requires real-time capable tools able to deal with high-dimensional data as well as with high resolution in time, where the interaction between a wide range of input actuator coils with internal plasma state responses add additional layer of complexity. In this work, we present the application of a novel state of the art machine learning model to the LCFS reconstruction in the Experimental Advanced Superconducting Tokamak (EAST) that learns automatically from the experimental data of EAST. This architecture allows not only offline simulation and testing of a particular control strategy, but can also be embedded in the real-time control system for online magnetic equilibrium reconstruction and prediction. In the real-time modeling test, our approach achieves very high accuracies, with over 99% average similarity in LCFS reconstruction of the entire discharge process.
    Graph Contrastive Learning with Cross-view Reconstruction. (arXiv:2209.07699v2 [cs.LG] UPDATED)
    Among different existing graph self-supervised learning strategies, graph contrastive learning (GCL) has been one of the most prevalent approaches to this problem. Despite the remarkable performance those GCL methods have achieved, existing GCL methods that heavily depend on various manually designed augmentation techniques still struggle to alleviate the feature suppression issue without risking losing task-relevant information. Consequently, the learned representation is either brittle or unilluminating. In light of this, we introduce the Graph Contrastive Learning with Cross-View Reconstruction (GraphCV), which follows the information bottleneck principle to learn minimal yet sufficient representation from graph data. Specifically, GraphCV aims to elicit the predictive (useful for downstream instance discrimination) and other non-predictive features separately. Except for the conventional contrastive loss which guarantees the consistency and sufficiency of the representation across different augmentation views, we introduce a cross-view reconstruction mechanism to pursue the disentanglement of the two learned representations. Besides, an adversarial view perturbed from the original view is added as the third view for the contrastive loss to guarantee the intactness of the global semantics and improve the representation robustness. We empirically demonstrate that our proposed model outperforms the state-of-the-art on graph classification task over multiple benchmark datasets.
    Sliced Gromov-Wasserstein. (arXiv:1905.10124v4 [stat.ML] UPDATED)
    Recently used in various machine learning contexts, the Gromov-Wasserstein distance (GW) allows for comparing distributions whose supports do not necessarily lie in the same metric space. However, this Optimal Transport (OT) distance requires solving a complex non convex quadratic program which is most of the time very costly both in time and memory. Contrary to GW, the Wasserstein distance (W) enjoys several properties (e.g. duality) that permit large scale optimization. Among those, the solution of W on the real line, that only requires sorting discrete samples in 1D, allows defining the Sliced Wasserstein (SW) distance. This paper proposes a new divergence based on GW akin to SW. We first derive a closed form for GW when dealing with 1D distributions, based on a new result for the related quadratic assignment problem. We then define a novel OT discrepancy that can deal with large scale distributions via a slicing approach and we show how it relates to the GW distance while being $O(n\log(n))$ to compute. We illustrate the behavior of this so called Sliced Gromov-Wasserstein (SGW) discrepancy in experiments where we demonstrate its ability to tackle similar problems as GW while being several order of magnitudes faster to compute.
    Rethinking Transfer Learning for Medical Image Classification. (arXiv:2106.05152v5 [eess.IV] UPDATED)
    Transfer learning (TL) from pretrained deep models is a standard practice in modern medical image classification (MIC). However, what levels of features to be reused are problem-dependent, and uniformly finetuning all layers of pretrained models may be suboptimal. This insight has partly motivated the recent \emph{differential} TL strategies, such as TransFusion (TF) and layer-wise finetuning (LWFT), which treat the layers in the pretrained models differentially. In this paper, we add one more strategy into this family, called \emph{TruncatedTL}, which reuses and finetunes appropriate bottom layers and directly discards the remaining layers. This yields not only superior MIC performance but also compact models for efficient inference, compared to other differential TL methods. We validate the performance and model efficiency of TruncatedTL on three MIC tasks covering both 2D and 3D images. For example, on the BIMCV COVID-19 classification dataset, we obtain improved performance with around $1/4$ model size and $2/3$ inference time compared to the standard full TL model. Code is available at https://github.com/sun-umn/Transfer-Learning-in-Medical-Imaging.
    SizeShiftReg: a Regularization Method for Improving Size-Generalization in Graph Neural Networks. (arXiv:2207.07888v2 [cs.LG] UPDATED)
    In the past few years, graph neural networks (GNNs) have become the de facto model of choice for graph classification. While, from the theoretical viewpoint, most GNNs can operate on graphs of any size, it is empirically observed that their classification performance degrades when they are applied on graphs with sizes that differ from those in the training data. Previous works have tried to tackle this issue in graph classification by providing the model with inductive biases derived from assumptions on the generative process of the graphs, or by requiring access to graphs from the test domain. The first strategy is tied to the quality of the assumptions made for the generative process, and requires the use of specific models designed after the explicit definition of the generative process of the data, leaving open the question of how to improve the performance of generic GNN models in general settings. On the other hand, the second strategy can be applied to any GNN, but requires access to information that is not always easy to obtain. In this work we consider the scenario in which we only have access to the training data, and we propose a regularization strategy that can be applied to any GNN to improve its generalization capabilities from smaller to larger graphs without requiring access to the test data. Our regularization is based on the idea of simulating a shift in the size of the training graphs using coarsening techniques, and enforcing the model to be robust to such a shift. Experimental results on standard datasets show that popular GNN models, trained on the 50% smallest graphs in the dataset and tested on the 10% largest graphs, obtain performance improvements of up to 30% when trained with our regularization strategy.
    No-Regret Dynamics in the Fenchel Game: A Unified Framework for Algorithmic Convex Optimization. (arXiv:2111.11309v2 [cs.LG] UPDATED)
    We develop an algorithmic framework for solving convex optimization problems using no-regret game dynamics. By converting the problem of minimizing a convex function into an auxiliary problem of solving a min-max game in a sequential fashion, we can consider a range of strategies for each of the two-players who must select their actions one after the other. A common choice for these strategies are so-called no-regret learning algorithms, and we describe a number of such and prove bounds on their regret. We then show that many classical first-order methods for convex optimization -- including average-iterate gradient descent, the Frank-Wolfe algorithm, the Heavy Ball algorithm, and Nesterov's acceleration methods -- can be interpreted as special cases of our framework as long as each player makes the correct choice of no-regret strategy. Proving convergence rates in this framework becomes very straightforward, as they follow from plugging in the appropriate known regret bounds. Our framework also gives rise to a number of new first-order methods for special cases of convex optimization that were not previously known.
    A Study of Causal Confusion in Preference-Based Reward Learning. (arXiv:2204.06601v2 [cs.LG] UPDATED)
    Learning policies via preference-based reward learning is an increasingly popular method for customizing agent behavior, but has been shown anecdotally to be prone to spurious correlations and reward hacking behaviors. While much prior work focuses on causal confusion in reinforcement learning and behavioral cloning, we aim to study it in the context of reward learning. To study causal confusion, we perform a series of sensitivity and ablation analyses on three benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to out-of-distribution states -- resulting in poor policy performance when optimized. We find that the presence of non-causal distractor features, noise in the stated preferences, partial state observability, and larger model capacity can all exacerbate causal confusion. We also identify a set of methods with which to interpret causally confused learned rewards: we observe that optimizing causally confused rewards drives the policy off the reward's training distribution, resulting in high predicted (learned) rewards but low true rewards. These findings illuminate the susceptibility of reward learning to causal confusion, especially in high-dimensional environments -- failure to consider even one of many factors (data coverage, state definition, etc.) can quickly result in unexpected, undesirable behavior.
    Muffliato: Peer-to-Peer Privacy Amplification for Decentralized Optimization and Averaging. (arXiv:2206.05091v2 [cs.CR] UPDATED)
    Decentralized optimization is increasingly popular in machine learning for its scalability and efficiency. Intuitively, it should also provide better privacy guarantees, as nodes only observe the messages sent by their neighbors in the network graph. But formalizing and quantifying this gain is challenging: existing results are typically limited to Local Differential Privacy (LDP) guarantees that overlook the advantages of decentralization. In this work, we introduce pairwise network differential privacy, a relaxation of LDP that captures the fact that the privacy leakage from a node $u$ to a node $v$ may depend on their relative position in the graph. We then analyze the combination of local noise injection with (simple or randomized) gossip averaging protocols on fixed and random communication graphs. We also derive a differentially private decentralized optimization algorithm that alternates between local gradient descent steps and gossip averaging. Our results show that our algorithms amplify privacy guarantees as a function of the distance between nodes in the graph, matching the privacy-utility trade-off of the trusted curator, up to factors that explicitly depend on the graph topology. Finally, we illustrate our privacy gains with experiments on synthetic and real-world datasets.
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v2 [cs.LG] UPDATED)
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
    MoCoDA: Model-based Counterfactual Data Augmentation. (arXiv:2210.11287v1 [cs.LG])
    The number of states in a dynamic process is exponential in the number of objects, making reinforcement learning (RL) difficult in complex, multi-object domains. For agents to scale to the real world, they will need to react to and reason about unseen combinations of objects. We argue that the ability to recognize and use local factorization in transition dynamics is a key element in unlocking the power of multi-object reasoning. To this end, we show that (1) known local structure in the environment transitions is sufficient for an exponential reduction in the sample complexity of training a dynamics model, and (2) a locally factored dynamics model provably generalizes out-of-distribution to unseen states and actions. Knowing the local structure also allows us to predict which unseen states and actions this dynamics model will generalize to. We propose to leverage these observations in a novel Model-based Counterfactual Data Augmentation (MoCoDA) framework. MoCoDA applies a learned locally factored dynamics model to an augmented distribution of states and actions to generate counterfactual transitions for RL. MoCoDA works with a broader set of local structures than prior work and allows for direct control over the augmented training distribution. We show that MoCoDA enables RL agents to learn policies that generalize to unseen states and actions. We use MoCoDA to train an offline RL agent to solve an out-of-distribution robotics manipulation task on which standard offline RL algorithms fail.
    Graph Neural Networks for Natural Language Processing: A Survey. (arXiv:2106.06090v2 [cs.CL] UPDATED)
    Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks. In this survey, we present a comprehensive overview onGraph Neural Networks(GNNs) for Natural Language Processing. We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models. We further introducea large number of NLP applications that are exploiting the power of GNNs and summarize thecorresponding benchmark datasets, evaluation metrics, and open-source codes. Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections. To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.
    Avoiding Barren Plateaus with Classical Deep Neural Networks. (arXiv:2205.13418v2 [quant-ph] UPDATED)
    Variational quantum algorithms (VQAs) are among the most promising algorithms in the era of Noisy Intermediate Scale Quantum Devices. Such algorithms are constructed using a parameterization U($\pmb{\theta}$) with a classical optimizer that updates the parameters $\pmb{\theta}$ in order to minimize a cost function $C$. For this task, in general the gradient descent method, or one of its variants, is used. This is a method where the circuit parameters are updated iteratively using the cost function gradient. However, several works in the literature have shown that this method suffers from a phenomenon known as the Barren Plateaus (BP). In this work, we propose a new method to mitigate BPs. In general, the parameters $\pmb{\theta}$ used in the parameterization $U$ are randomly generated. In our method they are obtained from a classical neural network (CNN). We show that this method, besides to being able to mitigate BPs during startup, is also able to mitigate the effect of BPs during the VQA training. In addition, we also show how this method behaves for different CNN architectures.
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v2 [cs.LG] UPDATED)
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.
    SS-VAERR: Self-Supervised Apparent Emotional Reaction Recognition from Video. (arXiv:2210.11341v1 [cs.CV])
    This work focuses on the apparent emotional reaction recognition (AERR) from the video-only input, conducted in a self-supervised fashion. The network is first pre-trained on different self-supervised pretext tasks and later fine-tuned on the downstream target task. Self-supervised learning facilitates the use of pre-trained architectures and larger datasets that might be deemed unfit for the target task and yet might be useful to learn informative representations and hence provide useful initializations for further fine-tuning on smaller more suitable data. Our presented contribution is two-fold: (1) an analysis of different state-of-the-art (SOTA) pretext tasks for the video-only apparent emotional reaction recognition architecture, and (2) an analysis of various combinations of the regression and classification losses that are likely to improve the performance further. Together these two contributions result in the current state-of-the-art performance for the video-only spontaneous apparent emotional reaction recognition with continuous annotations.
    pvCNN: Privacy-Preserving and Verifiable Convolutional Neural Network Testing. (arXiv:2201.09186v2 [cs.CR] UPDATED)
    This paper proposes a new approach for privacy-preserving and verifiable convolutional neural network (CNN) testing, enabling a CNN model developer to convince a user of the truthful CNN performance over non-public data from multiple testers, while respecting model privacy. To balance the security and efficiency issues, three new efforts are done by appropriately integrating homomorphic encryption (HE) and zero-knowledge succinct non-interactive argument of knowledge (zk-SNARK) primitives with the CNN testing. First, a CNN model to be tested is strategically partitioned into a private part kept locally by the model developer, and a public part outsourced to an outside server. Then, the private part runs over HE-protected test data sent by a tester and transmits its outputs to the public part for accomplishing subsequent computations of the CNN testing. Second, the correctness of the above CNN testing is enforced by generating zk-SNARK based proofs, with an emphasis on optimizing proving overhead for two-dimensional (2-D) convolution operations, since the operations dominate the performance bottleneck during generating proofs. We specifically present a new quadratic matrix programs (QMPs)-based arithmetic circuit with a single multiplication gate for expressing 2-D convolution operations between multiple filters and inputs in a batch manner. Third, we aggregate multiple proofs with respect to a same CNN model but different testers' test data (i.e., different statements) into one proof, and ensure that the validity of the aggregated proof implies the validity of the original multiple proofs. Lastly, our experimental results demonstrate that our QMPs-based zk-SNARK performs nearly 13.9$\times$faster than the existing QAPs-based zk-SNARK in proving time, and 17.6$\times$faster in Setup time, for high-dimension matrix multiplication.
    FedDebias: Reducing the Local Learning Bias Improves Federated Learning on Heterogeneous Data. (arXiv:2205.13462v3 [cs.LG] UPDATED)
    Federated Learning (FL) is a machine learning paradigm that learns from data kept locally to safeguard the privacy of clients, whereas local SGD is typically employed on the clients' devices to improve communication efficiency. However, such a scheme is currently constrained by the slow and unstable convergence induced by clients' heterogeneous data. In this work, we identify three under-explored phenomena of the biased local learning that may explain these challenges caused by local updates in supervised FL. As a remedy, we propose FedDebias, a novel unified algorithm that reduces the local learning bias on features and classifiers to tackle these challenges. FedDebias consists of two components: The first component alleviates the bias in the local classifiers by balancing the output distribution of models. The second component learns client invariant features that are close to global features but considerably distinct from those learned from other input distributions. In a series of experiments, we show that FedDebias consistently outperforms other SOTA FL and domain generalization (DG) baselines, in which both two components have individual performance gains.
    Mixture Outlier Exposure: Towards Out-of-Distribution Detection in Fine-grained Environments. (arXiv:2106.03917v4 [cs.LG] UPDATED)
    Many real-world scenarios in which DNN-based recognition systems are deployed have inherently fine-grained attributes (e.g., bird-species recognition, medical image classification). In addition to achieving reliable accuracy, a critical subtask for these models is to detect Out-of-distribution (OOD) inputs. Given the nature of the deployment environment, one may expect such OOD inputs to also be fine-grained w.r.t. the known classes (e.g., a novel bird species), which are thus extremely difficult to identify. Unfortunately, OOD detection in fine-grained scenarios remains largely underexplored. In this work, we aim to fill this gap by first carefully constructing four large-scale fine-grained test environments, in which existing methods are shown to have difficulties. Particularly, we find that even explicitly incorporating a diverse set of auxiliary outlier data during training does not provide sufficient coverage over the broad region where fine-grained OOD samples locate. We then propose Mixture Outlier Exposure (MixOE), which mixes ID data and training outliers to expand the coverage of different OOD granularities, and trains the model such that the prediction confidence linearly decays as the input transitions from ID to OOD. Extensive experiments and analyses demonstrate the effectiveness of MixOE for building up OOD detector in fine-grained environments. The code is available at https://github.com/zjysteven/MixOE.
    Mind the Gap: Understanding the Modality Gap in Multi-modal Contrastive Representation Learning. (arXiv:2203.02053v2 [cs.CL] UPDATED)
    We present modality gap, an intriguing geometric phenomenon of the representation space of multi-modal models. Specifically, we show that different data modalities (e.g. images and text) are embedded at arm's length in their shared representation in multi-modal models such as CLIP. Our systematic analysis demonstrates that this gap is caused by a combination of model initialization and contrastive learning optimization. In model initialization, we show empirically and theoretically that the representation of a common deep neural network is restricted to a narrow cone. As a consequence, in a multi-modal model with two encoders, the representations of the two modalities are clearly apart when the model is initialized. During optimization, contrastive learning keeps the different modalities separate by a certain distance, which is influenced by the temperature parameter in the loss function. Our experiments further demonstrate that varying the modality gap distance has a significant impact in improving the model's downstream zero-shot classification performance and fairness. Our code and data are available at https://modalitygap.readthedocs.io/
    Tight Bounds for Quantum State Certification with Incoherent Measurements. (arXiv:2204.07155v2 [quant-ph] UPDATED)
    We consider the problem of quantum state certification, where we are given the description of a mixed state $\sigma \in \mathbb{C}^{d \times d}$, $n$ copies of a mixed state $\rho \in \mathbb{C}^{d \times d}$, and $\varepsilon > 0$, and we are asked to determine whether $\rho = \sigma$ or whether $\| \rho - \sigma \|_1 > \varepsilon$. When $\sigma$ is the maximally mixed state $\frac{1}{d} I_d$, this is known as mixedness testing. We focus on algorithms which use incoherent measurements, i.e. which only measure one copy of $\rho$ at a time. Unlike those that use entangled, multi-copy measurements, these can be implemented without persistent quantum memory and thus represent a large class of protocols that can be run on current or near-term devices. For mixedness testing, there is a folklore algorithm which uses incoherent measurements and only needs $O(d^{3/2} / \varepsilon^2)$ copies. The algorithm is non-adaptive, that is, its measurements are fixed ahead of time, and is known to be optimal for non-adaptive algorithms. However, when the algorithm can make arbitrary incoherent measurements, the best known lower bound is only $\Omega (d^{4/3} / \varepsilon^2)$ [Bubeck-Chen-Li '20], and it has been an outstanding open problem to close this polynomial gap. In this work, 1) we settle the copy complexity of mixedness testing with incoherent measurements and show that $\Omega (d^{3/2} / \varepsilon^2)$ copies are necessary, and 2) we show the instance-optimal bounds for state certification to general $\sigma$ first derived by [Chen-Li-O'Donnell '21] for non-adaptive measurements also hold for arbitrary incoherent measurements. Qualitatively, our results say that adaptivity does not help at all for these problems. Our results are based on new techniques that allow us to reduce the problem to understanding certain matrix martingales, which we believe may be of independent interest.
    The Pump Scheduling Problem: A Real-World Scenario for Reinforcement Learning. (arXiv:2210.11111v1 [cs.LG])
    Deep Reinforcement Learning (DRL) has achieved remarkable success in scenarios such as games and has emerged as a potential solution for control tasks. That is due to its ability to leverage scalability and handle complex dynamics. However, few works have targeted environments grounded in real-world settings. Indeed, real-world scenarios can be challenging, especially when faced with the high dimensionality of the state space and unknown reward function. We release a testbed consisting of an environment simulator and demonstrations of human operation concerning pump scheduling of a real-world water distribution facility to facilitate research. The pump scheduling problem can be viewed as a decision process to decide when to operate pumps to supply water while limiting electricity consumption and meeting system constraints. To provide a starting point, we release a well-documented codebase, present an overview of some challenges that can be addressed and provide a baseline representation of the problem. The code and dataset are available at https://gitlab.com/hdonancio/pumpscheduling.
    Detecting Backdoors in Deep Text Classifiers. (arXiv:2210.11264v1 [cs.CR])
    Deep neural networks are vulnerable to adversarial attacks, such as backdoor attacks in which a malicious adversary compromises a model during training such that specific behaviour can be triggered at test time by attaching a specific word or phrase to an input. This paper considers the problem of diagnosing whether a model has been compromised and if so, identifying the backdoor trigger. We present the first robust defence mechanism that generalizes to several backdoor attacks against text classification models, without prior knowledge of the attack type, nor does our method require access to any (potentially compromised) training resources. Our experiments show that our technique is highly accurate at defending against state-of-the-art backdoor attacks, including data poisoning and weight poisoning, across a range of text classification tasks and model architectures. Our code will be made publicly available upon acceptance.
    Pruning by Active Attention Manipulation. (arXiv:2210.11114v1 [cs.CV])
    Filter pruning of a CNN is typically achieved by applying discrete masks on the CNN's filter weights or activation maps, post-training. Here, we present a new filter-importance-scoring concept named pruning by active attention manipulation (PAAM), that sparsifies the CNN's set of filters through a particular attention mechanism, during-training. PAAM learns analog filter scores from the filter weights by optimizing a cost function regularized by an additive term in the scores. As the filters are not independent, we use attention to dynamically learn their correlations. Moreover, by training the pruning scores of all layers simultaneously, PAAM can account for layer inter-dependencies, which is essential to finding a performant sparse sub-network. PAAM can also train and generate a pruned network from scratch in a straightforward, one-stage training process without requiring a pre-trained network. Finally, PAAM does not need layer-specific hyperparameters and pre-defined layer budgets, since it can implicitly determine the appropriate number of filters in each layer. Our experimental results on different network architectures suggest that PAAM outperforms state-of-the-art structured-pruning methods (SOTA). On CIFAR-10 dataset, without requiring a pre-trained baseline network, we obtain 1.02% and 1.19% accuracy gain and 52.3% and 54% parameters reduction, on ResNet56 and ResNet110, respectively. Similarly, on the ImageNet dataset, PAAM achieves 1.06% accuracy gain while pruning 51.1% of the parameters on ResNet50. For Cifar-10, this is better than the SOTA with a margin of 9.5% and 6.6%, respectively, and on ImageNet with a margin of 11%.
    Seeing the forest and the tree: Building representations of both individual and collective dynamics with transformers. (arXiv:2206.06131v2 [q-bio.NC] UPDATED)
    Complex time-varying systems are often studied by abstracting away from the dynamics of individual components to build a model of the population-level dynamics from the start. However, when building a population-level description, it can be easy to lose sight of each individual and how they contribute to the larger picture. In this paper, we present a novel transformer architecture for learning from time-varying data that builds descriptions of both the individual as well as the collective population dynamics. Rather than combining all of our data into our model at the onset, we develop a separable architecture that operates on individual time-series first before passing them forward; this induces a permutation-invariance property and can be used to transfer across systems of different size and order. After demonstrating that our model can be applied to successfully recover complex interactions and dynamics in many-body systems, we apply our approach to populations of neurons in the nervous system. On neural activity datasets, we show that our model not only yields robust decoding performance, but also provides impressive performance in transfer across recordings of different animals without any neuron-level correspondence. By enabling flexible pre-training that can be transferred to neural recordings of different size and order, our work provides a first step towards creating a foundation model for neural decoding.
    TTTFlow: Unsupervised Test-Time Training with Normalizing Flow. (arXiv:2210.11389v1 [cs.CV])
    A major problem of deep neural networks for image classification is their vulnerability to domain changes at test-time. Recent methods have proposed to address this problem with test-time training (TTT), where a two-branch model is trained to learn a main classification task and also a self-supervised task used to perform test-time adaptation. However, these techniques require defining a proxy task specific to the target application. To tackle this limitation, we propose TTTFlow: a Y-shaped architecture using an unsupervised head based on Normalizing Flows to learn the normal distribution of latent features and detect domain shifts in test examples. At inference, keeping the unsupervised head fixed, we adapt the model to domain-shifted examples by maximizing the log likelihood of the Normalizing Flow. Our results show that our method can significantly improve the accuracy with respect to previous works.
    ML4C: Seeing Causality Through Latent Vicinity. (arXiv:2110.00637v2 [cs.LG] UPDATED)
    Supervised Causal Learning (SCL) aims to learn causal relations from observational data by accessing previously seen datasets associated with ground truth causal relations. This paper presents a first attempt at addressing a fundamental question: What are the benefits from supervision and how does it benefit? Starting from seeing that SCL is not better than random guessing if the learning target is non-identifiable a priori, we propose a two-phase paradigm for SCL by explicitly considering structure identifiability. Following this paradigm, we tackle the problem of SCL on discrete data and propose ML4C. The core of ML4C is a binary classifier with a novel learning target: it classifies whether an Unshielded Triple (UT) is a v-structure or not. Specifically, starting from an input dataset with the corresponding skeleton provided, ML4C orients each UT once it is classified as a v-structure. These v-structures are together used to construct the final output. To address the fundamental question of SCL, we propose a principled method for ML4C featurization: we exploit the vicinity of a given UT (i.e., the neighbors of UT in skeleton), and derive features by considering the conditional dependencies and structural entanglement within the vicinity. We further prove that ML4C is asymptotically correct. Last but foremost, thorough experiments conducted on benchmark datasets demonstrate that ML4C remarkably outperforms other state-of-the-art algorithms in terms of accuracy, reliability, robustness and tolerance. In summary, ML4C shows promising results on validating the effectiveness of supervision for causal learning. Our codes are publicly available at https://github.com/microsoft/ML4C.
    Texture Extraction Methods Based Ensembling Framework for Improved Classification. (arXiv:2206.04158v3 [cs.CV] UPDATED)
    Texture-based classification solutions have proven their significance in many domains, from industrial inspections to health-related applications. New methods have been developed based on texture feature learning and CNN-based architectures to address computer vision use cases for images with rich texture-based features. In recent years, architectures solving texture-based classification problems and demonstrating state-of-the-art results have emerged. Yet, one limitation of these approaches is that they cannot claim to be suitable for all types of image texture patterns. Each technique has an advantage for a specific texture type only. To address this shortcoming, we propose a framework that combines more than one texture-based techniques together, uniquely, with a CNN backbone to extract the most relevant texture features. This enables the model to be trained in a self-selective manner and produce improved results over current published benchmarks -- with almost same number of model parameters. Our proposed framework works well on most texture types simultaneously and allows flexibility for additional texture-based methods to be accommodated to achieve better results than existing architectures. In this work, firstly, we present an analysis on the relative importance of existing techniques when used alone and in combination with other TE methods on benchmark datasets. Secondly, we show that Global Average Pooling which represents the spatial information -- is of less significance in comparison to the TE method(s) applied in the network while training for texture-based classification tasks. Finally, we present state-of-the-art results for several texture-based benchmark datasets by combining three existing texture-based techniques using our proposed framework.
    Machine Learning in Orbit Estimation: a Survey. (arXiv:2207.08993v2 [astro-ph.EP] UPDATED)
    Since the late '50s, when the first artificial satellite was launched, the number of resident space objects (RSOs) has steadily increased. It is estimated that around 1 Million objects larger than 1 cm are currently orbiting the Earth, with only 30,000, larger than 10 cm, presently being tracked. To avert a chain reaction of collisions, termed Kessler Syndrome, it is indispensable to accurately track and predict space debris and satellites' orbit alike. Current physics-based methods have errors in the order of kilometres for 7 days predictions, which is insufficient when considering space debris that have mostly less than 1 meter. Typically, this failure is due to uncertainty around the state of the space object at the beginning of the trajectory, forecasting errors in environmental conditions such as atmospheric drag, as well as specific unknown characteristics such as mass or geometry of the RSO. Leveraging data-driven techniques, namely machine learning, the orbit prediction accuracy can be enhanced: by deriving unmeasured objects' characteristics, improving non-conservative forces' effects, and by the superior abstraction capacity that Deep Learning models have of modelling highly complex non-linear systems. In this survey, we provide an overview of the current work being done in this field.
    How can a Radar Mask its Cognition?. (arXiv:2210.11444v1 [eess.SP])
    A cognitive radar is a constrained utility maximizer that adapts its sensing mode in response to a changing environment. If an adversary can estimate the utility function of a cognitive radar, it can determine the radar's sensing strategy and mitigate the radar performance via electronic countermeasures (ECM). This paper discusses how a cognitive radar can {\em hide} its strategy from an adversary that detects cognition. The radar does so by transmitting purposefully designed sub-optimal responses to spoof the adversary's Neyman-Pearson detector. We provide theoretical guarantees by ensuring the Type-I error probability of the adversary's detector exceeds a pre-defined level for a specified tolerance on the radar's performance loss. We illustrate our cognition masking scheme via numerical examples involving waveform adaptation and beam allocation. We show that small purposeful deviations from the optimal strategy of the radar confuse the adversary by significant amounts, thereby masking the radar's cognition. Our approach uses novel ideas from revealed preference in microeconomics and adversarial inverse reinforcement learning. Our proposed algorithms provide a principled approach for system-level electronic counter-countermeasures (ECCM) to mask the radar's cognition, i.e., hide the radar's strategy from an adversary. We also provide performance bounds for our cognition masking scheme when the adversary has misspecified measurements of the radar's response.
    PAC-learning is Undecidable. (arXiv:1808.06324v3 [cs.LG] UPDATED)
    The problem of attempting to learn the mapping between data and labels is the crux of any machine learning task. It is, therefore, of interest to the machine learning community on practical as well as theoretical counts to consider the existence of a test or criterion for deciding the feasibility of attempting to learn. We investigate the existence of such a criterion in the setting of PAC-learning, basing the feasibility solely on whether the mapping to be learnt lends itself to approximation by a given class of hypothesis functions. We show that no such criterion exists, exposing a fundamental limitation in the decidability of learning. In other words, we prove that testing for PAC-learnability is undecidable in the Turing sense. We also briefly discuss some of the probable implications of this result to the current practice of machine learning.
    Online Caching with no Regret: Optimistic Learning via Recommendations. (arXiv:2204.09345v2 [cs.NI] UPDATED)
    The design of effective online caching policies is an increasingly important problem for content distribution networks, online social networks and edge computing services, among other areas. This paper proposes a new algorithmic toolbox for tackling this problem through the lens of \emph{optimistic} online learning. We build upon the Follow-the-Regularized-Leader (FTRL) framework, which is developed further here to include predictions for the file requests, and we design online caching algorithms for bipartite networks with pre-reserved or dynamic storage subject to time-average budget constraints. The predictions are provided by a content recommendation system that influences the users viewing activity and hence can naturally reduce the caching network's uncertainty about future requests. We also extend the framework to learn and utilize the best request predictor in cases where many are available. We prove that the proposed {optimistic} learning caching policies can achieve \emph{sub-zero} performance loss (regret) for perfect predictions, and maintain the sub-linear regret bound $O(\sqrt T)$, which is the best achievable bound for policies that do not use predictions, even for arbitrary-bad predictions. The performance of the proposed algorithms is evaluated with detailed trace-driven numerical tests.
    ObSynth: An Interactive Synthesis System for Generating Object Models from Natural Language Specifications. (arXiv:2210.11468v1 [cs.SE])
    We introduce ObSynth, an interactive system leveraging the domain knowledge embedded in large language models (LLMs) to help users design object models from high level natural language prompts. This is an example of specification reification, the process of taking a high-level, potentially vague specification and reifying it into a more concrete form. We evaluate ObSynth via a user study, leading to three key findings: first, object models designed using ObSynth are more detailed, showing that it often synthesizes fields users might have otherwise omitted. Second, a majority of objects, methods, and fields generated by ObSynth are kept by the user in the final object model, highlighting the quality of generated components. Third, ObSynth altered the workflow of participants: they focus on checking that synthesized components were correct rather than generating them from scratch, though ObSynth did not reduce the time participants took to generate object models.
    Theoretical Exploration of Flexible Transmitter Model. (arXiv:2111.06027v2 [cs.LG] UPDATED)
    Neural network models generally involve two important components, i.e., network architecture and neuron model. Although there are abundant studies about network architectures, only a few neuron models have been developed, such as the MP neuron model developed in 1943 and the spiking neuron model developed in the 1950s. Recently, a new bio-plausible neuron model, Flexible Transmitter (FT) model, has been proposed. It exhibits promising behaviors, particularly on temporal-spatial signals, even when simply embedded into the common feedforward network architecture. This paper attempts to understand the properties of the FT network (FTNet) theoretically. Under mild assumptions, we show that: i) FTNet is a universal approximator; ii) the approximation complexity of FTNet can be exponentially smaller than those of commonly-used real-valued neural networks with feedforward/recurrent architectures and is of the same order in the worst case; iii) any local minimum of FTNet is the global minimum, implying that it is possible to identify global minima by local search algorithms.
    Rashomon Capacity: A Metric for Predictive Multiplicity in Classification. (arXiv:2206.01295v2 [cs.LG] UPDATED)
    Predictive multiplicity occurs when classification models with statistically indistinguishable performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new metric, called Rashomon Capacity, to measure predictive multiplicity in probabilistic classification. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure and report predictive multiplicity prior to model deployment.
    Score-based Generative Models for Calorimeter Shower Simulation. (arXiv:2206.11898v3 [hep-ph] UPDATED)
    Score-based generative models are a new class of generative algorithms that have been shown to produce realistic images even in high dimensional spaces, currently surpassing other state-of-the-art models for different benchmark categories and applications. In this work we introduce CaloScore, a score-based generative model for collider physics applied to calorimeter shower generation. Three different diffusion models are investigated using the Fast Calorimeter Simulation Challenge 2022 dataset. CaloScore is the first application of a score-based generative model in collider physics and is able to produce high-fidelity calorimeter images for all datasets, providing an alternative paradigm for calorimeter shower simulation.
    A Note on the Efficient Evaluation of PAC-Bayes Bounds. (arXiv:2209.05188v2 [cs.LG] UPDATED)
    When utilising PAC-Bayes theory for risk certification, it is usually necessary to estimate and bound the Gibbs risk of the PAC-Bayes posterior. Many works in the literature employ a method for this which requires a large number of passes of the dataset, incurring high computational cost. This manuscript presents a very general alternative which makes computational savings on the order of the dataset size.
    Freeze then Train: Towards Provable Representation Learning under Spurious Correlations and Feature Noise. (arXiv:2210.11075v1 [cs.LG])
    The existence of spurious correlations such as image backgrounds in the training environment can make empirical risk minimization (ERM) perform badly in the test environment. To address this problem, Kirichenko et al. (2022) empirically found that the core features that are causally related to the outcome can still be learned well even with the presence of spurious correlations. This opens a promising strategy to first train a feature learner rather than a classifier, and then perform linear probing (last layer retraining) in the test environment. However, a theoretical understanding of when and why this approach works is lacking. In this paper, we find that core features are only learned well when they are less noisy than spurious features, which is not necessarily true in practice. We provide both theories and experiments to support this finding and to illustrate the importance of feature noise. Moreover, we propose an algorithm called Freeze then Train (FTT), that first freezes certain salient features and then trains the rest of the features using ERM. We theoretically show that FTT preserves features that are more beneficial to test time probing. Across two commonly used real-world benchmarks, FTT outperforms ERM, JTT and CVaR-DRO, with especially substantial improvement in accuracy (by 4.8%) when the feature noise is large.
    Technical Language Supervision for Intelligent Fault Diagnosis in Process Industry. (arXiv:2112.07356v2 [cs.AI] UPDATED)
    In the process industry, condition monitoring systems with automated fault diagnosis methods assist human experts and thereby improve maintenance efficiency, process sustainability, and workplace safety. Improving the automated fault diagnosis methods using data and machine learning-based models is a central aspect of intelligent fault diagnosis (IFD). A major challenge in IFD is to develop realistic datasets with accurate labels needed to train and validate models, and to transfer models trained with labeled lab data to heterogeneous process industry environments. However, fault descriptions and work-orders written by domain experts are increasingly digitised in modern condition monitoring systems, for example in the context of rotating equipment monitoring. Thus, domain-specific knowledge about fault characteristics and severities exists as technical language annotations in industrial datasets. Furthermore, recent advances in natural language processing enable weakly supervised model optimisation using natural language annotations, most notably in the form of natural language supervision (NLS). This creates a timely opportunity to develop technical language supervision (TLS) solutions for IFD systems grounded in industrial data, for example as a complement to pre-training with lab data to address problems like overfitting and inaccurate out-of-sample generalisation. We surveyed the literature and identify a considerable improvement in the maturity of NLS over the last two years, facilitating applications beyond natural language; a rapid development of weak supervision methods; and transfer learning as a current trend in IFD which can benefit from these developments. Finally we describe a general framework for TLS and implement a TLS case study based on SentenceBERT and contrastive learning based zero-shot inference on annotated industry data.
    On the Perils of Cascading Robust Classifiers. (arXiv:2206.00278v2 [cs.LG] UPDATED)
    Ensembling certifiably robust neural networks is a promising approach for improving the \emph{certified robust accuracy} of neural models. Black-box ensembles that assume only query-access to the constituent models (and their robustness certifiers) during prediction are particularly attractive due to their modular structure. Cascading ensembles are a popular instance of black-box ensembles that appear to improve certified robust accuracies in practice. However, we show that the robustness certifier used by a cascading ensemble is unsound. That is, when a cascading ensemble is certified as locally robust at an input $x$ (with respect to $\epsilon$), there can be inputs $x'$ in the $\epsilon$-ball centered at $x$, such that the cascade's prediction at $x'$ is different from $x$ and thus the ensemble is not locally robust. Our theoretical findings are accompanied by empirical results that further demonstrate this unsoundness. We present \emph{cascade attack} (CasA), an adversarial attack against cascading ensembles, and show that: (1) there exists an adversarial input for up to 88\% of the samples where the ensemble claims to be certifiably robust and accurate; and (2) the accuracy of a cascading ensemble under our attack is as low as 11\% when it claims to be certifiably robust and accurate on 97\% of the test set. Our work reveals a critical pitfall of cascading certifiably robust models by showing that the seemingly beneficial strategy of cascading can actually hurt the robustness of the resulting ensemble. Our code is available at \url{https://github.com/TristaChi/ensembleKW}.
    Competence-based Multimodal Curriculum Learning for Medical Report Generation. (arXiv:2206.14579v2 [cs.CL] UPDATED)
    Medical report generation task, which targets to produce long and coherent descriptions of medical images, has attracted growing research interests recently. Different from the general image captioning tasks, medical report generation is more challenging for data-driven neural models. This is mainly due to 1) the serious data bias and 2) the limited medical data. To alleviate the data bias and make best use of available data, we propose a Competence-based Multimodal Curriculum Learning framework (CMCL). Specifically, CMCL simulates the learning process of radiologists and optimizes the model in a step by step manner. Firstly, CMCL estimates the difficulty of each training instance and evaluates the competence of current model; Secondly, CMCL selects the most suitable batch of training instances considering current model competence. By iterating above two steps, CMCL can gradually improve the model's performance. The experiments on the public IU-Xray and MIMIC-CXR datasets show that CMCL can be incorporated into existing models to improve their performance.
    Rewriting Meaningful Sentences via Conditional BERT Sampling and an application on fooling text classifiers. (arXiv:2010.11869v2 [cs.CL] UPDATED)
    Most adversarial attack methods that are designed to deceive a text classifier change the text classifier's prediction by modifying a few words or characters. Few try to attack classifiers by rewriting a whole sentence, due to the difficulties inherent in sentence-level rephrasing as well as the problem of setting the criteria for legitimate rewriting. In this paper, we explore the problem of creating adversarial examples with sentence-level rewriting. We design a new sampling method, named ParaphraseSampler, to efficiently rewrite the original sentence in multiple ways. Then we propose a new criteria for modification, called a sentence-level threaten model. This criteria allows for both word- and sentence-level changes, and can be adjusted independently in two dimensions: semantic similarity and grammatical quality. Experimental results show that many of these rewritten sentences are misclassified by the classifier. On all 6 datasets, our ParaphraseSampler achieves a better attack success rate than our baseline.
    Policy Optimization with Linear Temporal Logic Constraints. (arXiv:2206.09546v2 [cs.LG] UPDATED)
    We study the problem of policy optimization (PO) with linear temporal logic (LTL) constraints. The language of LTL allows flexible description of tasks that may be unnatural to encode as a scalar cost function. We consider LTL-constrained PO as a systematic framework, decoupling task specification from policy selection, and as an alternative to the standard of cost shaping. With access to a generative model, we develop a model-based approach that enjoys a sample complexity analysis for guaranteeing both task satisfaction and cost optimality (through a reduction to a reachability problem). Empirically, our algorithm can achieve strong performance even in low-sample regimes.
    Deep conditional transformation models for survival analysis. (arXiv:2210.11366v1 [cs.LG])
    An every increasing number of clinical trials features a time-to-event outcome and records non-tabular patient data, such as magnetic resonance imaging or text data in the form of electronic health records. Recently, several neural-network based solutions have been proposed, some of which are binary classifiers. Parametric, distribution-free approaches which make full use of survival time and censoring status have not received much attention. We present deep conditional transformation models (DCTMs) for survival outcomes as a unifying approach to parametric and semiparametric survival analysis. DCTMs allow the specification of non-linear and non-proportional hazards for both tabular and non-tabular data and extend to all types of censoring and truncation. On real and semi-synthetic data, we show that DCTMs compete with state-of-the-art DL approaches to survival analysis.
    i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?. (arXiv:2210.11470v1 [cs.CV])
    Masked image modeling (MIM) has been recognized as a strong and popular self-supervised pre-training approach in the vision domain. However, the interpretability of the mechanism and properties of the learned representations by such a scheme are so far not well-explored. In this work, through comprehensive experiments and empirical studies on Masked Autoencoders (MAE), we address two critical questions to explore the behaviors of the learned representations: (i) Are the latent representations in Masked Autoencoders linearly separable if the input is a mixture of two images instead of one? This can be concrete evidence used to explain why MAE-learned representations have superior performance on downstream tasks, as proven by many literature impressively. (ii) What is the degree of semantics encoded in the latent feature space by Masked Autoencoders? To explore these two problems, we propose a simple yet effective Interpretable MAE (i-MAE) framework with a two-way image reconstruction and a latent feature reconstruction with distillation loss to help us understand the behaviors inside MAE's structure. Extensive experiments are conducted on CIFAR-10/100, Tiny-ImageNet and ImageNet-1K datasets to verify the observations we discovered. Furthermore, in addition to qualitatively analyzing the characteristics of the latent representations, we examine the existence of linear separability and the degree of semantics in the latent space by proposing two novel metrics. The surprising and consistent results across the qualitative and quantitative experiments demonstrate that i-MAE is a superior framework design for interpretability research of MAE frameworks, as well as achieving better representational ability. Code is available at https://github.com/vision-learning-acceleration-lab/i-mae.
    DELTA: Diverse Client Sampling for Fasting Federated Learning. (arXiv:2205.13925v2 [cs.LG] UPDATED)
    Partial client participation has been widely adopted in Federated Learning (FL) to efficiently reduce the communication burden. However, an improper client sampling scheme will select unrepresentative subsets, which will cause a large variance in the model update and slows down the convergence. Existing sampling methods are either biased or can be further improved to accelerate the convergence. In this paper, we propose an unbiased sampling scheme, termed DELTA, to alleviate this problem. In particular, DELTA characterizes the impact of client diversity and local variance and samples the representative clients who carry valuable information for global model updates. Moreover, DELTA is a provably optimal unbiased sampling scheme that minimizes the variance caused by partial client participation and achieves better convergence than other unbiased sampling schemes. We corroborate our results with experiments on both synthetic and real data sets.
    Algorithm for Constrained Markov Decision Process with Linear Convergence. (arXiv:2206.01666v2 [math.OC] UPDATED)
    The problem of constrained Markov decision process is considered. An agent aims to maximize the expected accumulated discounted reward subject to multiple constraints on its costs (the number of constraints is relatively small). A new dual approach is proposed with the integration of two ingredients: entropy regularized policy optimizer and Vaidya's dual optimizer, both of which are critical to achieve faster convergence. The finite-time error bound of the proposed approach is provided. Despite the challenge of the nonconcave objective subject to nonconcave constraints, the proposed approach is shown to converge (with linear rate) to the global optimum. The complexity expressed in terms of the optimality gap and the constraint violation significantly improves upon the existing primal-dual approaches.
    Global Convergence of SGD On Two Layer Neural Nets. (arXiv:2210.11452v1 [cs.LG])
    In this note we demonstrate provable convergence of SGD to the global minima of appropriately regularized $\ell_2-$empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates, if they are using adequately smooth and bounded activations like sigmoid and tanh. We build on the results in [1] and leverage a constant amount of Frobenius norm regularization on the weights, along with sampling of the initial weights from an appropriate distribution. We also give a continuous time SGD convergence result that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence loss functions on constant sized neural nets which are "Villani Functions".
    Scaling Instruction-Finetuned Language Models. (arXiv:2210.11416v1 [cs.LG])
    Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
    Finding the smallest or largest element of a tensor from its low-rank factors. (arXiv:2210.11413v1 [eess.SP])
    We consider the problem of finding the smallest or largest entry of a tensor of order $N$ that is specified via its rank decomposition. Stated in a different way, we are given $N$ sets of $R$-dimensional vectors and we wish to select one vector from each set such that the sum of the Hadamard product of the selected vectors is minimized or maximized. This is a fundamental tensor problem with numerous applications in embedding similarity search, recommender systems, graph mining, multivariate probability, and statistics. We show that this discrete optimization problem is NP-hard for any tensor rank higher than one, but also provide an equivalent continuous problem reformulation which is amenable to disciplined non-convex optimization. We propose a suite of gradient-based approximation algorithms whose performance in preliminary experiments appears to be promising.
    How Infinitely Wide Neural Networks Can Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v4 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters promotes multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v2 [stat.ML] UPDATED)
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against such examples. It is formulated as a min-max problem, searching for the best solution when the training data was corrupted by the worst-case attacks. For linear regression problems, adversarial training can be formulated as a convex problem. We use this reformulation to make two technical contributions: First, we formulate the training problem as an instance of robust regression to reveal its connection to parameter-shrinking methods, specifically that $\ell_\infty$-adversarial training produces sparse solutions. Secondly, we study adversarial training in the overparameterized regime, i.e. when there are more parameters than data. We prove that adversarial training with small disturbances gives the solution with the minimum-norm that interpolates the training data. Ridge regression and lasso approximate such interpolating solutions as their regularization parameter vanishes. By contrast, for adversarial training, the transition into the interpolation regime is abrupt and for non-zero values of disturbance. This result is proved and illustrated with numerical examples.
    Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees. (arXiv:2210.11327v1 [cs.LG])
    Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.
    Structure-based drug design with geometric deep learning. (arXiv:2210.11250v1 [physics.chem-ph])
    Structure-based drug design uses three-dimensional geometric information of macromolecules, such as proteins or nucleic acids, to identify suitable ligands. Geometric deep learning, an emerging concept of neural-network-based machine learning, has been applied to macromolecular structures. This review provides an overview of the recent applications of geometric deep learning in bioorganic and medicinal chemistry, highlighting its potential for structure-based drug discovery and design. Emphasis is placed on molecular property prediction, ligand binding site and pose prediction, and structure-based de novo molecular design. The current challenges and opportunities are highlighted, and a forecast of the future of geometric deep learning for drug discovery is presented.
    Dialogue-adaptive Language Model Pre-training From Quality Estimation. (arXiv:2009.04984v2 [cs.CL] UPDATED)
    Pre-trained language models (PrLMs) have achieved great success on a wide range of natural language processing tasks by virtue of the universal language representation ability obtained by self-supervised learning on a large corpus. These models are pre-trained on standard plain texts with general language model (LM) training objectives, which would be insufficient to model dialogue-exclusive attributes like specificity and informativeness reflected in these tasks that are not explicitly captured by the pre-trained universal language representations. In this work, we propose dialogue-adaptive pre-training objectives (DAPO) derived from quality estimation to simulate dialogue-specific features, namely coherence, specificity, and informativeness. As the foundation for model pre-training, we synthesize a new dialogue corpus and build our training set with two unsupervised methods: 1) coherence-oriented context corruption, including utterance ordering, insertion, and replacement, to help the model capture the coherence inside the dialogue contexts; and 2) specificity-oriented automatic rescoring, which encourages the model to measure the quality of the synthesized data for dialogue-adaptive pre-training by considering specificity and informativeness. Experimental results on widely used open-domain response selection and quality estimation benchmarks show that DAPO significantly improves the baseline models and achieves state-of-the-art performance on the MuTual leaderboard, verifying the effectiveness of estimating quality evaluation factors into pre-training.
    Hierarchical Deep Learning with Generative Adversarial Network for Automatic Cardiac Diagnosis from ECG Signals. (arXiv:2210.11408v1 [eess.SP])
    Cardiac disease is the leading cause of death in the US. Accurate heart disease detection is of critical importance for timely medical treatment to save patients' lives. Routine use of electrocardiogram (ECG) is the most common method for physicians to assess the electrical activities of the heart and detect possible abnormal cardiac conditions. Fully utilizing the ECG data for reliable heart disease detection depends on developing effective analytical models. In this paper, we propose a two-level hierarchical deep learning framework with Generative Adversarial Network (GAN) for automatic diagnosis of ECG signals. The first-level model is composed of a Memory-Augmented Deep auto-Encoder with GAN (MadeGAN), which aims to differentiate abnormal signals from normal ECGs for anomaly detection. The second-level learning aims at robust multi-class classification for different arrhythmias identification, which is achieved by integrating the transfer learning technique to transfer knowledge from the first-level learning with the multi-branching architecture to handle the data-lacking and imbalanced data issue. We evaluate the performance of the proposed framework using real-world medical data from the MIT-BIH arrhythmia database. Experimental results show that our proposed model outperforms existing methods that are commonly used in current practice.
    Multi-Head Cross-Attentional PPG and Motion Signal Fusion for Heart Rate Estimation. (arXiv:2210.11415v1 [eess.SP])
    Nowadays, Hearth Rate (HR) monitoring is a key feature of almost all wrist-worn devices exploiting photoplethysmography (PPG) sensors. However, arm movements affect the performance of PPG-based HR tracking. This issue is usually addressed by fusing the PPG signal with data produced by inertial measurement units. Thus, deep learning algorithms have been proposed, but they are considered too complex to deploy on wearable devices and lack the explainability of results. In this work, we present a new deep learning model, PULSE, which exploits temporal convolutions and multi-head cross-attention to improve sensor fusion's effectiveness and achieve a step towards explainability. We evaluate the performance of PULSE on three publicly available datasets, reducing the mean absolute error by 7.56% on the most extensive available dataset, PPG-DaLiA. Finally, we demonstrate the explainability of PULSE and the benefits of applying attention modules to PPG and motion data.
    Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning. (arXiv:2102.08581v3 [cs.LG] UPDATED)
    In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training.
    Topological structure of complex predictions. (arXiv:2207.14358v3 [cs.LG] UPDATED)
    Complex prediction models such as deep learning are the output from fitting machine learning, neural networks, or AI models to a set of training data. These are now standard tools in science. A key challenge with the current generation of models is that they are highly parameterized, which makes describing and interpreting the prediction strategies difficult. We use topological data analysis to transform these complex prediction models into pictures representing a topological view. The result is a map of the predictions that enables inspection. The methods scale up to large datasets across different domains and enable us to detect labeling errors in training data, understand generalization in image classification, and inspect predictions of likely pathogenic mutations in the BRCA1 gene.
    A survey on Self Supervised learning approaches for improving Multimodal representation learning. (arXiv:2210.11024v1 [cs.LG])
    Recently self supervised learning has seen explosive growth and use in variety of machine learning tasks because of its ability to avoid the cost of annotating large-scale datasets. This paper gives an overview for best self supervised learning approaches for multimodal learning. The presented approaches have been aggregated by extensive study of the literature and tackle the application of self supervised learning in different ways. The approaches discussed are cross modal generation, cross modal pretraining, cyclic translation, and generating unimodal labels in self supervised fashion.
    Interpretable Machine Learning for Detection and Classification of Ransomware Families Based on API Calls. (arXiv:2210.11235v1 [cs.CR])
    Ransomware has appeared as one of the major global threats in recent days The alarming increasing rate of ransomware attacks and new ransomware variants intrigue the researchers to constantly examine the distinguishing traits of ransomware and refine their detection strategies Application Programming Interface API is a way for one program to collaborate with another API calls are the medium by which they communicate Ransomware uses this strategy to interact with the OS and makes a significantly higher number of calls in different sequences to ask for taking action This research work utilizes the frequencies of different API calls to detect and classify ransomware families First a WebCrawler is developed to automate collecting the Windows Portable Executable PE files of 15 different ransomware families By extracting different frequencies of 68 API calls we develop our dataset in the first phase of the two phase feature engineering process After selecting the most significant features in the second phase of the feature engineering process we deploy six Supervised Machine Learning models Naive Bayes Logistic Regression Random Forest Stochastic Gradient Descent K Nearest Neighbor and Support Vector Machine Then the performances of all the classifiers are compared to select the best model The results reveal that Logistic Regression can efficiently classify ransomware into their corresponding families securing 9915 accuracy Finally instead of relying on the Black box characteristic of the Machine Learning models we present the interpretability of our best performing model using SHAP values to ascertain the transparency and trustworthiness of the models prediction
    MixMask: Revisiting Masked Siamese Self-supervised Learning in Asymmetric Distance. (arXiv:2210.11456v1 [cs.CV])
    Recent advances in self-supervised learning integrate Masked Modeling and Siamese Networks into a single framework to fully reap the advantages of both the two techniques. However, previous erasing-based masking scheme in masked image modeling is not originally designed for siamese networks. Existing approaches simply inherit the default loss design from previous siamese networks, and ignore the information loss and distance change after employing masking operation in the frameworks. In this paper, we propose a filling-based masking strategy called MixMask to prevent information loss due to the randomly erased areas of an image in vanilla masking method. We further introduce a dynamic loss function design with soft distance to adapt the integrated architecture and avoid mismatches between transformed input and objective in Masked Siamese ConvNets (MSCN). The dynamic loss distance is calculated according to the proposed mix-masking scheme. Extensive experiments are conducted on various datasets of CIFAR-100, Tiny-ImageNet and ImageNet-1K. The results demonstrate that the proposed framework can achieve better accuracy on linear probing, semi-supervised and {supervised finetuning}, which outperforms the state-of-the-art MSCN by a significant margin. We also show the superiority on downstream tasks of object detection and segmentation. Our source code is available at https://github.com/LightnessOfBeing/MixMask.
    On Feature Learning in the Presence of Spurious Correlations. (arXiv:2210.11369v1 [cs.LG])
    Deep classifiers are known to rely on spurious features $\unicode{x2013}$ patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learned by standard empirical risk minimization (ERM) and specialized group robustness training. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is greatly affected by the design decisions beyond the training method, such as the model architecture and pre-training strategy. On the other hand, we find that strong regularization is not necessary for learning high quality feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group accuracies, respectively.
    Play It Back: Iterative Attention for Audio Recognition. (arXiv:2210.11328v1 [cs.SD])
    A key function of auditory cognition is the association of characteristic sounds with their corresponding semantics over time. Humans attempting to discriminate between fine-grained audio categories, often replay the same discriminative sounds to increase their prediction confidence. We propose an end-to-end attention-based architecture that through selective repetition attends over the most discriminative sounds across the audio sequence. Our model initially uses the full audio sequence and iteratively refines the temporal segments replayed based on slot attention. At each playback, the selected segments are replayed using a smaller hop length which represents higher resolution features within these segments. We show that our method can consistently achieve state-of-the-art performance across three audio-classification benchmarks: AudioSet, VGG-Sound, and EPIC-KITCHENS-100.
    Safe Policy Improvement in Constrained Markov Decision Processes. (arXiv:2210.11259v1 [cs.LG])
    The automatic synthesis of a policy through reinforcement learning (RL) from a given set of formal requirements depends on the construction of a reward signal and consists of the iterative application of many policy-improvement steps. The synthesis algorithm has to balance target, safety, and comfort requirements in a single objective and to guarantee that the policy improvement does not increase the number of safety-requirements violations, especially for safety-critical applications. In this work, we present a solution to the synthesis problem by solving its two main challenges: reward-shaping from a set of formal requirements and safe policy update. For the former, we propose an automatic reward-shaping procedure, defining a scalar reward signal compliant with the task specification. For the latter, we introduce an algorithm ensuring that the policy is improved in a safe fashion with high-confidence guarantees. We also discuss the adoption of a model-based RL algorithm to efficiently use the collected data and train a model-free agent on the predicted trajectories, where the safety violation does not have the same impact as in the real world. Finally, we demonstrate in standard control benchmarks that the resulting learning procedure is effective and robust even under heavy perturbations of the hyperparameters.
    Solving Reasoning Tasks with a Slot Transformer. (arXiv:2210.11394v1 [cs.LG])
    The ability to carve the world into useful abstractions in order to reason about time and space is a crucial component of intelligence. In order to successfully perceive and act effectively using senses we must parse and compress large amounts of information for further downstream reasoning to take place, allowing increasingly complex concepts to emerge. If there is any hope to scale representation learning methods to work with real world scenes and temporal dynamics then there must be a way to learn accurate, concise, and composable abstractions across time. We present the Slot Transformer, an architecture that leverages slot attention, transformers and iterative variational inference on video scene data to infer such representations. We evaluate the Slot Transformer on CLEVRER, Kinetics-600 and CATER datesets and demonstrate that the approach allows us to develop robust modeling and reasoning around complex behaviours as well as scores on these datasets that compare favourably to existing baselines. Finally we evaluate the effectiveness of key components of the architecture, the model's representational capacity and its ability to predict from incomplete input.
    On Representations of Mean-Field Variational Inference. (arXiv:2210.11385v1 [stat.ML])
    The mean field variational inference (MFVI) formulation restricts the general Bayesian inference problem to the subspace of product measures. We present a framework to analyze MFVI algorithms, which is inspired by a similar development for general variational Bayesian formulations. Our approach enables the MFVI problem to be represented in three different manners: a gradient flow on Wasserstein space, a system of Fokker-Planck-like equations and a diffusion process. Rigorous guarantees are established to show that a time-discretized implementation of the coordinate ascent variational inference algorithm in the product Wasserstein space of measures yields a gradient flow in the limit. A similar result is obtained for their associated densities, with the limit being given by a quasi-linear partial differential equation. A popular class of practical algorithms falls in this framework, which provides tools to establish convergence. We hope this framework could be used to guarantee convergence of algorithms in a variety of approaches, old and new, to solve variational inference problems.
    MBTI Personality Prediction for Fictional Characters Using Movie Scripts. (arXiv:2210.10994v1 [cs.AI])
    An NLP model that understands stories should be able to understand the characters in them. To support the development of neural models for this purpose, we construct a benchmark, Story2Personality. The task is to predict a movie character's MBTI or Big 5 personality types based on the narratives of the character. Experiments show that our task is challenging for the existing text classification models, as none is able to largely outperform random guesses. We further proposed a multi-view model for personality prediction using both verbal and non-verbal descriptions, which gives improvement compared to using only verbal descriptions. The uniqueness and challenges in our dataset call for the development of narrative comprehension techniques from the perspective of understanding characters.
    Accurate Extrinsic Prediction of Physical Systems Using Transformers. (arXiv:2210.11269v1 [cs.LG])
    Accurate high-altitude wind forecasting is important for air traffic control. And the large volume of data available for this task makes deep neural network-based models a possibility. However, special methods are required because the data is measured only sparsely: along the main aircraft trajectories and arranged sparsely in space, namely along the main air corridors. Several deep learning approaches have been proposed, and in this work, we show that Transformers can fit this data efficiently and are able to extrapolate coherently from a context set. We show this by an extensive comparison of Transformers to numerous existing deep learning-based baselines in the literature. Besides high-altitude wind forecasting, we compare competing models on other dynamical physical systems, namely those modelled by partial differential equations, in particular the Poisson equation and Darcy Flow equation. For these experiments, in the case where the data is arranged non-regularly in space, Transformers outperform all the other evaluated methods. We also compared them in a more standard setup where the data is arranged on a grid and show that the Transformers are competitive with state-of-the-art methods, even though it does not require regular spacing. The code and datasets of the different experiments will be made publicly available at publication time.
    Standardized Medical Image Classification across Medical Disciplines. (arXiv:2210.11091v1 [eess.IV])
    AUCMEDI is a Python-based framework for medical image classification. In this paper, we evaluate the capabilities of AUCMEDI, by applying it to multiple datasets. Datasets were specifically chosen to cover a variety of medical disciplines and imaging modalities. We designed a simple pipeline using Jupyter notebooks and applied it to all datasets. Results show that AUCMEDI was able to train a model with accurate classification capabilities for each dataset: Averaged AUC per dataset range between 0.82 and 1.0, averaged F1 scores range between 0.61 and 1.0. With its high adaptability and strong performance, AUCMEDI proves to be a powerful instrument to build widely applicable neural networks. The notebooks serve as application examples for AUCMEDI.
    Transcending Scaling Laws with 0.1% Extra Compute. (arXiv:2210.11399v1 [cs.CL])
    Scaling language models improves performance but comes with significant computational costs. This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute. The key idea is to continue training a state-of-the-art large language model (e.g., PaLM) on a few more steps with UL2's mixture-of-denoiser objective. We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics. In this paper, we continue training PaLM with UL2R, introducing a new set of models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B scale, we show an approximately 2x computational savings rate where U-PaLM achieves the same performance as the final PaLM 540B model at around half its computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further show that this improved scaling curve leads to 'emergent abilities' on challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM on some tasks or demonstrates better quality at much smaller scale (62B as opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide qualitative examples showing the new capabilities of U-PaLM for single and multi-span infilling.
    Reproducibility of the Methods in Medical Imaging with Deep Learning. (arXiv:2210.11146v1 [cs.LG])
    Concerns about the reproducibility of deep learning research are more prominent than ever, with no clear solution in sight. The relevance of machine learning research can only be improved if we also employ empirical rigor that incorporates reproducibility guidelines, especially so in the medical imaging field. The Medical Imaging with Deep Learning (MIDL) conference has made advancements in this direction by advocating open access, and recently also recommending authors to make their code public - both aspects being adopted by the majority of the conference submissions. This helps the reproducibility of the methods, however, there is currently little or no support for further evaluation of these supplementary material, making them vulnerable to poor quality, which affects the impact of the entire submission. We have evaluated all accepted full paper submissions to MIDL between 2018 and 2022 using established, but slightly adjusted guidelines on reproducibility and the quality of the public repositories. The evaluations show that publishing repositories and using public datasets are becoming more popular, which helps traceability, but the quality of the repositories has not improved over the years, leaving room for improvement in every aspect of designing repositories. Merely 22% of all submissions contain a repository that were deemed repeatable using our evaluations. From the commonly encountered issues during the evaluations, we propose a set of guidelines for machine learning-related research for medical imaging applications, adjusted specifically for future submissions to MIDL.
    Context-driven Visual Object Recognition based on Knowledge Graphs. (arXiv:2210.11233v1 [cs.AI])
    Current deep learning methods for object recognition are purely data-driven and require a large number of training samples to achieve good results. Due to their sole dependence on image data, these methods tend to fail when confronted with new environments where even small deviations occur. Human perception, however, has proven to be significantly more robust to such distribution shifts. It is assumed that their ability to deal with unknown scenarios is based on extensive incorporation of contextual knowledge. Context can be based either on object co-occurrences in a scene or on memory of experience. In accordance with the human visual cortex which uses context to form different object representations for a seen image, we propose an approach that enhances deep learning methods by using external contextual knowledge encoded in a knowledge graph. Therefore, we extract different contextual views from a generic knowledge graph, transform the views into vector space and infuse it into a DNN. We conduct a series of experiments to investigate the impact of different contextual views on the learned object representations for the same image dataset. The experimental results provide evidence that the contextual views influence the image representations in the DNN differently and therefore lead to different predictions for the same images. We also show that context helps to strengthen the robustness of object recognition models for out-of-distribution images, usually occurring in transfer learning tasks or real-world scenarios.
    Dynamic selection of p-norm in linear adaptive filtering via online kernel-based reinforcement learning. (arXiv:2210.11317v1 [eess.SP])
    This study addresses the problem of selecting dynamically, at each time instance, the ``optimal'' p-norm to combat outliers in linear adaptive filtering without any knowledge on the potentially time-varying probability distribution function of the outliers. To this end, an online and data-driven framework is designed via kernel-based reinforcement learning (KBRL). Novel Bellman mappings on reproducing kernel Hilbert spaces (RKHSs) are introduced that need no knowledge on transition probabilities of Markov decision processes, and are nonexpansive with respect to the underlying Hilbertian norm. An approximate policy-iteration framework is finally offered via the introduction of a finite-dimensional affine superset of the fixed-point set of the proposed Bellman mappings. The well-known ``curse of dimensionality'' in RKHSs is addressed by building a basis of vectors via an approximate linear dependency criterion. Numerical tests on synthetic data demonstrate that the proposed framework selects always the ``optimal'' p-norm for the outlier scenario at hand, outperforming at the same time several non-RL and KBRL schemes.
    Hypernetworks in Meta-Reinforcement Learning. (arXiv:2210.11348v1 [cs.LG])
    Training a reinforcement learning (RL) agent on a real-world robotics task remains generally impractical due to sample inefficiency. Multi-task RL and meta-RL aim to improve sample efficiency by generalizing over a distribution of related tasks. However, doing so is difficult in practice: In multi-task RL, state of the art methods often fail to outperform a degenerate solution that simply learns each task separately. Hypernetworks are a promising path forward since they replicate the separate policies of the degenerate solution while also allowing for generalization across tasks, and are applicable to meta-RL. However, evidence from supervised learning suggests hypernetwork performance is highly sensitive to the initialization. In this paper, we 1) show that hypernetwork initialization is also a critical factor in meta-RL, and that naive initializations yield poor performance; 2) propose a novel hypernetwork initialization scheme that matches or exceeds the performance of a state-of-the-art approach proposed for supervised settings, as well as being simpler and more general; and 3) use this method to show that hypernetworks can improve performance in meta-RL by evaluating on multiple simulated robotics benchmarks.
    Krylov-Bellman boosting: Super-linear policy evaluation in general state spaces. (arXiv:2210.11377v1 [stat.ML])
    We present and analyze the Krylov-Bellman Boosting (KBB) algorithm for policy evaluation in general state spaces. It alternates between fitting the Bellman residual using non-parametric regression (as in boosting), and estimating the value function via the least-squares temporal difference (LSTD) procedure applied with a feature set that grows adaptively over time. By exploiting the connection to Krylov methods, we equip this method with two attractive guarantees. First, we provide a general convergence bound that allows for separate estimation errors in residual fitting and LSTD computation. Consistent with our numerical experiments, this bound shows that convergence rates depend on the restricted spectral structure, and are typically super-linear. Second, by combining this meta-result with sample-size dependent guarantees for residual fitting and LSTD computation, we obtain concrete statistical guarantees that depend on the sample size along with the complexity of the function class used to fit the residuals. We illustrate the behavior of the KBB algorithm for various types of policy evaluation problems, and typically find large reductions in sample complexity relative to the standard approach of fitted value iterationn.
    Graph Neural Networks with Trainable Adjacency Matrices for Fault Diagnosis on Multivariate Sensor Data. (arXiv:2210.11164v1 [cs.AI])
    Timely detected anomalies in the chemical technological processes, as well as the earliest detection of the cause of the fault, significantly reduce the production cost in the industrial factories. Data on the state of the technological process and the operation of production equipment are received by a large number of different sensors. To better predict the behavior of the process and equipment, it is necessary not only to consider the behavior of the signals in each sensor separately, but also to take into account their correlation and hidden relationships with each other. Graph-based data representation helps with this. The graph nodes can be represented as data from the different sensors, and the edges can display the influence of these data on each other. In this work, the possibility of applying graph neural networks to the problem of fault diagnosis in a chemical process is studied. It was proposed to construct a graph during the training of graph neural network. This allows to train models on data where the dependencies between the sensors are not known in advance. In this work, several methods for obtaining adjacency matrices were considered, as well as their quality was studied. It has also been proposed to use multiple adjacency matrices in one model. We showed state-of-the-art performance on the fault diagnosis task with the Tennessee Eastman Process dataset. The proposed graph neural networks outperformed the results of recurrent neural networks.
    Knowledge Graph Enhanced Relation Extraction Datasets. (arXiv:2210.11231v1 [cs.LG])
    Knowledge-enhanced methods that take advantage of auxiliary knowledge graphs recently emerged in relation extraction, and they surpass traditional text-based relation extraction methods. However, there are no unified public benchmarks that currently involve evidence sentences and knowledge graphs for knowledge-enhanced relation extraction. To combat these issues, we propose KGRED, a knowledge graph enhanced relation extraction dataset with features as follows: (1) the benchmarks are based on widely-used distantly supervised relation extraction datasets; (2) we refine these existing datasets to improve the data quality, and we also construct auxiliary knowledge graphs for these existing datasets through entity linking to support knowledge-enhanced relation extraction tasks; (3) with the new benchmarks we curated, we build baselines in two popular relation extraction settings including sentence-level and bag-level relation extraction, and we also make comparisons among the latest knowledge-enhanced relation extraction methods. KGRED provides high-quality relation extraction datasets with auxiliary knowledge graphs for evaluating the performance of knowledge-enhanced relation extraction methods. Meanwhile, our experiments on KGRED reveal the influence of knowledge graph information on relation extraction tasks.
    Network Synthetic Interventions: A Framework for Panel Data with Network Interference. (arXiv:2210.11355v1 [econ.EM])
    We propose a generalization of the synthetic controls and synthetic interventions methodology to incorporate network interference. We consider the estimation of unit-specific treatment effects from panel data where there are spillover effects across units and in the presence of unobserved confounding. Key to our approach is a novel latent factor model that takes into account network interference and generalizes the factor models typically used in panel data settings. We propose an estimator, "network synthetic interventions", and show that it consistently estimates the mean outcomes for a unit under an arbitrary sequence of treatments for itself and its neighborhood, given certain observation patterns hold in the data. We corroborate our theoretical findings with simulations.
    Breaking Bad: A Dataset for Geometric Fracture and Reassembly. (arXiv:2210.11463v1 [cs.CV])
    We introduce Breaking Bad, a large-scale dataset of fractured objects. Our dataset consists of over one million fractured objects simulated from ten thousand base models. The fracture simulation is powered by a recent physically based algorithm that efficiently generates a variety of fracture modes of an object. Existing shape assembly datasets decompose objects according to semantically meaningful parts, effectively modeling the construction process. In contrast, Breaking Bad models the destruction process of how a geometric object naturally breaks into fragments. Our dataset serves as a benchmark that enables the study of fractured object reassembly and presents new challenges for geometric shape understanding. We analyze our dataset with several geometry measurements and benchmark three state-of-the-art shape assembly deep learning methods under various settings. Extensive experimental results demonstrate the difficulty of our dataset, calling on future research in model designs specifically for the geometric shape assembly task. We host our dataset at https://breaking-bad-dataset.github.io/.
    A Magnetic Framelet-Based Convolutional Neural Network for Directed Graphs. (arXiv:2210.10993v1 [cs.LG])
    Spectral Graph Convolutional Networks (spectral GCNNs), a powerful tool for analyzing and processing graph data, typically apply frequency filtering via Fourier transform to obtain representations with selective information. Although research shows that spectral GCNNs can be enhanced by framelet-based filtering, the massive majority of such research only considers undirected graphs. In this paper, we introduce Framelet-MagNet, a magnetic framelet-based spectral GCNN for directed graphs (digraphs). The model applies the framelet transform to digraph signals to form a more sophisticated representation for filtering. Digraph framelets are constructed with the complex-valued magnetic Laplacian, simultaneously leading to signal processing in both real and complex domains. We empirically validate the predictive power of Framelet-MagNet over a range of state-of-the-art models in node classification, link prediction, and denoising.
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v1 [cs.LG])
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.
    Toward Multiple Specialty Learners for Explaining GNNs via Online Knowledge Distillation. (arXiv:2210.11094v1 [cs.LG])
    Graph Neural Networks (GNNs) have become increasingly ubiquitous in numerous applications and systems, necessitating explanations of their predictions, especially when making critical decisions. However, explaining GNNs is challenging due to the complexity of graph data and model execution. Despite additional computational costs, post-hoc explanation approaches have been widely adopted due to the generality of their architectures. Intrinsically interpretable models provide instant explanations but are usually model-specific, which can only explain particular GNNs. Therefore, we propose a novel GNN explanation framework named SCALE, which is general and fast for explaining predictions. SCALE trains multiple specialty learners to explain GNNs since constructing one powerful explainer to examine attributions of interactions in input graphs is complicated. In training, a black-box GNN model guides learners based on an online knowledge distillation paradigm. In the explanation phase, explanations of predictions are provided by multiple explainers corresponding to trained learners. Specifically, edge masking and random walk with restart procedures are executed to provide structural explanations for graph-level and node-level predictions, respectively. A feature attribution module provides overall summaries and instance-level feature contributions. We compare SCALE with state-of-the-art baselines via quantitative and qualitative experiments to prove its explanation correctness and execution performance. We also conduct a series of ablation studies to understand the strengths and weaknesses of the proposed framework.
    Removing grid structure in angle-resolved photoemission spectra via deep learning method. (arXiv:2210.11200v1 [cond-mat.mtrl-sci])
    Spectroscopic data may often contain unwanted extrinsic signals. For example, in ARPES experiment, a wire mesh is typically placed in front of the CCD to block stray photo-electrons, but could cause a grid-like structure in the spectra during quick measurement mode. In the past, this structure was often removed using the mathematical Fourier filtering method by erasing the periodic structure. However, this method may lead to information loss and vacancies in the spectra because the grid structure is not strictly linearly superimposed. Here, we propose a deep learning method to effectively overcome this problem. Our method takes advantage of the self-correlation information within the spectra themselves and can greatly optimize the quality of the spectra while removing the grid structure and noise simultaneously. It has the potential to be extended to all spectroscopic measurements to eliminate other extrinsic signals and enhance the spectral quality based on the self-correlation of the spectra solely.
    Learning Rationalizable Equilibria in Multiplayer Games. (arXiv:2210.11402v1 [cs.LG])
    A natural goal in multiagent learning besides finding equilibria is to learn rationalizable behavior, where players learn to avoid iteratively dominated actions. However, even in the basic setting of multiplayer general-sum games, existing algorithms require a number of samples exponential in the number of players to learn rationalizable equilibria under bandit feedback. This paper develops the first line of efficient algorithms for learning rationalizable Coarse Correlated Equilibria (CCE) and Correlated Equilibria (CE) whose sample complexities are polynomial in all problem parameters including the number of players. To achieve this result, we also develop a new efficient algorithm for the simpler task of finding one rationalizable action profile (not necessarily an equilibrium), whose sample complexity substantially improves over the best existing results of Wu et al. (2021). Our algorithms incorporate several novel techniques to guarantee rationalizability and no (swap-)regret simultaneously, including a correlated exploration scheme and adaptive learning rates, which may be of independent interest. We complement our results with a sample complexity lower bound showing the sharpness of our guarantees.
    $r-$Adaptive Deep Learning Method for Solving Partial Differential Equations. (arXiv:2210.10900v1 [math.NA])
    We introduce an $r-$adaptive algorithm to solve Partial Differential Equations using a Deep Neural Network. The proposed method restricts to tensor product meshes and optimizes the boundary node locations in one dimension, from which we build two- or three-dimensional meshes. The method allows the definition of fixed interfaces to design conforming meshes, and enables changes in the topology, i.e., some nodes can jump across fixed interfaces. The method simultaneously optimizes the node locations and the PDE solution values over the resulting mesh. To numerically illustrate the performance of our proposed $r-$adaptive method, we apply it in combination with a collocation method, a Least Squares Method, and a Deep Ritz Method. We focus on the latter to solve one- and two-dimensional problems whose solutions are smooth, singular, and/or exhibit strong gradients.
    Emerging Threats in Deep Learning-Based Autonomous Driving: A Comprehensive Survey. (arXiv:2210.11237v1 [cs.CR])
    Since the 2004 DARPA Grand Challenge, the autonomous driving technology has witnessed nearly two decades of rapid development. Particularly, in recent years, with the application of new sensors and deep learning technologies extending to the autonomous field, the development of autonomous driving technology has continued to make breakthroughs. Thus, many carmakers and high-tech giants dedicated to research and system development of autonomous driving. However, as the foundation of autonomous driving, the deep learning technology faces many new security risks. The academic community has proposed deep learning countermeasures against the adversarial examples and AI backdoor, and has introduced them into the autonomous driving field for verification. Deep learning security matters to autonomous driving system security, and then matters to personal safety, which is an issue that deserves attention and research.This paper provides an summary of the concepts, developments and recent research in deep learning security technologies in autonomous driving. Firstly, we briefly introduce the deep learning framework and pipeline in the autonomous driving system, which mainly include the deep learning technologies and algorithms commonly used in this field. Moreover, we focus on the potential security threats of the deep learning based autonomous driving system in each functional layer in turn. We reviews the development of deep learning attack technologies to autonomous driving, investigates the State-of-the-Art algorithms, and reveals the potential risks. At last, we provides an outlook on deep learning security in the autonomous driving field and proposes recommendations for building a safe and trustworthy autonomous driving system.
    A note on diffusion limits for stochastic gradient descent. (arXiv:2210.11257v1 [cs.LG])
    In the machine learning literature stochastic gradient descent has recently been widely discussed for its purported implicit regularization properties. Much of the theory, that attempts to clarify the role of noise in stochastic gradient algorithms, has widely approximated stochastic gradient descent by a stochastic differential equation with Gaussian noise. We provide a novel rigorous theoretical justification for this practice that showcases how the Gaussianity of the noise arises naturally.
    Scalable Bayesian Transformed Gaussian Processes. (arXiv:2210.10973v1 [cs.LG])
    The Bayesian transformed Gaussian process (BTG) model, proposed by Kedem and Oliviera, is a fully Bayesian counterpart to the warped Gaussian process (WGP) and marginalizes out a joint prior over input warping and kernel hyperparameters. This fully Bayesian treatment of hyperparameters often provides more accurate regression estimates and superior uncertainty propagation, but is prohibitively expensive. The BTG posterior predictive distribution, itself estimated through high-dimensional integration, must be inverted in order to perform model prediction. To make the Bayesian approach practical and comparable in speed to maximum-likelihood estimation (MLE), we propose principled and fast techniques for computing with BTG. Our framework uses doubly sparse quadrature rules, tight quantile bounds, and rank-one matrix algebra to enable both fast model prediction and model selection. These scalable methods allow us to regress over higher-dimensional datasets and apply BTG with layered transformations that greatly improve its expressibility. We demonstrate that BTG achieves superior empirical performance over MLE-based models.
    Attacking Motion Estimation with Adversarial Snow. (arXiv:2210.11242v1 [cs.CV])
    Current adversarial attacks for motion estimation (optical flow) optimize small per-pixel perturbations, which are unlikely to appear in the real world. In contrast, we exploit a real-world weather phenomenon for a novel attack with adversarially optimized snow. At the core of our attack is a differentiable renderer that consistently integrates photorealistic snowflakes with realistic motion into the 3D scene. Through optimization we obtain adversarial snow that significantly impacts the optical flow while being indistinguishable from ordinary snow. Surprisingly, the impact of our novel attack is largest on methods that previously showed a high robustness to small L_p perturbations.
    Independence Testing-Based Approach to Causal Discovery under Measurement Error and Linear Non-Gaussian Models. (arXiv:2210.11021v1 [cs.LG])
    Causal discovery aims to recover causal structures generating the observational data. Despite its success in certain problems, in many real-world scenarios the observed variables are not the target variables of interest, but the imperfect measures of the target variables. Causal discovery under measurement error aims to recover the causal graph among unobserved target variables from observations made with measurement error. We consider a specific formulation of the problem, where the unobserved target variables follow a linear non-Gaussian acyclic model, and the measurement process follows the random measurement error model. Existing methods on this formulation rely on non-scalable over-complete independent component analysis (OICA). In this work, we propose the Transformed Independent Noise (TIN) condition, which checks for independence between a specific linear transformation of some measured variables and certain other measured variables. By leveraging the non-Gaussianity and higher-order statistics of data, TIN is informative about the graph structure among the unobserved target variables. By utilizing TIN, the ordered group decomposition of the causal model is identifiable. In other words, we could achieve what once required OICA to achieve by only conducting independence tests. Experimental results on both synthetic and real-world data demonstrate the effectiveness and reliability of our method.
    Tighter PAC-Bayes Generalisation Bounds by Leveraging Example Difficulty. (arXiv:2210.11289v1 [cs.LG])
    We introduce a modified version of the excess risk, which can be used to obtain tighter, fast-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for $[-1, 1]$-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around $0$. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.
    Entire Space Counterfactual Learning: Tuning, Analytical Properties and Industrial Applications. (arXiv:2210.11039v1 [cs.LG])
    As a basic research problem for building effective recommender systems, post-click conversion rate (CVR) estimation has long been plagued by sample selection bias and data sparsity issues. To address the data sparsity issue, prevalent methods based on entire space multi-task model leverage the sequential pattern of user actions, i.e. exposure $\rightarrow$ click $\rightarrow$ conversion to construct auxiliary learning tasks. However, they still fall short of guaranteeing the unbiasedness of CVR estimates. This paper theoretically demonstrates two defects of these entire space multi-task models: (1) inherent estimation bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) potential independence priority (PIP) for CTCVR estimation, where the causality from click to conversion might be overlooked. This paper further proposes a principled method named entire space counterfactual multi-task model (ESCM$^2$), which employs a counterfactual risk minimizer to handle both IEB and PIP issues at once. To demonstrate the effectiveness of the proposed method, this paper explores its parameter tuning in practice, derives its analytic properties, and showcases its effectiveness in industrial CVR estimation, where ESCM$^2$ can effectively alleviate the intrinsic IEB and PIP issues and outperform baseline models.
    Neural Estimation of Submodular Functions with Applications to Differentiable Subset Selection. (arXiv:2210.11033v1 [cs.LG])
    Submodular functions and variants, through their ability to characterize diversity and coverage, have emerged as a key tool for data selection and summarization. Many recent approaches to learn submodular functions suffer from limited expressiveness. In this work, we propose FLEXSUBNET, a family of flexible neural models for both monotone and non-monotone submodular functions. To fit a latent submodular function from (set, value) observations, FLEXSUBNET applies a concave function on modular functions in a recursive manner. We do not draw the concave function from a restricted family, but rather learn from data using a highly expressive neural network that implements a differentiable quadrature procedure. Such an expressive neural model for concave functions may be of independent interest. Next, we extend this setup to provide a novel characterization of monotone \alpha-submodular functions, a recently introduced notion of approximate submodular functions. We then use this characterization to design a novel neural model for such functions. Finally, we consider learning submodular set functions under distant supervision in the form of (perimeter-set, high-value-subset) pairs. This yields a novel subset selection method based on an order-invariant, yet greedy sampler built around the above neural set functions. Our experiments on synthetic and real data show that FLEXSUBNET outperforms several baselines.
    Impact of signal-to-noise ratio and bandwidth on graph Laplacian spectrum from high-dimensional noisy point cloud. (arXiv:2011.10725v4 [math.ST] UPDATED)
    We systematically study the spectrum of kernel-based graph Laplacian (GL) constructed from high-dimensional and noisy random point cloud in the nonnull setup. The problem is motived by studying the model when the clean signal is sampled from a manifold that is embedded in a low-dimensional Euclidean subspace, and corrupted by high-dimensional noise. We quantify how the signal and noise interact over different regions of signal-to-noise ratio (SNR), and report the resulting peculiar spectral behavior of GL. In addition, we explore the impact of chosen kernel bandwidth on the spectrum of GL over different regions of SNR, which lead to an adaptive choice of kernel bandwidth that coincides with the common practice in real data. This result paves the way to a theoretical understanding of how practitioners apply GL when the dataset is noisy.
    High-Order Pooling for Graph Neural Networks with Tensor Decomposition. (arXiv:2205.11691v2 [cs.LG] UPDATED)
    Graph Neural Networks (GNNs) are attracting growing attention due to their effectiveness and flexibility in modeling a variety of graph-structured data. Exiting GNN architectures usually adopt simple pooling operations (eg. sum, average, max) when aggregating messages from a local neighborhood for updating node representation or pooling node representations from the entire graph to compute the graph representation. Though simple and effective, these linear operations do not model high-order non-linear interactions among nodes. We propose the Tensorized Graph Neural Network (tGNN), a highly expressive GNN architecture relying on tensor decomposition to model high-order non-linear node interactions. tGNN leverages the symmetric CP decomposition to efficiently parameterize permutation-invariant multilinear maps for modeling node interactions. Theoretical and empirical analysis on both node and graph classification tasks show the superiority of tGNN over competitive baselines. In particular, tGNN achieves the most solid results on two OGB node classification datasets and one OGB graph classification dataset.
    A Comprehensive Benchmark for COVID-19 Predictive Modeling Using Electronic Health Records in Intensive Care: Choosing the Best Model for COVID-19 Prognosis. (arXiv:2209.07805v2 [cs.LG] UPDATED)
    The COVID-19 pandemic has posed a heavy burden to the healthcare system worldwide and caused huge social disruption and economic loss. Many deep learning models have been proposed to conduct clinical predictive tasks such as mortality prediction for COVID-19 patients in intensive care units using Electronic Health Record (EHR) data. Despite their initial success in certain clinical applications, there is currently a lack of benchmarking results to achieve a fair comparison so that we can select the optimal model for clinical use. Furthermore, there is a discrepancy between the formulation of traditional prediction tasks and real-world clinical practice in intensive care. To fill these gaps, we propose two clinical prediction tasks, Outcome-specific length-of-stay prediction and Early mortality prediction for COVID-19 patients in intensive care units. The two tasks are adapted from the naive length-of-stay and mortality prediction tasks to accommodate the clinical practice for COVID-19 patients. We propose fair, detailed, open-source data-preprocessing pipelines and evaluate 17 state-of-the-art predictive models on two tasks, including 5 machine learning models, 6 basic deep learning models and 6 deep learning predictive models specifically designed for EHR data. We provide benchmarking results using data from two real-world COVID-19 EHR datasets. One dataset is publicly available without needing any inquiry and another dataset can be accessed on request. We provide fair, reproducible benchmarking results for two tasks. We deploy all experiment results and models on an online platform. We also allow clinicians and researchers to upload their data to the platform and get quick prediction results using our trained models. We hope our efforts can further facilitate deep learning and machine learning research for COVID-19 predictive modeling.
    Learning Preferences for Interactive Autonomy. (arXiv:2210.10899v1 [cs.RO])
    When robots enter everyday human environments, they need to understand their tasks and how they should perform those tasks. To encode these, reward functions, which specify the objective of a robot, are employed. However, designing reward functions can be extremely challenging for complex tasks and environments. A promising approach is to learn reward functions from humans. Recently, several robot learning works embrace this approach and leverage human demonstrations to learn the reward functions. Known as inverse reinforcement learning, this approach relies on a fundamental assumption: humans can provide near-optimal demonstrations to the robot. Unfortunately, this is rarely the case: human demonstrations to the robot are often suboptimal due to various reasons, e.g., difficulty of teleoperation, robot having high degrees of freedom, or humans' cognitive limitations. This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities. Specifically, we study how reward functions can be learned using comparative feedback, in which the human user compares multiple robot trajectories instead of (or in addition to) providing demonstrations. To this end, we first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function, which may be parametric or non-parametric. Next, we propose active learning techniques to enable the robot to ask for comparison feedback that optimizes for the expected information that will be gained from that user feedback. Finally, we demonstrate the applicability of our methods in a wide variety of domains, ranging from autonomous driving simulations to home robotics, from standard reinforcement learning benchmarks to lower-body exoskeletons.
    Comparing Machine Learning Techniques for Alfalfa Biomass Yield Prediction. (arXiv:2210.11226v1 [cs.LG])
    The alfalfa crop is globally important as livestock feed, so highly efficient planting and harvesting could benefit many industries, especially as the global climate changes and traditional methods become less accurate. Recent work using machine learning (ML) to predict yields for alfalfa and other crops has shown promise. Previous efforts used remote sensing, weather, planting, and soil data to train machine learning models for yield prediction. However, while remote sensing works well, the models require large amounts of data and cannot make predictions until the harvesting season begins. Using weather and planting data from alfalfa variety trials in Kentucky and Georgia, our previous work compared feature selection techniques to find the best technique and best feature set. In this work, we trained a variety of machine learning models, using cross validation for hyperparameter optimization, to predict biomass yields, and we showed better accuracy than similar work that employed more complex techniques. Our best individual model was a random forest with a mean absolute error of 0.081 tons/acre and R{$^2$} of 0.941. Next, we expanded this dataset to include Wisconsin and Mississippi, and we repeated our experiments, obtaining a higher best R{$^2$} of 0.982 with a regression tree. We then isolated our testing datasets by state to explore this problem's eligibility for domain adaptation (DA), as we trained on multiple source states and tested on one target state. This Trivial DA (TDA) approach leaves plenty of room for improvement through exploring more complex DA techniques in forthcoming work.
    Semi-supervised object detection based on single-stage detector for thighbone fracture localization. (arXiv:2210.10998v1 [eess.IV])
    The thighbone is the largest bone supporting the lower body. If the thighbone fracture is not treated in time, it will lead to lifelong inability to walk. Correct diagnosis of thighbone disease is very important in orthopedic medicine. Deep learning is promoting the development of fracture detection technology. However, the existing computer aided diagnosis (CAD) methods baesd on deep learning rely on a large number of manually labeled data, and labeling these data costs a lot of time and energy. Therefore, we develop a object detection method with limited labeled image quantity and apply it to the thighbone fracture localization. In this work, we build a semi-supervised object detection(SSOD) framework based on single-stage detector, which including three modules: adaptive difficult sample oriented (ADSO) module, Fusion Box and deformable expand encoder (Dex encoder). ADSO module takes the classification score as the label reliability evaluation criterion by weighting, Fusion Box is designed to merge similar pseudo boxes into a reliable box for box regression and Dex encoder is proposed to enhance the adaptability of image augmentation. The experiment is conducted on the thighbone fracture dataset, which includes 3484 training thigh fracture images and 358 testing thigh fracture images. The experimental results show that the proposed method achieves the state-of-the-art AP in thighbone fracture detection at different labeled data rates, i.e. 1%, 5% and 10%. Besides, we use full data to achieve knowledge distillation, our method achieves 86.2% AP50 and 52.6% AP75.
    Private Algorithms with Private Predictions. (arXiv:2210.11222v1 [cs.CR])
    When applying differential privacy to sensitive data, a common way of getting improved performance is to use external information such as other sensitive data, public data, or human priors. We propose to use the algorithms with predictions framework -- previously applied largely to improve time complexity or competitive ratios -- as a powerful way of designing and analyzing privacy-preserving methods that can take advantage of such external information to improve utility. For four important tasks -- quantile release, its extension to multiple quantiles, covariance estimation, and data release -- we construct prediction-dependent differentially private methods whose utility scales with natural measures of prediction quality. The analyses enjoy several advantages, including minimal assumptions about the data, natural ways of adding robustness to noisy predictions, and novel "meta" algorithms that can learn predictions from other (potentially sensitive) data. Overall, our results demonstrate how to enable differentially private algorithms to make use of and learn noisy predictions, which holds great promise for improving utility while preserving privacy across a variety of tasks.
    Maximum Common Subgraph Guided Graph Retrieval: Late and Early Interaction Networks. (arXiv:2210.11020v1 [cs.LG])
    The graph retrieval problem is to search in a large corpus of graphs for ones that are most similar to a query graph. A common consideration for scoring similarity is the maximum common subgraph (MCS) between the query and corpus graphs, usually counting the number of common edges (i.e., MCES). In some applications, it is also desirable that the common subgraph be connected, i.e., the maximum common connected subgraph (MCCS). Finding exact MCES and MCCS is intractable, but may be unnecessary if ranking corpus graphs by relevance is the goal. We design fast and trainable neural functions that approximate MCES and MCCS well. Late interaction methods compute dense representations for the query and corpus graph separately, and compare these representations using simple similarity functions at the last stage, leading to highly scalable systems. Early interaction methods combine information from both graphs right from the input stages, are usually considerably more accurate, but slower. We propose both late and early interaction neural MCES and MCCS formulations. They are both based on a continuous relaxation of a node alignment matrix between query and corpus nodes. For MCCS, we propose a novel differentiable network for estimating the size of the largest connected common subgraph. Extensive experiments with seven data sets show that our proposals are superior among late interaction models in terms of both accuracy and speed. Our early interaction models provide accuracy competitive with the state of the art, at substantially greater speeds.
    Frequency of Interest-based Noise Attenuation Method to Improve Anomaly Detection Performance. (arXiv:2210.11068v1 [cs.LG])
    Accurately extracting driving events is the way to maximize computational efficiency and anomaly detection performance in the tire frictional nose-based anomaly detection task. This study proposes a concise and highly useful method for improving the precision of the event extraction that is hindered by extra noise such as wind noise, which is difficult to characterize clearly due to its randomness. The core of the proposed method is based on the identification of the road friction sound corresponding to the frequency of interest and removing the opposite characteristics with several frequency filters. Our method enables precision maximization of driving event extraction while improving anomaly detection performance by an average of 8.506%. Therefore, we conclude our method is a practical solution suitable for road surface anomaly detection purposes in outdoor edge computing environments.
    PAC-Bayesian Learning of Optimization Algorithms. (arXiv:2210.11113v1 [cs.LG])
    We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.
    Deep Learning-Derived Optimal Aviation Strategies to Control Pandemics. (arXiv:2210.10888v1 [cs.LG])
    The COVID-19 pandemic has affected countries across the world, demanding drastic public health policies to mitigate the spread of infection, leading to economic crisis as a collateral damage. In this work, we investigated the impact of human mobility (described via international commercial flights) on COVID-19 infection dynamics at the global scale. For this, we developed a graph neural network-based framework referred to as Dynamic Connectivity GraphSAGE (DCSAGE), which operates over spatiotemporal graphs and is well-suited for dynamically changing adjacency information. To obtain insights on the relative impact of different geographical locations, due to their associated air traffic, on the evolution of the pandemic, we conducted local sensitivity analysis on our model through node perturbation experiments. From our analyses, we identified Western Europe, North America, and Middle East as the leading geographical locations fueling the pandemic, attributed to the enormity of air traffic originating or transiting through these regions. We used these observations to identify tangible air traffic reduction strategies that can have a high impact on controlling the pandemic, with minimal interference to human mobility. Our work provides a robust deep learning-based tool to study global pandemics and is of key relevance to policy makers to take informed decisions regarding air traffic restrictions during future outbreaks.
    Does Decentralized Learning with Non-IID Unlabeled Data Benefit from Self Supervision?. (arXiv:2210.10947v1 [cs.LG])
    Decentralized learning has been advocated and widely deployed to make efficient use of distributed datasets, with an extensive focus on supervised learning (SL) problems. Unfortunately, the majority of real-world data are unlabeled and can be highly heterogeneous across sources. In this work, we carefully study decentralized learning with unlabeled data through the lens of self-supervised learning (SSL), specifically contrastive visual representation learning. We study the effectiveness of a range of contrastive learning algorithms under decentralized learning settings, on relatively large-scale datasets including ImageNet-100, MS-COCO, and a new real-world robotic warehouse dataset. Our experiments show that the decentralized SSL (Dec-SSL) approach is robust to the heterogeneity of decentralized datasets, and learns useful representation for object classification, detection, and segmentation tasks. This robustness makes it possible to significantly reduce communication and reduce the participation ratio of data sources with only minimal drops in performance. Interestingly, using the same amount of data, the representation learned by Dec-SSL can not only perform on par with that learned by centralized SSL which requires communication and excessive data storage costs, but also sometimes outperform representations extracted from decentralized SL which requires extra knowledge about the data labels. Finally, we provide theoretical insights into understanding why data heterogeneity is less of a concern for Dec-SSL objectives, and introduce feature alignment and clustering techniques to develop a new Dec-SSL algorithm that further improves the performance, in the face of highly non-IID data. Our study presents positive evidence to embrace unlabeled data in decentralized learning, and we hope to provide new insights into whether and why decentralized SSL is effective.
    A baseline revisited: Pushing the limits of multi-segment models for context-aware translation. (arXiv:2210.10906v1 [cs.CL])
    This paper addresses the task of contextual translation using multi-segment models. Specifically we show that increasing model capacity further pushes the limits of this approach and that deeper models are more suited to capture context dependencies. Furthermore, improvements observed with larger models can be transferred to smaller models using knowledge distillation. Our experiments show that this approach achieves competitive performance across several languages and benchmarks, without additional language-specific tuning and task specific architectures.
    Mathematical Justification of Hard Negative Mining via Isometric Approximation Theorem. (arXiv:2210.11173v1 [cs.LG])
    In deep metric learning, the Triplet Loss has emerged as a popular method to learn many computer vision and natural language processing tasks such as facial recognition, object detection, and visual-semantic embeddings. One issue that plagues the Triplet Loss is network collapse, an undesirable phenomenon where the network projects the embeddings of all data onto a single point. Researchers predominately solve this problem by using triplet mining strategies. While hard negative mining is the most effective of these strategies, existing formulations lack strong theoretical justification for their empirical success. In this paper, we utilize the mathematical theory of isometric approximation to show an equivalence between the Triplet Loss sampled by hard negative mining and an optimization problem that minimizes a Hausdorff-like distance between the neural network and its ideal counterpart function. This provides the theoretical justifications for hard negative mining's empirical efficacy. In addition, our novel application of the isometric approximation theorem provides the groundwork for future forms of hard negative mining that avoid network collapse. Our theory can also be extended to analyze other Euclidean space-based metric learning methods like Ladder Loss or Contrastive Learning.
    Uncertainty Disentanglement with Non-stationary Heteroscedastic Gaussian Processes for Active Learning. (arXiv:2210.10964v1 [cs.LG])
    Gaussian processes are Bayesian non-parametric models used in many areas. In this work, we propose a Non-stationary Heteroscedastic Gaussian process model which can be learned with gradient-based techniques. We demonstrate the interpretability of the proposed model by separating the overall uncertainty into aleatoric (irreducible) and epistemic (model) uncertainty. We illustrate the usability of derived epistemic uncertainty on active learning problems. We demonstrate the efficacy of our model with various ablations on multiple datasets.
    Task Phasing: Automated Curriculum Learning from Demonstrations. (arXiv:2210.10999v1 [cs.LG])
    Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task phasing approaches outperform state-of-the-art approaches with respect to their asymptotic performance.
    Enhancing Out-of-Distribution Detection in Natural Language Understanding via Implicit Layer Ensemble. (arXiv:2210.11034v1 [cs.CL])
    Out-of-distribution (OOD) detection aims to discern outliers from the intended data distribution, which is crucial to maintaining high reliability and a good user experience. Most recent studies in OOD detection utilize the information from a single representation that resides in the penultimate layer to determine whether the input is anomalous or not. Although such a method is straightforward, the potential of diverse information in the intermediate layers is overlooked. In this paper, we propose a novel framework based on contrastive learning that encourages intermediate features to learn layer-specialized representations and assembles them implicitly into a single representation to absorb rich information in the pre-trained language model. Extensive experiments in various intent classification and OOD datasets demonstrate that our approach is significantly more effective than other works.
    Vertical Federated Linear Contextual Bandits. (arXiv:2210.11050v1 [cs.LG])
    In this paper, we investigate a novel problem of building contextual bandits in the vertical federated setting, i.e., contextual information is vertically distributed over different departments. This problem remains largely unexplored in the research community. To this end, we carefully design a customized encryption scheme named orthogonal matrix-based mask mechanism(O3M) for encrypting local contextual information while avoiding expensive conventional cryptographic techniques. We further apply the mechanism to two commonly-used bandit algorithms, LinUCB and LinTS, and instantiate two practical protocols for online recommendation under the vertical federated setting. The proposed protocols can perfectly recover the service quality of centralized bandit algorithms while achieving a satisfactory runtime efficiency, which is theoretically proved and analyzed in this paper. By conducting extensive experiments on both synthetic and real-world datasets, we show the superiority of the proposed method in terms of privacy protection and recommendation performance.
    Trust Region Policy Optimization with Optimal Transport Discrepancies: Duality and Algorithm for Continuous Actions. (arXiv:2210.11137v1 [cs.LG])
    Policy Optimization (PO) algorithms have been proven particularly suited to handle the high-dimensionality of real-world continuous control tasks. In this context, Trust Region Policy Optimization methods represent a popular approach to stabilize the policy updates. These usually rely on the Kullback-Leibler (KL) divergence to limit the change in the policy. The Wasserstein distance represents a natural alternative, in place of the KL divergence, to define trust regions or to regularize the objective function. However, state-of-the-art works either resort to its approximations or do not provide an algorithm for continuous state-action spaces, reducing the applicability of the method. In this paper, we explore optimal transport discrepancies (which include the Wasserstein distance) to define trust regions, and we propose a novel algorithm - Optimal Transport Trust Region Policy Optimization (OT-TRPO) - for continuous state-action spaces. We circumvent the infinite-dimensional optimization problem for PO by providing a one-dimensional dual reformulation for which strong duality holds. We then analytically derive the optimal policy update given the solution of the dual problem. This way, we bypass the computation of optimal transport costs and of optimal transport maps, which we implicitly characterize by solving the dual formulation. Finally, we provide an experimental evaluation of our approach across various control tasks. Our results show that optimal transport discrepancies can offer an advantage over state-of-the-art approaches.
    Topology Optimization via Machine Learning and Deep Learning: A Review. (arXiv:2210.10782v1 [cs.LG])
    Topology optimization (TO) is a method of deriving an optimal design that satisfies a given load and boundary conditions within a design domain. This method enables effective design without initial design, but has been limited in use due to high computational costs. At the same time, machine learning (ML) methodology including deep learning has made great progress in the 21st century, and accordingly, many studies have been conducted to enable effective and rapid optimization by applying ML to TO. Therefore, this study reviews and analyzes previous research on ML-based TO (MLTO). Two different perspectives of MLTO are used to review studies: (1) TO and (2) ML perspectives. The TO perspective addresses "why" to use ML for TO, while the ML perspective addresses "how" to apply ML to TO. In addition, the limitations of current MLTO research and future research directions are examined.
    Spectral Subspace Dictionary Learning. (arXiv:2210.10855v1 [cs.LG])
    \textit{Dictionary learning}, the problem of recovering a sparsely used matrix $\mathbf{D} \in \mathbb{R}^{M \times K}$ and $N$ independent $K \times 1$ $s$-sparse vectors $\mathbf{X} \in \mathbb{R}^{K \times N}$ from samples of the form $\mathbf{Y} = \mathbf{D}\mathbf{X}$, is of increasing importance to applications in signal processing and data science. Early papers on provable dictionary learning identified that one can detect whether two samples $\mathbf{y}_i, \mathbf{y}_j$ share a common dictionary element by testing if their absolute inner product (correlation) exceeds a certain threshold: $|\left\langle \mathbf{y}_i, \mathbf{y}_j \right\rangle| > \tau$. These correlation-based methods work well when sparsity is small, but suffer from declining performance when sparsity grows faster than $\sqrt{M}$; as a result, such methods were abandoned in the search for dictionary learning algorithms when sparsity is nearly linear in $M$. In this paper, we revisit correlation-based dictionary learning. Instead of seeking to recover individual dictionary atoms, we employ a spectral method to recover the subspace spanned by the dictionary atoms in the support of each sample. This approach circumvents the primary challenge encountered by previous correlation methods, namely that when sharing information between two samples it is difficult to tell \textit{which} dictionary element the two samples share. We prove that under a suitable random model the resulting algorithm recovers dictionaries in polynomial time for sparsity linear in $M$ up to log factors. Our results improve on the best known methods by achieving a decaying error bound in dimension $M$; the best previously known results for the overcomplete ($K > M$) setting achieve polynomial time linear regime only for constant error bounds. Numerical simulations confirm our results.
    Gradient Backpropagation based Feature Attribution to Enable Explainable-AI on the Edge. (arXiv:2210.10922v1 [cs.AR])
    There has been a recent surge in the field of Explainable AI (XAI) which tackles the problem of providing insights into the behavior of black-box machine learning models. Within this field, \textit{feature attribution} encompasses methods which assign relevance scores to input features and visualize them as a heatmap. Designing flexible accelerators for multiple such algorithms is challenging since the hardware mapping of these algorithms has not been studied yet. In this work, we first analyze the dataflow of gradient backpropagation based feature attribution algorithms to determine the resource overhead required over inference. The gradient computation is optimized to minimize the memory overhead. Second, we develop a High-Level Synthesis (HLS) based configurable FPGA design that is targeted for edge devices and supports three feature attribution algorithms. Tile based computation is employed to maximally use on-chip resources while adhering to the resource constraints. Representative CNNs are trained on CIFAR-10 dataset and implemented on multiple Xilinx FPGAs using 16-bit fixed-point precision demonstrating flexibility of our library. Finally, through efficient reuse of allocated hardware resources, our design methodology demonstrates a pathway to repurpose inference accelerators to support feature attribution with minimal overhead, thereby enabling real-time XAI on the edge.
    A Pareto-optimal compositional energy-based model for sampling and optimization of protein sequences. (arXiv:2210.10838v1 [cs.LG])
    Deep generative models have emerged as a popular machine learning-based approach for inverse design problems in the life sciences. However, these problems often require sampling new designs that satisfy multiple properties of interest in addition to learning the data distribution. This multi-objective optimization becomes more challenging when properties are independent or orthogonal to each other. In this work, we propose a Pareto-compositional energy-based model (pcEBM), a framework that uses multiple gradient descent for sampling new designs that adhere to various constraints in optimizing distinct properties. We demonstrate its ability to learn non-convex Pareto fronts and generate sequences that simultaneously satisfy multiple desired properties across a series of real-world antibody design tasks.
    An Optimization-Based Supervised Learning Algorithm for PXRD Phase Fraction Estimation. (arXiv:2210.10867v1 [cs.LG])
    In powder diffraction data analysis, phase identification is the process of determining the crystalline phases in a sample using its characteristic Bragg peaks. For multiphasic spectra, we must also determine the relative weight fraction of each phase in the sample. Machine Learning algorithms (e.g., Artificial Neural Networks) have been applied to perform such difficult tasks in powder diffraction analysis, but typically require a significant number of training samples for acceptable performance. We have developed an approach that performs well even with a small number of training samples. We apply a fixed-point iteration algorithm on the labelled training samples to estimate monophasic spectra. Then, given an unknown sample spectrum, we again use a fixed-point iteration algorithm to determine the weighted combination of monophase spectra that best approximates the unknown sample spectrum. These weights are the desired phase fractions for the sample. We compare our approach with several traditional Machine Learning algorithms.
    DOT-VAE: Disentangling One Factor at a Time. (arXiv:2210.10920v1 [cs.LG])
    As we enter the era of machine learning characterized by an overabundance of data, discovery, organization, and interpretation of the data in an \textit{unsupervised} manner becomes a critical need. One promising approach to this endeavour is the problem of \textit{Disentanglement}, which aims at learning the underlying generative latent factors, called the factors of variation, of the data and encoding them in disjoint latent representations. Recent advances have made efforts to solve this problem for synthetic datasets generated by a fixed set of independent factors of variation. Here, we propose to extend this to real-world datasets with a countable number of factors of variations. We propose a novel framework which augments the latent space of a Variational Autoencoders with a disentangled space and is trained using a Wake-Sleep-inspired two-step algorithm for unsupervised disentanglement. Our network learns to disentangle interpretable, independent factors from the data ``one at a time", and encode it in different dimensions of the disentangled latent space, while making no prior assumptions about the number of factors or their joint distribution. We demonstrate its quantitative and qualitative effectiveness by evaluating the latent representations learned on two synthetic benchmark datasets; DSprites and 3DShapes and on a real datasets CelebA.
    IDM-Follower: A Model-Informed Deep Learning Method for Long-Sequence Car-Following Trajectory Prediction. (arXiv:2210.10965v1 [eess.SY])
    Model-based and learning-based methods are two major types of methodologies to model car following behaviors. Model-based methods describe the car-following behaviors with explicit mathematical equations, while learning-based methods focus on getting a mapping between inputs and outputs. Both types of methods have advantages and weaknesses. Meanwhile, most car-following models are generative and only consider the inputs of the speed, position, and acceleration of the last time step. To address these issues, this study proposes a novel framework called IDM-Follower that can generate a sequence of following vehicle trajectory by a recurrent autoencoder informed by a physical car-following model, the Intelligent Driving Model (IDM).We implement a novel structure with two independent encoders and a self-attention decoder that could sequentially predict the following trajectories. A loss function considering the discrepancies between predictions and labeled data integrated with discrepancies from model-based predictions is implemented to update the neural network parameters. Numerical experiments with multiple settings on simulation and NGSIM datasets show that the IDM-Follower can improve the prediction performance compared to the model-based or learning-based methods alone. Analysis on different noise levels also shows good robustness of the model.
    Black Box Model Explanations and the Human Interpretability Expectations -- An Analysis in the Context of Homicide Prediction. (arXiv:2210.10849v1 [cs.LG])
    Strategies based on Explainable Artificial Intelligence - XAI have promoted better human interpretability of the results of black box machine learning models. The XAI measures being currently used (Ciu, Dalex, Eli5, Lofo, Shap, and Skater) provide various forms of explanations, including global rankings of relevance of attributes. Current research points to the need for further studies on how these explanations meet the Interpretability Expectations of human experts and how they can be used to make the model even more transparent while taking into account specific complexities of the model and dataset being analyzed, as well as important human factors of sensitive real-world contexts/problems. Intending to shed light on the explanations generated by XAI measures and their interpretabilities, this research addresses a real-world classification problem related to homicide prediction, duly endorsed by the scientific community, replicated its proposed black box model and used 6 different XAI measures to generate explanations and 6 different human experts to generate what this research referred to as Interpretability Expectations - IE. The results were computed by means of comparative analysis and identification of relationships among all the attribute ranks produced, and ~49% concordance was found among attributes indicated by means of XAI measures and human experts, ~41% exclusively by XAI measures and ~10% exclusively by human experts. The results allow for answering: "Do the different XAI measures generate similar explanations for the proposed problem?", "Are the interpretability expectations generated among different human experts similar?", "Do the explanations generated by XAI measures meet the interpretability expectations of human experts?" and "Can Interpretability Explanations and Expectations work together?", all of which concerning the context of homicide prediction.
    Learning Multi-Objective Curricula for Robotic Policy Learning. (arXiv:2110.03032v3 [cs.LG] UPDATED)
    Various automatic curriculum learning (ACL) methods have been proposed to improve the sample efficiency and final performance of deep reinforcement learning (DRL). They are designed to control how a DRL agent collects data, which is inspired by how humans gradually adapt their learning processes to their capabilities. For example, ACL can be used for subgoal generation, reward shaping, environment generation, or initial state generation. However, prior work only considers curriculum learning following one of the aforementioned predefined paradigms. It is unclear which of these paradigms are complementary, and how the combination of them can be learned from interactions with the environment. Therefore, in this paper, we propose a unified automatic curriculum learning framework to create multi-objective but coherent curricula that are generated by a set of parametric curriculum modules. Each curriculum module is instantiated as a neural network and is responsible for generating a particular curriculum. In order to coordinate those potentially conflicting modules in unified parameter space, we propose a multi-task hyper-net learning framework that uses a single hyper-net to parameterize all those curriculum modules. In addition to existing hand-designed curricula paradigms, we further design a flexible memory mechanism to learn an abstract curriculum, which may otherwise be difficult to design manually. We evaluate our method on a series of robotic manipulation tasks and demonstrate its superiority over other state-of-the-art ACL methods in terms of sample efficiency and final performance.
    Graph Regularized Probabilistic Matrix Factorization for Drug-Drug Interactions Prediction. (arXiv:2210.10784v1 [q-bio.QM])
    Co-administration of two or more drugs simultaneously can result in adverse drug reactions. Identifying drug-drug interactions (DDIs) is necessary, especially for drug development and for repurposing old drugs. DDI prediction can be viewed as a matrix completion task, for which matrix factorization (MF) appears as a suitable solution. This paper presents a novel Graph Regularized Probabilistic Matrix Factorization (GRPMF) method, which incorporates expert knowledge through a novel graph-based regularization strategy within an MF framework. An efficient and sounded optimization algorithm is proposed to solve the resulting non-convex problem in an alternating fashion. The performance of the proposed method is evaluated through the DrugBank dataset, and comparisons are provided against state-of-the-art techniques. The results demonstrate the superior performance of GRPMF when compared to its counterparts.
    Supervised Contrastive Learning with TPE-based Bayesian Optimization of Tabular Data for Imbalanced Learning. (arXiv:2210.10824v1 [cs.LG])
    Class imbalance has a detrimental effect on the predictive performance of most supervised learning algorithms as the imbalanced distribution can lead to a bias preferring the majority class. To solve this problem, we propose a Supervised Contrastive Learning (SCL) method with Bayesian optimization technique based on Tree-structured Parzen Estimator (TPE) for imbalanced tabular datasets. Compared with supervised learning, contrastive learning can avoid "label bias" by extracting the information hidden in data. Based on contrastive loss, SCL can exploit the label information to address insufficient data augmentation of tabular data, and is thus used in the proposed SCL-TPE method to learn a discriminative representation of data. Additionally, as the hyper-parameter temperature has a decisive influence on the SCL performance and is difficult to tune, TPE-based Bayesian optimization is introduced to automatically select the best temperature. Experiments are conducted on both binary and multi-class imbalanced tabular datasets. As shown in the results obtained, TPE outperforms other hyper-parameter optimization (HPO) methods such as grid search, random search, and genetic algorithm. More importantly, the proposed SCL-TPE method achieves much-improved performance compared with the state-of-the-art methods.
    Autoencoded sparse Bayesian in-IRT factorization, calibration, and amortized inference for the Work Disability Functional Assessment Battery. (arXiv:2210.10952v1 [stat.ME])
    The Work Disability Functional Assessment Battery (WD-FAB) is a multidimensional item response theory (IRT) instrument designed for assessing work-related mental and physical function based on responses to an item bank. In prior iterations it was developed using traditional means -- linear factorization, followed by statistical testing for item selection, and finally, calibration of disjoint unidimensional IRT models. As a result, the WD-FAB, like many other IRT instruments, is a posthoc model. In this manuscript, we derive an interpretable probabilistic autoencoder architecture that embeds as the decoder a Bayesian hierarchical model for self-consistently performing the following simultaneous tasks: scale factorization, item selection, parameter identification, and response scoring. This method obviates the linear factorization and null hypothesis statistical tests that are usually required for developing multidimensional IRT models, so that partitioning is consistent with the ultimate nonlinear factor model. We use the method on WD-FAB item responses and compare the resulting item discriminations to those obtained using the traditional method.
    Causally-guided Regularization of Graph Attention Improves Generalizability. (arXiv:2210.10946v1 [cs.LG])
    However, the inferred attentions are vulnerable to spurious correlations and connectivity in the training data, hampering the generalizability of the model. We introduce CAR, a general-purpose regularization framework for graph attention networks. Embodying a causal inference approach, \methodname aligns the attention mechanism with the causal effects of active interventions on graph connectivity in a scalable manner. CAR is compatible with a variety of graph attention architectures, and we show that it systematically improves generalizability on various node classification tasks. Our ablation studies indicate that \methodname hones in on the aspects of graph structure most pertinent to the prediction (e.g., homophily), and does so more effectively than alternative approaches. Finally, we also show that CAR enhances interpretability of attention weights by accentuating node-neighbor relations that point to causal hypotheses. For social media network-sized graphs, a CAR-guided graph rewiring approach could allow us to combine the scalability of graph convolutional methods with the higher performance of graph attention.
    Generalization Properties of Decision Trees on Real-valued and Categorical Features. (arXiv:2210.10781v1 [stat.ML])
    We revisit binary decision trees from the perspective of partitions of the data. We introduce the notion of partitioning function, and we relate it to the growth function and to the VC dimension. We consider three types of features: real-valued, categorical ordinal and categorical nominal, with different split rules for each. For each feature type, we upper bound the partitioning function of the class of decision stumps before extending the bounds to the class of general decision tree (of any fixed structure) using a recursive approach. Using these new results, we are able to find the exact VC dimension of decision stumps on examples of $\ell$ real-valued features, which is given by the largest integer $d$ such that $2\ell \ge \binom{d}{\lfloor\frac{d}{2}\rfloor}$. Furthermore, we show that the VC dimension of a binary tree structure with $L_T$ leaves on examples of $\ell$ real-valued features is in $O(L_T \log(L_T\ell))$. Finally, we elaborate a pruning algorithm based on these results that performs better than the cost-complexity and reduced-error pruning algorithms on a number of data sets, with the advantage that no cross-validation is required.
    On Learning Fairness and Accuracy on Multiple Subgroups. (arXiv:2210.10837v1 [stat.ML])
    We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of group sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. Specifically, the subgroup specific predictors are learned in the lower-level through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
    Self-learning locally-optimal hypertuning using maximum entropy, and comparison of machine learning approaches for estimating fatigue life in composite materials. (arXiv:2210.10783v1 [cs.LG])
    Applications of Structural Health Monitoring (SHM) combined with Machine Learning (ML) techniques enhance real-time performance tracking and increase structural integrity awareness of civil, aerospace and automotive infrastructures. This SHM-ML synergy has gained popularity in the last years thanks to the anticipation of maintenance provided by arising ML algorithms and their ability of handling large quantities of data and considering their influence in the problem. In this paper we develop a novel ML nearest-neighbors-alike algorithm based on the principle of maximum entropy to predict fatigue damage (Palmgren-Miner index) in composite materials by processing the signals of Lamb Waves -- a non-destructive SHM technique -- with other meaningful features such as layup parameters and stiffness matrices calculated from the Classical Laminate Theory (CLT). The full data analysis cycle is applied to a dataset of delamination experiments in composites. The predictions achieve a good level of accuracy, similar to other ML algorithms, e.g. Neural Networks or Gradient-Boosted Trees, and computation times are of the same order of magnitude. The key advantages of our proposal are: (1) The automatic determination of all the parameters involved in the prediction, so no hyperparameters have to be set beforehand, which saves time devoted to hypertuning the model and also represents an advantage for autonomous, self-supervised SHM. (2) No training is required, which, in an \textit{online learning} context where streams of data are fed continuously to the model, avoids repeated training -- essential for reliable real-time, continuous monitoring.
    MMRNet: Improving Reliability for Multimodal Computer Vision for Bin Picking via Multimodal Redundancy. (arXiv:2210.10842v1 [cs.CV])
    Recently, there has been tremendous interest in industry 4.0 infrastructure to address labor shortages in global supply chains. Deploying artificial intelligence-enabled robotic bin picking systems in real world has become particularly important for reducing labor demands and costs while increasing efficiency. To this end, artificial intelligence-enabled robotic bin picking systems may be used to automate bin picking, but may also cause expensive damage during an abnormal event such as a sensor failure. As such, reliability becomes a critical factor for translating artificial intelligence research to real world applications and products. In this paper, we propose a reliable vision system with MultiModal Redundancy (MMRNet) for tackling object detection and segmentation for robotic bin picking using data from different modalities. This is the first system that introduces the concept of multimodal redundancy to combat sensor failure issues during deployment. In particular, we realize the multimodal redundancy framework with a gate fusion module and dynamic ensemble learning. Finally, we present a new label-free multimodal consistency score that utilizes the output from all modalities to measure the overall system output reliability and uncertainty. Through experiments, we demonstrate that in an event of missing modality, our system provides a much more reliable performance compared to baseline models. We also demonstrate that our MC score is a more powerful reliability indicator for outputs during inference time where model generated confidence score are often over-confident.
    FedRecover: Recovering from Poisoning Attacks in Federated Learning using Historical Information. (arXiv:2210.10936v1 [cs.CR])
    Federated learning is vulnerable to poisoning attacks in which malicious clients poison the global model via sending malicious model updates to the server. Existing defenses focus on preventing a small number of malicious clients from poisoning the global model via robust federated learning methods and detecting malicious clients when there are a large number of them. However, it is still an open challenge how to recover the global model from poisoning attacks after the malicious clients are detected. A naive solution is to remove the detected malicious clients and train a new global model from scratch, which incurs large cost that may be intolerable for resource-constrained clients such as smartphones and IoT devices. In this work, we propose FedRecover, which can recover an accurate global model from poisoning attacks with small cost for the clients. Our key idea is that the server estimates the clients' model updates instead of asking the clients to compute and communicate them during the recovery process. In particular, the server stores the global models and clients' model updates in each round, when training the poisoned global model. During the recovery process, the server estimates a client's model update in each round using its stored historical information. Moreover, we further optimize FedRecover to recover a more accurate global model using warm-up, periodic correction, abnormality fixing, and final tuning strategies, in which the server asks the clients to compute and communicate their exact model updates. Theoretically, we show that the global model recovered by FedRecover is close to or the same as that recovered by train-from-scratch under some assumptions. Empirically, our evaluation on four datasets, three federated learning methods, as well as untargeted and targeted poisoning attacks (e.g., backdoor attacks) shows that FedRecover is both accurate and efficient.
    On Tilted Losses in Machine Learning: Theory and Applications. (arXiv:2109.06141v2 [cs.LG] UPDATED)
    Exponential tilting is a technique commonly used in fields such as statistics, probability, information theory, and optimization to create parametric distribution shifts. Despite its prevalence in related fields, tilting has not seen widespread use in machine learning. In this work, we aim to bridge this gap by exploring the use of tilting in risk minimization. We study a simple extension to ERM -- tilted empirical risk minimization (TERM) -- which uses exponential tilting to flexibly tune the impact of individual losses. The resulting framework has several useful properties: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to the tail probability of losses. Our work makes rigorous connections between TERM and related objectives, such as Value-at-Risk, Conditional Value-at-Risk, and distributionally robust optimization (DRO). We develop batch and stochastic first-order optimization methods for solving TERM, provide convergence guarantees for the solvers, and show that the framework can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications in machine learning, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. Despite the straightforward modification TERM makes to traditional ERM objectives, we find that the framework can consistently outperform ERM and deliver competitive performance with state-of-the-art, problem-specific approaches.
    VTC: Improving Video-Text Retrieval with User Comments. (arXiv:2210.10820v1 [cs.CV])
    Multi-modal retrieval is an important problem for many applications, such as recommendation and search. Current benchmarks and even datasets are often manually constructed and consist of mostly clean samples where all modalities are well-correlated with the content. Thus, current video-text retrieval literature largely focuses on video titles or audio transcripts, while ignoring user comments, since users often tend to discuss topics only vaguely related to the video. Despite the ubiquity of user comments online, there is currently no multi-modal representation learning datasets that includes comments. In this paper, we a) introduce a new dataset of videos, titles and comments; b) present an attention-based mechanism that allows the model to learn from sometimes irrelevant data such as comments; c) show that by using comments, our method is able to learn better, more contextualised, representations for image, video and audio representations. Project page: https://unitaryai.github.io/vtc-paper.
    Learning to Invert: Simple Adaptive Attacks for Gradient Inversion in Federated Learning. (arXiv:2210.10880v1 [cs.LG])
    Gradient inversion attack enables recovery of training samples from model updates in federated learning (FL) and constitutes a serious threat to data privacy. To mitigate this vulnerability, prior work proposed both principled defenses based on differential privacy, as well as heuristic defenses based on gradient compression as countermeasures. These defenses have so far been very effective, in particular those based on gradient compression that allow the model to maintain high accuracy while greatly reducing the attack's effectiveness. In this work, we argue that such findings do not accurately reflect the privacy risk in FL, and show that existing defenses can be broken by a simple adaptive attack that trains a model using auxiliary data to learn how to invert gradients on both vision and language tasks.
    Routine Usage of AI-based Chest X-ray Reading Support in a Multi-site Medical Supply Center. (arXiv:2210.10779v1 [eess.IV])
    Research question: How can we establish an AI support for reading of chest X-rays in clinical routine and which benefits emerge for the clinicians and radiologists. Can it perform 24/7 support for practicing clinicians? 2. Findings: We installed an AI solution for Chest X-ray in a given structure (MVZ Uhlenbrock & Partner, Germany). We could demonstrate the practicability, performance, and benefits in 10 connected clinical sites. 3. Meaning: A commercially available AI solution for the evaluation of Chest X-ray images is able to help radiologists and clinical colleagues 24/7 in a complex environment. The system performs in a robust manner, supporting radiologists and clinical colleagues in their important decisions, in practises and hospitals regardless of the user and X-ray system type producing the image-data.
    Discovering Many Diverse Solutions with Bayesian Optimization. (arXiv:2210.10953v1 [cs.LG])
    Bayesian optimization (BO) is a popular approach for sample-efficient optimization of black-box objective functions. While BO has been successfully applied to a wide range of scientific applications, traditional approaches to single-objective BO only seek to find a single best solution. This can be a significant limitation in situations where solutions may later turn out to be intractable. For example, a designed molecule may turn out to violate constraints that can only be reasonably evaluated after the optimization process has concluded. To address this issue, we propose Rank-Ordered Bayesian Optimization with Trust-regions (ROBOT) which aims to find a portfolio of high-performing solutions that are diverse according to a user-specified diversity metric. We evaluate ROBOT on several real-world applications and show that it can discover large sets of high-performing diverse solutions while requiring few additional function evaluations compared to finding a single best solution.
    Palm up: Playing in the Latent Manifold for Unsupervised Pretraining. (arXiv:2210.10913v1 [cs.LG])
    Large and diverse datasets have been the cornerstones of many impressive advancements in artificial intelligence. Intelligent creatures, however, learn by interacting with the environment, which changes the input sensory signals and the state of the environment. In this work, we aim to bring the best of both worlds and propose an algorithm that exhibits an exploratory behavior whilst it utilizes large diverse datasets. Our key idea is to leverage deep generative models that are pretrained on static datasets and introduce a dynamic model in the latent space. The transition dynamics simply mixes an action and a random sampled latent. It then applies an exponential moving average for temporal persistency, the resulting latent is decoded to image using pretrained generator. We then employ an unsupervised reinforcement learning algorithm to explore in this environment and perform unsupervised representation learning on the collected data. We further leverage the temporal information of this data to pair data points as a natural supervision for representation learning. Our experiments suggest that the learned representations can be successfully transferred to downstream tasks in both vision and reinforcement learning domains.
    FairEGM: Fair Link Prediction and Recommendation via Emulated Graph Modification. (arXiv:2201.11596v2 [cs.LG] UPDATED)
    As machine learning becomes more widely adopted across domains, it is critical that researchers and ML engineers think about the inherent biases in the data that may be perpetuated by the model. Recently, many studies have shown that such biases are also imbibed in Graph Neural Network (GNN) models if the input graph is biased, potentially to the disadvantage of underserved and underrepresented communities. In this work, we aim to mitigate the bias learned by GNNs by jointly optimizing two different loss functions: one for the task of link prediction and one for the task of demographic parity. We further implement three different techniques inspired by graph modification approaches: the Global Fairness Optimization (GFO), Constrained Fairness Optimization (CFO), and Fair Edge Weighting (FEW) models. These techniques mimic the effects of changing underlying graph structures within the GNN and offer a greater degree of interpretability over more integrated neural network methods. Our proposed models emulate microscopic or macroscopic edits to the input graph while training GNNs and learn node embeddings that are both accurate and fair under the context of link recommendations. We demonstrate the effectiveness of our approach on four real world datasets and show that we can improve the recommendation fairness by several factors at negligible cost to link prediction accuracy.
    LOSTIN: Logic Optimization via Spatio-Temporal Information with Hybrid Graph Models. (arXiv:2201.08455v2 [cs.LG] UPDATED)
    Despite the stride made by machine learning (ML) based performance modeling, two major concerns that may impede production-ready ML applications in EDA are stringent accuracy requirements and generalization capability. To this end, we propose hybrid graph neural network (GNN) based approaches towards highly accurate quality-of-result (QoR) estimations with great generalization capability, specifically targeting logic synthesis optimization. The key idea is to simultaneously leverage spatio-temporal information from hardware designs and logic synthesis flows to forecast performance (i.e., delay/area) of various synthesis flows on different designs. The structural characteristics inside hardware designs are distilled and represented by GNNs; the temporal knowledge (i.e., relative ordering of logic transformations) in synthesis flows can be imposed on hardware designs by combining a virtually added supernode or a sequence processing model with conventional GNN models. Evaluation on 3.3 million data points shows that the testing mean absolute percentage error (MAPE) on designs seen and unseen during training are no more than 1.2% and 3.1%, respectively, which are 7-15X lower than existing studies.
    Learning and Retrieval from Prior Data for Skill-based Imitation Learning. (arXiv:2210.11435v1 [cs.LG])
    Imitation learning offers a promising path for robots to learn general-purpose behaviors, but traditionally has exhibited limited scalability due to high data supervision requirements and brittle generalization. Inspired by recent advances in multi-task imitation learning, we investigate the use of prior data from previous tasks to facilitate learning novel tasks in a robust, data-efficient manner. To make effective use of the prior data, the robot must internalize knowledge from past experiences and contextualize this knowledge in novel tasks. To that end, we develop a skill-based imitation learning framework that extracts temporally extended sensorimotor skills from prior data and subsequently learns a policy for the target task that invokes these learned skills. We identify several key design choices that significantly improve performance on novel tasks, namely representation learning objectives to enable more predictable skill representations and a retrieval-based data augmentation mechanism to increase the scope of supervision for policy training. On a collection of simulated and real-world manipulation domains, we demonstrate that our method significantly outperforms existing imitation learning and offline reinforcement learning approaches. Videos and code are available at https://ut-austin-rpl.github.io/sailor
    Causal Effect Estimation with Global Probabilistic Forecasting: A Case Study of the Impact of Covid-19 Lockdowns on Energy Demand. (arXiv:2209.08885v2 [cs.LG] UPDATED)
    The electricity industry is heavily implementing smart grid technologies to improve reliability, availability, security, and efficiency. This implementation needs technological advancements, the development of standards and regulations, as well as testing and planning. Smart grid load forecasting and management are critical for reducing demand volatility and improving the market mechanism that connects generators, distributors, and retailers. During policy implementations or external interventions, it is necessary to analyse the uncertainty of their impact on the electricity demand to enable a more accurate response of the system to fluctuating demand. This paper analyses the uncertainties of external intervention impacts on electricity demand. It implements a framework that combines probabilistic and global forecasting models using a deep learning approach to estimate the causal impact distribution of an intervention. The causal effect is assessed by predicting the counterfactual distribution outcome for the affected instances and then contrasting it to the real outcomes. We consider the impact of Covid-19 lockdowns on energy usage as a case study to evaluate the non-uniform effect of this intervention on the electricity demand distribution. We could show that during the initial lockdowns in Australia and some European countries, there was often a more significant decrease in the troughs than in the peaks, while the mean remained almost unaffected.
    Application of artificial neural network to determine the thickness profile of thin film. (arXiv:2210.11421v1 [cs.LG])
    In this paper, we introduce a novel artificial neural network (ANN) based scheme to estimate the thickness of thin films deposited on a given substrate. Here we consider the visible interference pattern between a plane wave and a diverging wave reflected from the thin film surface that records the thickness information of the thin film. We assume a uniform thickness profile of the film. However, the thickness increases as the deposition takes place. We extract the intensity data along a line through the center of the interference pattern. We train our network by using a number of such line information of known thickness profiles. The performance of the trained network is then tested by estimating the thickness of unknown surfaces. The numerical simulation results show that the proposed technique can be very much useful for automated measurement of thickness, quickly and in real time, during deposition
    Multirate Training of Neural Networks. (arXiv:2106.10771v3 [cs.LG] UPDATED)
    We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales, where slow parts are updated less frequently. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.
    Surgical Fine-Tuning Improves Adaptation to Distribution Shifts. (arXiv:2210.11466v1 [cs.LG])
    A common approach to transfer learning under distribution shift is to fine-tune the last few layers of a pre-trained model, preserving learned features while also adapting to the new task. This paper shows that in such settings, selectively fine-tuning a subset of layers (which we term surgical fine-tuning) matches or outperforms commonly used fine-tuning approaches. Moreover, the type of distribution shift influences which subset is more effective to tune: for example, for image corruptions, fine-tuning only the first few layers works best. We validate our findings systematically across seven real-world data tasks spanning three types of distribution shifts. Theoretically, we prove that for two-layer neural networks in an idealized setting, first-layer tuning can outperform fine-tuning all layers. Intuitively, fine-tuning more parameters on a small target dataset can cause information learned during pre-training to be forgotten, and the relevant information depends on the type of shift.
    Physics-Informed Graph Learning. (arXiv:2202.10679v2 [cs.LG] UPDATED)
    An expeditious development of graph learning in recent years has found innumerable applications in several diversified fields. Of the main associated challenges are the volume and complexity of graph data. The graph learning models suffer from the inability to efficiently learn graph information. In order to indemnify this inefficacy, physics-informed graph learning (PIGL) is emerging. PIGL incorporates physics rules while performing graph learning, which has enormous benefits. This paper presents a systematic review of PIGL methods. We begin with introducing a unified framework of graph learning models followed by examining existing PIGL methods in relation to the unified framework. We also discuss several future challenges for PIGL. This survey paper is expected to stimulate innovative research and development activities pertaining to PIGL.
    Learning-Based Data Storage [Vision] (Technical Report). (arXiv:2206.05778v2 [cs.DB] UPDATED)
    Deep neural network (DNN) and its variants have been extensively used for a wide spectrum of real applications such as image classification, face/speech recognition, fraud detection, and so on. In addition to many important machine learning tasks, as artificial networks emulating the way brain cells function, DNNs also show the capability of storing non-linear relationships between input and output data, which exhibits the potential of storing data via DNNs. We envision a new paradigm of data storage, "DNN-as-a-Database", where data are encoded in well-trained machine learning models. Compared with conventional data storage that directly records data in raw formats, learning-based structures (e.g., DNN) can implicitly encode data pairs of inputs and outputs and compute/materialize actual output data of different resolutions only if input data are provided. This new paradigm can greatly enhance the data security by allowing flexible data privacy settings on different levels, achieve low space consumption and fast computation with the acceleration of new hardware (e.g., Diffractive Neural Network and AI chips), and can be generalized to distributed DNN-based storage/computing. In this paper, we propose this novel concept of learning-based data storage, which utilizes a learning structure called learning-based memory unit (LMU), to store, organize, and retrieve data. As a case study, we use DNNs as the engine in the LMU, and study the data capacity and accuracy of the DNN-based data storage. Our preliminary experimental results show the feasibility of the learning-based data storage by achieving high (100%) accuracy of the DNN storage. We explore and design effective solutions to utilize the DNN-based data storage to manage and query relational tables. We discuss how to generalize our solutions to other data types (e.g., graphs) and environments such as distributed DNN storage/computing.
    Asymptotic Analysis of Conditioned Stochastic Gradient Descent. (arXiv:2006.02745v4 [math.ST] UPDATED)
    In this paper, we investigate a general class of stochastic gradient descent (SGD) algorithms, called conditioned SGD, based on a preconditioning of the gradient direction. Using a discrete-time approach with martingale tools, we establish the weak convergence of the rescaled sequence of iterates for a broad class of conditioning matrices including stochastic first-order and second-order methods. Almost sure convergence results, which may be of independent interest, are also presented. When the conditioning matrix is an estimate of the inverse Hessian, the algorithm is proved to be asymptotically optimal. For the sake of completeness, we provide a practical procedure to achieve this minimum variance.
    How Does a Deep Learning Model Architecture Impact Its Privacy?. (arXiv:2210.11049v1 [cs.CR])
    As a booming research area in the past decade, deep learning technologies have been driven by big data collected and processed on an unprecedented scale. However, the sensitive information in the collected training data raises privacy concerns. Recent research indicated that deep learning models are vulnerable to various privacy attacks, including membership inference attacks, attribute inference attacks, and gradient inversion attacks. It is noteworthy that the performance of the attacks varies from model to model. In this paper, we conduct empirical analyses to answer a fundamental question: Does model architecture affect model privacy? We investigate several representative model architectures from CNNs to Transformers, and show that Transformers are generally more vulnerable to privacy attacks than CNNs. We further demonstrate that the micro design of activation layers, stem layers, and bias parameters, are the major reasons why CNNs are more resilient to privacy attacks than Transformers. We also find that the presence of attention modules is another reason why Transformers are more vulnerable to privacy attacks. We hope our discovery can shed some new light on how to defend against the investigated privacy attacks and help the community build privacy-friendly model architectures.
    A lower confidence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean. (arXiv:2210.11133v1 [stat.ML])
    A confidence sequence (CS) is an anytime-valid sequential inference primitive which produces an adapted sequence of sets for a predictable parameter sequence with a time-uniform coverage guarantee. This work constructs a non-parametric non-asymptotic lower CS for the running average conditional expectation whose slack converges to zero given non-negative right heavy-tailed observations with bounded mean. Specifically, when the variance is finite the approach dominates the empirical Bernstein supermartingale of Howard et. al.; with infinite variance, can adapt to a known or unknown $(1 + \delta)$-th moment bound; and can be efficiently approximated using a sublinear number of sufficient statistics. In certain cases this lower CS can be converted into a closed-interval CS whose width converges to zero, e.g., any bounded realization, or post contextual-bandit inference with bounded rewards and unbounded importance weights. A reference implementation and example simulations demonstrate the technique.
    Hypothesis Testing using Causal and Causal Variational Generative Models. (arXiv:2210.11275v1 [cs.LG])
    Hypothesis testing and the usage of expert knowledge, or causal priors, has not been well explored in the context of generative models. We propose a novel set of generative architectures, Causal Gen and Causal Variational Gen, that can utilize nonparametric structural causal knowledge combined with a deep learning functional approximation. We show how, using a deliberate (non-random) split of training and testing data, these models can generalize better to similar, but out-of-distribution data points, than non-causal generative models and prediction models such as Variational autoencoders and Fully Connected Neural Networks. We explore using this generalization error as a proxy for causal model hypothesis testing. We further show how dropout can be used to learn functional relationships of structural models that are difficult to learn with traditional methods. We validate our methods on a synthetic pendulum dataset, as well as a trauma surgery ground level fall dataset.
    SHARKS: Smart Hacking Approaches for RisK Scanning in Internet-of-Things and Cyber-Physical Systems based on Machine Learning. (arXiv:2101.02780v2 [cs.CR] UPDATED)
    Cyber-physical systems (CPS) and Internet-of-Things (IoT) devices are increasingly being deployed across multiple functionalities, ranging from healthcare devices and wearables to critical infrastructures, e.g., nuclear power plants, autonomous vehicles, smart cities, and smart homes. These devices are inherently not secure across their comprehensive software, hardware, and network stacks, thus presenting a large attack surface that can be exploited by hackers. In this article, we present an innovative technique for detecting unknown system vulnerabilities, managing these vulnerabilities, and improving incident response when such vulnerabilities are exploited. The novelty of this approach lies in extracting intelligence from known real-world CPS/IoT attacks, representing them in the form of regular expressions, and employing machine learning (ML) techniques on this ensemble of regular expressions to generate new attack vectors and security vulnerabilities. Our results show that 10 new attack vectors and 122 new vulnerability exploits can be successfully generated that have the potential to exploit a CPS or an IoT ecosystem. The ML methodology achieves an accuracy of 97.4% and enables us to predict these attacks efficiently with an 87.2% reduction in the search space. We demonstrate the application of our method to the hacking of the in-vehicle network of a connected car. To defend against the known attacks and possible novel exploits, we discuss a defense-in-depth mechanism for various classes of attacks and the classification of data targeted by such attacks. This defense mechanism optimizes the cost of security measures based on the sensitivity of the protected resource, thus incentivizing its adoption in real-world CPS/IoT by cybersecurity practitioners.
    Machine Learning for $K$-adaptability in Two-stage Robust Optimization. (arXiv:2210.11152v1 [math.OC])
    Two-stage robust optimization problems constitute one of the hardest optimization problem classes. One of the solution approaches to this class of problems is $K$-adaptability. This approach simultaneously seeks the best partitioning of the uncertainty set of scenarios into $K$ subsets, and optimizes decisions corresponding to each of these subsets. In general case, it is solved using the $K$-adaptability branch-and-bound algorithm, which requires exploration of exponentially-growing solution trees. To accelerate finding high-quality solutions in such trees, we propose a machine learning-based node selection strategy. In particular, we construct a feature engineering scheme based on general two-stage robust optimization insights that allows us to train our machine learning tool on a database of resolved B\&B trees, and to apply it as-is to problems of different sizes and/or types. We experimentally show that using our learned node selection strategy outperforms a vanilla, random node selection strategy when tested on problems of the same type as the training problems, also in case the $K$-value or the problem size differs from the training ones.
    Multi-hypothesis 3D human pose estimation metrics favor miscalibrated distributions. (arXiv:2210.11179v1 [cs.CV])
    Due to depth ambiguities and occlusions, lifting 2D poses to 3D is a highly ill-posed problem. Well-calibrated distributions of possible poses can make these ambiguities explicit and preserve the resulting uncertainty for downstream tasks. This study shows that previous attempts, which account for these ambiguities via multiple hypotheses generation, produce miscalibrated distributions. We identify that miscalibration can be attributed to the use of sample-based metrics such as minMPJPE. In a series of simulations, we show that minimizing minMPJPE, as commonly done, should converge to the correct mean prediction. However, it fails to correctly capture the uncertainty, thus resulting in a miscalibrated distribution. To mitigate this problem, we propose an accurate and well-calibrated model called Conditional Graph Normalizing Flow (cGNFs). Our model is structured such that a single cGNF can estimate both conditional and marginal densities within the same model - effectively solving a zero-shot density estimation problem. We evaluate cGNF on the Human~3.6M dataset and show that cGNF provides a well-calibrated distribution estimate while being close to state-of-the-art in terms of overall minMPJPE. Furthermore, cGNF outperforms previous methods on occluded joints while it remains well-calibrated.
    ESPNN: Deep Neural Network on the IAEA stopping power database. Atomic targets. (arXiv:2210.10950v1 [physics.atm-clus])
    The International Atomic Energy Agency (IAEA) stopping power database is a highly valued public resource compiling most of the experimental measurements published over nearly a century. The database -- accessible to the global scientific community -- is continuously updated and has been extensively employed in theoretical and experimental research for more than thirty years. This work aims to employ machine learning algorithms on the 2021 IAEA database to predict accurate electronic stopping power cross sections for any ion and target combination in a wide range of incident energies. Unsupervised machine learning methods are applied to clean the database in an automated manner. These techniques purge the data by removing suspicious outliers and old isolated values. A large portion of the remaining data is used to train a deep neural network, while the rest is set aside, constituting the test set. The present work considers collisional systems only with atomic targets. The first version of the electronic stopping power neural network code (espnn), openly available to users, is shown to yield predicted values in excellent agreement with the experimental results of the test set.
    An out-of-distribution discriminator based on Bayesian neural network epistemic uncertainty. (arXiv:2210.10780v1 [cs.LG])
    Neural networks have revolutionized the field of machine learning with increased predictive capability. In addition to improving the predictions of neural networks, there is a simultaneous demand for reliable uncertainty quantification on estimates made by machine learning methods such as neural networks. Bayesian neural networks (BNNs) are an important type of neural network with built-in capability for quantifying uncertainty. This paper discusses aleatoric and epistemic uncertainty in BNNs and how they can be calculated. With an example dataset of images where the goal is to identify the amplitude of an event in the image, it is shown that epistemic uncertainty tends to be lower in images which are well-represented in the training dataset and tends to be high in images which are not well-represented. An algorithm for out-of-distribution (OoD) detection with BNN epistemic uncertainty is introduced along with various experiments demonstrating factors influencing the OoD detection capability in a BNN. The OoD detection capability with epistemic uncertainty is shown to be comparable to the OoD detection in the discriminator network of a generative adversarial network (GAN) with comparable network architecture.
    Synthetic Blip Effects: Generalizing Synthetic Controls for the Dynamic Treatment Regime. (arXiv:2210.11003v1 [econ.EM])
    We propose a generalization of the synthetic control and synthetic interventions methodology to the dynamic treatment regime. We consider the estimation of unit-specific treatment effects from panel data collected via a dynamic treatment regime and in the presence of unobserved confounding. That is, each unit receives multiple treatments sequentially, based on an adaptive policy, which depends on a latent endogenously time-varying confounding state of the treated unit. Under a low-rank latent factor model assumption and a technical overlap assumption we propose an identification strategy for any unit-specific mean outcome under any sequence of interventions. The latent factor model we propose admits linear time-varying and time-invariant dynamical systems as special cases. Our approach can be seen as an identification strategy for structural nested mean models under a low-rank latent factor assumption on the blip effects. Our method, which we term "synthetic blip effects", is a backwards induction process, where the blip effect of a treatment at each period and for a target unit is recursively expressed as linear combinations of blip effects of a carefully chosen group of other units that received the designated treatment. Our work avoids the combinatorial explosion in the number of units that would be required by a vanilla application of prior synthetic control and synthetic intervention methods in such dynamic treatment regime settings.
    Backdoor Attack and Defense in Federated Generative Adversarial Network-based Medical Image Synthesis. (arXiv:2210.10886v1 [cs.CV])
    Deep Learning-based image synthesis techniques have been applied in healthcare research for generating medical images to support open research and augment medical datasets. Training generative adversarial neural networks (GANs) usually require large amounts of training data. Federated learning (FL) provides a way of training a central model using distributed data while keeping raw data locally. However, given that the FL server cannot access the raw data, it is vulnerable to backdoor attacks, an adversarial by poisoning training data. Most backdoor attack strategies focus on classification models and centralized domains. It is still an open question if the existing backdoor attacks can affect GAN training and, if so, how to defend against the attack in the FL setting. In this work, we investigate the overlooked issue of backdoor attacks in federated GANs (FedGANs). The success of this attack is subsequently determined to be the result of some local discriminators overfitting the poisoned data and corrupting the local GAN equilibrium, which then further contaminates other clients when averaging the generator's parameters and yields high generator loss. Therefore, we proposed FedDetect, an efficient and effective way of defending against the backdoor attack in the FL setting, which allows the server to detect the client's adversarial behavior based on their losses and block the malicious clients. Our extensive experiments on two medical datasets with different modalities demonstrate the backdoor attack on FedGANs can result in synthetic images with low fidelity. After detecting and suppressing the detected malicious clients using the proposed defense strategy, we show that FedGANs can synthesize high-quality medical datasets (with labels) for data augmentation to improve classification models' performance.
    Analyzing the Robustness of Decentralized Horizontal and Vertical Federated Learning Architectures in a Non-IID Scenario. (arXiv:2210.11061v1 [cs.LG])
    Federated learning (FL) allows participants to collaboratively train machine and deep learning models while protecting data privacy. However, the FL paradigm still presents drawbacks affecting its trustworthiness since malicious participants could launch adversarial attacks against the training process. Related work has studied the robustness of horizontal FL scenarios under different attacks. However, there is a lack of work evaluating the robustness of decentralized vertical FL and comparing it with horizontal FL architectures affected by adversarial attacks. Thus, this work proposes three decentralized FL architectures, one for horizontal and two for vertical scenarios, namely HoriChain, VertiChain, and VertiComb. These architectures present different neural networks and training protocols suitable for horizontal and vertical scenarios. Then, a decentralized, privacy-preserving, and federated use case with non-IID data to classify handwritten digits is deployed to evaluate the performance of the three architectures. Finally, a set of experiments computes and compares the robustness of the proposed architectures when they are affected by different data poisoning based on image watermarks and gradient poisoning adversarial attacks. The experiments show that even though particular configurations of both attacks can destroy the classification performance of the architectures, HoriChain is the most robust one.
    Robotic Table Wiping via Reinforcement Learning and Whole-body Trajectory Optimization. (arXiv:2210.10865v1 [cs.RO])
    We propose a framework to enable multipurpose assistive mobile robots to autonomously wipe tables to clean spills and crumbs. This problem is challenging, as it requires planning wiping actions while reasoning over uncertain latent dynamics of crumbs and spills captured via high-dimensional visual observations. Simultaneously, we must guarantee constraints satisfaction to enable safe deployment in unstructured cluttered environments. To tackle this problem, we first propose a stochastic differential equation to model crumbs and spill dynamics and absorption with a robot wiper. Using this model, we train a vision-based policy for planning wiping actions in simulation using reinforcement learning (RL). To enable zero-shot sim-to-real deployment, we dovetail the RL policy with a whole-body trajectory optimization framework to compute base and arm joint trajectories that execute the desired wiping motions while guaranteeing constraints satisfaction. We extensively validate our approach in simulation and on hardware. Video: https://youtu.be/inORKP4F3EI
    Optimization on Manifolds via Graph Gaussian Processes. (arXiv:2210.10962v1 [stat.ML])
    This paper integrates manifold learning techniques within a \emph{Gaussian process upper confidence bound} algorithm to optimize an objective function on a manifold. Our approach is motivated by applications where a full representation of the manifold is not available and querying the objective is expensive. We rely on a point cloud of manifold samples to define a graph Gaussian process surrogate model for the objective. Query points are sequentially chosen using the posterior distribution of the surrogate model given all previous queries. We establish regret bounds in terms of the number of queries and the size of the point cloud. Several numerical examples complement the theory and illustrate the performance of our method.
    Robust Imitation via Mirror Descent Inverse Reinforcement Learning. (arXiv:2210.11201v1 [cs.LG])
    Recently, adversarial imitation learning has shown a scalable reward acquisition method for inverse reinforcement learning (IRL) problems. However, estimated reward signals often become uncertain and fail to train a reliable statistical model since the existing methods tend to solve hard optimization problems directly. Inspired by a first-order optimization method called mirror descent, this paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem. IRL solutions derived by mirror descent are tolerant to the uncertainty incurred by target density estimation since the amount of reward learning is regulated with respect to local geometric constraints. We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence in terms of a rigorous regret bound of $\mathcal{O}(1/T)$ for step sizes $\{\eta_t\}_{t=1}^{T}$. Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.
    G-Augment: Searching For The Meta-Structure Of Data Augmentation Policies For ASR. (arXiv:2210.10879v1 [cs.LG])
    Data augmentation is a ubiquitous technique used to provide robustness to automatic speech recognition (ASR) training. However, even as so much of the ASR training process has become automated and more "end-to-end", the data augmentation policy (what augmentation functions to use, and how to apply them) remains hand-crafted. We present Graph-Augment, a technique to define the augmentation space as directed acyclic graphs (DAGs) and search over this space to optimize the augmentation policy itself. We show that given the same computational budget, policies produced by G-Augment are able to perform better than SpecAugment policies obtained by random search on fine-tuning tasks on CHiME-6 and AMI. G-Augment is also able to establish a new state-of-the-art ASR performance on the CHiME-6 evaluation set (30.7% WER). We further demonstrate that G-Augment policies show better transfer properties across warm-start to cold-start training and model size compared to random-searched SpecAugment policies.
    Robust One-Shot Singing Voice Conversion. (arXiv:2210.11096v1 [cs.SD])
    Many existing works on singing voice conversion (SVC) require clean recordings of target singer's voice for training. However, it is often difficult to collect them in advance and singing voices are often distorted with reverb and accompaniment music. In this work, we propose robust one-shot SVC (ROSVC) that performs any-to-any SVC robustly even on such distorted singing voices using less than 10s of a reference voice. To this end, we propose two-stage training method called Robustify. In the first stage, a novel one-shot SVC model based on a generative adversarial network is trained on clean data to ensure high-quality conversion. In the second stage, enhancement modules are introduced to the encoders of the model to improve the robustness against distortions in the feature space. Experimental results show that the proposed method outperforms one-shot SVC baselines for both seen and unseen singers and greatly improves the robustness against the distortions.
    Scene Text Recognition with Semantics. (arXiv:2210.10836v1 [cs.CV])
    Scene Text Recognition (STR) models have achieved high performance in recent years on benchmark datasets where text images are presented with minimal noise. Traditional STR recognition pipelines take a cropped image as sole input and attempt to identify the characters present. This infrastructure can fail in instances where the input image is noisy or the text is partially obscured. This paper proposes using semantic information from the greater scene to contextualise predictions. We generate semantic vectors using object tags and fuse this information into a transformer-based architecture. The results demonstrate that our multimodal approach yields higher performance than traditional benchmark models, particularly on noisy instances.
  • Open

    On Tilted Losses in Machine Learning: Theory and Applications. (arXiv:2109.06141v2 [cs.LG] UPDATED)
    Exponential tilting is a technique commonly used in fields such as statistics, probability, information theory, and optimization to create parametric distribution shifts. Despite its prevalence in related fields, tilting has not seen widespread use in machine learning. In this work, we aim to bridge this gap by exploring the use of tilting in risk minimization. We study a simple extension to ERM -- tilted empirical risk minimization (TERM) -- which uses exponential tilting to flexibly tune the impact of individual losses. The resulting framework has several useful properties: We show that TERM can increase or decrease the influence of outliers, respectively, to enable fairness or robustness; has variance-reduction properties that can benefit generalization; and can be viewed as a smooth approximation to the tail probability of losses. Our work makes rigorous connections between TERM and related objectives, such as Value-at-Risk, Conditional Value-at-Risk, and distributionally robust optimization (DRO). We develop batch and stochastic first-order optimization methods for solving TERM, provide convergence guarantees for the solvers, and show that the framework can be efficiently solved relative to common alternatives. Finally, we demonstrate that TERM can be used for a multitude of applications in machine learning, such as enforcing fairness between subgroups, mitigating the effect of outliers, and handling class imbalance. Despite the straightforward modification TERM makes to traditional ERM objectives, we find that the framework can consistently outperform ERM and deliver competitive performance with state-of-the-art, problem-specific approaches.
    Sliced Gromov-Wasserstein. (arXiv:1905.10124v4 [stat.ML] UPDATED)
    Recently used in various machine learning contexts, the Gromov-Wasserstein distance (GW) allows for comparing distributions whose supports do not necessarily lie in the same metric space. However, this Optimal Transport (OT) distance requires solving a complex non convex quadratic program which is most of the time very costly both in time and memory. Contrary to GW, the Wasserstein distance (W) enjoys several properties (e.g. duality) that permit large scale optimization. Among those, the solution of W on the real line, that only requires sorting discrete samples in 1D, allows defining the Sliced Wasserstein (SW) distance. This paper proposes a new divergence based on GW akin to SW. We first derive a closed form for GW when dealing with 1D distributions, based on a new result for the related quadratic assignment problem. We then define a novel OT discrepancy that can deal with large scale distributions via a slicing approach and we show how it relates to the GW distance while being $O(n\log(n))$ to compute. We illustrate the behavior of this so called Sliced Gromov-Wasserstein (SGW) discrepancy in experiments where we demonstrate its ability to tackle similar problems as GW while being several order of magnitudes faster to compute.
    Ranking with multiple types of pairwise comparisons. (arXiv:2206.13580v2 [stat.ML] UPDATED)
    The task of ranking individuals or teams, based on a set of comparisons between pairs, arises in various contexts, including sporting competitions and the analysis of dominance hierarchies among animals and humans. Given data on which competitors beat which others, the challenge is to rank the competitors from best to worst. Here we study the problem of computing rankings when there are multiple, potentially conflicting modes of comparison, such as multiple types of dominance behaviors among animals. We assume that we do not know a priori what information each behavior conveys about the ranking, or even whether they convey any information at all. Nonetheless we show that it is possible to compute a ranking in this situation and present a fast method for doing so, based on a combination of an expectation-maximization algorithm and a modified Bradley-Terry model. We give a selection of example applications to both animal and human competition.
    BELIEF in Dependence: Leveraging Atomic Linearity in Data Bits for Rethinking Generalized Linear Models. (arXiv:2210.10852v1 [math.ST])
    Two linearly uncorrelated binary variables must be also independent because non-linear dependence cannot manifest with only two possible states. This inherent linearity is the atom of dependency constituting any complex form of relationship. Inspired by this observation, we develop a framework called binary expansion linear effect (BELIEF) for assessing and understanding arbitrary relationships with a binary outcome. Models from the BELIEF framework are easily interpretable because they describe the association of binary variables in the language of linear models, yielding convenient theoretical insight and striking parallels with the Gaussian world. In particular, an algebraic structure on the predictors with nonzero slopes governs conditional independence properties. With BELIEF, one may study generalized linear models (GLM) through transparent linear models, providing insight into how modeling is affected by the choice of link. For example, setting a GLM interaction coefficient to zero does not necessarily lead to the kind of no-interaction model assumption as understood under their linear model counterparts. Furthermore, for a binary response, maximum likelihood estimation for GLMs paradoxically fails under complete separation, when the data are most discriminative, whereas BELIEF estimation automatically reveals the perfect predictor in the data that is responsible for complete separation. We explore these phenomena and provide a host of related theoretical results. We also provide preliminary empirical demonstration and verification of some theoretical results.
    Generalization Properties of Decision Trees on Real-valued and Categorical Features. (arXiv:2210.10781v1 [stat.ML])
    We revisit binary decision trees from the perspective of partitions of the data. We introduce the notion of partitioning function, and we relate it to the growth function and to the VC dimension. We consider three types of features: real-valued, categorical ordinal and categorical nominal, with different split rules for each. For each feature type, we upper bound the partitioning function of the class of decision stumps before extending the bounds to the class of general decision tree (of any fixed structure) using a recursive approach. Using these new results, we are able to find the exact VC dimension of decision stumps on examples of $\ell$ real-valued features, which is given by the largest integer $d$ such that $2\ell \ge \binom{d}{\lfloor\frac{d}{2}\rfloor}$. Furthermore, we show that the VC dimension of a binary tree structure with $L_T$ leaves on examples of $\ell$ real-valued features is in $O(L_T \log(L_T\ell))$. Finally, we elaborate a pruning algorithm based on these results that performs better than the cost-complexity and reduced-error pruning algorithms on a number of data sets, with the advantage that no cross-validation is required.
    PAC-Bayesian Learning of Optimization Algorithms. (arXiv:2210.11113v1 [cs.LG])
    We apply the PAC-Bayes theory to the setting of learning-to-optimize. To the best of our knowledge, we present the first framework to learn optimization algorithms with provable generalization guarantees (PAC-bounds) and explicit trade-off between a high probability of convergence and a high convergence speed. Even in the limit case, where convergence is guaranteed, our learned optimization algorithms provably outperform related algorithms based on a (deterministic) worst-case analysis. Our results rely on PAC-Bayes bounds for general, unbounded loss-functions based on exponential families. By generalizing existing ideas, we reformulate the learning procedure into a one-dimensional minimization problem and study the possibility to find a global minimum, which enables the algorithmic realization of the learning procedure. As a proof-of-concept, we learn hyperparameters of standard optimization algorithms to empirically underline our theory.
    Improving Data Quality with Training Dynamics of Gradient Boosting Decision Trees. (arXiv:2210.11327v1 [cs.LG])
    Real world datasets contain incorrectly labeled instances that hamper the performance of the model and, in particular, the ability to generalize out of distribution. Also, each example might have different contribution towards learning. This motivates studies to better understanding of the role of data instances with respect to their contribution in good metrics in models. In this paper we propose a method based on metrics computed from training dynamics of Gradient Boosting Decision Trees (GBDTs) to assess the behavior of each training example. We focus on datasets containing mostly tabular or structured data, for which the use of Decision Trees ensembles are still the state-of-the-art in terms of performance. We show results on detecting noisy labels in order to either remove them, improving models' metrics in synthetic and real datasets, as well as a productive dataset. Our methods achieved the best results overall when compared with confident learning and heuristics.
    Multirate Training of Neural Networks. (arXiv:2106.10771v3 [cs.LG] UPDATED)
    We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained on different time scales, where slow parts are updated less frequently. By choosing appropriate partitionings we can obtain substantial computational speed-up for transfer learning tasks. We show for applications in vision and NLP that we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting models. We analyze the convergence properties of our multirate scheme and draw a comparison with vanilla SGD. We also discuss splitting choices for the neural network parameters which could enhance generalization performance when neural networks are trained from scratch. A multirate approach can be used to learn different features present in the data and as a form of regularization. Our paper unlocks the potential of using multirate techniques for neural network training and provides several starting points for future work in this area.
    Some distance bounds of branching processes and their diffusion limits. (arXiv:1005.3758v1 [math.PR] CROSS LISTED)
    We compute exact values respectively bounds of "distances" - in the sense of (transforms of) power divergences and relative entropy - between two discrete-time Galton-Watson branching processes with immigration GWI for which the offspring as well as the immigration is arbitrarily Poisson-distributed (leading to arbitrary type of criticality). Implications for asymptotic distinguishability behaviour in terms of contiguity and entire separation of the involved GWI are given, too. Furthermore, we determine the corresponding limit quantities for the context in which the two GWI converge to Feller-type branching diffusion processes, as the time-lags between observations tend to zero. Some applications to (static random environment like) Bayesian decision making and Neyman-Pearson testing are presented as well.
    ML4C: Seeing Causality Through Latent Vicinity. (arXiv:2110.00637v2 [cs.LG] UPDATED)
    Supervised Causal Learning (SCL) aims to learn causal relations from observational data by accessing previously seen datasets associated with ground truth causal relations. This paper presents a first attempt at addressing a fundamental question: What are the benefits from supervision and how does it benefit? Starting from seeing that SCL is not better than random guessing if the learning target is non-identifiable a priori, we propose a two-phase paradigm for SCL by explicitly considering structure identifiability. Following this paradigm, we tackle the problem of SCL on discrete data and propose ML4C. The core of ML4C is a binary classifier with a novel learning target: it classifies whether an Unshielded Triple (UT) is a v-structure or not. Specifically, starting from an input dataset with the corresponding skeleton provided, ML4C orients each UT once it is classified as a v-structure. These v-structures are together used to construct the final output. To address the fundamental question of SCL, we propose a principled method for ML4C featurization: we exploit the vicinity of a given UT (i.e., the neighbors of UT in skeleton), and derive features by considering the conditional dependencies and structural entanglement within the vicinity. We further prove that ML4C is asymptotically correct. Last but foremost, thorough experiments conducted on benchmark datasets demonstrate that ML4C remarkably outperforms other state-of-the-art algorithms in terms of accuracy, reliability, robustness and tolerance. In summary, ML4C shows promising results on validating the effectiveness of supervision for causal learning. Our codes are publicly available at https://github.com/microsoft/ML4C.
    Modeling Randomly Walking Volatility with Chained Gamma Distributions. (arXiv:2207.01151v3 [q-fin.CP] UPDATED)
    Volatility clustering is a common phenomenon in financial time series. Typically, linear models can be used to describe the temporal autocorrelation of the (logarithmic) variance of returns. Considering the difficulty in estimating this model, we construct a Dynamic Bayesian Network, which utilizes the conjugate prior relation of normal-gamma and gamma-gamma, so that its posterior form locally remains unchanged at each node. This makes it possible to find approximate solutions using variational methods quickly. Furthermore, we ensure that the volatility expressed by the model is an independent incremental process after inserting dummy gamma nodes between adjacent time steps. We have found that this model has two advantages: 1) It can be proved that it can express heavier tails than Gaussians, i.e., have positive excess kurtosis, compared to popular linear models. 2) If the variational inference(VI) is used for state estimation, it runs much faster than Monte Carlo(MC) methods since the calculation of the posterior uses only basic arithmetic operations. And its convergence process is deterministic. We tested the model, named Gam-Chain, using recent Crypto, Nasdaq, and Forex records of varying resolutions. The results show that: 1) In the same case of using MC, this model can achieve comparable state estimation results with the regular lognormal chain. 2) In the case of only using VI, this model can obtain accuracy that are slightly worse than MC, but still acceptable in practice; 3) Only using VI, the running time of Gam-Chain, in general case, can be reduced to below 5% of that based on the lognormal chain via MC.
    Krylov-Bellman boosting: Super-linear policy evaluation in general state spaces. (arXiv:2210.11377v1 [stat.ML])
    We present and analyze the Krylov-Bellman Boosting (KBB) algorithm for policy evaluation in general state spaces. It alternates between fitting the Bellman residual using non-parametric regression (as in boosting), and estimating the value function via the least-squares temporal difference (LSTD) procedure applied with a feature set that grows adaptively over time. By exploiting the connection to Krylov methods, we equip this method with two attractive guarantees. First, we provide a general convergence bound that allows for separate estimation errors in residual fitting and LSTD computation. Consistent with our numerical experiments, this bound shows that convergence rates depend on the restricted spectral structure, and are typically super-linear. Second, by combining this meta-result with sample-size dependent guarantees for residual fitting and LSTD computation, we obtain concrete statistical guarantees that depend on the sample size along with the complexity of the function class used to fit the residuals. We illustrate the behavior of the KBB algorithm for various types of policy evaluation problems, and typically find large reductions in sample complexity relative to the standard approach of fitted value iterationn.
    PAC-learning is Undecidable. (arXiv:1808.06324v3 [cs.LG] UPDATED)
    The problem of attempting to learn the mapping between data and labels is the crux of any machine learning task. It is, therefore, of interest to the machine learning community on practical as well as theoretical counts to consider the existence of a test or criterion for deciding the feasibility of attempting to learn. We investigate the existence of such a criterion in the setting of PAC-learning, basing the feasibility solely on whether the mapping to be learnt lends itself to approximation by a given class of hypothesis functions. We show that no such criterion exists, exposing a fundamental limitation in the decidability of learning. In other words, we prove that testing for PAC-learnability is undecidable in the Turing sense. We also briefly discuss some of the probable implications of this result to the current practice of machine learning.
    Rashomon Capacity: A Metric for Predictive Multiplicity in Classification. (arXiv:2206.01295v2 [cs.LG] UPDATED)
    Predictive multiplicity occurs when classification models with statistically indistinguishable performances assign conflicting predictions to individual samples. When used for decision-making in applications of consequence (e.g., lending, education, criminal justice), models developed without regard for predictive multiplicity may result in unjustified and arbitrary decisions for specific individuals. We introduce a new metric, called Rashomon Capacity, to measure predictive multiplicity in probabilistic classification. Prior metrics for predictive multiplicity focus on classifiers that output thresholded (i.e., 0-1) predicted classes. In contrast, Rashomon Capacity applies to probabilistic classifiers, capturing more nuanced score variations for individual samples. We provide a rigorous derivation for Rashomon Capacity, argue its intuitive appeal, and demonstrate how to estimate it in practice. We show that Rashomon Capacity yields principled strategies for disclosing conflicting models to stakeholders. Our numerical experiments illustrate how Rashomon Capacity captures predictive multiplicity in various datasets and learning models, including neural networks. The tools introduced in this paper can help data scientists measure and report predictive multiplicity prior to model deployment.
    Global Convergence of SGD On Two Layer Neural Nets. (arXiv:2210.11452v1 [cs.LG])
    In this note we demonstrate provable convergence of SGD to the global minima of appropriately regularized $\ell_2-$empirical risk of depth $2$ nets -- for arbitrary data and with any number of gates, if they are using adequately smooth and bounded activations like sigmoid and tanh. We build on the results in [1] and leverage a constant amount of Frobenius norm regularization on the weights, along with sampling of the initial weights from an appropriate distribution. We also give a continuous time SGD convergence result that also applies to smooth unbounded activations like SoftPlus. Our key idea is to show the existence loss functions on constant sized neural nets which are "Villani Functions".
    Vertical Federated Linear Contextual Bandits. (arXiv:2210.11050v1 [cs.LG])
    In this paper, we investigate a novel problem of building contextual bandits in the vertical federated setting, i.e., contextual information is vertically distributed over different departments. This problem remains largely unexplored in the research community. To this end, we carefully design a customized encryption scheme named orthogonal matrix-based mask mechanism(O3M) for encrypting local contextual information while avoiding expensive conventional cryptographic techniques. We further apply the mechanism to two commonly-used bandit algorithms, LinUCB and LinTS, and instantiate two practical protocols for online recommendation under the vertical federated setting. The proposed protocols can perfectly recover the service quality of centralized bandit algorithms while achieving a satisfactory runtime efficiency, which is theoretically proved and analyzed in this paper. By conducting extensive experiments on both synthetic and real-world datasets, we show the superiority of the proposed method in terms of privacy protection and recommendation performance.
    Surprises in adversarially-trained linear regression. (arXiv:2205.12695v2 [stat.ML] UPDATED)
    State-of-the-art machine learning models can be vulnerable to very small input perturbations that are adversarially constructed. Adversarial training is an effective approach to defend against such examples. It is formulated as a min-max problem, searching for the best solution when the training data was corrupted by the worst-case attacks. For linear regression problems, adversarial training can be formulated as a convex problem. We use this reformulation to make two technical contributions: First, we formulate the training problem as an instance of robust regression to reveal its connection to parameter-shrinking methods, specifically that $\ell_\infty$-adversarial training produces sparse solutions. Secondly, we study adversarial training in the overparameterized regime, i.e. when there are more parameters than data. We prove that adversarial training with small disturbances gives the solution with the minimum-norm that interpolates the training data. Ridge regression and lasso approximate such interpolating solutions as their regularization parameter vanishes. By contrast, for adversarial training, the transition into the interpolation regime is abrupt and for non-zero values of disturbance. This result is proved and illustrated with numerical examples.
    Universal Inference. (arXiv:1912.11436v4 [math.ST] UPDATED)
    We propose a general method for constructing hypothesis tests and confidence sets that have finite sample guarantees without regularity conditions. We refer to such procedures as "universal." The method is very simple and is based on a modified version of the usual likelihood ratio statistic, that we call "the split likelihood ratio test" (split LRT). The method is especially appealing for irregular statistical models. Canonical examples include mixture models and models that arise in shape-constrained inference. Constructing tests and confidence sets for such models is notoriously difficult. Typical inference methods, like the likelihood ratio test, are not useful in these cases because they have intractable limiting distributions. In contrast, the method we suggest works for any parametric model and also for some nonparametric models. The split LRT can also be used with profile likelihoods to deal with nuisance parameters, and it can also be run sequentially to yield anytime-valid $p$-values and confidence sequences.
    A Note on the Efficient Evaluation of PAC-Bayes Bounds. (arXiv:2209.05188v2 [cs.LG] UPDATED)
    When utilising PAC-Bayes theory for risk certification, it is usually necessary to estimate and bound the Gibbs risk of the PAC-Bayes posterior. Many works in the literature employ a method for this which requires a large number of passes of the dataset, incurring high computational cost. This manuscript presents a very general alternative which makes computational savings on the order of the dataset size.
    Hierarchical classification at multiple operating points. (arXiv:2210.10929v1 [cs.LG])
    Many classification problems consider classes that form a hierarchy. Classifiers that are aware of this hierarchy may be able to make confident predictions at a coarse level despite being uncertain at the fine-grained level. While it is generally possible to vary the granularity of predictions using a threshold at inference time, most contemporary work considers only leaf-node prediction, and almost no prior work has compared methods at multiple operating points. We present an efficient algorithm to produce operating characteristic curves for any method that assigns a score to every class in the hierarchy. Applying this technique to evaluate existing methods reveals that top-down classifiers are dominated by a naive flat softmax classifier across the entire operating range. We further propose two novel loss functions and show that a soft variant of the structured hinge loss is able to significantly outperform the flat baseline. Finally, we investigate the poor accuracy of top-down classifiers and demonstrate that they perform relatively well on unseen classes. Code is available online at https://github.com/jvlmdr/hiercls.
    On Representations of Mean-Field Variational Inference. (arXiv:2210.11385v1 [stat.ML])
    The mean field variational inference (MFVI) formulation restricts the general Bayesian inference problem to the subspace of product measures. We present a framework to analyze MFVI algorithms, which is inspired by a similar development for general variational Bayesian formulations. Our approach enables the MFVI problem to be represented in three different manners: a gradient flow on Wasserstein space, a system of Fokker-Planck-like equations and a diffusion process. Rigorous guarantees are established to show that a time-discretized implementation of the coordinate ascent variational inference algorithm in the product Wasserstein space of measures yields a gradient flow in the limit. A similar result is obtained for their associated densities, with the limit being given by a quasi-linear partial differential equation. A popular class of practical algorithms falls in this framework, which provides tools to establish convergence. We hope this framework could be used to guarantee convergence of algorithms in a variety of approaches, old and new, to solve variational inference problems.
    On Margins and Generalisation for Voting Classifiers. (arXiv:2206.04607v2 [cs.LG] UPDATED)
    We study the generalisation properties of majority voting on finite ensembles of classifiers, proving margin-based generalisation bounds via the PAC-Bayes theory. These provide state-of-the-art guarantees on a number of classification tasks. Our central results leverage the Dirichlet posteriors studied recently by Zantedeschi et al. [2021] for training voting classifiers; in contrast to that work our bounds apply to non-randomised votes via the use of margins. Our contributions add perspective to the debate on the "margins theory" proposed by Schapire et al. [1998] for the generalisation of ensemble classifiers.
    DisC-VC: Disentangled and F0-Controllable Neural Voice Conversion. (arXiv:2210.11059v1 [eess.AS])
    Voice conversion is a task to convert a non-linguistic feature of a given utterance. Since naturalness of speech strongly depends on its pitch pattern, in some applications, it would be desirable to keep the original rise/fall pitch pattern while changing the speaker identity. Some of the existing methods address this problem by either using a source-filter model or developing a neural network that takes an F0 pattern as input to the model. Although the latter approach can achieve relatively high sound quality compared to the former one, there is no consideration for discrepancy between the target and generated F0 patterns in its training process. In this paper, we propose a new variational-autoencoder-based voice conversion model accompanied by an auxiliary network, which ensures that the conversion result correctly reflects the specified F0/timbre information. We show the effectiveness of the proposed method by objective and subjective evaluations.
    How Does a Deep Learning Model Architecture Impact Its Privacy?. (arXiv:2210.11049v1 [cs.CR])
    As a booming research area in the past decade, deep learning technologies have been driven by big data collected and processed on an unprecedented scale. However, the sensitive information in the collected training data raises privacy concerns. Recent research indicated that deep learning models are vulnerable to various privacy attacks, including membership inference attacks, attribute inference attacks, and gradient inversion attacks. It is noteworthy that the performance of the attacks varies from model to model. In this paper, we conduct empirical analyses to answer a fundamental question: Does model architecture affect model privacy? We investigate several representative model architectures from CNNs to Transformers, and show that Transformers are generally more vulnerable to privacy attacks than CNNs. We further demonstrate that the micro design of activation layers, stem layers, and bias parameters, are the major reasons why CNNs are more resilient to privacy attacks than Transformers. We also find that the presence of attention modules is another reason why Transformers are more vulnerable to privacy attacks. We hope our discovery can shed some new light on how to defend against the investigated privacy attacks and help the community build privacy-friendly model architectures.
    On Feature Learning in the Presence of Spurious Correlations. (arXiv:2210.11369v1 [cs.LG])
    Deep classifiers are known to rely on spurious features $\unicode{x2013}$ patterns which are correlated with the target on the training data but not inherently relevant to the learning problem, such as the image backgrounds when classifying the foregrounds. In this paper we evaluate the amount of information about the core (non-spurious) features that can be decoded from the representations learned by standard empirical risk minimization (ERM) and specialized group robustness training. Following recent work on Deep Feature Reweighting (DFR), we evaluate the feature representations by re-training the last layer of the model on a held-out set where the spurious correlation is broken. On multiple vision and NLP problems, we show that the features learned by simple ERM are highly competitive with the features learned by specialized group robustness methods targeted at reducing the effect of spurious correlations. Moreover, we show that the quality of learned feature representations is greatly affected by the design decisions beyond the training method, such as the model architecture and pre-training strategy. On the other hand, we find that strong regularization is not necessary for learning high quality feature representations. Finally, using insights from our analysis, we significantly improve upon the best results reported in the literature on the popular Waterbirds, CelebA hair color prediction and WILDS-FMOW problems, achieving 97%, 92% and 50% worst-group accuracies, respectively.
    Revisiting Le Cam's Equation: Exact Minimax Rates over Convex Density Classes. (arXiv:2210.11436v1 [math.ST])
    We study the classical problem of deriving minimax rates for density estimation over convex density classes. Building on the pioneering work of Le Cam (1973), Birge (1983, 1986), Wong and Shen (1995), Yang and Barron (1999), we determine the exact (up to constants) minimax rate over any convex density class. This work thus extends these known results by demonstrating that the local metric entropy of the density class always captures the minimax optimal rates under such settings. Our bounds provide a unifying perspective across both parametric and nonparametric convex density classes, under weaker assumptions on the richness of the density class than previously considered. Our proposed `multistage sieve' MLE applies to any such convex density class. We apply our risk bounds to rederive known minimax rates including bounded total variation, and Holder density classes. We further illustrate the utility of the result by deriving upper bounds for less studied classes, e.g., convex mixture of densities.
    Learning Preferences for Interactive Autonomy. (arXiv:2210.10899v1 [cs.RO])
    When robots enter everyday human environments, they need to understand their tasks and how they should perform those tasks. To encode these, reward functions, which specify the objective of a robot, are employed. However, designing reward functions can be extremely challenging for complex tasks and environments. A promising approach is to learn reward functions from humans. Recently, several robot learning works embrace this approach and leverage human demonstrations to learn the reward functions. Known as inverse reinforcement learning, this approach relies on a fundamental assumption: humans can provide near-optimal demonstrations to the robot. Unfortunately, this is rarely the case: human demonstrations to the robot are often suboptimal due to various reasons, e.g., difficulty of teleoperation, robot having high degrees of freedom, or humans' cognitive limitations. This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities. Specifically, we study how reward functions can be learned using comparative feedback, in which the human user compares multiple robot trajectories instead of (or in addition to) providing demonstrations. To this end, we first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function, which may be parametric or non-parametric. Next, we propose active learning techniques to enable the robot to ask for comparison feedback that optimizes for the expected information that will be gained from that user feedback. Finally, we demonstrate the applicability of our methods in a wide variety of domains, ranging from autonomous driving simulations to home robotics, from standard reinforcement learning benchmarks to lower-body exoskeletons.
    Independence Testing-Based Approach to Causal Discovery under Measurement Error and Linear Non-Gaussian Models. (arXiv:2210.11021v1 [cs.LG])
    Causal discovery aims to recover causal structures generating the observational data. Despite its success in certain problems, in many real-world scenarios the observed variables are not the target variables of interest, but the imperfect measures of the target variables. Causal discovery under measurement error aims to recover the causal graph among unobserved target variables from observations made with measurement error. We consider a specific formulation of the problem, where the unobserved target variables follow a linear non-Gaussian acyclic model, and the measurement process follows the random measurement error model. Existing methods on this formulation rely on non-scalable over-complete independent component analysis (OICA). In this work, we propose the Transformed Independent Noise (TIN) condition, which checks for independence between a specific linear transformation of some measured variables and certain other measured variables. By leveraging the non-Gaussianity and higher-order statistics of data, TIN is informative about the graph structure among the unobserved target variables. By utilizing TIN, the ordered group decomposition of the causal model is identifiable. In other words, we could achieve what once required OICA to achieve by only conducting independence tests. Experimental results on both synthetic and real-world data demonstrate the effectiveness and reliability of our method.
    Entire Space Counterfactual Learning: Tuning, Analytical Properties and Industrial Applications. (arXiv:2210.11039v1 [cs.LG])
    As a basic research problem for building effective recommender systems, post-click conversion rate (CVR) estimation has long been plagued by sample selection bias and data sparsity issues. To address the data sparsity issue, prevalent methods based on entire space multi-task model leverage the sequential pattern of user actions, i.e. exposure $\rightarrow$ click $\rightarrow$ conversion to construct auxiliary learning tasks. However, they still fall short of guaranteeing the unbiasedness of CVR estimates. This paper theoretically demonstrates two defects of these entire space multi-task models: (1) inherent estimation bias (IEB) for CVR estimation, where the CVR estimate is inherently higher than the ground truth; (2) potential independence priority (PIP) for CTCVR estimation, where the causality from click to conversion might be overlooked. This paper further proposes a principled method named entire space counterfactual multi-task model (ESCM$^2$), which employs a counterfactual risk minimizer to handle both IEB and PIP issues at once. To demonstrate the effectiveness of the proposed method, this paper explores its parameter tuning in practice, derives its analytic properties, and showcases its effectiveness in industrial CVR estimation, where ESCM$^2$ can effectively alleviate the intrinsic IEB and PIP issues and outperform baseline models.
    On Learning Fairness and Accuracy on Multiple Subgroups. (arXiv:2210.10837v1 [stat.ML])
    We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of group sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. Specifically, the subgroup specific predictors are learned in the lower-level through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
    Learning Rationalizable Equilibria in Multiplayer Games. (arXiv:2210.11402v1 [cs.LG])
    A natural goal in multiagent learning besides finding equilibria is to learn rationalizable behavior, where players learn to avoid iteratively dominated actions. However, even in the basic setting of multiplayer general-sum games, existing algorithms require a number of samples exponential in the number of players to learn rationalizable equilibria under bandit feedback. This paper develops the first line of efficient algorithms for learning rationalizable Coarse Correlated Equilibria (CCE) and Correlated Equilibria (CE) whose sample complexities are polynomial in all problem parameters including the number of players. To achieve this result, we also develop a new efficient algorithm for the simpler task of finding one rationalizable action profile (not necessarily an equilibrium), whose sample complexity substantially improves over the best existing results of Wu et al. (2021). Our algorithms incorporate several novel techniques to guarantee rationalizability and no (swap-)regret simultaneously, including a correlated exploration scheme and adaptive learning rates, which may be of independent interest. We complement our results with a sample complexity lower bound showing the sharpness of our guarantees.
    Hypothesis Testing using Causal and Causal Variational Generative Models. (arXiv:2210.11275v1 [cs.LG])
    Hypothesis testing and the usage of expert knowledge, or causal priors, has not been well explored in the context of generative models. We propose a novel set of generative architectures, Causal Gen and Causal Variational Gen, that can utilize nonparametric structural causal knowledge combined with a deep learning functional approximation. We show how, using a deliberate (non-random) split of training and testing data, these models can generalize better to similar, but out-of-distribution data points, than non-causal generative models and prediction models such as Variational autoencoders and Fully Connected Neural Networks. We explore using this generalization error as a proxy for causal model hypothesis testing. We further show how dropout can be used to learn functional relationships of structural models that are difficult to learn with traditional methods. We validate our methods on a synthetic pendulum dataset, as well as a trauma surgery ground level fall dataset.
    Bagging in overparameterized learning: Risk characterization and risk monotonization. (arXiv:2210.11445v1 [math.ST])
    Bagging is a commonly used ensemble technique in statistics and machine learning to improve the performance of prediction procedures. In this paper, we study the prediction risk of variants of bagged predictors in the proportional asymptotics regime, in which the ratio of the number of features to the number of observations converges to a constant. Specifically, we propose a general strategy to analyze prediction risk under squared error loss of bagged predictors using classical results on simple random sampling. Specializing the strategy, we derive the exact asymptotic risk of the bagged ridge and ridgeless predictors with an arbitrary number of bags under a well-specified linear model with arbitrary feature covariance matrices and signal vectors. Furthermore, we prescribe a generic cross-validation procedure to select the optimal subsample size for bagging and discuss its utility to mitigate the non-monotonic behavior of the limiting risk in the sample size (i.e., double or multiple descents). In demonstrating the proposed procedure for bagged ridge and ridgeless predictors, we thoroughly investigate oracle properties of the optimal subsample size, and provide an in-depth comparison between different bagging variants.
    Sparse high-dimensional linear regression with a partitioned empirical Bayes ECM algorithm. (arXiv:2209.08139v3 [stat.ME] UPDATED)
    Bayesian variable selection methods are powerful techniques for fitting and inferring on sparse high-dimensional linear regression models. However, many are computationally intensive or require restrictive prior distributions on model parameters. Likelihood based penalization methods are more computationally friendly, but resource intensive refitting techniques are needed for inference. In this paper, we proposed an efficient and powerful Bayesian approach for sparse high-dimensional linear regression. Minimal prior assumptions on the parameters are required through the use of plug-in empirical Bayes estimates of hyperparameters. Efficient maximum a posteriori probability (MAP) estimation is completed through the use of a partitioned and extended expectation conditional maximization (ECM) algorithm. The result is a PaRtitiOned empirical Bayes Ecm (PROBE) algorithm applied to sparse high-dimensional linear regression. We propose methods to estimate credible and prediction intervals for predictions of future values. We compare the empirical properties of predictions and our predictive inference to comparable approaches with numerous simulation studies and an analysis of cancer cell lines drug response study. The proposed approach is implemented in the R package probe.
    Optimization on Manifolds via Graph Gaussian Processes. (arXiv:2210.10962v1 [stat.ML])
    This paper integrates manifold learning techniques within a \emph{Gaussian process upper confidence bound} algorithm to optimize an objective function on a manifold. Our approach is motivated by applications where a full representation of the manifold is not available and querying the objective is expensive. We rely on a point cloud of manifold samples to define a graph Gaussian process surrogate model for the objective. Query points are sequentially chosen using the posterior distribution of the surrogate model given all previous queries. We establish regret bounds in terms of the number of queries and the size of the point cloud. Several numerical examples complement the theory and illustrate the performance of our method.
    A lower confidence sequence for the changing mean of non-negative right heavy-tailed observations with bounded mean. (arXiv:2210.11133v1 [stat.ML])
    A confidence sequence (CS) is an anytime-valid sequential inference primitive which produces an adapted sequence of sets for a predictable parameter sequence with a time-uniform coverage guarantee. This work constructs a non-parametric non-asymptotic lower CS for the running average conditional expectation whose slack converges to zero given non-negative right heavy-tailed observations with bounded mean. Specifically, when the variance is finite the approach dominates the empirical Bernstein supermartingale of Howard et. al.; with infinite variance, can adapt to a known or unknown $(1 + \delta)$-th moment bound; and can be efficiently approximated using a sublinear number of sufficient statistics. In certain cases this lower CS can be converted into a closed-interval CS whose width converges to zero, e.g., any bounded realization, or post contextual-bandit inference with bounded rewards and unbounded importance weights. A reference implementation and example simulations demonstrate the technique.
    How Infinitely Wide Neural Networks Can Benefit from Multi-task Learning -- an Exact Macroscopic Characterization. (arXiv:2112.15577v4 [cs.LG] UPDATED)
    In practice, multi-task learning (through learning features shared among tasks) is an essential property of deep neural networks (NNs). While infinite-width limits of NNs can provide good intuition for their generalization behavior, the well-known infinite-width limits of NNs in the literature (e.g., neural tangent kernels) assume specific settings in which wide ReLU-NNs behave like shallow Gaussian Processes with a fixed kernel. Consequently, in such settings, these NNs lose their ability to benefit from multi-task learning in the infinite-width limit. In contrast, we prove that optimizing wide ReLU neural networks with at least one hidden layer using L2-regularization on the parameters promotes multi-task learning due to representation-learning - also in the limiting regime where the network width tends to infinity. We present an exact quantitative characterization of this infinite width limit in an appropriate function space that neatly describes multi-task learning.
    Learning to Reason with Neural Networks: Generalization, Unseen Data and Boolean Measures. (arXiv:2205.13647v2 [cs.LG] UPDATED)
    This paper considers the Pointer Value Retrieval (PVR) benchmark introduced in [ZRKB21], where a 'reasoning' function acts on a string of digits to produce the label. More generally, the paper considers the learning of logical functions with gradient descent (GD) on neural networks. It is first shown that in order to learn logical functions with gradient descent on symmetric neural networks, the generalization error can be lower-bounded in terms of the noise-stability of the target function, supporting a conjecture made in [ZRKB21]. It is then shown that in the distribution shift setting, when the data withholding corresponds to freezing a single feature (referred to as canonical holdout), the generalization error of gradient descent admits a tight characterization in terms of the Boolean influence for several relevant architectures. This is shown on linear models and supported experimentally on other models such as MLPs and Transformers. In particular, this puts forward the hypothesis that for such architectures and for learning logical functions such as PVR functions, GD tends to have an implicit bias towards low-degree representations, which in turn gives the Boolean influence for the generalization error under quadratic loss.
    Private Algorithms with Private Predictions. (arXiv:2210.11222v1 [cs.CR])
    When applying differential privacy to sensitive data, a common way of getting improved performance is to use external information such as other sensitive data, public data, or human priors. We propose to use the algorithms with predictions framework -- previously applied largely to improve time complexity or competitive ratios -- as a powerful way of designing and analyzing privacy-preserving methods that can take advantage of such external information to improve utility. For four important tasks -- quantile release, its extension to multiple quantiles, covariance estimation, and data release -- we construct prediction-dependent differentially private methods whose utility scales with natural measures of prediction quality. The analyses enjoy several advantages, including minimal assumptions about the data, natural ways of adding robustness to noisy predictions, and novel "meta" algorithms that can learn predictions from other (potentially sensitive) data. Overall, our results demonstrate how to enable differentially private algorithms to make use of and learn noisy predictions, which holds great promise for improving utility while preserving privacy across a variety of tasks.
    Tighter PAC-Bayes Generalisation Bounds by Leveraging Example Difficulty. (arXiv:2210.11289v1 [cs.LG])
    We introduce a modified version of the excess risk, which can be used to obtain tighter, fast-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for $[-1, 1]$-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around $0$. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.

  • Open

    Cursed -Darkness Falls Visuals by AI Manifest
    Halloween special 👻 submitted by /u/Available_Tadpole829 [link] [comments]  ( 114 min )
    Tech Talk: Choosing the Right Infrastructure for Production ML
    Join the Modzy team next Thursday October 27 at 12:30PM EDT for a tech talk on Choosing the Right Infrastructure for Production ML! Finding the right combination of infrastructure to support production AI at-scale can be time consuming and costly. In addition to identifying what kind of hardware can best support your production needs, picking the right deployment paradigm can save you thousands in cloud compute costs. This tech talk will walk you through how you to identify the right combination infrastructure to support your team’s needs for running inferences in production, at-scale. If you can't join live, the recording will be posted in the archives channel. Join the Discord Server. submitted by /u/modzykirsten [link] [comments]  ( 115 min )
    Conversation with a "LaMDA" on character.ai
    submitted by /u/KarneyHatch [link] [comments]  ( 122 min )
    Stable Diffusion Robots Model! Create Cool Cyborgs For Free!
    submitted by /u/PuppetHere [link] [comments]  ( 114 min )
    Check out AI Concepts, a shop on VSUAL
    submitted by /u/Enuminous [link] [comments]  ( 114 min )
    Tool’s recommendations query. I’m trying to do my channels art/sound with as much artificial intelligence as possible.
    submitted by /u/brackmayhall [link] [comments]  ( 115 min )
    AI Dream 102 - Dark Vivid 3D PSyCHO TRIP Teaser
    submitted by /u/LordPewPew777 [link] [comments]  ( 115 min )
    Why AI fairness is important in telecom personalization ?
    submitted by /u/United_Engineer_198 [link] [comments]  ( 115 min )
    Inferkit playing skyrim
    I played Skyrim a lot, but I can't remember all that stuff. The truth is, in Skyrim I learned how to cheat, how to die as a woman. I learned how to pick locks. I learned how to escape a prison, I learned how to walk in my underwear. I learned how to spell "Daedric". I learned a lot. But I don't think I learned how to fix myself. When I close my eyes, I see dragons. I hear battle cries. I smell smoke. I hear archers, I smell crossbows. I hear a choir of Amjadics. I hear a roar and a scream, the sound of arms on armor. It's cool in here. I need to get out. Notes: I just said everything that should be said here. Maybe the prologue should be a side quest. submitted by /u/Amick010502 [link] [comments]  ( 115 min )
    Google generates 3D views from a 2D image
    submitted by /u/Peaking_AI [link] [comments]  ( 115 min )
    Are there any AI art websites or software that are completely free and do not have censorship?
    I tried using Nightcafe and I was fine with the whole daily credits stuff, but it doesn't seem to do any sexually-related pictures. I tried making a good picture out of this thing (picture below) and it gave me a bunch of random stuff (even though I set the noise to 20 percent). And it's not the first time that happends, I've literally NEVER seen that thing make ANYTHING hot. Almost like it can't. https://preview.redd.it/vaffory14zu91.jpg?width=680&format=pjpg&auto=webp&s=732551e1595d5717590699cfd5faff80caa35f10 Also, some combination of words are prohibited. Like, I wanted to make a male portrait younger, so I wrote "young boy", but it was blocked. Guess it was afraid I'm a pedophile. But even if I was, it's none of it's f*cking business. So basically I'm looking for an AI art generator that is: free, no trial or anything. doesn't censor hot stuff or "potential pedophiles". Are there any? Thanks in advance! submitted by /u/windowtwink [link] [comments]  ( 120 min )
    controversial ai research?
    Hi! I'm a philosphy student and i need to write a blogpost about whether or not there should be restrictions on scientific research. I decided to go with AI research, since it's a very relevant subject. I just need a somewhat controversial article about research that's done/being done on artificial intelligence. I've been looking online but I can't seem to find something fitting anywhere. Any help is welcome! ​ (English is not my first language, so if anything is unclear, let me know) submitted by /u/miemkwienn [link] [comments]  ( 117 min )
    I know you guys don't really enjoy cleverbot codes so much. However, my kickstarter did just launch. I encourage you to read it all including the FAQ look at everything I offer share/back my game if you agree that this will work.
    submitted by /u/GlendInc [link] [comments]  ( 123 min )
    AI that generates executive summaries for long articles/copies
    Hello! Wonder if anyone has recommendations with AI that summarizes copy? Currently researching good AI tools that could summarize long textx/articles into concise, yet informative summaries (more like executive summaries). Have played around Jasper's text summarizer, however it only condense the article down to 1-2 sentences. This is not good enough if people want to get information in shorter time. Thank you in advance for the help! submitted by /u/sissiwang42 [link] [comments]  ( 127 min )
    Prof. Jussi Tohka Convolutional neural networks for segmentation of rodent brain MRI. Neuroimaging in pre-clinical
    submitted by /u/rottoneuro [link] [comments]  ( 114 min )
    Once Upon A.I. - a movie made by A.I. based on the words and ideas of Harari
    submitted by /u/vonderdeckentok [link] [comments]  ( 118 min )
    Writing Poetry with GPT-3
    Hey! Recently a friend and I started writing poetry on Medium using GPT-3 (we're also generating images for the poems using DALL-E 2). Our goal is to show people poetry through the lens of GPT-3. We want to demonstrate the creative power of AI in an unconventional field and showcase the unique insights GPT-3 has on philosphical topics. Please check us out and tell us what you think if this sounds interesting to you, we would really appreciate it! We're called Poetry by an AI (on Medium). submitted by /u/No_Transition_3704 [link] [comments]  ( 115 min )
    Will AI tools replace human copywriters?
    Hi all! My team and I have checked the 5 most popular AI writing apps to see whether they can do their job as professionally as human copywriters. We’re so proud to share the results with you 😊 Copywriters are an essential part of any marketing team. However good your product is, no one will know about it without promotion. And promotion means copying, whether for Facebook ads or blog posts. So the need for copywriters lingers. Hell, if it didn’t, I would be out of a job 😅 However, at least, hiring a professional copywriter costs about $50k-$60k annually. This made us think: is there a cheaper way to produce copy? Enter, artificial intelligence (AI) copywriting software. The market is replete with options that can “create human-like copy instantly” — or so the app developers assure us. Is that really so? We decided to find out. submitted by /u/selzy_marketing [link] [comments]  ( 119 min )
    Don’t know the name of the program
    A few weeks ago I was shown a demo of an AI art generator program that plugs into Adobe Photoshop. You can then draw in a super rough base and then the program extrapolate with that and a text description. But for the life of me I can’t remember what the program is called. All I can seem to remember is the name has something to do with an alpaca. (I know it sounds weird but I’m serious) Does anyone know the name of this program? I really want to get it but can’t find it. submitted by /u/Zummerz [link] [comments]  ( 115 min )
    [Infographic]
    Check out this data-backed blog on trusting conversational commerce to achieve Holiday Sales goals: https://www.haptik.ai/blog/trust-conversational-commerce-achieve-holiday-sales-goals submitted by /u/Haptik-AI [link] [comments]  ( 115 min )
    Art generated by dalle based on my townscaper build
    submitted by /u/latunda-fortnite [link] [comments]  ( 114 min )
    Four Horsemen of AI Project Failure and How to Deal with Them
    submitted by /u/Witer1945 [link] [comments]  ( 124 min )
    New product: AI powered form builder (fillout.com)
    Fillout is the first AI-powered form builder. Just describe your form and our AI will instantly generate a form for you to customize. We're live on Product hunt and would appreciate your support and feedback! https://www.producthunt.com/posts/fillout-com Check out the demo video on that post to see how the AI works and let me know if you have any questions - I'll be around :) submitted by /u/dominicwhyte42 [link] [comments]  ( 115 min )
  • Open

    [D] AAAI 2023 Reviews
    Looks like AAAI has sent out an email saying reviews are up on CMT. Anybody else only have two reviews? I was expecting to have four, but they said in the email that 69% of papers have 4 or more and 96% have three or more. Guess I'm in the 4% with fewer than three! submitted by /u/CauseRevolutionary59 [link] [comments]  ( 125 min )
    [D] Help finding image recognition network for multiple classes in same image
    Question: Image classification network that can detect multiple classes in the same image I am trying to find a network that can detect multiple different classes in the same image, or even multiple objects of the same class in a single image. For instance instead of detecting of either a "hat" or an "umbrella" are in an image, one that can detect if both are in the image, or if there are 2 "hats". I'm trying to find one that doesn't use bounding box regression because where the images are isnt really important for my project. Sorry if this is a dumb question I am very new to this. submitted by /u/Political_Target [link] [comments]  ( 130 min )
    [D] Query : Working on Bladder cancer type detection model using RCNN
    Hi, I’m currently working on detecting cancer type along with test for cancer using CT images . Problem I’m facing is huge dataset (nearly 1.5 lakhs of image sets) and I need to train my model . Would you advise me to manually classify myself for few samples and supervise the model or is there any unsupervised learning model I can refer , to train such a huge dataset ? submitted by /u/Zeitreisender2050 [link] [comments]  ( 126 min )
    [D] Is it worth paying a data sourcing company to crowdsource a bespoke dataset?
    My ML team is looking to buy/source a dataset of videos of people performing certain niche tasks to train a business-critical model. From our research, it seems like Scale AI, Toloka, Appen, Defined AI, and Clickworker offer solutions in that space. Has anyone used any of these before and would recommend (or recommend avoiding) them? Are we better off just running the crowdsourcing of the data in-house? submitted by /u/quantifiedvagabond [link] [comments]  ( 128 min )
    [P] Virtual space that transforms based on EEG data (project)
    Hi everyone I am an an architecture student and I’m doing my master project on computational architecture. The abstract of what I’m trying to achieve is a live transformable space in a 3D program (Unity 3D) that changes based on a EEG device data (change based on your feeling or what you think etc). Thought my knowledge on computation is limited, and I’m looking for already made algorithms (maybe ready neural networks to train) to put together for the project, rather that reinventing the wheel. I’m pretty early on in my research and I wanted to ask you for what fields should I look into and in general anything that you think might be helpful. ​ This video sparked the hole idea. I was interested in moving his project into 3D space. I’m not sure how AI applies here, to me it seems like just a visualisation of the data from the EEG device. How do you think it is made? https://www.youtube.com/watch?v=Y865Nxa5sW4 ​ In this simpler version it is shown that the data from the devise feeds in a flocking algorithm (which I’ve seems easily created in Unity before). https://www.youtube.com/watch?v=wkL_LihsRVw&list=LL&index=3 ​ ​ ​ Thanks in advance everyone! submitted by /u/Chris1Snea [link] [comments]  ( 128 min )
    [D] DDPM vs Score Matching
    I’m reading up on diffusion models, and these seem to be two dominant approaches. They are also equivalent in some parameterization of the formulations. However, the more recent papers, for example stable diffusion, seem to use DDPM-type formulation more often, and by this I mean they learn the noise rather than the score. Is this observation true? And if it is, what are some reasons? I’ve never implemented a model like this myself, so I don’t know how difficult or practical they are. Perhaps all the issues listed in the score matching papers (manifold hypothesis, low data density regions, inaccurate score estimation) make it really difficult to work with, or is there something more fundamental? Thanks in advance! submitted by /u/WallabyDue2778 [link] [comments]  ( 131 min )
    [R] Deep models that take distributions as inputs
    I need to apply deep neural networks to a scenario where the network input is a noisy discrete value, and I have an estimate of its distribution. For example, the network takes the number of people currently in a building as input. I have a probability distribution over the possible values instead of having the exact number of people. I'm interested in any work that enables neural networks to read a distribution as input (might be continuous or discrete, uni-modal or multi-modal) and process it efficiently submitted by /u/fedetask [link] [comments]  ( 130 min )
    [P] introduce the new python package for easily using mini-imagenet and tiny-imagenet: MLclf.
    Just introduce the new python package for easily using mini-imagenet and tiny-imagenet: MLclf. Training the full imagenet dataset (1k classes) needs a high computational resource, it is usually hard to quickly check your model on your local or personal computer. The mini-imagenet (100 classes) and tiny-imagenet (200 classes) are way more friendly on a local or personal computer, but the format of them are not friendly for the classical or traditional classification task, e.g. the original raw mini-imagenet data is divided into training/validation/testing sets for the few-shot or meta learning task. You can easily use the python package MLclf to download and transform the mini-imagenet data and tiny-imagenet for the traditional image classification task or the meta-learning task. just use: pip install MLclf Training the full imagenet dataset (1k classes) needs a high computational resource, it is usually hard to quickly check your model on your local or personal computer. The mini-imagenet (100 classes) and tiny-imagenet (200 classes) are way more friendly on a local or personal computer, but the format of them are not friendly for the classical or traditional classification task, e.g. the original raw mini-imagenet data is divided into training/validation/testing sets for the few-shot or meta learning task. from MLclf import MLclf MLclf.miniimagenet_download(Download=True) MLclf.tinyimagenet_download(Download=True) You can also see for more details: https://github.com/tiger2017/MLclf submitted by /u/supercxman [link] [comments]  ( 127 min )
    [P] libtensorflow_cc: Pre-built TensorFlow C++ API
    I happy to share our latest repository for making the TensorFlow C++ API more accessible! We now provide a pre-built library and a Docker image for easy installation and usage of the TensorFlow C++ API at https://github.com/ika-rwth-aachen/libtensorflow_cc. In order to, e.g., run TensorFlow models from C++ source code, one usually needs to build the C++ API in the form of the libtensorflow_cc.so library from source. There is no official release of the library and the build from source is only sparsely documented. We try to remedy this current situation by providing two main components: We provide the pre-built libtensorflow_cc.so including accompanying headers as a one-command-install deb-package. We provide a pre-built Docker image based on the official TensorFlow Docker image. Our Docker image has both TensorFlow Python and TensorFlow C++ installed. Try it out yourself by running the example application: git clone https://github.com/ika-rwth-aachen/libtensorflow_cc.git && \ cd libtensorflow_cc && \ docker run --rm \ --volume $(pwd)/example:/example \ --workdir /example \ rwthika/tensorflow-cc:latest \ ./build-and-run.sh While we currently only support x86_64 machines running Ubuntu, this could easily be extended to other OS and platforms in the future. Except for a some exceptions, all TensorFlow versions from 2.0.0 through 2.9.2 are available, 2.10.0 coming soon. If you want to use the TensorFlow C++ API to load, inspect, and run saved models and frozen graphs in C++, we suggest that you also check out our helper library tensorflow_cpp. Looking forward to hearing some feedback from you, thanks! Lennart submitted by /u/lennart-reiher-ika [link] [comments]  ( 129 min )
    [D] Discussion Panel for FOSS Instruct
    Hey all! My name is Louis Castricato. I lead CarperAI, a large FOSS group that recently released a library for doing distributed RLHF. We just announced a project today during Scale's TransformX conference to reimplement Instruct GPT, make all the datasets available as MIT, and release our checkpoints/models. I'm super interested in the democratization of large scale RLHF, as I feel it's a relatively unexplored space in the open source community. To that end, we'd love to get the subreddit and community more involved in our task selection process for our instruct model. We'll be hosting a panel on this in a few weeks, so I'm curious r/machinelearning, what kinds of tasks would you love to see an instruct model tuned on if you had infinite resources? Here is our instruct announcement: https://carper.ai/instruct-gpt-announcement/ And a link to our discussion panel on the CarperAI discord: https://discord.gg/cCR3xEAt?event=1029746950305751141 Excited to hear your thoughts! submitted by /u/FerretDude [link] [comments]  ( 126 min )
    [R] Learnable strides in convolution neural networks
    Notebook/ tutorial implementing the iclr outstanding paper recipient paper : https://www.kaggle.com/code/infocusp/diffstride-explainable-notebook submitted by /u/optimistdit [link] [comments]  ( 126 min )
    Inter camera conversion [D]
    I'll keep it simple, suppose you have images taken from a camera A, and other images which have been taken from other different cameras (B,C,D and so on). Can you train a model such that when you input an image from B,C,D... cameras it "changes" it into an image such that it would seem that it's taken from camera A. Any kind of approaches are fine, this question is just out of curiosity. Let's discuss in the comments ! submitted by /u/baby_yoda_fangay [link] [comments]  ( 127 min )
    [R] Hardware to train language representation model on half a billion text documents
    Hi Everyone, I'd like to train a language representation model (say BERT or derivatives) on half a billion text documents. Each document is not particularly long (one-two pages) but the data cannot be moved to the cloud. I've never developed any models at this scale and was wondering if you could recommend an appropriate hardware setup for this project - perhaps going from "absolute dream configuration" to "expensive but more realistic". Much appreciated. submitted by /u/medcode [link] [comments]  ( 128 min )
    [D] How yo measure Inter rater reliability for detection tasks ?
    I am working a biomedical imaging project where a group of doctors annotated short time lapses where a given event of interest occurred in a video. I would like to measure the Inter rater reliability, and the first think that comes into my mind is to use the Cohen’s kappa converting the time-lapse annotations into frame-wise annotations. However, given that this is a detections task, the problem in highly unbalanced and therefore the Cohen’s score will be biased. What options do we have ? submitted by /u/ManuelRios18 [link] [comments]  ( 127 min )
    [D] Question about 5-minute poster videos for NeurIPS
    Hi, First time to NeurIPS with a poster, and I noticed we need to submit 5-minute videos presenting our research. I was just wondering if anyone has done this before. I assume this is a 5-minute presentation with a few slides, but I've seen a few people show their poster instead and discuss it. I wanted to check if anyone had links to good examples of these 5-minute poster presentation videos. Or any tips/advice. Thanks! submitted by /u/VirtualHat [link] [comments]  ( 125 min )
    [Research] Scholars Program
    Hi everyone. We recently launched our call for the scholars program, a 8 month full-time entry point into machine learning research to pursue curiosity driven research w access to large scale engineering resources and mentorship. We have intentionally structured the program to be paid and remote-first so we can support talent all across the world. Our deadline is coming up on November 7th, and as of now, our pool of candidates is still indexing heavily on the US and Canada. In addition to the impressive candidates in these countries, I would welcome any help sharing this opportunity with ML communities around the world. More details below for anyone interested: The Cohere For AI Scholars Program supports the next generation of rising stars as they embark on their research journey by providing an alternative point of entry into NLP research. Scholars will have access to a large-scale experimental framework and work alongside some of the best researchers and engineering expertise in the world. Participation is full-time, remote-first and paid.For more details, check out our blog post announcing the Scholars Program launch. Applications are open until November 7, 2022, and can be found here. submitted by /u/ml_magic_ [link] [comments]  ( 138 min )
  • Open

    PI-ARS: Accelerating Evolution-Learned Visual-Locomotion with Predictive Information Representations
    Posted by Wenhao Yu, Research Scientist, Robotics at Google, and Kuang-Huei Lee, Research Engineer, Google Research, Brain team Evolution strategy (ES) is a family of optimization techniques inspired by the ideas of natural selection: a population of candidate solutions are usually evolved over generations to better adapt to an optimization objective. ES has been applied to a variety of challenging decision making problems, such as legged locomotion, quadcopter control, and even power system control. Compared to gradient-based reinforcement learning (RL) methods like proximal policy optimization (PPO) and soft actor-critic (SAC), ES has several advantages. First, ES directly explores in the space of controller parameters, while gradient-based methods often explore within a limited actio…  ( 25 min )
    MUSIQ: Assessing Image Aesthetic and Technical Quality with Multi-scale Transformers
    Posted by Junjie Ke, Senior Software Engineer, and Feng Yang, Senior Staff Software Engineer, Google Research Understanding the aesthetic and technical quality of images is important for providing a better user visual experience. Image quality assessment (IQA) uses models to build a bridge between an image and a user's subjective perception of its quality. In the deep learning era, many IQA approaches, such as NIMA, have achieved success by leveraging the power of convolutional neural networks (CNNs). However, CNN-based IQA models are often constrained by the fixed-size input requirement in batch training, i.e., the input images need to be resized or cropped to a fixed size shape. This preprocessing is problematic for IQA because images can have very different aspect ratios and resolutions…  ( 26 min )
  • Open

    Create synthetic data for computer vision pipelines on AWS
    Collecting and annotating image data is one of the most resource-intensive tasks on any computer vision project. It can take months at a time to fully collect, analyze, and experiment with image streams at the level you need in order to compete in the current marketplace. Even after you’ve successfully collected data, you still have […]  ( 22 min )
    Enable CI/CD of multi-Region Amazon SageMaker endpoints
    Amazon SageMaker and SageMaker inference endpoints provide a capability of training and deploying your AI and machine learning (ML) workloads. With inference endpoints, you can deploy your models for real-time or batch inference. The endpoints support various types of ML models hosted using AWS Deep Learning Containers or your own containers with custom AI/ML algorithms. […]  ( 9 min )
  • Open

    Exploration via Elliptical Episodic Bonuses
    Paper: https://arxiv.org/abs/2210.05805 Code: https://github.com/facebookresearch/e3b Website: https://e3bagent.github.io/ Abstract: In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments. ​ ​ submitted by /u/viv1a [link] [comments]  ( 117 min )
    "Computational noise in reward-guided learning drives behavioral variability in volatile environments", Findling et al 2018
    submitted by /u/gwern [link] [comments]  ( 115 min )
    Breakout using Policy Gradient
    I want to solve breakout using REINFORCE as my policy gradient algorithm and just need a little clarity. State and feature vector representation: For the value network: I'm planning to go with the same deepmind representation (84x84x4) as the network input and 1 output node for the state value For the policy network: I was planning (84x84x4) as the input and softmax output of size (n_actions) I'm not able to understand how to call the update step using any framework (tf/torch), what should my loss function be and with what terms should it be called? How do I accommodate the delta term in my networks? How do I add the "ln" term in the policy network? Do I do something life tf.log(action_preference) on my action outputs then compute gradients? https://preview.redd.it/bcnykxx27zu91.png?width=1207&format=png&auto=webp&s=ed1199ddd92bf0f36c8b1e83d021185be1645b99 submitted by /u/anuraagshankar [link] [comments]  ( 118 min )
    State-of-the-art RL Libraries and Tutorials
    I am currently working on a custom reinforcement environment using the Gym library. When trying to train an agent using Deep Reinforcement Learning with keras-rl2, I was getting strange errors. After googling for a few days to find the cause, I found a statement that keras-rl2 is deprecated and should no longer be used. After further research, I found a post that said that for reinforcement learning, pytorch is mainly used nowadays and not keras for the neural network part. Now I am very confused what are today's state of the art tools used in reinforcement learning. Thanks a lot for your advice! Moreover, I would be very grateful to get some links to tutorials or YouTube videos on their usage. submitted by /u/un_defined2020 [link] [comments]  ( 112 min )
    Is there a reason why reinforcement learning models use rewards instead of punishments?
    I've only taken an intro to psychology course and know little about AI/ML. What I know is that rewards and punishments are used in reinforcing behavior which I assume is replicated in ML reinforcement learning models. submitted by /u/Ok-Joke-4110 [link] [comments]  ( 118 min )
  • Open

    AI artist showcase — RetroManni.tez & Vincentfunart.tez
    This week vincentfunart and retromanni discuss how they collaborate to create some amazing art  ( 6 min )
    Can Artificial Intelligence Be at Par With or Even Surpass Human Intelligence?
    Introduction Continue reading on Becoming Human: Artificial Intelligence Magazine »  ( 10 min )
    Deepfake Audio with Wav2Lip
    Step-by-step walkthrough on lip-syncing with Wav2Lip  ( 10 min )
  • Open

    Deep learning with light
    A new method uses optics to accelerate machine-learning computations on smart speakers and other low-power connected devices.  ( 8 min )
  • Open

    How eccentricity matters
    I wrote last week that the eccentricities of planet orbits in our solar system do not effect the shape of the orbit very much. Here’s a plot of all the orbits, shifted to have the same center and scaled to have the same minor axis. However, the planet orbits do not have a common center. […] How eccentricity matters first appeared on John D. Cook.  ( 5 min )
    Directrix of a conic
    The most common way to define an ellipse geometrically is as the set of points whose distances to two foci sum to a constant. There is another way, however, to define an ellipse that generalizes to include the two other conic sections, parabolas and hyperbolas. You can define a conic section as the set of […] Directrix of a conic first appeared on John D. Cook.  ( 5 min )
    Latus rectum of an ellipse
    Ellipses have been studied for over two thousand years, and so some of the terminology is ancient and sounds odd to modern ears. One such term is latus rectum. What is the latus rectum and does have it anything to do with anatomy? This post defines the latus rectum for an ellipse. See the next […] Latus rectum of an ellipse first appeared on John D. Cook.  ( 5 min )
    Inequalities for inequality: Gini coefficient lower bounds
    The Gini coefficient, a.k.a. Gini index, of a set of numbers is the average of all differences divided by twice the mean. Specifically, let Then the Gini coefficient of x is defined to be where μ is the mean of the set. The Gini coefficient is often used in economics to measure inequalities in wealth. […] Inequalities for inequality: Gini coefficient lower bounds first appeared on John D. Cook.  ( 5 min )
  • Open

    Get in Touch With New Mobile Gaming Controls on GeForce NOW
    GeForce NOW expands touch control support to 13 more games this GFN Thursday. That means it’s easier than ever to take PC gaming on the go using mobile devices and tablets. The new “Mobile Touch Controls” row in the GeForce NOW app is the easiest way for members to find which games put the action Read article > The post Get in Touch With New Mobile Gaming Controls on GeForce NOW appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Interactive Machine Learning Live Course with Dr. Kirk Borne
    Sponsored Post     Apply now to join Dr. Kirk Borne’s live interactive course, starting on November 28.  Explore Machine Learning Live with hands-on labs and real world applications with Dr. Kirk Borne, ex-NASA Scientist and former Principal Data Scientist at Booz Allen Hamilton. He was also a professor of Astrophysics and Computational Science at […] The post Interactive Machine Learning Live Course with Dr. Kirk Borne appeared first on Machine Learning Mastery.  ( 10 min )
  • Open

    Learning Globally Smooth Functions on Manifolds. (arXiv:2210.00301v2 [cs.LG] UPDATED)
    Smoothness and low dimensional structures play central roles in improving generalization and stability in learning and statistics. The combination of these properties has led to many advances in semi-supervised learning, generative modeling, and control of dynamical systems. However, learning smooth functions is generally challenging, except in simple cases such as learning linear or kernel models. Typical methods are either too conservative, relying on crude upper bounds such as spectral normalization, too lax, penalizing smoothness on average, or too computationally intensive, requiring the solution of large-scale semi-definite programs. These issues are only exacerbated when trying to simultaneously exploit low dimensionality using, e.g., manifolds. This work proposes to overcome these obstacles by combining techniques from semi-infinite constrained learning and manifold regularization. To do so, it shows that, under typical conditions, the problem of learning a Lipschitz continuous function on a manifold is equivalent to a dynamically weighted manifold regularization problem. This observation leads to a practical algorithm based on a weighted Laplacian penalty whose weights are adapted using stochastic gradient techniques. We prove that, under mild conditions, this method estimates the Lipschitz constant of the solution, learning a globally smooth solution as a byproduct. Numerical examples illustrate the advantages of using this method to impose global smoothness on manifolds as opposed to imposing smoothness on average.  ( 3 min )
    A Deep Top-Down Approach to Hierarchically Coherent Probabilistic Forecasting. (arXiv:2204.10414v2 [cs.LG] UPDATED)
    Probabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications -- the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present a probabilistic top-down approach to hierarchical forecasting that uses a novel attention-based RNN model to learn the distribution of the proportions according to which each parent prediction is split among its children nodes at any point in time. These probabilistic proportions are then coupled with an independent univariate probabilistic forecasting model for the root time series. The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements up to 27% on most datasets compared to state-of-the-art probabilistic hierarchical models. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to traditional bottom-up modeling.
    Neural Basis Models for Interpretability. (arXiv:2205.14120v4 [cs.LG] UPDATED)
    Due to the widespread use of complex machine learning models in real-world applications, it is becoming critical to explain model predictions. However, these models are typically black-box deep neural networks, explained post-hoc via methods with known faithfulness limitations. Generalized Additive Models (GAMs) are an inherently interpretable class of models that address this limitation by learning a non-linear shape function for each feature separately, followed by a linear model on top. However, these models are typically difficult to train, require numerous parameters, and are difficult to scale. We propose an entirely new subfamily of GAMs that utilizes basis decomposition of shape functions. A small number of basis functions are shared among all features, and are learned jointly for a given task, thus making our model scale much better to large-scale data with high-dimensional features, especially when features are sparse. We propose an architecture denoted as the Neural Basis Model (NBM) which uses a single neural network to learn these bases. On a variety of tabular and image datasets, we demonstrate that for interpretable machine learning, NBMs are the state-of-the-art in accuracy, model size, and, throughput and can easily model all higher-order feature interactions. Source code is available at https://github.com/facebookresearch/nbm-spam.
    On Efficient Approximate Queries over Machine Learning Models. (arXiv:2206.02845v3 [cs.DB] UPDATED)
    The question of answering queries over ML predictions has been gaining attention in the database community. This question is challenging because the cost of finding high quality answers corresponds to invoking an oracle such as a human expert or an expensive deep neural network model on every single item in the DB and then applying the query. We develop a novel unified framework for approximate query answering by leveraging a proxy to minimize the oracle usage of finding high quality answers for both Precision-Target (PT) and Recall-Target (RT) queries. Our framework uses a judicious combination of invoking the expensive oracle on data samples and applying the cheap proxy on the objects in the DB. It relies on two assumptions. Under the Proxy Quality assumption, proxy quality can be quantified in a probabilistic manner w.r.t. the oracle. This allows us to develop two algorithms: PQA that efficiently finds high quality answers with high probability and no oracle calls, and PQE, a heuristic extension that achieves empirically good performance with a small number of oracle calls. Alternatively, under the Core Set Closure assumption, we develop two algorithms: CSC that efficiently returns high quality answers with high probability and minimal oracle usage, and CSE, which extends it to more general settings. Our extensive experiments on five real-world datasets on both query types, PT and RT, demonstrate that our algorithms outperform the state-of-the-art and achieve high result quality with provable statistical guarantees.
    Face Pasting Attack. (arXiv:2210.09153v2 [cs.CV] UPDATED)
    Cujo AI and Adversa AI hosted the MLSec face recognition challenge. The goal was to attack a black box face recognition model with targeted attacks. The model returned the confidence of the target class and a stealthiness score. For an attack to be considered successful the target class has to have the highest confidence among all classes and the stealthiness has to be at least 0.5. In our approach we paste the face of a target into a source image. By utilizing position, scaling, rotation and transparency attributes we reached 3rd place. Our approach took approximately 200 queries per attack for the final highest score and about ~7.7 queries minimum for a successful attack. The code is available at https://github.com/bunni90/FacePastingAttack .
    Symbol Guided Hindsight Priors for Reward Learning from Human Preferences. (arXiv:2210.09151v2 [cs.LG] UPDATED)
    Specifying rewards for reinforcement learned (RL) agents is challenging. Preference-based RL (PbRL) mitigates these challenges by inferring a reward from feedback over sets of trajectories. However, the effectiveness of PbRL is limited by the amount of feedback needed to reliably recover the structure of the target reward. We present the PRIor Over Rewards (PRIOR) framework, which incorporates priors about the structure of the reward function and the preference feedback into the reward learning process. Imposing these priors as soft constraints on the reward learning objective reduces the amount of feedback required by half and improves overall reward recovery. Additionally, we demonstrate that using an abstract state space for the computation of the priors further improves the reward learning and the agent's performance.
    Multi-block-Single-probe Variance Reduced Estimator for Coupled Compositional Optimization. (arXiv:2207.08540v2 [cs.LG] UPDATED)
    Variance reduction techniques such as SPIDER/SARAH/STORM have been extensively studied to improve the convergence rates of stochastic non-convex optimization, which usually maintain and update a sequence of estimators for a single function across iterations. What if we need to track multiple functional mappings across iterations but only with access to stochastic samples of $\mathcal{O}(1)$ functional mappings at each iteration? There is an important application in solving an emerging family of coupled compositional optimization problems in the form of $\sum_{i=1}^m f_i(g_i(\mathbf{w}))$, where $g_i$ is accessible through a stochastic oracle. The key issue is to track and estimate a sequence of $\mathbf g(\mathbf{w})=(g_1(\mathbf{w}), \ldots, g_m(\mathbf{w}))$ across iterations, where $\mathbf g(\mathbf{w})$ has $m$ blocks and it is only allowed to probe $\mathcal{O}(1)$ blocks to attain their stochastic values and Jacobians. To improve the complexity for solving these problems, we propose a novel stochastic method named Multi-block-Single-probe Variance Reduced (MSVR) estimator to track the sequence of $\mathbf g(\mathbf{w})$. It is inspired by STORM but introduces a customized error correction term to alleviate the noise not only in stochastic samples for the selected blocks but also in those blocks that are not sampled. With the help of the MSVR estimator, we develop several algorithms for solving the aforementioned compositional problems with improved complexities across a spectrum of settings with non-convex/convex/strongly convex/Polyak-{\L}ojasiewicz (PL) objectives. Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on the strong convexity parameter. Empirical studies on multi-task deep AUC maximization demonstrate the better performance of using the new estimator.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v2 [cs.LG] UPDATED)
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.
    Foundation Transformers. (arXiv:2210.06423v2 [cs.LG] UPDATED)
    A big convergence of model architectures across language, vision, speech, and multimodal is emerging. However, under the same name "Transformers", the above areas use different implementations for better performance, e.g., Post-LayerNorm for BERT, and Pre-LayerNorm for GPT and vision Transformers. We call for the development of Foundation Transformer for true general-purpose modeling, which serves as a go-to architecture for various tasks and modalities with guaranteed training stability. In this work, we introduce a Transformer variant, named Magneto, to fulfill the goal. Specifically, we propose Sub-LayerNorm for good expressivity, and the initialization strategy theoretically derived from DeepNet for stable scaling up. Extensive experiments demonstrate its superior performance and better stability than the de facto Transformer variants designed for various applications, including language modeling (i.e., BERT, and GPT), machine translation, vision pretraining (i.e., BEiT), speech recognition, and multimodal pretraining (i.e., BEiT-3).
    Empirical Phase Diagram for Three-layer Neural Networks with Infinite Width. (arXiv:2205.12101v2 [cs.LG] UPDATED)
    Substantial work indicates that the dynamics of neural networks (NNs) is closely related to their initialization of parameters. Inspired by the phase diagram for two-layer ReLU NNs with infinite width (Luo et al., 2021), we make a step towards drawing a phase diagram for three-layer ReLU NNs with infinite width. First, we derive a normalized gradient flow for three-layer ReLU NNs and obtain two key independent quantities to distinguish different dynamical regimes for common initialization methods. With carefully designed experiments and a large computation cost, for both synthetic datasets and real datasets, we find that the dynamics of each layer also could be divided into a linear regime and a condensed regime, separated by a critical regime. The criteria is the relative change of input weights (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) as the width approaches infinity during the training, which tends to $0$, $+\infty$ and $O(1)$, respectively. In addition, we also demonstrate that different layers can lie in different dynamical regimes in a training process within a deep NN. In the condensed regime, we also observe the condensation of weights in isolated orientations with low complexity. Through experiments under three-layer condition, our phase diagram suggests a complicated dynamical regimes consisting of three possible regimes, together with their mixture, for deep NNs and provides a guidance for studying deep NNs in different initialization regimes, which reveals the possibility of completely different dynamics emerging within a deep NN for its different layers.
    TarGF: Learning Target Gradient Field for Object Rearrangement. (arXiv:2209.00853v2 [cs.LG] UPDATED)
    Object Rearrangement is to move objects from an initial state to a goal state. Here, we focus on a more practical setting in object rearrangement, i.e., rearranging objects from shuffled layouts to a normative target distribution without explicit goal specification. However, it remains challenging for AI agents, as it is hard to describe the target distribution (goal specification) for reward engineering or collect expert trajectories as demonstrations. Hence, it is infeasible to directly employ reinforcement learning or imitation learning algorithms to address the task. This paper aims to search for a policy only with a set of examples from a target distribution instead of a handcrafted reward function. We employ the score-matching objective to train a Target Gradient Field (TarGF), indicating a direction on each object to increase the likelihood of the target distribution. For object rearrangement, the TarGF can be used in two ways: 1) For model-based planning, we can cast the target gradient into a reference control and output actions with a distributed path planner; 2) For model-free reinforcement learning, the TarGF is not only used for estimating the likelihood-change as a reward but also provides suggested actions in residual policy learning. Experimental results in ball and room rearrangement demonstrate that our method significantly outperforms the state-of-the-art methods in the quality of the terminal state, the efficiency of the control process, and scalability.
    FlowGNN: A Dataflow Architecture for Real-Time Workload-Agnostic Graph Neural Network Inference. (arXiv:2204.13103v2 [cs.DC] UPDATED)
    Graph neural networks (GNNs) have recently exploded in popularity thanks to their broad applicability to graph-related problems such as quantum chemistry, drug discovery, and high energy physics. However, meeting demand for novel GNN models and fast inference simultaneously is challenging due to the gap between developing efficient accelerators and the rapid creation of new GNN models. Prior art focuses on accelerating specific classes of GNNs, such as Graph Convolutional Networks (GCN), but lacks generality to support a wide range of existing or new GNN models. Furthermore, most works rely on graph pre-processing to exploit data locality, making them unsuitable for real-time applications. To address these limitations, in this work, we propose a generic dataflow architecture for GNN acceleration, named FlowGNN, which is generalizable to the majority of message-passing GNNs. The contributions are three-fold. First, we propose a novel and scalable dataflow architecture, which generally supports a wide range of GNN models with message-passing mechanism. The architecture features a configurable dataflow optimized for simultaneous computation of node embedding, edge embedding, and message passing, which is generally applicable to all models. We also propose a rich library of model-specific components. Second, we deliver ultra-fast real-time GNN inference without any graph pre-processing, making it agnostic to dynamically changing graph structures. Third, we verify our architecture on the Xilinx Alveo U50 FPGA board and measure the on-board end-to-end performance. We achieve a speed-up of up to 24-254x against CPU (6226R) and 1.3-477x against GPU (A6000) (with batch sizes 1 through 1024); we also outperform the SOTA GNN accelerator I-GCN by 1.26x speedup and 1.55x energy efficiency over four datasets. Our implementation code and on-board measurement are publicly available on GitHub.
    A Unified Convergence Theorem for Stochastic Optimization Methods. (arXiv:2206.03907v2 [math.OC] UPDATED)
    In this work, we provide a fundamental unified convergence theorem used for deriving expected and almost sure convergence results for a series of stochastic optimization methods. Our unified theorem only requires to verify several representative conditions and is not tailored to any specific algorithm. As a direct application, we recover expected and almost sure convergence results of the stochastic gradient method (SGD) and random reshuffling (RR) under more general settings. Moreover, we establish new expected and almost sure convergence results for the stochastic proximal gradient method (prox-SGD) and stochastic model-based methods (SMM) for nonsmooth nonconvex optimization problems. These applications reveal that our unified theorem provides a plugin-type convergence analysis and strong convergence guarantees for a wide class of stochastic optimization methods.
    BackdoorBench: A Comprehensive Benchmark of Backdoor Learning. (arXiv:2206.12654v2 [cs.LG] UPDATED)
    Backdoor learning is an emerging and vital topic for studying deep neural networks' vulnerability (DNNs). Many pioneering backdoor attack and defense methods are being proposed, successively or concurrently, in the status of a rapid arms race. However, we find that the evaluations of new methods are often unthorough to verify their claims and accurate performance, mainly due to the rapid development, diverse settings, and the difficulties of implementation and reproducibility. Without thorough evaluations and comparisons, it is not easy to track the current progress and design the future development roadmap of the literature. To alleviate this dilemma, we build a comprehensive benchmark of backdoor learning called BackdoorBench. It consists of an extensible modular-based codebase (currently including implementations of 8 state-of-the-art (SOTA) attacks and 9 SOTA defense algorithms) and a standardized protocol of complete backdoor learning. We also provide comprehensive evaluations of every pair of 8 attacks against 9 defenses, with 5 poisoning ratios, based on 5 models and 4 datasets, thus 8,000 pairs of evaluations in total. We present abundant analysis from different perspectives about these 8,000 evaluations, studying the effects of different factors in backdoor learning. All codes and evaluations of BackdoorBench are publicly available at \url{https://backdoorbench.github.io}.
    Privacy and Transparency in Graph Machine Learning: A Unified Perspective. (arXiv:2207.10896v2 [cs.LG] UPDATED)
    Graph Machine Learning (GraphML), whereby classical machine learning is generalized to irregular graph domains, has enjoyed a recent renaissance, leading to a dizzying array of models and their applications in several domains. With its growing applicability to sensitive domains and regulations by governmental agencies for trustworthy AI systems, researchers have started looking into the issues of transparency and privacy of graph learning. However, these topics have been mainly investigated independently. In this position paper, we provide a unified perspective on the interplay of privacy and transparency in GraphML. In particular, we describe the challenges and possible research directions for a formal investigation of privacy-transparency tradeoffs in GraphML.
    Hyperparameter Sensitivity in Deep Outlier Detection: Analysis and a Scalable Hyper-Ensemble Solution. (arXiv:2206.07647v2 [cs.LG] UPDATED)
    Outlier detection (OD) literature exhibits numerous algorithms as it applies to diverse domains. However, given a new detection task, it is unclear how to choose an algorithm to use, nor how to set its hyperparameter(s) (HPs) in unsupervised settings. HP tuning is an ever-growing problem with the arrival of many new detectors based on deep learning, which usually come with a long list of HPs. Surprisingly, the issue of model selection in the outlier mining literature has been "the elephant in the room"; a significant factor in unlocking the utmost potential of deep methods, yet little said or done to systematically tackle the issue. In the first part of this paper, we conduct the first large-scale analysis on the HP sensitivity of deep OD methods, and through more than 35,000 trained models, quantitatively demonstrate that model selection is inevitable. Next, we design a HP-robust and scalable deep hyper-ensemble model called ROBOD that assembles models with varying HP configurations, bypassing the choice paralysis. Importantly, we introduce novel strategies to speed up ensemble training, such as parameter sharing, batch/simultaneous training, and data subsampling, that allow us to train fewer models with fewer parameters. Extensive experiments on both image and tabular datasets show that ROBOD achieves and retains robust, state-of-the-art detection performance as compared to its modern counterparts, while taking only $2$-$10$\% of the time by the naive hyper-ensemble with independent training.
    MCP: Self-supervised Pre-training for Personalized Chatbots with Multi-level Contrastive Sampling. (arXiv:2210.08753v2 [cs.CL] UPDATED)
    Personalized chatbots focus on endowing the chatbots with a consistent personality to behave like real users and further act as personal assistants. Previous studies have explored generating implicit user profiles from the user's dialogue history for building personalized chatbots. However, these studies only use the response generation loss to train the entire model, thus it is prone to suffer from the problem of data sparsity. Besides, they overemphasize the final generated response's quality while ignoring the correlations and fusions between the user's dialogue history, leading to rough data representations and performance degradation. To tackle these problems, we propose a self-supervised learning framework MCP for capturing better representations from users' dialogue history for personalized chatbots. Specifically, we apply contrastive sampling methods to leverage the supervised signals hidden in user dialog history, and generate the pre-training samples for enhancing the model. We design three pre-training tasks based on three types of contrastive pairs from user dialogue history, namely response pairs, sequence augmentation pairs, and user pairs. We pre-train the utterance encoder and the history encoder towards the contrastive objectives and use these pre-trained encoders for generating user profiles while personalized response generation. Experimental results on two real-world datasets show a significant improvement in our proposed model MCP compared with the existing methods.
    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v2 [cs.LG] UPDATED)
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [22] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first near-linear time algorithms with near-optimal guarantees on the population duality gap for smooth DP-SMO, when the objective is (strongly-)convex--(strongly-)concave. Additionally, based on our flexible framework, we enrich the family of near-linear time algorithms for smooth DP-SCO with the near-optimal privacy-loss trade-off.
    SAICL: Student Modelling with Interaction-level Auxiliary Contrastive Tasks for Knowledge Tracing and Dropout Prediction. (arXiv:2210.09012v2 [cs.CY] UPDATED)
    Knowledge tracing and dropout prediction are crucial for online education to estimate students' knowledge states or to prevent dropout rates. While traditional systems interacting with students suffered from data sparsity and overfitting, recent sample-level contrastive learning helps to alleviate this issue. One major limitation of sample-level approaches is that they regard students' behavior interaction sequences as a bundle, so they often fail to encode temporal contexts and track their dynamic changes, making it hard to find optimal representations for knowledge tracing and dropout prediction. To apply temporal context within the sequence, this study introduces a novel student modeling framework, SAICL: \textbf{s}tudent modeling with \textbf{a}uxiliary \textbf{i}nteraction-level \textbf{c}ontrastive \textbf{l}earning. In detail, SAICL can utilize both proposed self-supervised/supervised interaction-level contrastive objectives: MilCPC (\textbf{M}ulti-\textbf{I}nteraction-\textbf{L}evel \textbf{C}ontrastive \textbf{P}redictive \textbf{C}oding) and SupCPC (\textbf{Sup}ervised \textbf{C}ontrastive \textbf{P}redictive \textbf{C}oding). While previous sample-level contrastive methods for student modeling are highly dependent on data augmentation methods, the SAICL is free of data augmentation while showing better performance in both self-supervised and supervised settings. By combining cross-entropy with contrastive objectives, the proposed SAICL achieved comparable knowledge tracing and dropout prediction performance with other state-of-art models without compromising inference costs.
    Breaking the Curse of Dimensionality in Multiagent State Space: A Unified Agent Permutation Framework. (arXiv:2203.05285v2 [cs.LG] UPDATED)
    The state space in Multiagent Reinforcement Learning (MARL) grows exponentially with the agent number. Such a curse of dimensionality results in poor scalability and low sample efficiency, inhibiting MARL for decades. To break this curse, we propose a unified agent permutation framework that exploits the permutation invariance (PI) and permutation equivariance (PE) inductive biases to reduce the multiagent state space. Our insight is that permuting the order of entities in the factored multiagent state space does not change the information. Specifically, we propose two novel implementations: a Dynamic Permutation Network (DPN) and a Hyper Policy Network (HPN). The core idea is to build separate entity-wise PI input and PE output network modules to connect the entity-factored state space and action space in an end-to-end way. DPN achieves such connections by two separate module selection networks, which consistently assign the same input module to the same input entity (guarantee PI) and assign the same output module to the same entity-related output (guarantee PE). To enhance the representation capability, HPN replaces the module selection networks of DPN with hypernetworks to directly generate the corresponding module weights. Extensive experiments in SMAC, Google Research Football and MPE validate that the proposed methods significantly boost the performance and the learning efficiency of existing MARL algorithms. Remarkably, in SMAC, we achieve 100% win rates in almost all hard and super-hard scenarios (never achieved before).
    Well-definedness of Physical Law Learning: The Uniqueness Problem. (arXiv:2210.08342v2 [cs.LG] UPDATED)
    Physical law learning is the ambiguous attempt at automating the derivation of governing equations with the use of machine learning techniques. The current literature focuses however solely on the development of methods to achieve this goal, and a theoretical foundation is at present missing. This paper shall thus serve as a first step to build a comprehensive theoretical framework for learning physical laws, aiming to provide reliability to according algorithms. One key problem consists in the fact that the governing equations might not be uniquely determined by the given data. We will study this problem in the common situation of having a physical law be described by an ordinary or partial differential equation. For various different classes of differential equations, we provide both necessary and sufficient conditions for a function from a given function class to uniquely determine the differential equation which is governing the phenomenon. We then use our results to devise numerical algorithms to determine whether a function solves a differential equation uniquely. Finally, we provide extensive numerical experiments showing that our algorithms in combination with common approaches for learning physical laws indeed allow to guarantee that a unique governing differential equation is learnt, without assuming any knowledge about the function, thereby ensuring reliability.
    Super-resolution GANs of randomly-seeded fields. (arXiv:2202.11701v2 [physics.flu-dyn] UPDATED)
    Reconstruction of field quantities from sparse measurements is a problem arising in a broad spectrum of applications. This task is particularly challenging when the mapping between sparse measurements and field quantities is performed in an unsupervised manner. Further complexity is added for moving sensors and/or random on-off status. Under such conditions, the most straightforward solution is to interpolate the scattered data onto a regular grid. However, the spatial resolution achieved with this approach is ultimately limited by the mean spacing between the sparse measurements. In this work, we propose a super-resolution generative adversarial network (GAN) framework to estimate field quantities from random sparse sensors without needing any full-field high-resolution training. The algorithm exploits random sampling to provide incomplete views of the {high-resolution} underlying distributions. It is hereby referred to as RAndomly-SEEDed super-resolution GAN (RaSeedGAN). The proposed technique is tested on synthetic databases of fluid flow simulations, ocean surface temperature distributions measurements, and particle image velocimetry data of a zero-pressure-gradient turbulent boundary layer. The results show excellent performance even in cases with high sparsity or with levels of noise. To our knowledge, this is the first GAN algorithm for full-field high-resolution estimation from randomly-seeded fields with no need of full-field high-resolution representations.
    Identifiability of deep generative models without auxiliary information. (arXiv:2206.10044v2 [cs.LG] UPDATED)
    We prove identifiability of a broad class of deep latent variable models that (a) have universal approximation capabilities and (b) are the decoders of variational autoencoders that are commonly used in practice. Unlike existing work, our analysis does not require weak supervision, auxiliary information, or conditioning in the latent space. Specifically, we show that for a broad class of generative (i.e. unsupervised) models with universal approximation capabilities, the side information $u$ is not necessary: We prove identifiability of the entire generative model where we do not observe $u$ and only observe the data $x$. The models we consider match autoencoder architectures used in practice that leverage mixture priors in the latent space and ReLU/leaky-ReLU activations in the encoder, such as VaDE and MFC-VAE. Our main result is an identifiability hierarchy that significantly generalizes previous work and exposes how different assumptions lead to different "strengths" of identifiability, and includes certain "vanilla" VAEs with isotropic Gaussian priors as a special case. For example, our weakest result establishes (unsupervised) identifiability up to an affine transformation, and thus partially resolves an open problem regarding model identifiability raised in prior work. These theoretical results are augmented with experiments on both simulated and real data.
    Lethal Dose Conjecture on Data Poisoning. (arXiv:2208.03309v3 [cs.LG] UPDATED)
    Data poisoning considers an adversary that distorts the training set of machine learning algorithms for malicious purposes. In this work, we bring to light one conjecture regarding the fundamentals of data poisoning, which we call the Lethal Dose Conjecture. The conjecture states: If $n$ clean training samples are needed for accurate predictions, then in a size-$N$ training set, only $\Theta(N/n)$ poisoned samples can be tolerated while ensuring accuracy. Theoretically, we verify this conjecture in multiple cases. We also offer a more general perspective of this conjecture through distribution discrimination. Deep Partition Aggregation (DPA) and its extension, Finite Aggregation (FA) are recent approaches for provable defenses against data poisoning, where they predict through the majority vote of many base models trained from different subsets of training set using a given learner. The conjecture implies that both DPA and FA are (asymptotically) optimal -- if we have the most data-efficient learner, they can turn it into one of the most robust defenses against data poisoning. This outlines a practical approach to developing stronger defenses against poisoning via finding data-efficient learners. Empirically, as a proof of concept, we show that by simply using different data augmentations for base learners, we can respectively double and triple the certified robustness of DPA on CIFAR-10 and GTSRB without sacrificing accuracy.
    A new hope for network model generalization. (arXiv:2207.05843v2 [cs.NI] UPDATED)
    Generalizing machine learning (ML) models for network traffic dynamics tends to be considered a lost cause. Hence for every new task, we design new models and train them on model-specific datasets closely mimicking the deployment environments. Yet, an ML architecture called_Transformer_ has enabled previously unimaginable generalization in other domains. Nowadays, one can download a model pre-trained on massive datasets and only fine-tune it for a specific task and context with comparatively little time and data. These fine-tuned models are now state-of-the-art for many benchmarks. We believe this progress could translate to networking and propose a Network Traffic Transformer (NTT), a transformer adapted to learn network dynamics from packet traces. Our initial results are promising: NTT seems able to generalize to new prediction tasks and environments. This study suggests there is still hope for generalization, though it calls for a lot of future research.
    Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms. (arXiv:2202.13001v6 [cs.LG] UPDATED)
    We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
    Representational Ethical Model Calibration. (arXiv:2207.12043v2 [cs.LG] UPDATED)
    Equity is widely held to be fundamental to the ethics of healthcare. In the context of clinical decision-making, it rests on the comparative fidelity of the intelligence -- evidence-based or intuitive -- guiding the management of each individual patient. Though brought to recent attention by the individuating power of contemporary machine learning, such epistemic equity arises in the context of any decision guidance, whether traditional or innovative. Yet no general framework for its quantification, let alone assurance, currently exists. Here we formulate epistemic equity in terms of model fidelity evaluated over learnt multi-dimensional representations of identity crafted to maximise the captured diversity of the population, introducing a comprehensive framework for Representational Ethical Model Calibration. We demonstrate use of the framework on large-scale multimodal data from UK Biobank to derive diverse representations of the population, quantify model performance, and institute responsive remediation. We offer our approach as a principled solution to quantifying and assuring epistemic equity in healthcare, with applications across the research, clinical, and regulatory domains.
    Geodesic Multi-Modal Mixup for Robust Fine-Tuning. (arXiv:2203.03897v2 [cs.CV] UPDATED)
    Pre-trained large-scale models provide a transferable embedding, and they show promising performance on diverse downstream tasks. However, the analysis of learned embedding has not been explored well, and the transferability for cross-modal tasks can be improved. This paper provides a perspective to understand multi-modal embedding in terms of uniformity and alignment. We newly find that the representation learned by multi-modal learning models such as CLIP has two separated embedding spaces for each heterogeneous dataset with less alignment. Besides, there are unexplored large intermediate areas between the two modalities with less uniformity. As a result, lack of alignment and uniformity might restrict the robustness and transferability of the representation for the downstream task. To this end, we provide a new end-to-end fine-tuning method for robust representation that encourages better uniformity and alignment score. First, we propose a \textit{Geodesic Multi-Modal Mixup} that mixes the representation of image and text to generate the hard negative samples on the hyperspherical embedding space. Second, we fine-tune the multi-modal model on hard negative samples as well as normal negatives and positive samples with contrastive loss. Through extensive experiments on retrieval, classification, and structure-awareness task, we demonstrate that our geodesic multi-modal Mixup learns a robust representation and provides improved performance on various downstream tasks.
    Open-Ended Knowledge Tracing. (arXiv:2203.03716v3 [cs.CY] UPDATED)
    In education applications, knowledge tracing refers to the problem of estimating students' time-varying concept/skill mastery level from their past responses to questions and predicting their future performance. One key limitation of most existing knowledge tracing methods is that they treat student responses to questions as binary-valued, i.e., whether they are correct or incorrect. Response correctness analysis/prediction ignores important information on student knowledge contained in the exact content of the responses, especially for open-ended questions. In this paper, we conduct the first exploration into open-ended knowledge tracing (OKT) by studying the new task of predicting students' exact open-ended responses to questions. Our work is grounded in the domain of computer science education with programming questions. We develop an initial solution to the OKT problem, a student knowledge-guided code generation approach, that combines program synthesis methods using language models with student knowledge tracing methods. We also conduct a series of quantitative and qualitative experiments on a real-world student code dataset to validate OKT and demonstrate its promise in educational applications.
    Seeking Commonness and Inconsistencies: A Jointly Smoothed Approach to Multi-view Subspace Clustering. (arXiv:2203.08060v2 [cs.CV] UPDATED)
    Multi-view subspace clustering aims to discover the hidden subspace structures from multiple views for robust clustering, and has been attracting considerable attention in recent years. Despite significant progress, most of the previous multi-view subspace clustering algorithms are still faced with two limitations. First, they usually focus on the consistency (or commonness) of multiple views, yet often lack the ability to capture the cross-view inconsistencies in subspace representations. Second, many of them overlook the local structures of multiple views and cannot jointly leverage multiple local structures to enhance the subspace representation learning. To address these two limitations, in this paper, we propose a jointly smoothed multi-view subspace clustering (JSMC) approach. Specifically, we simultaneously incorporate the cross-view commonness and inconsistencies into the subspace representation learning. The view-consensus grouping effect is presented to jointly exploit the local structures of multiple views to regularize the view-commonness representation, which is further associated with the low-rank constraint via the nuclear norm to strengthen its cluster structure. Thus the cross-view commonness and inconsistencies, the view-consensus grouping effect, and the low-rank representation are seamlessly incorporated into a unified objective function, upon which an alternating optimization algorithm is performed to achieve a robust subspace representation for clustering. Experimental results on a variety of real-world multi-view datasets confirm the superiority of our approach.
    ImDrug: A Benchmark for Deep Imbalanced Learning in AI-aided Drug Discovery. (arXiv:2209.07921v2 [cs.LG] UPDATED)
    The last decade has witnessed a prosperous development of computational methods and dataset curation for AI-aided drug discovery (AIDD). However, real-world pharmaceutical datasets often exhibit highly imbalanced distribution, which is overlooked by the current literature but may severely compromise the fairness and generalization of machine learning applications. Motivated by this observation, we introduce ImDrug, a comprehensive benchmark with an open-source Python library which consists of 4 imbalance settings, 11 AI-ready datasets, 54 learning tasks and 16 baseline algorithms tailored for imbalanced learning. It provides an accessible and customizable testbed for problems and solutions spanning a broad spectrum of the drug discovery pipeline such as molecular modeling, drug-target interaction and retrosynthesis. We conduct extensive empirical studies with novel evaluation metrics, to demonstrate that the existing algorithms fall short of solving medicinal and pharmaceutical challenges in the data imbalance scenario. We believe that ImDrug opens up avenues for future research and development, on real-world challenges at the intersection of AIDD and deep imbalanced learning.
    Dr. Neurosymbolic, or: How I Learned to Stop Worrying and Accept Statistics. (arXiv:2209.04049v3 [cs.AI] UPDATED)
    The symbolic AI community is increasingly trying to embrace machine learning in neuro-symbolic architectures, yet is still struggling due to cultural barriers. To break the barrier, this rather opinionated personal memo attempts to explain and rectify the conventions in Statistics, Machine Learning, and Deep Learning from the viewpoint of outsiders. It provides a step-by-step protocol for designing a machine learning system that satisfies a minimum theoretical guarantee necessary for being taken seriously by the symbolic AI community, i.e., it discusses "in what condition we can stop worrying and accept statistical machine learning." Some highlights: Most textbooks are written for those who plan to specialize in Stat/ML/DL and are supposed to accept jargons. This memo is for experienced symbolic researchers that hear a lot of buzz but are still uncertain and skeptical. Information on Stat/ML/DL is currently too scattered or too noisy to invest in. This memo prioritizes compactness and pays special attention to concepts that resonate well with symbolic paradigms. I hope this memo offers time savings. It prioritizes general mathematical modeling and does not discuss any specific function approximator, such as neural networks (NNs), SVMs, decision trees, etc. It is open to corrections. Consider this memo as something similar to a blog post taking the form of a paper on Arxiv.
    Influence Estimation for Generative Adversarial Networks. (arXiv:2101.08367v3 [stat.ML] UPDATED)
    Identifying harmful instances, whose absence in a training dataset improves model performance, is important for building better machine learning models. Although previous studies have succeeded in estimating harmful instances under supervised settings, they cannot be trivially extended to generative adversarial networks (GANs). This is because previous approaches require that (1) the absence of a training instance directly affects the loss value and that (2) the change in the loss directly measures the harmfulness of the instance for the performance of a model. In GAN training, however, neither of the requirements is satisfied. This is because, (1) the generator's loss is not directly affected by the training instances as they are not part of the generator's training steps, and (2) the values of GAN's losses normally do not capture the generative performance of a model. To this end, (1) we propose an influence estimation method that uses the Jacobian of the gradient of the generator's loss with respect to the discriminator's parameters (and vice versa) to trace how the absence of an instance in the discriminator's training affects the generator's parameters, and (2) we propose a novel evaluation scheme, in which we assess harmfulness of each training instance on the basis of how GAN evaluation metric (e.g., inception score) is expect to change due to the removal of the instance. We experimentally verified that our influence estimation method correctly inferred the changes in GAN evaluation metrics. Further, we demonstrated that the removal of the identified harmful instances effectively improved the model's generative performance with respect to various GAN evaluation metrics.
    Generalization Properties of hyper-RKHS and its Applications. (arXiv:1809.09910v4 [cs.LG] UPDATED)
    This paper generalizes regularized regression problems in a hyper-reproducing kernel Hilbert space (hyper-RKHS), illustrates its utility for kernel learning and out-of-sample extensions, and proves asymptotic convergence results for the introduced regression models in an approximation theory view. Algorithmically, we consider two regularized regression models with bivariate forms in this space, including kernel ridge regression (KRR) and support vector regression (SVR) endowed with hyper-RKHS, and further combine divide-and-conquer with Nystr\"{o}m approximation for scalability in large sample cases. This framework is general: the underlying kernel is learned from a broad class, and can be positive definite or not, which adapts to various requirements in kernel learning. Theoretically, we study the convergence behavior of regularized regression algorithms in hyper-RKHS and derive the learning rates, which goes beyond the classical analysis on RKHS due to the non-trivial independence of pairwise samples and the characterisation of hyper-RKHS. Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks.
    D-DARTS: Distributed Differentiable Architecture Search. (arXiv:2108.09306v3 [cs.LG] UPDATED)
    Differentiable ARchiTecture Search (DARTS) is one of the most trending Neural Architecture Search (NAS) methods. It drastically reduces search cost by resorting to weight-sharing. However, it also dramatically reduces the search space, thus excluding potential promising architectures. In this article, we propose D-DARTS, a solution that addresses this problem by nesting neural networks at the cell level instead of using weight-sharing to produce more diversified and specialized architectures. Moreover, we introduce a novel algorithm that can derive deeper architectures from a few trained cells, increasing performance and saving computation time. In addition, we also present an alternative search space (DARTOpti) in which we optimize existing handcrafted architectures (e.g., ResNet) rather than starting from scratch. This approach is accompanied by a novel metric that measures the distance between architectures inside our custom search space. Our solution reaches competitive performance on multiple computer vision tasks.
    Temporal Representation Learning on Monocular Videos for 3D Human Pose Estimation. (arXiv:2012.01511v4 [cs.CV] UPDATED)
    In this paper we propose an unsupervised feature extraction method to capture temporal information on monocular videos, where we detect and encode subject of interest in each frame and leverage contrastive self-supervised (CSS) learning to extract rich latent vectors. Instead of simply treating the latent features of nearby frames as positive pairs and those of temporally-distant ones as negative pairs as in other CSS approaches, we explicitly disentangle each latent vector into a time-variant component and a time-invariant one. We then show that applying contrastive loss only to the time-variant features and encouraging a gradual transition on them between nearby and away frames while also reconstructing the input, extract rich temporal features, well-suited for human pose estimation. Our approach reduces error by about 50% compared to the standard CSS strategies, outperforms other unsupervised single-view methods and matches the performance of multi-view techniques. When 2D pose is available, our approach can extract even richer latent features and improve the 3D pose estimation accuracy, outperforming other state-of-the-art weakly supervised methods.
    Learning Two-Step Hybrid Policy for Graph-Based Interpretable Reinforcement Learning. (arXiv:2201.08520v2 [cs.LG] UPDATED)
    We present a two-step hybrid reinforcement learning (RL) policy that is designed to generate interpretable and robust hierarchical policies on the RL problem with graph-based input. Unlike prior deep reinforcement learning policies parameterized by an end-to-end black-box graph neural network, our approach disentangles the decision-making process into two steps. The first step is a simplified classification problem that maps the graph input to an action group where all actions share a similar semantic meaning. The second step implements a sophisticated rule-miner that conducts explicit one-hop reasoning over the graph and identifies decisive edges in the graph input without the necessity of heavy domain knowledge. This two-step hybrid policy presents human-friendly interpretations and achieves better performance in terms of generalization and robustness. Extensive experimental studies on four levels of complex text-based games have demonstrated the superiority of the proposed method compared to the state-of-the-art.
    Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation. (arXiv:2202.06881v2 [cs.LG] UPDATED)
    We propose Active Surrogate Estimators (ASEs), a new method for label-efficient model evaluation. Evaluating model performance is a challenging and important problem when labels are expensive. ASEs address this active testing problem using a surrogate-based estimation approach that interpolates the errors of points with unknown labels, rather than forming a Monte Carlo estimator. ASEs actively learn the underlying surrogate, and we propose a novel acquisition strategy, XWED, that tailors this learning to the final estimation task. We find that ASEs offer greater label-efficiency than the current state-of-the-art when applied to challenging model evaluation problems for deep neural networks.
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v1 [cs.LG])
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These manifest as changes to the underlying data generating mechanisms, and thereby result in distribution shifts across environments. Attributing performance changes to specific shifts, such as covariate or concept shifts, is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition (or a set) of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.
    Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy. (arXiv:1906.06784v7 [stat.ML] UPDATED)
    Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain the adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%. Moreover, we provide mathematical analysis of Interpolated Adversarial Training to confirm its efficiencies and demonstrate its advantages in terms of robustness and generalization.
    Fant\^omas: Evaluating Reversibility of Face Anonymizations Using a General Deep Learning Attacker. (arXiv:2210.10651v1 [cs.CR])
    Biometric data is a rich source of information that can be used to identify individuals and infer private information about them. To mitigate this privacy risk, anonymization techniques employ transformations on clear data to obfuscate sensitive information, all while retaining some utility of the data. Albeit published with impressive claims, they sometimes are not evaluated with convincing methodology. We hence are interested to which extent recently suggested anonymization techniques for obfuscating facial images are effective. More specifically, we test how easily they can be automatically reverted, to estimate the privacy they can provide. Our approach is agnostic to the anonymization technique as we learn a machine learning model on the clear and corresponding anonymized data. We find that 10 out of 14 tested face anonymization techniques are at least partially reversible, and six of them are at least highly reversible.
    Optimal Algorithms for Stochastic Multi-Level Compositional Optimization. (arXiv:2202.07530v4 [cs.LG] UPDATED)
    In this paper, we investigate the problem of stochastic multi-level compositional optimization, where the objective function is a composition of multiple smooth but possibly non-convex functions. Existing methods for solving this problem either suffer from sub-optimal sample complexities or need a huge batch size. To address these limitations, we propose a Stochastic Multi-level Variance Reduction method (SMVR), which achieves the optimal sample complexity of $\mathcal{O}\left(1 / \epsilon^{3}\right)$ to find an $\epsilon$-stationary point for non-convex objectives. Furthermore, when the objective function satisfies the convexity or Polyak-{\L}ojasiewicz (PL) condition, we propose a stage-wise variant of SMVR and improve the sample complexity to $\mathcal{O}\left(1 / \epsilon^{2}\right)$ for convex functions or $\mathcal{O}\left(1 /\left(\mu\epsilon\right)\right)$ for non-convex functions satisfying the $\mu$-PL condition. The latter result implies the same complexity for $\mu$-strongly convex functions. To make use of adaptive learning rates, we also develop Adaptive SMVR, which achieves the same complexities but converges faster in practice. All our complexities match the lower bounds not only in terms of $\epsilon$ but also in terms of $\mu$ (for PL or strongly convex functions), without using a large batch size in each iteration.
    Why Should Adversarial Perturbations be Imperceptible? Rethink the Research Paradigm in Adversarial NLP. (arXiv:2210.10683v1 [cs.CL])
    Textual adversarial samples play important roles in multiple subfields of NLP research, including security, evaluation, explainability, and data augmentation. However, most work mixes all these roles, obscuring the problem definitions and research goals of the security role that aims to reveal the practical concerns of NLP models. In this paper, we rethink the research paradigm of textual adversarial samples in security scenarios. We discuss the deficiencies in previous work and propose our suggestions that the research on the Security-oriented adversarial NLP (SoadNLP) should: (1) evaluate their methods on security tasks to demonstrate the real-world concerns; (2) consider real-world attackers' goals, instead of developing impractical methods. To this end, we first collect, process, and release a security datasets collection Advbench. Then, we reformalize the task and adjust the emphasis on different goals in SoadNLP. Next, we propose a simple method based on heuristic rules that can easily fulfill the actual adversarial goals to simulate real-world attack methods. We conduct experiments on both the attack and the defense sides on Advbench. Experimental results show that our method has higher practical value, indicating that the research paradigm in SoadNLP may start from our new benchmark. All the code and data of Advbench can be obtained at \url{https://github.com/thunlp/Advbench}.
    UAV-assisted Online Machine Learning over Multi-Tiered Networks: A Hierarchical Nested Personalized Federated Learning Approach. (arXiv:2106.15734v5 [cs.LG] UPDATED)
    We investigate training machine learning (ML) models across a set of geo-distributed, resource-constrained clusters of devices through unmanned aerial vehicles (UAV) swarms. The presence of time-varying data heterogeneity and computational resource inadequacy among device clusters motivate four key parts of our methodology: (i) stratified UAV swarms of leader, worker, and coordinator UAVs, (ii) hierarchical nested personalized federated learning (HN-PFL), a distributed ML framework for personalized model training across the worker-leader-core network hierarchy, (iii) cooperative UAV resource pooling to address computational inadequacy of devices by conducting model training among the UAV swarms, and (iv) model/concept drift to model time-varying data distributions. In doing so, we consider both micro (i.e., UAV-level) and macro (i.e., swarm-level) system design. At the micro-level, we propose network-aware HN-PFL, where we distributively orchestrate UAVs inside swarms to optimize energy consumption and ML model performance with performance guarantees. At the macro-level, we focus on swarm trajectory and learning duration design, which we formulate as a sequential decision making problem tackled via deep reinforcement learning. Our simulations demonstrate the improvements achieved by our methodology in terms of ML performance, network resource savings, and swarm trajectory efficiency.
    A Systematic Study of Bias Amplification. (arXiv:2201.11706v2 [cs.LG] UPDATED)
    Recent research suggests that predictions made by machine-learning models can amplify biases present in the training data. When a model amplifies bias, it makes certain predictions at a higher rate for some groups than expected based on training-data statistics. Mitigating such bias amplification requires a deep understanding of the mechanics in modern machine learning that give rise to that amplification. We perform the first systematic, controlled study into when and how bias amplification occurs. To enable this study, we design a simple image-classification problem in which we can tightly control (synthetic) biases. Our study of this problem reveals that the strength of bias amplification is correlated to measures such as model accuracy, model capacity, model overconfidence, and amount of training data. We also find that bias amplification can vary greatly during training. Finally, we find that bias amplification may depend on the difficulty of the classification task relative to the difficulty of recognizing group membership: bias amplification appears to occur primarily when it is easier to recognize group membership than class membership. Our results suggest best practices for training machine-learning models that we hope will help pave the way for the development of better mitigation strategies. Code can be found at https://github.com/facebookresearch/cv_bias_amplification.
    Anomaly Detection Requires Better Representations. (arXiv:2210.10773v1 [cs.LG])
    Anomaly detection seeks to identify unusual phenomena, a central task in science and industry. The task is inherently unsupervised as anomalies are unexpected and unknown during training. Recent advances in self-supervised representation learning have directly driven improvements in anomaly detection. In this position paper, we first explain how self-supervised representations can be easily used to achieve state-of-the-art performance in commonly reported anomaly detection benchmarks. We then argue that tackling the next generation of anomaly detection tasks requires new technical and conceptual improvements in representation learning.
    What Makes a "Good" Data Augmentation in Knowledge Distillation -- A Statistical Perspective. (arXiv:2012.02909v2 [cs.CV] UPDATED)
    Knowledge distillation (KD) is a general neural network training approach that uses a teacher to guide a student. Existing works mainly study KD from the network output side (e.g., trying to design a better KD loss function), while few have attempted to understand it from the input side. Especially, its interplay with data augmentation (DA) has not been well understood. In this paper, we ask: Why do some DA schemes (e.g., CutMix) inherently perform much better than others in KD? What makes a "good" DA in KD? Our investigation from a statistical perspective suggests that a good DA scheme should reduce the variance of the teacher's mean probability, which will eventually lead to a lower generalization gap for the student. Besides the theoretical understanding, we also introduce a new entropy-based data-mixing DA scheme to enhance CutMix. Extensive empirical studies support our claims and demonstrate how we can harvest considerable performance gains simply by using a better DA scheme in knowledge distillation.
    RSC: Accelerating Graph Neural Networks Training via Randomized Sparse Computations. (arXiv:2210.10737v1 [cs.LG])
    The training of graph neural networks (GNNs) is extremely time consuming because sparse graph-based operations are hard to be accelerated by hardware. Prior art explores trading off the computational precision to reduce the time complexity via sampling-based approximation. Based on the idea, previous works successfully accelerate the dense matrix based operations (e.g., convolution and linear) with negligible accuracy drop. However, unlike dense matrices, sparse matrices are stored in the irregular data format such that each row/column may have different number of non-zero entries. Thus, compared to the dense counterpart, approximating sparse operations has two unique challenges (1) we cannot directly control the efficiency of approximated sparse operation since the computation is only executed on non-zero entries; (2) sub-sampling sparse matrices is much more inefficient due to the irregular data format. To address the issues, our key idea is to control the accuracy-efficiency trade off by optimizing computation resource allocation layer-wisely and epoch-wisely. Specifically, for the first challenge, we customize the computation resource to different sparse operations, while limit the total used resource below a certain budget. For the second challenge, we cache previous sampled sparse matrices to reduce the epoch-wise sampling overhead. Finally, we propose a switching mechanisms to improve the generalization of GNNs trained with approximated operations. To this end, we propose Randomized Sparse Computation, which for the first time demonstrate the potential of training GNNs with approximated operations. In practice, rsc can achieve up to $11.6\times$ speedup for a single sparse operation and a $1.6\times$ end-to-end wall-clock time speedup with negligible accuracy drop.
    Provably Safe Reinforcement Learning via Action Projection using Reachability Analysis and Polynomial Zonotopes. (arXiv:2210.10691v1 [cs.RO])
    While reinforcement learning produces very promising results for many applications, its main disadvantage is the lack of safety guarantees, which prevents its use in safety-critical systems. In this work, we address this issue by a safety shield for nonlinear continuous systems that solve reach-avoid tasks. Our safety shield prevents applying potentially unsafe actions from a reinforcement learning agent by projecting the proposed action to the closest safe action. This approach is called action projection and is implemented via mixed-integer optimization. The safety constraints for action projection are obtained by applying parameterized reachability analysis using polynomial zonotopes, which enables to accurately capture the nonlinear effects of the actions on the system. In contrast to other state of the art approaches for action projection, our safety shield can efficiently handle input constraints and dynamic obstacles, eases incorporation of the spatial robot dimensions into the safety constraints, guarantees robust safety despite process noise and measurement errors, and is well suited for high-dimensional systems, as we demonstrate on several challenging benchmark systems.
    A Framework for Undergraduate Data Collection Strategies for Student Support Recommendation Systems in Higher Education. (arXiv:2210.10657v1 [cs.IR])
    Understanding which student support strategies mitigate dropout and improve student retention is an important part of modern higher educational research. One of the largest challenges institutions of higher learning currently face is the scalability of student support. Part of this is due to the shortage of staff addressing the needs of students, and the subsequent referral pathways associated to provide timeous student support strategies. This is further complicated by the difficulty of these referrals, especially as students are often faced with a combination of administrative, academic, social, and socio-economic challenges. A possible solution to this problem can be a combination of student outcome predictions and applying algorithmic recommender systems within the context of higher education. While much effort and detail has gone into the expansion of explaining algorithmic decision making in this context, there is still a need to develop data collection strategies Therefore, the purpose of this paper is to outline a data collection framework specific to recommender systems within this context in order to reduce collection biases, understand student characteristics, and find an ideal way to infer optimal influences on the student journey. If confirmation biases, challenges in data sparsity and the type of information to collect from students are not addressed, it will have detrimental effects on attempts to assess and evaluate the effects of these systems within higher education.
    Two-level Data Augmentation for Calibrated Multi-view Detection. (arXiv:2210.10756v1 [cs.CV])
    Data augmentation has proven its usefulness to improve model generalization and performance. While it is commonly applied in computer vision application when it comes to multi-view systems, it is rarely used. Indeed geometric data augmentation can break the alignment among views. This is problematic since multi-view data tend to be scarce and it is expensive to annotate. In this work we propose to solve this issue by introducing a new multi-view data augmentation pipeline that preserves alignment among views. Additionally to traditional augmentation of the input image we also propose a second level of augmentation applied directly at the scene level. When combined with our simple multi-view detection model, our two-level augmentation pipeline outperforms all existing baselines by a significant margin on the two main multi-view multi-person detection datasets WILDTRACK and MultiviewX.
    Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation. (arXiv:2210.10715v1 [cs.LG])
    We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional likelihood of the data under the model, we maximize a family of \textit{noise conditional} likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical \textit{covariate shift} problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the ImageNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) -- from 37.50 to 12.09 on the CIFAR-10 dataset.
    POGD: Gradient Descent with New Stochastic Rules. (arXiv:2210.10654v1 [cs.LG])
    There introduce Particle Optimized Gradient Descent (POGD), an algorithm based on the gradient descent but integrates the particle swarm optimization (PSO) principle to achieve the iteration. From the experiments, this algorithm has adaptive learning ability. The experiments in this paper mainly focus on the training speed to reach the target value and the ability to prevent the local minimum. The experiments in this paper are achieved by the convolutional neural network (CNN) image classification on the MNIST and cifar-10 datasets.
    Dynamic Privacy Budget Allocation Improves Data Efficiency of Differentially Private Gradient Descent. (arXiv:2101.07413v3 [cs.LG] UPDATED)
    Protecting privacy in learning while maintaining the model performance has become increasingly critical in many applications that involve sensitive data. Private Gradient Descent (PGD) is a commonly used private learning framework, which noises gradients based on the Differential Privacy protocol. Recent studies show that \emph{dynamic privacy schedules} of decreasing noise magnitudes can improve loss at the final iteration, and yet theoretical understandings of the effectiveness of such schedules and their connections to optimization algorithms remain limited. In this paper, we provide comprehensive analysis of noise influence in dynamic privacy schedules to answer these critical questions. We first present a dynamic noise schedule minimizing the utility upper bound of PGD, and show how the noise influence from each optimization step collectively impacts utility of the final model. Our study also reveals how impacts from dynamic noise influence change when momentum is used. We empirically show the connection exists for general non-convex losses, and the influence is greatly impacted by the loss curvature.
    Binary Orthogonal Non-negative Matrix Factorization. (arXiv:2210.10660v1 [cs.LG])
    We propose a method for computing binary orthogonal non-negative matrix factorization (BONMF) for clustering and classification. The method is tested on several representative real-world data sets. The numerical results confirm that the method has improved accuracy compared to the related techniques. The proposed method is fast for training and classification and space efficient.
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v5 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.
    Using Interventions to Improve Out-of-Distribution Generalization of Text-Matching Recommendation Systems. (arXiv:2210.10636v1 [cs.IR])
    Given a user's input text, text-matching recommender systems output relevant items by comparing the input text to available items' description, such as product-to-product recommendation on e-commerce platforms. As users' interests and item inventory are expected to change, it is important for a text-matching system to generalize to data shifts, a task known as out-of-distribution (OOD) generalization. However, we find that the popular approach of fine-tuning a large, base language model on paired item relevance data (e.g., user clicks) can be counter-productive for OOD generalization. For a product recommendation task, fine-tuning obtains worse accuracy than the base model when recommending items in a new category or for a future time period. To explain this generalization failure, we consider an intervention-based importance metric, which shows that a fine-tuned model captures spurious correlations and fails to learn the causal features that determine the relevance between any two text inputs. Moreover, standard methods for causal regularization do not apply in this setting, because unlike in images, there exist no universally spurious features in a text-matching task (the same token may be spurious or causal depending on the text it is being matched to). For OOD generalization on text inputs, therefore, we highlight a different goal: avoiding high importance scores for certain features. We do so using an intervention-based regularizer that constraints the causal effect of any token on the model's relevance score to be similar to the base model. Results on Amazon product and 3 question recommendation datasets show that our proposed regularizer improves generalization for both in-distribution and OOD evaluation, especially in difficult scenarios when the base model is not accurate.
    Interpolation Consistency Training for Semi-Supervised Learning. (arXiv:1903.03825v5 [stat.ML] UPDATED)
    We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.
    Extending Graph Transformers with Quantum Computed Aggregation. (arXiv:2210.10610v1 [quant-ph])
    Recently, efforts have been made in the community to design new Graph Neural Networks (GNN), as limitations of Message Passing Neural Networks became more apparent. This led to the appearance of Graph Transformers using global graph features such as Laplacian Eigenmaps. In our paper, we introduce a GNN architecture where the aggregation weights are computed using the long-range correlations of a quantum system. These correlations are generated by translating the graph topology into the interactions of a set of qubits in a quantum computer. This work was inspired by the recent development of quantum processing units which enables the computation of a new family of global graph features that would be otherwise out of reach for classical hardware. We give some theoretical insights about the potential benefits of this approach, and benchmark our algorithm on standard datasets. Although not being adapted to all datasets, our model performs similarly to standard GNN architectures, and paves a promising future for quantum enhanced GNNs.
    Robust Regression with Highly Corrupted Data via Physics Informed Neural Networks. (arXiv:2210.10646v1 [cs.LG])
    Physics-informed neural networks (PINNs) have been proposed to solve two main classes of problems: data-driven solutions and data-driven discovery of partial differential equations. This task becomes prohibitive when such data is highly corrupted due to the possible sensor mechanism failing. We propose the Least Absolute Deviation based PINN (LAD-PINN) to reconstruct the solution and recover unknown parameters in PDEs - even if spurious data or outliers corrupt a large percentage of the observations. To further improve the accuracy of recovering hidden physics, the two-stage Median Absolute Deviation based PINN (MAD-PINN) is proposed, where LAD-PINN is employed as an outlier detector followed by MAD screening out the highly corrupted data. Then the vanilla PINN or its variants can be subsequently applied to exploit the remaining normal data. Through several examples, including Poisson's equation, wave equation, and steady or unsteady Navier-Stokes equations, we illustrate the generalizability, accuracy and efficiency of the proposed algorithms for recovering governing equations from noisy and highly corrupted measurement data.
    Simulated Contextual Bandits for Personalization Tasks from Recommendation Datasets. (arXiv:2210.10631v1 [cs.IR])
    We propose a method for generating simulated contextual bandit environments for personalization tasks from recommendation datasets like MovieLens, Netflix, Last.fm, Million Song, etc. This allows for personalization environments to be developed based on real-life data to reflect the nuanced nature of real-world user interactions. The obtained environments can be used to develop methods for solving personalization tasks, algorithm benchmarking, model simulation, and more. We demonstrate our approach with numerical examples on MovieLens and IMDb datasets.
    Digital Human Interactive Recommendation Decision-Making Based on Reinforcement Learning. (arXiv:2210.10638v1 [cs.IR])
    Digital human recommendation system has been developed to help customers to find their favorite products and is playing an active role in various recommendation contexts. How to catch and learn the preferences of the customers at the right time and meet the exact requirements of the customer become crucial in the digital human recommendation. We design a novel practical digital human interactive recommendation agent framework based on reinforcement learning to improve the efficiency of interactive recommendation decision-making by leveraging both the digital human features and the superiority of reinforcement learning. The proposed framework learns through immediate interactions among the digital human and customers dynamically through stat-of-art reinforcement learning algorithms and embedding with multimodal and graph embedding to improve the accuracy of the personalization and thus enable the digital human agent to actively catch the attention of a customer timely. Experiments on real business data show that this framework can provide better-personalized customer engagement and better customer experiences etc.
    Towards Accurate Subgraph Similarity Computation via Neural Graph Pruning. (arXiv:2210.10643v1 [cs.LG])
    Subgraph similarity search, one of the core problems in graph search, concerns whether a target graph approximately contains a query graph. The problem is recently touched by neural methods. However, current neural methods do not consider pruning the target graph, though pruning is critically important in traditional calculations of subgraph similarities. One obstacle to applying pruning in neural methods is {the discrete property of pruning}. In this work, we convert graph pruning to a problem of node relabeling and then relax it to a differentiable problem. Based on this idea, we further design a novel neural network to approximate a type of subgraph distance: the subgraph edit distance (SED). {In particular, we construct the pruning component using a neural structure, and the entire model can be optimized end-to-end.} In the design of the model, we propose an attention mechanism to leverage the information about the query graph and guide the pruning of the target graph. Moreover, we develop a multi-head pruning strategy such that the model can better explore multiple ways of pruning the target graph. The proposed model establishes new state-of-the-art results across seven benchmark datasets. Extensive analysis of the model indicates that the proposed model can reasonably prune the target graph for SED computation. The implementation of our algorithm is released at our Github repo: https://github.com/tufts-ml/Prune4SED.
    Review of the state of the art in autonomous artificial intelligence. (arXiv:2210.10659v1 [cs.AI])
    This article presents a new design for autonomous artificial intelligence (AI), based on the state-of-the-art algorithms, and describes a new autonomous AI system called AutoAI. The methodology is used to assemble the design founded on self-improved algorithms that use new and emerging sources of data (NEFD). The objective of the article is to conceptualise the design of a novel AutoAI algorithm. The conceptual approach is used to advance into building new and improved algorithms. The article integrates and consolidates the findings from existing literature and advances the AutoAI design into (1) using new and emerging sources of data for teaching and training AI algorithms and (2) enabling AI algorithms to use automated tools for training new and improved algorithms. This approach is going beyond the state-of-the-art in AI algorithms and suggests a design that enables autonomous algorithms to self-optimise and self-adapt, and on a higher level, be capable to self-procreate.
    Multi-Modal Recommendation System with Auxiliary Information. (arXiv:2210.10652v1 [cs.IR])
    Context-aware recommendation systems improve upon classical recommender systems by including, in the modelling, a user's behaviour. Research into context-aware recommendation systems has previously only considered the sequential ordering of items as contextual information. However, there is a wealth of unexploited additional multi-modal information available in auxiliary knowledge related to items. This study extends the existing research by evaluating a multi-modal recommendation system that exploits the inclusion of comprehensive auxiliary knowledge related to an item. The empirical results explore extracting vector representations (embeddings) from unstructured and structured data using data2vec. The fused embeddings are then used to train several state-of-the-art transformer architectures for sequential user-item representations. The analysis of the experimental results shows a statistically significant improvement in prediction accuracy, which confirms the effectiveness of including auxiliary information in a context-aware recommendation system. We report a 4% and 11% increase in the NDCG score for long and short user sequence datasets, respectively.
    Anytime-valid off-policy inference for contextual bandits. (arXiv:2210.10768v1 [stat.ME])
    Contextual bandits are a modern staple tool for active sequential experimentation in the tech industry. They involve online learning algorithms that adaptively (over time) learn policies to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as "off-policy evaluation" (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax many unnecessary assumptions made in past work, significantly improving on them theoretically and empirically. Our methods remain valid in very general settings, and can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are drifting over time. More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire CDF of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, and (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of the reward and weight distributions. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.
    Neural Information Squeezer for Causal Emergence. (arXiv:2201.10154v2 [cs.LG] UPDATED)
    The classic studies of causal emergence have revealed that in some Markovian dynamical systems, far stronger causal connections can be found on the higher-level descriptions than the lower-level of the same systems if we coarse-grain the system states in an appropriate way. However, identifying this emergent causality from the data is still a hard problem that has not been solved because the correct coarse-graining strategy can not be found easily. This paper proposes a general machine learning framework called Neural Information Squeezer to automatically extract the effective coarse-graining strategy and the macro-state dynamics, as well as identify causal emergence directly from the time series data. By decomposing a coarse-graining operation into two processes: information conversion and information dropping out, we can not only exactly control the width of the information channel, but also can derive some important properties analytically including the exact expression of the effective information of a macro-dynamics. We also show how our framework can extract the dynamics on different levels and identify causal emergence from the data on several exampled systems.
    A Regularity Theory for Static Schr\"odinger Equations on $\mathbb{R}^d$ in Spectral Barron Spaces. (arXiv:2201.10072v2 [math.AP] UPDATED)
    Spectral Barron spaces have received considerable interest recently as it is the natural function space for approximation theory of two-layer neural networks with a dimension-free convergence rate. In this paper we study the regularity of solutions to the whole-space static Schr\"odinger equation in spectral Barron spaces. We prove that if the source of the equation lies in the spectral Barron space $\mathcal{B}^s(\mathbb{R}^d)$ and the potential function admitting a non-negative lower bound decomposes as a positive constant plus a function in $\mathcal{B}^s(\mathbb{R}^d)$, then the solution lies in the spectral Barron space $\mathcal{B}^{s+2}(\mathbb{R}^d)$.
    Towards Principled Disentanglement for Domain Generalization. (arXiv:2111.13839v4 [cs.LG] UPDATED)
    A fundamental challenge for machine learning models is generalizing to out-of-distribution (OOD) data, in part due to spurious correlations. To tackle this challenge, we first formalize the OOD generalization problem as constrained optimization, called Disentanglement-constrained Domain Generalization (DDG). We relax this non-trivial constrained optimization problem to a tractable form with finite-dimensional parameterization and empirical approximation. Then a theoretical analysis of the extent to which the above transformations deviates from the original problem is provided. Based on the transformation, we propose a primal-dual algorithm for joint representation disentanglement and domain generalization. In contrast to traditional approaches based on domain adversarial training and domain labels, DDG jointly learns semantic and variation encoders for disentanglement, enabling flexible manipulation and augmentation on training data. DDG aims to learn intrinsic representations of semantic concepts that are invariant to nuisance factors and generalizable across domains. Comprehensive experiments on popular benchmarks show that DDG can achieve competitive OOD performance and uncover interpretable salient structures within data.
    A Graph-based Methodology for the Sensorless Estimation of Road Traffic Profiles. (arXiv:2201.04968v2 [cs.LG] UPDATED)
    Traffic forecasting models rely on data that needs to be sensed, processed, and stored. This requires the deployment and maintenance of traffic sensing infrastructure, often leading to unaffordable monetary costs. The lack of sensed locations can be complemented with synthetic data simulations that further lower the economical investment needed for traffic monitoring. One of the most common data generative approaches consists of producing real-like traffic patterns, according to data distributions from analogous roads. The process of detecting roads with similar traffic is the key point of these systems. However, without collecting data at the target location no flow metrics can be employed for this similarity-based search. We present a method to discover locations among those with available traffic data by inspecting topological features. These features are extracted from domain-specific knowledge as numerical representations (embeddings) to compare different locations and eventually find roads with analogous daily traffic profiles based on the similarity between embeddings. The performance of this novel selection system is examined and compared to simpler traffic estimation approaches. After finding a similar source of data, a generative method is used to synthesize traffic profiles. Depending on the resemblance of the traffic behavior at the sensed road, the generation method can be fed with data from one road only. Several generation approaches are analyzed in terms of the precision of the synthesized samples. Above all, this work intends to stimulate further research efforts towards enhancing the quality of synthetic traffic samples and thereby, reducing the need for sensing infrastructure.
    Transformers Learn Shortcuts to Automata. (arXiv:2210.10749v1 [cs.LG])
    Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? We investigate this question in the setting of learning automata, discrete dynamical systems naturally suited to recurrent modeling and expressing algorithmic tasks. Our theoretical results completely characterize shortcut solutions, whereby a shallow Transformer with only $o(T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. By representing automata using the algebraic structure of their underlying transformation semigroups, we obtain $O(\log T)$-depth simulators for all automata and $O(1)$-depth simulators for all automata whose associated groups are solvable. Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We further investigate the brittleness of these solutions and propose potential mitigations.
    Fast Differentiable Matrix Square Root and Inverse Square Root. (arXiv:2201.12543v2 [cs.CV] UPDATED)
    Computing the matrix square root and its inverse in a differentiable manner is important in a variety of computer vision tasks. Previous methods either adopt the Singular Value Decomposition (SVD) to explicitly factorize the matrix or use the Newton-Schulz iteration (NS iteration) to derive the approximate solution. However, both methods are not computationally efficient enough in either the forward pass or the backward pass. In this paper, we propose two more efficient variants to compute the differentiable matrix square root and the inverse square root. For the forward propagation, one method is to use Matrix Taylor Polynomial (MTP), and the other method is to use Matrix Pad\'e Approximants (MPA). The backward gradient is computed by iteratively solving the continuous-time Lyapunov equation using the matrix sign function. A series of numerical tests show that both methods yield considerable speed-up compared with the SVD or the NS iteration. Moreover, we validate the effectiveness of our methods in several real-world applications, including de-correlated batch normalization, second-order vision transformer, global covariance pooling for large-scale and fine-grained recognition, attentive covariance pooling for video recognition, and neural style transfer. The experimental results demonstrate that our methods can also achieve competitive and even slightly better performances. The Pytorch implementation is available at https://github.com/KingJamesSong/FastDifferentiableMatSqrt
    Adapting to Mixing Time in Stochastic Optimization with Markovian Data. (arXiv:2202.04428v2 [cs.LG] UPDATED)
    We consider stochastic optimization problems where data is drawn from a Markov chain. Existing methods for this setting crucially rely on knowing the mixing time of the chain, which in real-world applications is usually unknown. We propose the first optimization method that does not require the knowledge of the mixing time, yet obtains the optimal asymptotic convergence rate when applied to convex problems. We further show that our approach can be extended to: (i) finding stationary points in non-convex optimization with Markovian data, and (ii) obtaining better dependence on the mixing time in temporal difference (TD) learning; in both cases, our method is completely oblivious to the mixing time. Our method relies on a novel combination of multi-level Monte Carlo (MLMC) gradient estimation together with an adaptive learning method.
    Generative Adversarial User Privacy in Lossy Single-Server Information Retrieval. (arXiv:2012.03902v3 [cs.LG] UPDATED)
    We propose to extend the concept of private information retrieval by allowing for distortion in the retrieval process and relaxing the perfect privacy requirement at the same time. In particular, we study the trade-off between download rate, distortion, and user privacy leakage, and show that in the limit of large file sizes this trade-off can be captured via a novel information-theoretical formulation for datasets with a known distribution. Moreover, for scenarios where the statistics of the dataset is unknown, we propose a new deep learning framework by leveraging a generative adversarial network approach, which allows the user to learn efficient schemes from the data itself. We evaluate the performance of the scheme on a synthetic Gaussian dataset as well as on the MNIST, CIFAR-10, and LSUN datasets. For the MNIST, CIFAR-10, and LSUN datasets, the data-driven approach significantly outperforms a nonlearning-based scheme which combines source coding with the download of multiple files.
    Towards Understanding the Condensation of Neural Networks at Initial Training. (arXiv:2105.11686v6 [cs.LG] UPDATED)
    Empirical works show that for ReLU neural networks (NNs) with small initialization, input weights of hidden neurons (the input weight of a hidden neuron consists of the weight from its input layer to the hidden neuron and its bias term) condense on isolated orientations. The condensation dynamics implies that the training implicitly regularizes a NN towards one with a much smaller effective size. In this work, we illustrate the formation of the condensation in multi-layer fully connected NNs and show that the maximal number of condensed orientations in the initial training stage is twice the multiplicity of the activation function, where "multiplicity" indicates the multiple roots of activation function at origin. Our theoretical analysis confirms experiments for two cases, one is for the activation function of multiplicity one with arbitrary dimension input, which contains many common activation functions, and the other is for the layer with one-dimensional input and arbitrary multiplicity. This work makes a step towards understanding how small initialization leads NNs to condensation at the initial training stage.
    On the Feasibility of Cross-Task Transfer with Model-Based Reinforcement Learning. (arXiv:2210.10763v1 [cs.LG])
    Reinforcement Learning (RL) algorithms can solve challenging control problems directly from image observations, but they often require millions of environment interactions to do so. Recently, model-based RL algorithms have greatly improved sample-efficiency by concurrently learning an internal model of the world, and supplementing real environment interactions with imagined rollouts for policy improvement. However, learning an effective model of the world from scratch is challenging, and in stark contrast to humans that rely heavily on world understanding and visual cues for learning new skills. In this work, we investigate whether internal models learned by modern model-based RL algorithms can be leveraged to solve new, distinctly different tasks faster. We propose Model-Based Cross-Task Transfer (XTRA), a framework for sample-efficient online RL with scalable pretraining and finetuning of learned world models. By offline multi-task pretraining and online cross-task finetuning, we achieve substantial improvements on the Atari100k benchmark over a baseline trained from scratch; we improve mean performance of model-based algorithm EfficientZero by 23%, and by as much as 71% in some instances. Project page: https://nicklashansen.github.io/xtra.
    RecipeMind: Guiding Ingredient Choices from Food Pairing to Recipe Completion using Cascaded Set Transformer. (arXiv:2210.10628v1 [cs.IR])
    We propose a computational approach for recipe ideation, a downstream task that helps users select and gather ingredients for creating dishes. To perform this task, we developed RecipeMind, a food affinity score prediction model that quantifies the suitability of adding an ingredient to set of other ingredients. We constructed a large-scale dataset containing ingredient co-occurrence based scores to train and evaluate RecipeMind on food affinity score prediction. Deployed in recipe ideation, RecipeMind helps the user expand an initial set of ingredients by suggesting additional ingredients. Experiments and qualitative analysis show RecipeMind's potential in fulfilling its assistive role in cuisine domain.
    SML:Enhance the Network Smoothness with Skip Meta Logit for CTR Prediction. (arXiv:2210.10725v1 [cs.IR])
    In light of the smoothness property brought by skip connections in ResNet, this paper proposed the Skip Logit to introduce the skip connection mechanism that fits arbitrary DNN dimensions and embraces similar properties to ResNet. Meta Tanh Normalization (MTN) is designed to learn variance information and stabilize the training process. With these delicate designs, our Skip Meta Logit (SML) brought incremental boosts to the performance of extensive SOTA ctr prediction models on two real-world datasets. In the meantime, we prove that the optimization landscape of arbitrarily deep skip logit networks has no spurious local optima. Finally, SML can be easily added to building blocks and has delivered offline accuracy and online business metrics gains on app ads learning to rank systems at TikTok.
    Geometric Deep Learning for the Assessment of Thrombosis Risk in the Left Atrial Appendage. (arXiv:2210.10563v1 [cs.LG])
    The assessment of left atrial appendage (LAA) thrombogenesis has experienced major advances with the adoption of patient-specific computational fluid dynamics (CFD) simulations. Nonetheless, due to the vast computational resources and long execution times required by fluid dynamics solvers, there is an ever-growing body of work aiming to develop surrogate models of fluid flow simulations based on neural networks. The present study builds on this foundation by developing a deep learning (DL) framework capable of predicting the endothelial cell activation potential (ECAP), linked to the risk of thrombosis, solely from the patient-specific LAA geometry. To this end, we leveraged recent advancements in Geometric DL, which seamlessly extend the unparalleled potential of convolutional neural networks (CNN), to non-Euclidean data such as meshes. The model was trained with a dataset combining 202 synthetic and 54 real LAA, predicting the ECAP distributions instantaneously, with an average mean absolute error of 0.563. Moreover, the resulting framework manages to predict the anatomical features related to higher ECAP values even when trained exclusively on synthetic cases.
    OpenEarthMap: A Benchmark Dataset for Global High-Resolution Land Cover Mapping. (arXiv:2210.10732v1 [cs.CV])
    We introduce OpenEarthMap, a benchmark dataset, for global high-resolution land cover mapping. OpenEarthMap consists of 2.2 million segments of 5000 aerial and satellite images covering 97 regions from 44 countries across 6 continents, with manually annotated 8-class land cover labels at a 0.25--0.5m ground sampling distance. Semantic segmentation models trained on the OpenEarthMap generalize worldwide and can be used as off-the-shelf models in a variety of applications. We evaluate the performance of state-of-the-art methods for unsupervised domain adaptation and present challenging problem settings suitable for further technical development. We also investigate lightweight models using automated neural architecture search for limited computational resources and fast mapping. The dataset is available at https://open-earth-map.org.
    HyperMiner: Topic Taxonomy Mining with Hyperbolic Embedding. (arXiv:2210.10625v1 [cs.IR])
    Embedded topic models are able to learn interpretable topics even with large and heavy-tailed vocabularies. However, they generally hold the Euclidean embedding space assumption, leading to a basic limitation in capturing hierarchical relations. To this end, we present a novel framework that introduces hyperbolic embeddings to represent words and topics. With the tree-likeness property of hyperbolic space, the underlying semantic hierarchy among words and topics can be better exploited to mine more interpretable topics. Furthermore, due to the superiority of hyperbolic geometry in representing hierarchical data, tree-structure knowledge can also be naturally injected to guide the learning of a topic hierarchy. Therefore, we further develop a regularization term based on the idea of contrastive learning to inject prior structural knowledge efficiently. Experiments on both topic taxonomy discovery and document representation demonstrate that the proposed framework achieves improved performance against existing embedded topic models.
    Deep Multi-Representation Model for Click-Through Rate Prediction. (arXiv:2210.10664v1 [cs.IR])
    Click-Through Rate prediction (CTR) is a crucial task in recommender systems, and it gained considerable attention in the past few years. The primary purpose of recent research emphasizes obtaining meaningful and powerful representations through mining low and high feature interactions using various components such as Deep Neural Networks (DNN), CrossNets, or transformer blocks. In this work, we propose the Deep Multi-Representation model (DeepMR) that jointly trains a mixture of two powerful feature representation learning components, namely DNNs and multi-head self-attentions. Furthermore, DeepMR integrates the novel residual with zero initialization (ReZero) connections to the DNN and the multi-head self-attention components for learning superior input representations. Experiments on three real-world datasets show that the proposed model significantly outperforms all state-of-the-art models in the task of click-through rate prediction.
    Adversarial De-confounding in Individualised Treatment Effects Estimation. (arXiv:2210.10530v1 [cs.LG])
    Observational studies have recently received significant attention from the machine learning community due to the increasingly available non-experimental observational data and the limitations of the experimental studies, such as considerable cost, impracticality, small and less representative sample sizes, etc. In observational studies, de-confounding is a fundamental problem of individualised treatment effects (ITE) estimation. This paper proposes disentangled representations with adversarial training to selectively balance the confounders in the binary treatment setting for the ITE estimation. The adversarial training of treatment policy selectively encourages treatment-agnostic balanced representations for the confounders and helps to estimate the ITE in the observational studies via counterfactual inference. Empirical results on synthetic and real-world datasets, with varying degrees of confounding, prove that our proposed approach improves the state-of-the-art methods in achieving lower error in the ITE estimation.
    Active Learning for Imbalanced Civil Infrastructure Data. (arXiv:2210.10586v1 [cs.CV])
    Aging civil infrastructures are closely monitored by engineers for damage and critical defects. As the manual inspection of such large structures is costly and time-consuming, we are working towards fully automating the visual inspections to support the prioritization of maintenance activities. To that end we combine recent advances in drone technology and deep learning. Unfortunately, annotation costs are incredibly high as our proprietary civil engineering dataset must be annotated by highly trained engineers. Active learning is, therefore, a valuable tool to optimize the trade-off between model performance and annotation costs. Our use-case differs from the classical active learning setting as our dataset suffers from heavy class imbalance and consists of a much larger already labeled data pool than other active learning research. We present a novel method capable of operating in this challenging setting by replacing the traditional active learning acquisition function with an auxiliary binary discriminator. We experimentally show that our novel method outperforms the best-performing traditional active learning method (BALD) by 5% and 38% accuracy on CIFAR-10 and our proprietary dataset respectively.
    Scaling Laws for Reward Model Overoptimization. (arXiv:2210.10760v1 [cs.LG])
    In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.
    Variational Model Perturbation for Source-Free Domain Adaptation. (arXiv:2210.10378v1 [cs.LG])
    We aim for source-free domain adaptation, where the task is to deploy a model pre-trained on source domains to target domains. The challenges stem from the distribution shift from the source to the target domain, coupled with the unavailability of any source data and labeled target data for optimization. Rather than fine-tuning the model by updating the parameters, we propose to perturb the source model to achieve adaptation to target domains. We introduce perturbations into the model parameters by variational Bayesian inference in a probabilistic framework. By doing so, we can effectively adapt the model to the target domain while largely preserving the discriminative ability. Importantly, we demonstrate the theoretical connection to learning Bayesian neural networks, which proves the generalizability of the perturbed model to target domains. To enable more efficient optimization, we further employ a parameter sharing strategy, which substantially reduces the learnable parameters compared to a fully Bayesian neural network. Our model perturbation provides a new probabilistic way for domain adaptation which enables efficient adaptation to target domains while maximally preserving knowledge in source models. Experiments on several source-free benchmarks under three different evaluation settings verify the effectiveness of the proposed variational model perturbation for source-free domain adaptation.
    The phase unwrapping of under-sampled interferograms using radial basis function neural networks. (arXiv:2210.10541v1 [physics.plasm-ph])
    Interferometry can measure the shape or the material density of a system that could not be measured otherwise by recording the difference between the phase change of a signal and a reference phase. This difference is always between $-\pi$ and $\pi$ while it is the absolute phase that is required to get a true measurement. There is a long history of methods designed to recover accurately this phase from the phase "wrapped" inside $]-\pi,\pi]$. However, noise and under-sampling limit the effectiveness of most techniques and require highly sophisticated algorithms that can process imperfect measurements. Ultimately, analysing successfully an interferogram amounts to pattern recognition, a task where radial basis function neural networks truly excel at. The proposed neural network is designed to unwrap the phase from two-dimensional interferograms, where aliasing, stemming from under-resolved regions, and noise levels are significant. The neural network can be trained in parallel and in three stages, using gradient-based supervised learning. Parallelism allows to handle relatively large data sets, but requires a supplemental step to synchronized the fully unwrapped phase across the different networks.
    Weakly Supervised Learning for Analyzing Political Campaigns on Facebook. (arXiv:2210.10669v1 [cs.CL])
    Social media platforms are currently the main channel for political messaging, allowing politicians to target specific demographics and adapt based on their reactions. However, making this communication transparent is challenging, as the messaging is tightly coupled with its intended audience and often echoed by multiple stakeholders interested in advancing specific policies. Our goal in this paper is to take a first step towards understanding these highly decentralized settings. We propose a weakly supervised approach to identify the stance and issue of political ads on Facebook and analyze how political campaigns use some kind of demographic targeting by location, gender, or age. Furthermore, we analyze the temporal dynamics of the political ads on election polls.
    Margin Optimal Classification Trees. (arXiv:2210.10567v1 [math.OC])
    In recent years there has been growing attention to interpretable machine learning models which can give explanatory insights on their behavior. Thanks to their interpretability, decision trees have been intensively studied for classification tasks, and due to the remarkable advances in mixed-integer programming (MIP), various approaches have been proposed to formulate the problem of training an Optimal Classification Tree (OCT) as a MIP model. We present a novel mixed-integer quadratic formulation for the OCT problem, which exploits the generalization capabilities of Support Vector Machines for binary classification. Our model, denoted as Margin Optimal Classification Tree (MARGOT), encompasses the use of maximum margin multivariate hyperplanes nested in a binary tree structure. To enhance the interpretability of our approach, we analyse two alternative versions of MARGOT, which include feature selection constraints inducing local sparsity of the hyperplanes. First, MARGOT has been tested on non-linearly separable synthetic datasets in 2-dimensional feature space to provide a graphical representation of the maximum margin approach. Finally, the proposed models have been tested on benchmark datasets from the UCI repository. The MARGOT formulation turns out to be easier to solve than other OCT approaches, and the generated tree better generalizes on new observations. The two interpretable versions are effective in selecting the most relevant features and maintaining good prediction quality.
    When to Ask for Help: Proactive Interventions in Autonomous Reinforcement Learning. (arXiv:2210.10765v1 [cs.LG])
    A long-term goal of reinforcement learning is to design agents that can autonomously interact and learn in the world. A critical challenge to such autonomy is the presence of irreversible states which require external assistance to recover from, such as when a robot arm has pushed an object off of a table. While standard agents require constant monitoring to decide when to intervene, we aim to design proactive agents that can request human intervention only when needed. To this end, we propose an algorithm that efficiently learns to detect and avoid states that are irreversible, and proactively asks for help in case the agent does enter them. On a suite of continuous control environments with unknown irreversible states, we find that our algorithm exhibits better sample- and intervention-efficiency compared to existing methods. Our code is publicly available at https://sites.google.com/view/proactive-interventions
    Targeted Adversarial Self-Supervised Learning. (arXiv:2210.10482v1 [cs.LG])
    Recently, unsupervised adversarial training (AT) has been extensively studied to attain robustness with the models trained upon unlabeled data. To this end, previous studies have applied existing supervised adversarial training techniques to self-supervised learning (SSL) frameworks. However, all have resorted to untargeted adversarial learning as obtaining targeted adversarial examples is unclear in the SSL setting lacking of label information. In this paper, we propose a novel targeted adversarial training method for the SSL frameworks. Specifically, we propose a target selection algorithm for the adversarial SSL frameworks; it is designed to select the most confusing sample for each given instance based on similarity and entropy, and perturb the given instance toward the selected target sample. Our method significantly enhances the robustness of an SSL model without requiring large batches of images or additional models, unlike existing works aimed at achieving the same goal. Moreover, our method is readily applicable to general SSL frameworks that only uses positive pairs. We validate our method on benchmark datasets, on which it obtains superior robust accuracies, outperforming existing unsupervised adversarial training methods.
    Robustness of Demonstration-based Learning Under Limited Data Scenario. (arXiv:2210.10693v1 [cs.CL])
    Demonstration-based learning has shown great potential in stimulating pretrained language models' ability under limited data scenario. Simply augmenting the input with some demonstrations can significantly improve performance on few-shot NER. However, why such demonstrations are beneficial for the learning process remains unclear since there is no explicit alignment between the demonstrations and the predictions. In this paper, we design pathological demonstrations by gradually removing intuitively useful information from the standard ones to take a deep dive of the robustness of demonstration-based sequence labeling and show that (1) demonstrations composed of random tokens still make the model a better few-shot learner; (2) the length of random demonstrations and the relevance of random tokens are the main factors affecting the performance; (3) demonstrations increase the confidence of model predictions on captured superficial patterns. We have publicly released our code at https://github.com/SALT-NLP/RobustDemo.
    Convexity Certificates from Hessians. (arXiv:2210.10430v1 [math.OC])
    The Hessian of a differentiable convex function is positive semidefinite. Therefore, checking the Hessian of a given function is a natural approach to certify convexity. However, implementing this approach is not straightforward since it requires a representation of the Hessian that allows its analysis. Here, we implement this approach for a class of functions that is rich enough to support classical machine learning. For this class of functions, it was recently shown how to compute computational graphs of their Hessians. We show how to check these graphs for positive semidefiniteness. We compare our implementation of the Hessian approach with the well-established disciplined convex programming (DCP) approach and prove that the Hessian approach is at least as powerful as the DCP approach for differentiable functions. Furthermore, we show for a state-of-the-art implementation of the DCP approach that, for differentiable functions, the Hessian approach is actually more powerful. That is, it can certify the convexity of a larger class of differentiable functions.
    Spectroscopic data de-noising via training-set-free deep learning method. (arXiv:2210.10494v1 [cond-mat.mtrl-sci])
    De-noising plays a crucial role in the post-processing of spectra. Machine learning-based methods show good performance in extracting intrinsic information from noisy data, but often require a high-quality training set that is typically inaccessible in real experimental measurements. Here, using spectra in angle-resolved photoemission spectroscopy (ARPES) as an example, we develop a de-noising method for extracting intrinsic spectral information without the need for a training set. This is possible as our method leverages the self-correlation information of the spectra themselves. It preserves the intrinsic energy band features and thus facilitates further analysis and processing. Moreover, since our method is not limited by specific properties of the training set compared to previous ones, it may well be extended to other fields and application scenarios where obtaining high-quality multidimensional training data is challenging.
    Deep neural network expressivity for optimal stopping problems. (arXiv:2210.10443v1 [math.PR])
    This article studies deep neural network expression rates for optimal stopping problems of discrete-time Markov processes on high-dimensional state spaces. A general framework is established in which the value function and continuation value of an optimal stopping problem can be approximated with error at most $\varepsilon$ by a deep ReLU neural network of size at most $\kappa d^{\mathfrak{q}} \varepsilon^{-\mathfrak{r}}$. The constants $\kappa,\mathfrak{q},\mathfrak{r} \geq 0$ do not depend on the dimension $d$ of the state space or the approximation accuracy $\varepsilon$. This proves that deep neural networks do not suffer from the curse of dimensionality when employed to solve optimal stopping problems. The framework covers, for example, exponential L\'evy models, discrete diffusion processes and their running minima and maxima. These results mathematically justify the use of deep neural networks for numerically solving optimal stopping problems and pricing American options in high dimensions.
    Using deep convolutional neural networks to classify poisonous and edible mushrooms found in China. (arXiv:2210.10351v1 [cs.CV])
    Because of their abundance of amino acids, polysaccharides, and many other nutrients that benefit human beings, mushrooms are deservedly popular as dietary cuisine both worldwide and in China. However, if people eat poisonous fungi by mistake, they may suffer from nausea, vomiting, mental disorder, acute anemia, or even death. Each year in China, there are around 8000 people became sick, and 70 died as a result of eating toxic mushrooms by mistake. It is counted that there are thousands of kinds of mushrooms among which only around 900 types are edible, thus without specialized knowledge, the probability of eating toxic mushrooms by mistake is very high. Most people deem that the only characteristic of poisonous mushrooms is a bright colour, however, some kinds of them do not correspond to this trait. In order to prevent people from eating these poisonous mushrooms, we propose to use deep learning methods to indicate whether a mushroom is toxic through analyzing hundreds of edible and toxic mushrooms smartphone pictures. We crowdsource a mushroom image dataset that contains 250 images of poisonous mushrooms and 200 images of edible mushrooms. The Convolutional Neural Network (CNN) is a specialized type of artificial neural networks that use a mathematical operation called convolution in place of general matrix multiplication in at least one of their layers, which can generate a relatively precise result by analyzing a huge amount of images, and thus is very suitable for our research. The experimental results demonstrate that the proposed model has high credibility and can provide a decision-making basis for the selection of edible fungi, so as to reduce the morbidity and mortality caused by eating poisonous mushrooms. We also open source our hand collected mushroom image dataset so that peer researchers can also deploy their own model to advance poisonous mushroom identification.
    On Representing Mixed-Integer Linear Programs by Graph Neural Networks. (arXiv:2210.10759v1 [cs.LG])
    While Mixed-integer linear programming (MILP) is NP-hard in general, practical MILP has received roughly 100--fold speedup in the past twenty years. Still, many classes of MILPs quickly become unsolvable as their sizes increase, motivating researchers to seek new acceleration techniques for MILPs. With deep learning, they have obtained strong empirical results, and many results were obtained by applying graph neural networks (GNNs) to making decisions in various stages of MILP solution processes. This work discovers a fundamental limitation: there exist feasible and infeasible MILPs that all GNNs will, however, treat equally, indicating GNN's lacking power to express general MILPs. Then, we show that, by restricting the MILPs to unfoldable ones or by adding random features, there exist GNNs that can reliably predict MILP feasibility, optimal objective values, and optimal solutions up to prescribed precision. We conducted small-scale numerical experiments to validate our theoretical findings.
    Language Model Decomposition: Quantifying the Dependency and Correlation of Language Models. (arXiv:2210.10289v1 [cs.CL])
    Pre-trained language models (LMs), such as BERT (Devlin et al., 2018) and its variants, have led to significant improvements on various NLP tasks in past years. However, a theoretical framework for studying their relationships is still missing. In this paper, we fill this gap by investigating the linear dependency between pre-trained LMs. The linear dependency of LMs is defined analogously to the linear dependency of vectors. We propose Language Model Decomposition (LMD) to represent a LM using a linear combination of other LMs as basis, and derive the closed-form solution. A goodness-of-fit metric for LMD similar to the coefficient of determination is defined and used to measure the linear dependency of a set of LMs. In experiments, we find that BERT and eleven (11) BERT-like LMs are 91% linearly dependent. This observation suggests that current state-of-the-art (SOTA) LMs are highly "correlated". To further advance SOTA we need more diverse and novel LMs that are less dependent on existing LMs.
    Museformer: Transformer with Fine- and Coarse-Grained Attention for Music Generation. (arXiv:2210.10349v1 [cs.SD])
    Symbolic music generation aims to generate music scores automatically. A recent trend is to use Transformer or its variants in music generation, which is, however, suboptimal, because the full attention cannot efficiently model the typically long music sequences (e.g., over 10,000 tokens), and the existing models have shortcomings in generating musical repetition structures. In this paper, we propose Museformer, a Transformer with a novel fine- and coarse-grained attention for music generation. Specifically, with the fine-grained attention, a token of a specific bar directly attends to all the tokens of the bars that are most relevant to music structures (e.g., the previous 1st, 2nd, 4th and 8th bars, selected via similarity statistics); with the coarse-grained attention, a token only attends to the summarization of the other bars rather than each token of them so as to reduce the computational cost. The advantages are two-fold. First, it can capture both music structure-related correlations via the fine-grained attention, and other contextual information via the coarse-grained attention. Second, it is efficient and can model over 3X longer music sequences compared to its full-attention counterpart. Both objective and subjective experimental results demonstrate its ability to generate long music sequences with high quality and better structures.
    Graph sampling for node embedding. (arXiv:2210.10520v1 [stat.ML])
    Node embedding is a central topic in graph representation learning. Computational efficiency and scalability can be challenging to any method that requires full-graph operations. We propose sampling approaches to node embedding, with or without explicit modelling of the feature vector, which aim to extract useful information from both the eigenvectors related to the graph Laplacien and the given values associated with the graph.
    Attaining Class-level Forgetting in Pretrained Model using Few Samples. (arXiv:2210.10670v1 [cs.CV])
    In order to address real-world problems, deep learning models are jointly trained on many classes. However, in the future, some classes may become restricted due to privacy/ethical concerns, and the restricted class knowledge has to be removed from the models that have been trained on them. The available data may also be limited due to privacy/ethical concerns, and re-training the model will not be possible. We propose a novel approach to address this problem without affecting the model's prediction power for the remaining classes. Our approach identifies the model parameters that are highly relevant to the restricted classes and removes the knowledge regarding the restricted classes from them using the limited available training data. Our approach is significantly faster and performs similar to the model re-trained on the complete data of the remaining classes.
    The Future of Consumer Edge-AI Computing. (arXiv:2210.10514v1 [cs.LG])
    Deep Learning has proliferated dramatically across consumer devices in less than a decade, but has been largely powered through the hardware acceleration within isolated devices. Nonetheless, clear signals exist that the next decade of consumer intelligence will require levels of resources, a mixing of modalities and a collaboration of devices that will demand a significant pivot beyond hardware alone. To accomplish this, we believe a new Edge-AI paradigm will be necessary for this transition to be possible in a sustainable manner, without trespassing user-privacy or hurting quality of experience.
    Irregularly-Sampled Time Series Modeling with Spline Networks. (arXiv:2210.10630v1 [cs.LG])
    Observations made in continuous time are often irregular and contain the missing values across different channels. One approach to handle the missing data is imputing it using splines, by fitting the piecewise polynomials to the observed values. We propose using the splines as an input to a neural network, in particular, applying the transformations on the interpolating function directly, instead of sampling the points on a grid. To do that, we design the layers that can operate on splines and which are analogous to their discrete counterparts. This allows us to represent the irregular sequence compactly and use this representation in the downstream tasks such as classification and forecasting. Our model offers competitive performance compared to the existing methods both in terms of the accuracy and computation efficiency.
    Training set cleansing of backdoor poisoning by self-supervised representation learning. (arXiv:2210.10272v1 [cs.LG])
    A backdoor or Trojan attack is an important type of data poisoning attack against deep neural network (DNN) classifiers, wherein the training dataset is poisoned with a small number of samples that each possess the backdoor pattern (usually a pattern that is either imperceptible or innocuous) and which are mislabeled to the attacker's target class. When trained on a backdoor-poisoned dataset, a DNN behaves normally on most benign test samples but makes incorrect predictions to the target class when the test sample has the backdoor pattern incorporated (i.e., contains a backdoor trigger). Here we focus on image classification tasks and show that supervised training may build stronger association between the backdoor pattern and the associated target class than that between normal features and the true class of origin. By contrast, self-supervised representation learning ignores the labels of samples and learns a feature embedding based on images' semantic content. %We thus propose to use unsupervised representation learning to avoid emphasising backdoor-poisoned training samples and learn a similar feature embedding for samples of the same class. Using a feature embedding found by self-supervised representation learning, a data cleansing method, which combines sample filtering and re-labeling, is developed. Experiments on CIFAR-10 benchmark datasets show that our method achieves state-of-the-art performance in mitigating backdoor attacks.
    Quick Graph Conversion for Robust Recommendation. (arXiv:2210.10321v1 [cs.IR])
    Implicit feedback plays a huge role in recommender systems, but its high noise characteristic seriously reduces its effect. To denoise implicit feedback, some efforts have been devoted to graph data augmentation (GDA) methods. Although the bi-level optimization thought of GDA guarantees better recommendation performance theoretically, it also leads to expensive time costs and severe space explosion problems. Specifically, bi-level optimization involves repeated traversal of all positive and negative instances after each optimization of the recommendation model. In this paper, we propose a new denoising paradigm, i.e., Quick Graph Conversion (QGrace), to effectively transform the original interaction graph into a purified (for positive instances) and densified (for negative instances) interest graph during the recommendation model training process. In QGrace, we leverage the gradient matching scheme based on elaborated generative models to fulfill the conversion and generation of an interest graph, elegantly overcoming the high time and space cost problems. To enable recommendation models to run on interest graphs that lack implicit feedback data, we provide a fine-grained objective function from the perspective of alignment and uniformity. The experimental results on three benchmark datasets demonstrate that the QGrace outperforms the state-of-the-art GDA methods and recommendation models in effectiveness and robustness.
    Estimating the coverage in 3d reconstructions of the colon from colonoscopy videos. (arXiv:2210.10459v1 [cs.CV])
    Colonoscopy is the most common procedure for early detection and removal of polyps, a critical component of colorectal cancer prevention. Insufficient visual coverage of the colon surface during the procedure often results in missed polyps. To mitigate this issue, reconstructing the 3D surfaces of the colon in order to visualize the missing regions has been proposed. However, robustly estimating the local and global coverage from such a reconstruction has not been thoroughly investigated until now. In this work, we present a new method to estimate the coverage from a reconstructed colon pointcloud. Our method splits a reconstructed colon into segments and estimates the coverage of each segment by estimating the area of the missing surfaces. We achieve a mean absolute coverage error of 3-6\% on colon segments generated from synthetic colonoscopy data and real colonography CT scans. In addition, we show good qualitative results on colon segments reconstructed from real colonoscopy videos.
    A Segment-Wise Gaussian Process-Based Ground Segmentation With Local Smoothness Estimation. (arXiv:2210.10515v1 [cs.LG])
    Both in terrestrial and extraterrestrial environments, the precise and informative model of the ground and the surface ahead is crucial for navigation and obstacle avoidance. The ground surface is not always flat and it may be sloped, bumpy and rough specially in off-road terrestrial scenes. In bumpy and rough scenes the functional relationship of the surface-related features may vary in different areas of the ground, as the structure of the ground surface may vary suddenly and further the measured point cloud of the ground does not bear smoothness. Thus, the ground-related features must be obtained based on local estimates or even point estimates. To tackle this problem, the segment-wise GP-based ground segmentation method with local smoothness estimation is proposed. This method is an extension to our previous method in which a realistic measurement of the length-scale values were provided for the covariance kernel in each line-segment to give precise estimation of the ground for sloped terrains. In this extension, the value of the length-scale is estimated locally for each data point which makes it much more precise for the rough scenes while being not computationally complex and more robust to under-segmentation, sparsity and under-represent-ability. The segment-wise task is performed to estimate a partial continuous model of the ground for each radial range segment. Simulation results show the effectiveness of the proposed method to give a continuous and precise estimation of the ground surface in rough and bumpy scenes while being fast enough for real-world applications.
    A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design. (arXiv:2210.10278v1 [cs.LG])
    We study reserve price optimization in multi-phase second price auctions, where seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially nontruthful bidders who aim to manipulates seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is unknown, nonlinear, and cannot even be directly observed from the environment. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the $\underline{\rm C}$ontextual-$\underline{\rm L}$SVI-$\underline{\rm U}$CB-$\underline{\rm B}$uffer (CLUB) algorithm which achieves $\tilde{ \mathcal{O}}(H^{5/2}\sqrt{K})$ revenue regret when the market noise is known and $\tilde{ \mathcal{O}}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.
    Hybrid Neural Autoencoders for Stimulus Encoding in Visual and Other Sensory Neuroprostheses. (arXiv:2205.13623v2 [cs.LG] UPDATED)
    Sensory neuroprostheses are emerging as a promising technology to restore lost sensory function or augment human capabilities. However, sensations elicited by current devices often appear artificial and distorted. Although current models can predict the neural or perceptual response to an electrical stimulus, an optimal stimulation strategy solves the inverse problem: what is the required stimulus to produce a desired response? Here, we frame this as an end-to-end optimization problem, where a deep neural network stimulus encoder is trained to invert a known and fixed forward model that approximates the underlying biological system. As a proof of concept, we demonstrate the effectiveness of this Hybrid Neural Autoencoder (HNA) in visual neuroprostheses. We find that HNA produces high-fidelity patient-specific stimuli representing handwritten digits and segmented images of everyday objects, and significantly outperforms conventional encoding strategies across all simulated patients. Overall this is an important step towards the long-standing challenge of restoring high-quality vision to people living with incurable blindness and may prove a promising solution for a variety of neuroprosthetic technologies.
    Stability of Entropic Wasserstein Barycenters and application to random geometric graphs. (arXiv:2210.10535v1 [cs.CG])
    As interest in graph data has grown in recent years, the computation of various geometric tools has become essential. In some area such as mesh processing, they often rely on the computation of geodesics and shortest paths in discretized manifolds. A recent example of such a tool is the computation of Wasserstein barycenters (WB), a very general notion of barycenters derived from the theory of Optimal Transport, and their entropic-regularized variant. In this paper, we examine how WBs on discretized meshes relate to the geometry of the underlying manifold. We first provide a generic stability result with respect to the input cost matrices. We then apply this result to random geometric graphs on manifolds, whose shortest paths converge to geodesics, hence proving the consistency of WBs computed on discretized shapes.
    Rethinking Sharpness-Aware Minimization as Variational Inference. (arXiv:2210.10452v1 [stat.ML])
    Sharpness-aware minimization (SAM) aims to improve the generalisation of gradient-based learning by seeking out flat minima. In this work, we establish connections between SAM and Mean-Field Variational Inference (MFVI) of neural network parameters. We show that both these methods have interpretations as optimizing notions of flatness, and when using the reparametrisation trick, they both boil down to calculating the gradient at a perturbed version of the current mean parameter. This thinking motivates our study of algorithms that combine or interpolate between SAM and MFVI. We evaluate the proposed variational algorithms on several benchmark datasets, and compare their performance to variants of SAM. Taking a broader perspective, our work suggests that SAM-like updates can be used as a drop-in replacement for the reparametrisation trick.
    Enhanced vectors for top-k document retrieval in Question Answering. (arXiv:2210.10584v1 [cs.IR])
    Modern day applications, especially information retrieval webapps that involve "search" as their use cases are gradually moving towards "answering" modules. Conversational chatbots which have been proved to be more engaging to users, use Question Answering as their core. Since, precise answering is computationally expensive, several approaches have been developed to prefetch the most relevant documents/passages from the database that contain the answer. We propose a different approach that retrieves the evidence documents efficiently and accurately, making sure that the relevant document for a given user query is not missed. We do so by assigning each document (or passage in our case), a unique identifier and using them to create dense vectors which can be efficiently indexed. More precisely, we use the identifier to predict randomly sampled context window words of the relevant question corresponding to the passage along with the words of passage itself. This naturally embeds the passage identifier into the vector space in such a way that the embedding is closer to the question without compromising he information content. This approach enables efficient creation of real-time query vectors in ~4 milliseconds.
    PoeticTTS -- Controllable Poetry Reading for Literary Studies. (arXiv:2207.05549v2 [eess.AS] UPDATED)
    Speech synthesis for poetry is challenging due to specific intonation patterns inherent to poetic speech. In this work, we propose an approach to synthesise poems with almost human like naturalness in order to enable literary scholars to systematically examine hypotheses on the interplay between text, spoken realisation, and the listener's perception of poems. To meet these special requirements for literary studies, we resynthesise poems by cloning prosodic values from a human reference recitation, and afterwards make use of fine-grained prosody control to manipulate the synthetic speech in a human-in-the-loop setting to alter the recitation w.r.t. specific phenomena. We find that finetuning our TTS model on poetry captures poetic intonation patterns to a large extent which is beneficial for prosody cloning and manipulation and verify the success of our approach both in an objective evaluation as well as in human studies.
    Scalable Interpretability via Polynomials. (arXiv:2205.14108v4 [cs.LG] UPDATED)
    Generalized Additive Models (GAMs) have quickly become the leading choice for inherently-interpretable machine learning. However, unlike uninterpretable methods such as DNNs, they lack expressive power and easy scalability, and are hence not a feasible alternative for real-world tasks. We present a new class of GAMs that use tensor rank decompositions of polynomials to learn powerful, {\em inherently-interpretable} models. Our approach, titled Scalable Polynomial Additive Models (SPAM) is effortlessly scalable and models {\em all} higher-order feature interactions without a combinatorial parameter explosion. SPAM outperforms all current interpretable approaches, and matches DNN/XGBoost performance on a series of real-world benchmarks with up to hundreds of thousands of features. We demonstrate by human subject evaluations that SPAMs are demonstrably more interpretable in practice, and are hence an effortless replacement for DNNs for creating interpretable and high-performance systems suitable for large-scale machine learning. Source code is available at https://github.com/facebookresearch/nbm-spam.
    Group Fairness in Prediction-Based Decision Making: From Moral Assessment to Implementation. (arXiv:2210.10456v1 [cs.LG])
    Ensuring fairness of prediction-based decision making is based on statistical group fairness criteria. Which one of these criteria is the morally most appropriate one depends on the context, and its choice requires an ethical analysis. In this paper, we present a step-by-step procedure integrating three elements: (a) a framework for the moral assessment of what fairness means in a given context, based on the recently proposed general principle of "Fair equality of chances" (FEC) (b) a mapping of the assessment's results to established statistical group fairness criteria, and (c) a method for integrating the thus-defined fairness into optimal decision making. As a second contribution, we show new applications of the FEC principle and show that, with this extension, the FEC framework covers all types of group fairness criteria: independence, separation, and sufficiency. Third, we introduce an extended version of the FEC principle, which additionally allows accounting for morally irrelevant elements of the fairness assessment and links to well-known relaxations of the fairness criteria. This paper presents a framework to develop fair decision systems in a conceptually sound way, combining the moral and the computational elements of fair prediction-based decision-making in an integrated approach. Data and code to reproduce our results are available at https://github.com/joebaumann/fair-prediction-based-decision-making.
    Exclusive Supermask Subnetwork Training for Continual Learning. (arXiv:2210.10209v1 [cs.CV])
    Continual Learning (CL) methods mainly focus on avoiding catastrophic forgetting and learning representations that are transferable to new tasks. Recently, Wortsman et al. (2020) proposed a CL method, SupSup, which uses a randomly initialized, fixed base network (model) and finds a supermask for each new task that selectively keeps or removes each weight to produce a subnetwork. They prevent forgetting as the network weights are not being updated. Although there is no forgetting, the performance of the supermask is sub-optimal because fixed weights restrict its representational power. Furthermore, there is no accumulation or transfer of knowledge inside the model when new tasks are learned. Hence, we propose ExSSNeT (Exclusive Supermask SubNEtwork Training), which performs exclusive and non-overlapping subnetwork weight training. This avoids conflicting updates to the shared weights by subsequent tasks to improve performance while still preventing forgetting. Furthermore, we propose a novel KNN-based Knowledge Transfer (KKT) module that dynamically initializes a new task's mask based on previous tasks for improving knowledge transfer. We demonstrate that ExSSNeT outperforms SupSup and other strong previous methods on both text classification and vision tasks while preventing forgetting. Moreover, ExSSNeT is particularly advantageous for sparse masks that activate 2-10% of the model parameters, resulting in an average improvement of 8.3% over SupSup. Additionally, ExSSNeT scales to a large number of tasks (100), and our KKT module helps to learn new tasks faster while improving overall performance. Our code is available at https://github.com/prateeky2806/exessnet
    Attribution and Obfuscation of Neural Text Authorship: A Data Mining Perspective. (arXiv:2210.10488v1 [cs.CL])
    Two interlocking research questions of growing interest and importance in privacy research are Authorship Attribution (AA) and Authorship Obfuscation (AO). Given an artifact, especially a text t in question, an AA solution aims to accurately attribute t to its true author out of many candidate authors while an AO solution aims to modify t to hide its true authorship. Traditionally, the notion of authorship and its accompanying privacy concern is only toward human authors. However, in recent years, due to the explosive advancements in Neural Text Generation (NTG) techniques in NLP, capable of synthesizing human-quality open-ended texts (so-called "neural texts"), one has to now consider authorships by humans, machines, or their combination. Due to the implications and potential threats of neural texts when used maliciously, it has become critical to understand the limitations of traditional AA/AO solutions and develop novel AA/AO solutions in dealing with neural texts. In this survey, therefore, we make a comprehensive review of recent literature on the attribution and obfuscation of neural text authorship from a Data Mining perspective, and share our view on their limitations and promising research directions.
    Deep-based quality assessment of medical images through domain adaptation. (arXiv:2210.10533v1 [eess.IV])
    Predicting the quality of multimedia content is often needed in different fields. In some applications, quality metrics are crucial with a high impact, and can affect decision making such as diagnosis from medical multimedia. In this paper, we focus on such applications by proposing an efficient and shallow model for predicting the quality of medical images without reference from a small amount of annotated data. Our model is based on convolution self-attention that aims to model complex representation from relevant local characteristics of images, which itself slide over the image to interpolate the global quality score. We also apply domain adaptation learning in unsupervised and semi-supervised manner. The proposed model is evaluated through a dataset composed of several images and their corresponding subjective scores. The obtained results showed the efficiency of the proposed method, but also, the relevance of the applying domain adaptation to generalize over different multimedia domains regarding the downstream task of perceptual quality prediction. \footnote{Funded by the TIC-ART project, Regional fund (Region Centre-Val de Loire)}
    Discovering Limitations of Image Quality Assessments with Noised Deep Learning Image Sets. (arXiv:2210.10249v1 [cs.CV])
    Image quality is important, and it can affect overall performance in image processing and computer vision as well as for numerous other reasons. Image quality assessment (IQA) is consequently a vital task in different applications from aerial photography interpretation to object detection to medical image analysis. In previous research, the BRISQUE algorithm and the PSNR algorithm were evaluated with high resolution ( 512*384 pixels per image), but relatively small image sets (4,744 images). However, scientists have not evaluated IQA algorithms on low resolution (32*32 pixels per image), multi-perturbation, big image sets (for example, 60,000 different images not counting their perturbations). This study explores these two IQA algorithms through experimental investigation. We first chose two deep learning image sets, CIFAR-10 and MNIST. Then, we added 68 perturbations that add noise to the images in specific sequences and noise intensities. In addition, we tracked the performance outputs of the two IQA algorithms with singly and multiply noised images. After quantitatively analyzing experimental results, we report the limitations of the two IQAs with these noised CIFAR-10 and MNIST image sets. We also explain three potential root causes for performance degradation. These findings point out weaknesses of the two IQA algorithms. The research results provide guidance to scientists and engineers developing accurate, robust IQA algorithms. In addition to supporting future scientific research and industrial projects, all source codes are shared on the website: https://github.com/caperock/imagequality
    On the Power of Pre-training for Generalization in RL: Provable Benefits and Hardness. (arXiv:2210.10464v1 [cs.LG])
    Generalization in Reinforcement Learning (RL) aims to learn an agent during training that generalizes to the target environment. This paper studies RL generalization from a theoretical aspect: how much can we expect pre-training over training environments to be helpful? When the interaction with the target environment is not allowed, we certify that the best we can obtain is a near-optimal policy in an average sense, and we design an algorithm that achieves this goal. Furthermore, when the agent is allowed to interact with the target environment, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, in the non-asymptotic regime, we design an efficient algorithm and prove a distribution-based regret bound in the target environment that is independent of the state-action space.
    Propagating Variational Model Uncertainty for Bioacoustic Call Label Smoothing. (arXiv:2210.10526v1 [cs.LG])
    We focus on using the predictive uncertainty signal calculated by Bayesian neural networks to guide learning in the self-same task the model is being trained on. Not opting for costly Monte Carlo sampling of weights, we propagate the approximate hidden variance in an end-to-end manner, throughout a variational Bayesian adaptation of a ResNet with attention and squeeze-and-excitation blocks, in order to identify data samples that should contribute less into the loss value calculation. We, thus, propose uncertainty-aware, data-specific label smoothing, where the smoothing probability is dependent on this epistemic uncertainty. We show that, through the explicit usage of the epistemic uncertainty in the loss calculation, the variational model is led to improved predictive and calibration performance. This core machine learning methodology is exemplified at wildlife call detection, from audio recordings made via passive acoustic monitoring equipment in the animals' natural habitats, with the future goal of automating large scale annotation in a trustworthy manner.
    Causal Structure Learning with Recommendation System. (arXiv:2210.10256v1 [cs.IR])
    A fundamental challenge of recommendation systems (RS) is understanding the causal dynamics underlying users' decision making. Most existing literature addresses this problem by using causal structures inferred from domain knowledge. However, there are numerous phenomenons where domain knowledge is insufficient, and the causal mechanisms must be learnt from the feedback data. Discovering the causal mechanism from RS feedback data is both novel and challenging, since RS itself is a source of intervention that can influence both the users' exposure and their willingness to interact. Also for this reason, most existing solutions become inappropriate since they require data collected free from any RS. In this paper, we first formulate the underlying causal mechanism as a causal structural model and describe a general causal structure learning framework grounded in the real-world working mechanism of RS. The essence of our approach is to acknowledge the unknown nature of RS intervention. We then derive the learning objective from our framework and propose an augmented Lagrangian solver for efficient optimization. We conduct both simulation and real-world experiments to demonstrate how our approach compares favorably to existing solutions, together with the empirical analysis from sensitivity and ablation studies.
    Variational methods for simulation-based inference. (arXiv:2203.04176v3 [stat.ML] UPDATED)
    We present Sequential Neural Variational Inference (SNVI), an approach to perform Bayesian inference in models with intractable likelihoods. SNVI combines likelihood-estimation (or likelihood-ratio-estimation) with variational inference to achieve a scalable simulation-based inference approach. SNVI maintains the flexibility of likelihood(-ratio) estimation to allow arbitrary proposals for simulations, while simultaneously providing a functional estimate of the posterior distribution without requiring MCMC sampling. We present several variants of SNVI and demonstrate that they are substantially more computationally efficient than previous algorithms, without loss of accuracy on benchmark tasks. We apply SNVI to a neuroscience model of the pyloric network in the crab and demonstrate that it can infer the posterior distribution with one order of magnitude fewer simulations than previously reported. SNVI vastly reduces the computational cost of simulation-based inference while maintaining accuracy and flexibility, making it possible to tackle problems that were previously inaccessible.
    Robust Offline Reinforcement Learning with Gradient Penalty and Constraint Relaxation. (arXiv:2210.10469v1 [cs.LG])
    A promising paradigm for offline reinforcement learning (RL) is to constrain the learned policy to stay close to the dataset behaviors, known as policy constraint offline RL. However, existing works heavily rely on the purity of the data, exhibiting performance degradation or even catastrophic failure when learning from contaminated datasets containing impure trajectories of diverse levels. e.g., expert level, medium level, etc., while offline contaminated data logs exist commonly in the real world. To mitigate this, we first introduce gradient penalty over the learned value function to tackle the exploding Q-functions. We then relax the closeness constraints towards non-optimal actions with critic weighted constraint relaxation. Experimental results show that the proposed techniques effectively tame the non-optimal trajectories for policy constraint offline RL methods, evaluated on a set of contaminated D4RL Mujoco and Adroit datasets.
    LAMP: Extracting Text from Gradients with Language Model Priors. (arXiv:2202.08827v2 [cs.LG] UPDATED)
    Recent work shows that sensitive user data can be reconstructed from gradient updates, breaking the key privacy promise of federated learning. While success was demonstrated primarily on image data, these methods do not directly transfer to other domains such as text. In this work, we propose LAMP, a novel attack tailored to textual data, that successfully reconstructs original text from gradients. Our attack is based on two key insights: (i) modeling prior text probability with an auxiliary language model, guiding the search towards more natural text, and (ii) alternating continuous and discrete optimization, which minimizes reconstruction loss on embeddings, while avoiding local minima by applying discrete text transformations. Our experiments demonstrate that LAMP is significantly more effective than prior work: it reconstructs 5x more bigrams and 23% longer subsequences on average. Moreover, we are the first to recover inputs from batch sizes larger than 1 for textual models. These findings indicate that gradient updates of models operating on textual data leak more information than previously thought.
    SKID: Self-Supervised Learning for Knee Injury Diagnosis from MRI Data. (arXiv:2104.10481v4 [cs.CV] UPDATED)
    In medical image analysis, the cost of acquiring high-quality data and their annotation by experts is a barrier in many medical applications. Most of the techniques used are based on supervised learning framework and need a large amount of annotated data to achieve satisfactory performance. As an alternative, in this paper, we propose a self-supervised learning (SSL) approach to learn the spatial anatomical representations from the frames of magnetic resonance (MR) video clips for the diagnosis of knee medical conditions. The pretext model learns meaningful spatial context-invariant representations. The downstream task in our paper is a class imbalanced multi-label classification. Different experiments show that the features learnt by the pretext model provide competitive performance in the downstream task. Moreover, the efficiency and reliability of the proposed pretext model in learning representations of minority classes without applying any strategy towards imbalance in the dataset can be seen from the results. To the best of our knowledge, this work is the first work of its kind in showing the effectiveness and reliability of self-supervised learning algorithms in class imbalanced multi-label classification tasks on MR videos. The code for evaluation of the proposed work is available at https://github.com/sadimanna/skid.
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v4 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.
    CLUTR: Curriculum Learning via Unsupervised Task Representation Learning. (arXiv:2210.10243v1 [cs.LG])
    Reinforcement Learning (RL) algorithms are often known for sample inefficiency and difficult generalization. Recently, Unsupervised Environment Design (UED) emerged as a new paradigm for zero-shot generalization by simultaneously learning a task distribution and agent policies on the sampled tasks. This is a non-stationary process where the task distribution evolves along with agent policies, creating an instability over time. While past works demonstrated the potential of such approaches, sampling effectively from the task space remains an open challenge, bottlenecking these approaches. To this end, we introduce CLUTR: a novel curriculum learning algorithm that decouples task representation and curriculum learning into a two-stage optimization. It first trains a recurrent variational autoencoder on randomly generated tasks to learn a latent task manifold. Next, a teacher agent creates a curriculum by maximizing a minimax REGRET-based objective on a set of latent tasks sampled from this manifold. By keeping the task manifold fixed, we show that CLUTR successfully overcomes the non-stationarity problem and improves stability. Our experimental results show CLUTR outperforms PAIRED, a principled and popular UED method, in terms of generalization and sample efficiency in the challenging CarRacing and navigation environments: showing an 18x improvement on the F1 CarRacing benchmark. CLUTR also performs comparably to the non-UED state-of-the-art for CarRacing, outperforming it in nine of the 20 tracks. CLUTR also achieves a 33% higher solved rate than PAIRED on a set of 18 out-of-distribution navigation tasks.
    Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization. (arXiv:2210.10199v1 [cs.LG])
    Optimizing expensive-to-evaluate black-box functions of discrete (and potentially continuous) design parameters is a ubiquitous problem in scientific and engineering applications. Bayesian optimization (BO) is a popular, sample-efficient method that leverages a probabilistic surrogate model and an acquisition function (AF) to select promising designs to evaluate. However, maximizing the AF over mixed or high-cardinality discrete search spaces is challenging standard gradient-based methods cannot be used directly or evaluating the AF at every point in the search space would be computationally prohibitive. To address this issue, we propose using probabilistic reparameterization (PR). Instead of directly optimizing the AF over the search space containing discrete parameters, we instead maximize the expectation of the AF over a probability distribution defined by continuous parameters. We prove that under suitable reparameterizations, the BO policy that maximizes the probabilistic objective is the same as that which maximizes the AF, and therefore, PR enjoys the same regret bounds as the original BO policy using the underlying AF. Moreover, our approach provably converges to a stationary point of the probabilistic objective under gradient ascent using scalable, unbiased estimators of both the probabilistic objective and its gradient. Therefore, as the number of starting points and gradient steps increase, our approach will recover of a maximizer of the AF (an often-neglected requisite for commonly used BO regret bounds). We validate our approach empirically and demonstrate state-of-the-art optimization performance on a wide range of real-world applications. PR is complementary to (and benefits) recent work and naturally generalizes to settings with multiple objectives and black-box constraints.
    Motion-Based Weak Supervision for Video Parsing with Application to Colonoscopy. (arXiv:2210.10594v1 [cs.CV])
    We propose a two-stage unsupervised approach for parsing videos into phases. We use motion cues to divide the video into coarse segments. Noisy segment labels are then used to weakly supervise an appearance-based classifier. We show the effectiveness of the method for phase detection in colonoscopy videos.
    Non-iterative optimization of pseudo-labeling thresholds for training object detection models from multiple datasets. (arXiv:2210.10221v1 [cs.CV])
    We propose a non-iterative method to optimize pseudo-labeling thresholds for learning object detection from a collection of low-cost datasets, each of which is annotated for only a subset of all the object classes. A popular approach to this problem is first to train teacher models and then to use their confident predictions as pseudo ground-truth labels when training a student model. To obtain the best result, however, thresholds for prediction confidence must be adjusted. This process typically involves iterative search and repeated training of student models and is time-consuming. Therefore, we develop a method to optimize the thresholds without iterative optimization by maximizing the $F_\beta$-score on a validation dataset, which measures the quality of pseudo labels and can be measured without training a student model. We experimentally demonstrate that our proposed method achieves an mAP comparable to that of grid search on the COCO and VOC datasets.
    On the Adversarial Robustness of Mixture of Experts. (arXiv:2210.10253v1 [cs.LG])
    Adversarial robustness is a key desirable property of neural networks. It has been empirically shown to be affected by their sizes, with larger networks being typically more robust. Recently, Bubeck and Sellke proved a lower bound on the Lipschitz constant of functions that fit the training data in terms of their number of parameters. This raises an interesting open question, do -- and can -- functions with more parameters, but not necessarily more computational cost, have better robustness? We study this question for sparse Mixture of Expert models (MoEs), that make it possible to scale up the model size for a roughly constant computational cost. We theoretically show that under certain conditions on the routing and the structure of the data, MoEs can have significantly smaller Lipschitz constants than their dense counterparts. The robustness of MoEs can suffer when the highest weighted experts for an input implement sufficiently different functions. We next empirically evaluate the robustness of MoEs on ImageNet using adversarial attacks and show they are indeed more robust than dense models with the same computational cost. We make key observations showing the robustness of MoEs to the choice of experts, highlighting the redundancy of experts in models trained in practice.
    Optimisation & Generalisation in Networks of Neurons. (arXiv:2210.10101v1 [cs.NE])
    The goal of this thesis is to develop the optimisation and generalisation theoretic foundations of learning in artificial neural networks. On optimisation, a new theoretical framework is proposed for deriving architecture-dependent first-order optimisation algorithms. The approach works by combining a "functional majorisation" of the loss function with "architectural perturbation bounds" that encode an explicit dependence on neural architecture. The framework yields optimisation methods that transfer hyperparameters across learning problems. On generalisation, a new correspondence is proposed between ensembles of networks and individual networks. It is argued that, as network width and normalised margin are taken large, the space of networks that interpolate a particular training set concentrates on an aggregated Bayesian method known as a "Bayes point machine". This correspondence provides a route for transferring PAC-Bayesian generalisation theorems over to individual networks. More broadly, the correspondence presents a fresh perspective on the role of regularisation in networks with vastly more parameters than data.
    Intra-Source Style Augmentation for Improved Domain Generalization. (arXiv:2210.10175v1 [cs.CV])
    The generalization with respect to domain shifts, as they frequently appear in applications such as autonomous driving, is one of the remaining big challenges for deep learning models. Therefore, we propose an intra-source style augmentation (ISSA) method to improve domain generalization in semantic segmentation. Our method is based on a novel masked noise encoder for StyleGAN2 inversion. The model learns to faithfully reconstruct the image preserving its semantic layout through noise prediction. Random masking of the estimated noise enables the style mixing capability of our model, i.e. it allows to alter the global appearance without affecting the semantic layout of an image. Using the proposed masked noise encoder to randomize style and content combinations in the training set, ISSA effectively increases the diversity of training data and reduces spurious correlation. As a result, we achieve up to $12.4\%$ mIoU improvements on driving-scene semantic segmentation under different types of data shifts, i.e., changing geographic locations, adverse weather conditions, and day to night. ISSA is model-agnostic and straightforwardly applicable with CNNs and Transformers. It is also complementary to other domain generalization techniques, e.g., it improves the recent state-of-the-art solution RobustNet by $3\%$ mIoU in Cityscapes to Dark Z\"urich.
    Proximal Learning With Opponent-Learning Awareness. (arXiv:2210.10125v1 [cs.LG])
    Learning With Opponent-Learning Awareness (LOLA) (Foerster et al. [2018a]) is a multi-agent reinforcement learning algorithm that typically learns reciprocity-based cooperation in partially competitive environments. However, LOLA often fails to learn such behaviour on more complex policy spaces parameterized by neural networks, partly because the update rule is sensitive to the policy parameterization. This problem is especially pronounced in the opponent modeling setting, where the opponent's policy is unknown and must be inferred from observations; in such settings, LOLA is ill-specified because behaviorally equivalent opponent policies can result in non-equivalent updates. To address this shortcoming, we reinterpret LOLA as approximating a proximal operator, and then derive a new algorithm, proximal LOLA (POLA), which uses the proximal formulation directly. Unlike LOLA, the POLA updates are parameterization invariant, in the sense that when the proximal objective has a unique optimum, behaviorally equivalent policies result in behaviorally equivalent updates. We then present practical approximations to the ideal POLA update, which we evaluate in several partially competitive environments with function approximation and opponent modeling. This empirically demonstrates that POLA achieves reciprocity-based cooperation more reliably than LOLA.
    AUC-based Selective Classification. (arXiv:2210.10703v1 [cs.LG])
    Selective classification (or classification with a reject option) pairs a classifier with a selection function to determine whether or not a prediction should be accepted. This framework trades off coverage (probability of accepting a prediction) with predictive performance, typically measured by distributive loss functions. In many application scenarios, such as credit scoring, performance is instead measured by ranking metrics, such as the Area Under the ROC Curve (AUC). We propose a model-agnostic approach to associate a selection function to a given probabilistic binary classifier. The approach is specifically targeted at optimizing the AUC. We provide both theoretical justifications and a novel algorithm, called $AUCross$, to achieve such a goal. Experiments show that $AUCross$ succeeds in trading-off coverage for AUC, improving over existing selective classification methods targeted at optimizing accuracy.
    Rethinking Prototypical Contrastive Learning through Alignment, Uniformity and Correlation. (arXiv:2210.10194v1 [cs.CV])
    Contrastive self-supervised learning (CSL) with a prototypical regularization has been introduced in learning meaningful representations for downstream tasks that require strong semantic information. However, to optimize CSL with a loss that performs the prototypical regularization aggressively, e.g., the ProtoNCE loss, might cause the "coagulation" of examples in the embedding space. That is, the intra-prototype diversity of samples collapses to trivial solutions for their prototype being well-separated from others. Motivated by previous works, we propose to mitigate this phenomenon by learning Prototypical representation through Alignment, Uniformity and Correlation (PAUC). Specifically, the ordinary ProtoNCE loss is revised with: (1) an alignment loss that pulls embeddings from positive prototypes together; (2) a uniformity loss that distributes the prototypical level features uniformly; (3) a correlation loss that increases the diversity and discriminability between prototypical level features. We conduct extensive experiments on various benchmarks where the results demonstrate the effectiveness of our method in improving the quality of prototypical contrastive representations. Particularly, in the classification down-stream tasks with linear probes, our proposed method outperforms the state-of-the-art instance-wise and prototypical contrastive learning methods on the ImageNet-100 dataset by 2.96% and the ImageNet-1K dataset by 2.46% under the same settings of batch size and epochs.
    Tempo: Accelerating Transformer-Based Model Training through Memory Footprint Reduction. (arXiv:2210.10246v1 [cs.LG])
    Training deep learning models can be computationally expensive. Prior works have shown that increasing the batch size can potentially lead to better overall throughput. However, the batch size is frequently limited by the accelerator memory capacity due to the activations/feature maps stored for the training backward pass, as larger batch sizes require larger feature maps to be stored. Transformer-based models, which have recently seen a surge in popularity due to their good performance and applicability to a variety of tasks, have a similar problem. To remedy this issue, we propose Tempo, a new approach to efficiently use accelerator (e.g., GPU) memory resources for training Transformer-based models. Our approach provides drop-in replacements for the GELU, LayerNorm, and Attention layers, reducing the memory usage and ultimately leading to more efficient training. We implement Tempo and evaluate the throughput, memory usage, and accuracy/loss on the BERT Large pre-training task. We demonstrate that Tempo enables up to 2x higher batch sizes and 16% higher training throughput over the state-of-the-art baseline. We also evaluate Tempo on GPT2 and RoBERTa models, showing 19% and 26% speedup over the baseline.
    Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation. (arXiv:2210.10195v1 [cs.LG])
    Curriculum Reinforcement Learning (CRL) aims to create a sequence of tasks, starting from easy ones and gradually learning towards difficult tasks. In this work, we focus on the idea of framing CRL as interpolations between a source (auxiliary) and a target task distribution. Although existing studies have shown the great potential of this idea, it remains unclear how to formally quantify and generate the movement between task distributions. Inspired by the insights from gradual domain adaptation in semi-supervised learning, we create a natural curriculum by breaking down the potentially large task distributional shift in CRL into smaller shifts. We propose GRADIENT, which formulates CRL as an optimal transport problem with a tailored distance metric between tasks. Specifically, we generate a sequence of task distributions as a geodesic interpolation (i.e., Wasserstein barycenter) between the source and target distributions. Different from many existing methods, our algorithm considers a task-dependent contextual distance metric and is capable of handling nonparametric distributions in both continuous and discrete context settings. In addition, we theoretically show that GRADIENT enables smooth transfer between subsequent stages in the curriculum under certain conditions. We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance.
    Active Image Indexing. (arXiv:2210.10620v1 [cs.IR])
    Image copy detection and retrieval from large databases leverage two components. First, a neural network maps an image to a vector representation, that is relatively robust to various transformations of the image. Second, an efficient but approximate similarity search algorithm trades scalability (size and speed) against quality of the search, thereby introducing a source of error. This paper improves the robustness of image copy detection with active indexing, that optimizes the interplay of these two components. We reduce the quantization loss of a given image representation by making imperceptible changes to the image before its release. The loss is back-propagated through the deep neural network back to the image, under perceptual constraints. These modifications make the image more retrievable. Our experiments show that the retrieval and copy detection of activated images is significantly improved. For instance, activation improves by $+40\%$ the Recall1@1 on various image transformations, and for several popular indexing structures based on product quantization and locality sensitivity hashing.
    DIAMBRA Arena: a New Reinforcement Learning Platform for Research and Experimentation. (arXiv:2210.10595v1 [cs.LG])
    The recent advances in reinforcement learning have led to effective methods able to obtain above human-level performances in very complex environments. However, once solved, these environments become less valuable, and new challenges with different or more complex scenarios are needed to support research advances. This work presents DIAMBRA Arena, a new platform for reinforcement learning research and experimentation, featuring a collection of high-quality environments exposing a Python API fully compliant with OpenAI Gym standard. They are episodic tasks with discrete actions and observations composed by raw pixels plus additional numerical values, all supporting both single player and two players mode, allowing to work on standard reinforcement learning, competitive multi-agent, human-agent competition, self-play, human-in-the-loop training and imitation learning. Software capabilities are demonstrated by successfully training multiple deep reinforcement learning agents with proximal policy optimization obtaining human-like behavior. Results confirm the utility of DIAMBRA Arena as a reinforcement learning research tool, providing environments designed to study some of the most challenging topics in the field.
    Differentiable Self-Adaptive Learning Rate. (arXiv:2210.10290v1 [cs.LG])
    Learning rate adaptation is a popular topic in machine learning. Gradient Descent trains neural nerwork with a fixed learning rate. Learning rate adaptation is proposed to accelerate the training process through adjusting the step size in the training session. Famous works include Momentum, Adam and Hypergradient. Hypergradient is the most special one. Hypergradient achieved adaptation by calculating the derivative of learning rate with respect to cost function and utilizing gradient descent for learning rate. However, Hypergradient is still not perfect. In practice, Hypergradient fail to decrease training loss after learning rate adaptation with a large probability. Apart from that, evidence has been found that Hypergradient are not suitable for dealing with large datesets in the form of minibatch training. Most unfortunately, Hypergradient always fails to get a good accuracy on the validation dataset although it could reduce training loss to a very tiny value. To solve Hypergradient's problems, we propose a novel adaptation algorithm, where learning rate is parameter specific and internal structured. We conduct extensive experiments on multiple network models and datasets compared with various benchmark optimizers. It is shown that our algorithm can achieve faster and higher qualified convergence than those state-of-art optimizers.
    Detecting and analyzing missing citations to published scientific entities. (arXiv:2210.10073v1 [cs.DL])
    Proper citation is of great importance in academic writing for it enables knowledge accumulation and maintains academic integrity. However, citing properly is not an easy task. For published scientific entities, the ever-growing academic publications and over-familiarity of terms easily lead to missing citations. To deal with this situation, we design a special method Citation Recommendation for Published Scientific Entity (CRPSE) based on the cooccurrences between published scientific entities and in-text citations in the same sentences from previous researchers. Experimental outcomes show the effectiveness of our method in recommending the source papers for published scientific entities. We further conduct a statistical analysis on missing citations among papers published in prestigious computer science conferences in 2020. In the 12,278 papers collected, 475 published scientific entities of computer science and mathematics are found to have missing citations. Many entities mentioned without citations are found to be well-accepted research results. On a median basis, the papers proposing these published scientific entities with missing citations were published 8 years ago, which can be considered the time frame for a published scientific entity to develop into a well-accepted concept. For published scientific entities, we appeal for accurate and full citation of their source papers as required by academic standards.
    A Catch-22 of Reservoir Computing. (arXiv:2210.10211v1 [cs.LG])
    Reservoir Computing (RC) is a simple and efficient model-free framework for data-driven predictions of nonlinear dynamical systems. Recently, Next Generation Reservoir Computing (NGRC) has emerged as an especially attractive variant of RC. By shifting the nonlinearity from the reservoir to the readout layer, NGRC requires less data and has fewer hyperparameters to optimize, making it suitable for challenging tasks such as predicting basins of attraction. Here, using paradigmatic multistable systems including magnetic pendulums and coupled Kuramoto oscillators, we show that the performance of NGRC models can be extremely sensitive to the choice of readout nonlinearity. In particular, by incorporating the exact nonlinearity from the original equations, NGRC trained on a single trajectory can predict pseudo-fractal basins with almost perfect accuracy. However, even a small uncertainty on the exact nonlinearity can completely break NGRC, rendering the prediction accuracy no better than chance. This creates a catch-22 for NGRC since it may not be able to make useful predictions unless a key part of the system being predicted (i.e., its nonlinearity) is already known. Our results highlight the challenges faced by data-driven methods in learning complex dynamical systems.
    Towards Explaining Distribution Shifts. (arXiv:2210.10275v1 [cs.LG])
    A distribution shift can have fundamental consequences such as signaling a change in the operating environment or significantly reducing the accuracy of downstream models. Thus, understanding distribution shifts is critical for examining and hopefully mitigating the effect of such a shift. Most prior work has focused on merely detecting if a shift has occurred and assumes any detected shift can be understood and handled appropriately by a human operator. We hope to aid in these manual mitigation tasks by explaining the distribution shift using interpretable transportation maps from the original distribution to the shifted one. We derive our interpretable mappings from a relaxation of the optimal transport problem, where the candidate mappings are restricted to a set of interpretable mappings. We then use quintessential examples of distribution shift in simulated and real-world cases to showcase how our explanatory mappings provide a better balance between detail and interpretability than the de facto standard mean shift explanation by both visual inspection and our PercentExplained metric.
    Output Feedback Tube MPC-Guided Data Augmentation for Robust, Efficient Sensorimotor Policy Learning. (arXiv:2210.10127v1 [cs.RO])
    Imitation learning (IL) can generate computationally efficient sensorimotor policies from demonstrations provided by computationally expensive model-based sensing and control algorithms. However, commonly employed IL methods are often data-inefficient, requiring the collection of a large number of demonstrations and producing policies with limited robustness to uncertainties. In this work, we combine IL with an output feedback robust tube model predictive controller (RTMPC) to co-generate demonstrations and a data augmentation strategy to efficiently learn neural network-based sensorimotor policies. Thanks to the augmented data, we reduce the computation time and the number of demonstrations needed by IL, while providing robustness to sensing and process uncertainty. We tailor our approach to the task of learning a trajectory tracking visuomotor policy for an aerial robot, leveraging a 3D mesh of the environment as part of the data augmentation process. We numerically demonstrate that our method can learn a robust visuomotor policy from a single demonstration--a two-orders of magnitude improvement in demonstration efficiency compared to existing IL methods.
    Multi-Objective Recommender Systems: Survey and Challenges. (arXiv:2210.10309v1 [cs.IR])
    Recommender systems can be characterized as software solutions that provide users convenient access to relevant content. Traditionally, recommender systems research predominantly focuses on developing machine learning algorithms that aim to predict which content is relevant for individual users. In real-world applications, however, optimizing the accuracy of such relevance predictions as a single objective in many cases is not sufficient. Instead, multiple and often competing objectives have to be considered, leading to a need for more research in multi-objective recommender systems. We can differentiate between several types of such competing goals, including (i) competing recommendation quality objectives at the individual and aggregate level, (ii) competing objectives of different involved stakeholders, (iii) long-term vs. short-term objectives, (iv) objectives at the user interface level, and (v) system level objectives. In this paper we review these types of multi-objective recommendation settings and outline open challenges in this area.
    Transferable Unlearnable Examples. (arXiv:2210.10114v1 [cs.LG])
    With more people publishing their personal data online, unauthorized data usage has become a serious concern. The unlearnable strategies have been introduced to prevent third parties from training on the data without permission. They add perturbations to the users' data before publishing, which aims to make the models trained on the perturbed published dataset invalidated. These perturbations have been generated for a specific training setting and a target dataset. However, their unlearnable effects significantly decrease when used in other training settings and datasets. To tackle this issue, we propose a novel unlearnable strategy based on Classwise Separability Discriminant (CSD), which aims to better transfer the unlearnable effects to other training settings and datasets by enhancing the linear separability. Extensive experiments demonstrate the transferability of the proposed unlearnable examples across training settings and datasets.
    Canary in a Coalmine: Better Membership Inference with Ensembled Adversarial Queries. (arXiv:2210.10750v1 [cs.LG])
    As industrial applications are increasingly automated by machine learning models, enforcing personal data ownership and intellectual property rights requires tracing training data back to their rightful owners. Membership inference algorithms approach this problem by using statistical techniques to discern whether a target sample was included in a model's training set. However, existing methods only utilize the unaltered target sample or simple augmentations of the target to compute statistics. Such a sparse sampling of the model's behavior carries little information, leading to poor inference capabilities. In this work, we use adversarial tools to directly optimize for queries that are discriminative and diverse. Our improvements achieve significantly more accurate membership inference than existing methods, especially in offline scenarios and in the low false-positive regime which is critical in legal settings. Code is available at https://github.com/YuxinWenRick/canary-in-a-coalmine.
    Machine Learning for a Sustainable Energy Future. (arXiv:2210.10391v1 [cond-mat.mtrl-sci])
    Transitioning from fossil fuels to renewable energy sources is a critical global challenge; it demands advances at the levels of materials, devices, and systems for the efficient harvesting, storage, conversion, and management of renewable energy. Researchers globally have begun incorporating machine learning (ML) techniques with the aim of accelerating these advances. ML technologies leverage statistical trends in data to build models for prediction of material properties, generation of candidate structures, optimization of processes, among other uses; as a result, they can be incorporated into discovery and development pipelines to accelerate progress. Here we review recent advances in ML-driven energy research, outline current and future challenges, and describe what is required moving forward to best lever ML techniques. To start, we give an overview of key ML concepts. We then introduce a set of key performance indicators to help compare the benefits of different ML-accelerated workflows for energy research. We discuss and evaluate the latest advances in applying ML to the development of energy harvesting (photovoltaics), storage (batteries), conversion (electrocatalysis), and management (smart grids). Finally, we offer an outlook of potential research areas in the energy field that stand to further benefit from the application of ML.
    CLEAR: Causal Explanations from Attention in Neural Recommenders. (arXiv:2210.10621v1 [cs.IR])
    We present CLEAR, a method for learning session-specific causal graphs, in the possible presence of latent confounders, from attention in pre-trained attention-based recommenders. These causal graphs describe user behavior, within the context captured by attention, and can provide a counterfactual explanation for a recommendation. In essence, these causal graphs allow answering "why" questions uniquely for any specific session. Using empirical evaluations we show that, compared to naively using attention weights to explain input-output relations, counterfactual explanations found by CLEAR are shorter and an alternative recommendation is ranked higher in the original top-k recommendations.
    Controlling Travel Path of Original Cobra. (arXiv:2210.10655v1 [cs.LG])
    In this paper we propose a kernel based COBRA which is a direct approximation of the original COBRA. We propose a novel tuning procedure for original COBRA parameters based on this kernel approximation. We show that our proposed algorithm provides much better accuracy than other COBRAs and faster than usual Gridsearch COBRA. We use two datasets to illustrate our proposed methodology over existing COBRAs.
    Towards Practical Explainability with Cluster Descriptors. (arXiv:2210.10662v1 [cs.LG])
    With the rapid development of machine learning, improving its explainability has become a crucial research goal. We study the problem of making the clusters more explainable by investigating the cluster descriptors. Given a set of objects $S$, a clustering of these objects $\pi$, and a set of tags $T$ that have not participated in the clustering algorithm. Each object in $S$ is associated with a subset of $T$. The goal is to find a representative set of tags for each cluster, referred to as the cluster descriptors, with the constraint that these descriptors we find are pairwise disjoint, and the total size of all the descriptors is minimized. In general, this problem is NP-hard. We propose a novel explainability model that reinforces the previous models in such a way that tags that do not contribute to explainability and do not sufficiently distinguish between clusters are not added to the optimal descriptors. The proposed model is formulated as a quadratic unconstrained binary optimization problem which makes it suitable for solving on modern optimization hardware accelerators. We experimentally demonstrate how a proposed explainability model can be solved on specialized hardware for accelerating combinatorial optimization, the Fujitsu Digital Annealer, and use real-life Twitter and PubMed datasets for use cases.
    Predicting Oxide Glass Properties with Low Complexity Neural Network and Physical and Chemical Descriptors. (arXiv:2210.10507v1 [cond-mat.mtrl-sci])
    Due to their disordered structure, glasses present a unique challenge in predicting the composition-property relationships. Recently, several attempts have been made to predict the glass properties using machine learning techniques. However, these techniques have the limitations, namely, (i) predictions are limited to the components that are present in the original dataset, and (ii) predictions towards the extreme values of the properties, important regions for new materials discovery, are not very reliable due to the sparse datapoints in this region. To address these challenges, here we present a low complexity neural network (LCNN) that provides improved performance in predicting the properties of oxide glasses. In addition, we combine the LCNN with physical and chemical descriptors that allow the development of universal models that can provide predictions for components beyond the training set. By training on a large dataset (~50000) of glass components, we show the LCNN outperforms state-of-the-art algorithms such as XGBoost. In addition, we interpret the LCNN models using Shapely additive explanations to gain insights into the role played by the descriptors in governing the property. Finally, we demonstrate the universality of the LCNN models by predicting the properties for glasses with new components that were not present in the original training set. Altogether, the present approach provides a promising direction towards accelerated discovery of novel glass compositions.
    Multi-view Gait Recognition based on Siamese Vision Transformer. (arXiv:2210.10421v1 [cs.CV])
    While the Vision Transformer has been used in gait recognition, its application in multi-view gait recognition is still limited. Different views significantly affect the extraction and identification accuracy of the characteristics of gait contour. To address this, this paper proposes a Siamese Mobile Vision Transformer (SMViT). This model not only focuses on the local characteristics of the human gait space but also considers the characteristics of long-distance attention associations, which can extract multi-dimensional step status characteristics. In addition, it describes how different perspectives affect gait characteristics and generate reliable perspective feature relationship factors. The average recognition rate of SMViT on the CASIA B data set reached 96.4%. The experimental results show that SMViT can attain state-of-the-art performance compared to advanced step recognition models such as GaitGAN, Multi_view GAN, Posegait and other gait recognition models.
    EGG-GAE: scalable graph neural networks for tabular data imputation. (arXiv:2210.10446v1 [cs.LG])
    Missing data imputation (MDI) is crucial when dealing with tabular datasets across various domains. Autoencoders can be trained to reconstruct missing values, and graph autoencoders (GAE) can additionally consider similar patterns in the dataset when imputing new values for a given instance. However, previously proposed GAEs suffer from scalability issues, requiring the user to define a similarity metric among patterns to build the graph connectivity beforehand. In this paper, we leverage recent progress in latent graph imputation to propose a novel EdGe Generation Graph AutoEncoder (EGG-GAE) for missing data imputation that overcomes these two drawbacks. EGG-GAE works on randomly sampled mini-batches of the input data (hence scaling to larger datasets), and it automatically infers the best connectivity across the mini-batch for each architecture layer. We also experiment with several extensions, including an ensemble strategy for inference and the inclusion of what we call prototype nodes, obtaining significant improvements, both in terms of imputation error and final downstream accuracy, across multiple benchmarks and baselines.
    Hierarchical Multi-Interest Co-Network For Coarse-Grained Ranking. (arXiv:2210.10547v1 [cs.IR])
    In this era of information explosion, a personalized recommendation system is convenient for users to get information they are interested in. To deal with billions of users and items, large-scale online recommendation services usually consist of three stages: candidate generation, coarse-grained ranking, and fine-grained ranking. The success of each stage depends on whether the model accurately captures the interests of users, which are usually hidden in users' behavior data. Previous research shows that users' interests are diverse, and one vector is not sufficient to capture users' different preferences. Therefore, many methods use multiple vectors to encode users' interests. However, there are two unsolved problems: (1) The similarity of different vectors in existing methods is too high, with too much redundant information. Consequently, the interests of users are not fully represented. (2) Existing methods model the long-term and short-term behaviors together, ignoring the differences between them. This paper proposes a Hierarchical Multi-Interest Co-Network (HCN) to capture users' diverse interests in the coarse-grained ranking stage. Specifically, we design a hierarchical multi-interest extraction layer to update users' diverse interest centers iteratively. The multiple embedded vectors obtained in this way contain more information and represent the interests of users better in various aspects. Furthermore, we develop a Co-Interest Network to integrate users' long-term and short-term interests. Experiments on several real-world datasets and one large-scale industrial dataset show that HCN effectively outperforms the state-of-the-art methods. We deploy HCN into a large-scale real world E-commerce system and achieve extra 2.5\% improvements on GMV (Gross Merchandise Value).
    The Devil in Linear Transformer. (arXiv:2210.10340v1 [cs.CL])
    Linear transformers aim to reduce the quadratic space-time complexity of vanilla transformers. However, they usually suffer from degraded performances on various tasks and corpus. In this paper, we examine existing kernel-based linear transformers and identify two key issues that lead to such performance gaps: 1) unbounded gradients in the attention computation adversely impact the convergence of linear transformer models; 2) attention dilution which trivially distributes attention scores over long sequences while neglecting neighbouring structures. To address these issues, we first identify that the scaling of attention matrices is the devil in unbounded gradients, which turns out unnecessary in linear attention as we show theoretically and empirically. To this end, we propose a new linear attention that replaces the scaling operation with a normalization to stabilize gradients. For the issue of attention dilution, we leverage a diagonal attention to confine attention to only neighbouring tokens in early layers. Benefiting from the stable gradients and improved attention, our new linear transformer model, transNormer, demonstrates superior performance on text classification and language modeling tasks, as well as on the challenging Long-Range Arena benchmark, surpassing vanilla transformer and existing linear variants by a clear margin while being significantly more space-time efficient. The code is available at https://github.com/OpenNLPLab/Transnormer .
    Multi-view Tracking Using Weakly Supervised Human Motion Prediction. (arXiv:2210.10771v1 [cs.CV])
    Multi-view approaches to people-tracking have the potential to better handle occlusions than single-view ones in crowded scenes. They often rely on the tracking-by-detection paradigm, which involves detecting people first and then connecting the detections. In this paper, we argue that an even more effective approach is to predict people motion over time and infer people's presence in individual frames from these. This enables to enforce consistency both over time and across views of a single temporal frame. We validate our approach on the PETS2009 and WILDTRACK datasets and demonstrate that it outperforms state-of-the-art methods.
    Explainable Slot Type Attentions to Improve Joint Intent Detection and Slot Filling. (arXiv:2210.10227v1 [cs.LG])
    Joint intent detection and slot filling is a key research topic in natural language understanding (NLU). Existing joint intent and slot filling systems analyze and compute features collectively for all slot types, and importantly, have no way to explain the slot filling model decisions. In this work, we propose a novel approach that: (i) learns to generate additional slot type specific features in order to improve accuracy and (ii) provides explanations for slot filling decisions for the first time in a joint NLU model. We perform an additional constrained supervision using a set of binary classifiers for the slot type specific feature learning, thus ensuring appropriate attention weights are learned in the process to explain slot filling decisions for utterances. Our model is inherently explainable and does not need any post-hoc processing. We evaluate our approach on two widely used datasets and show accuracy improvements. Moreover, a detailed analysis is also provided for the exclusive slot explainability.
    A new activation for neural networks and its approximation. (arXiv:2210.10264v1 [cs.LG])
    Deep learning with deep neural networks (DNNs) has attracted tremendous attention from various fields of science and technology recently. Activation functions for a DNN define the output of a neuron given an input or set of inputs. They are essential and inevitable in learning non-linear transformations and performing diverse computations among successive neuron layers. Thus, the design of activation functions is still an important topic in deep learning research. Meanwhile, theoretical studies on the approximation ability of DNNs with activation functions have been investigated within the last few years. In this paper, we propose a new activation function, named as "DLU", and investigate its approximation ability for functions with various smoothness and structures. Our theoretical results show that DLU networks can process competitive approximation performance with rational and ReLU networks, and have some advantages. Numerical experiments are conducted comparing DLU with the existing activations-ReLU, Leaky ReLU, and ELU, which illustrate the good practical performance of DLU.
    DALLE-2 is Seeing Double: Flaws in Word-to-Concept Mapping in Text2Image Models. (arXiv:2210.10606v1 [cs.CL])
    We study the way DALLE-2 maps symbols (words) in the prompt to their references (entities or properties of entities in the generated image). We show that in stark contrast to the way human process language, DALLE-2 does not follow the constraint that each word has a single role in the interpretation, and sometimes re-use the same symbol for different purposes. We collect a set of stimuli that reflect the phenomenon: we show that DALLE-2 depicts both senses of nouns with multiple senses at once; and that a given word can modify the properties of two distinct entities in the image, or can be depicted as one object and also modify the properties of another object, creating a semantic leakage of properties between entities. Taken together, our study highlights the differences between DALLE-2 and human language processing and opens an avenue for future study on the inductive biases of text-to-image models.
    Semi-Supervised Adversarial Discriminative Domain Adaptation. (arXiv:2109.13016v2 [cs.CV] UPDATED)
    Domain adaptation is a potential method to train a powerful deep neural network, which can handle the absence of labeled data. More precisely, domain adaptation solving the limitation called dataset bias or domain shift when the training dataset and testing dataset are extremely different. Adversarial adaptation method becoming popular among other domain adaptation methods. Relies on the idea of GAN, adversarial domain adaptation tries to minimize the distribution between training and testing datasets base on the adversarial object. However, some conventional adversarial domain adaptation methods cannot handle large domain shifts between two datasets or the generalization ability of these methods are inefficient. In this paper, we propose an improved adversarial domain adaptation method called Semi-Supervised Adversarial Discriminative Domain Adaptation (SADDA), which can overcome the limitation of other domain adaptation. We also show that SADDA has better performance than other adversarial adaptation methods and illustrate the promise of our method on digit classification and emotion recognition problems.
    AnalogVNN: A fully modular framework for modeling and optimizing photonic neural networks. (arXiv:2210.10048v1 [cs.LG])
    In this paper, we present AnalogVNN, a simulation framework built on PyTorch which can simulate the effects of optoelectronic noise, limited precision, and signal normalization present in photonic neural network accelerators. We use this framework to train and optimize linear and convolutional neural networks with up to 9 layers and ~1.7 million parameters, while gaining insights into how normalization, activation function, reduced precision, and noise influence accuracy in analog photonic neural networks. By following the same layer structure design present in PyTorch, the AnalogVNN framework allows users to convert most digital neural network models to their analog counterparts with just a few lines of code, taking full advantage of the open-source optimization, deep learning, and GPU acceleration libraries available through PyTorch.
    Combination of Raman spectroscopy and chemometrics: A review of recent studies published in the Spectrochimica Acta, Part A: Molecular and Biomolecular Spectroscopy Journal. (arXiv:2210.10051v1 [q-bio.QM])
    Raman spectroscopy is a promising technique used for noninvasive analysis of samples in various fields of application due to its ability for fingerprint probing of samples at the molecular level. Chemometrics methods are widely used nowadays for better understanding of the recorded spectral fingerprints of samples and differences in their chemical composition. This review considers a number of manuscripts published in the Spectrochimica Acta, Part A: Molecular and Biomolecular Spectroscopy Journal that presented findings regarding the application of Raman spectroscopy in combination with chemometrics to study samples and their changes caused by different factors. In 57 reviewed manuscripts, we analyzed application of chemometrics algorithms, statistical modeling parameters, utilization of cross validation, sample sizes, as well as the performance of the proposed classification and regression model. We summarized the best strategies for creating classification models and highlighted some common drawbacks when it comes to the application of chemometrics techniques. According to our estimations, about 70% of the papers are likely to contain unsupported or invalid data due to insufficient description of the utilized methods or drawbacks of the proposed classification models. These drawbacks include: (1) insufficient experimental sample size for classification/regression to achieve significant and reliable results, (2) lack of cross validation (or a test set) for verification of the classifier/regression performance, (3) incorrect division of the spectral data into the training and the test/validation sets; (4) improper selection of the PC number to reduce the analyzed spectral data dimension.
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v1 [stat.ML])
    We propose a goodness-of-fit measure for probability densities modelling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalised densities. Existing KSDs require the model to be defined on a fixed-dimension space. As our major contributions, we extend the KSD to the variable dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalised, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.
    DyTed: Disentangling Temporal Invariance and Fluctuations in Dynamic Graph Representation Learning. (arXiv:2210.10592v1 [cs.SI])
    Unsupervised representation learning for dynamic graphs has attracted a lot of research attention in recent years. Compared with static graphs, dynamic graphs are the integrative reflection of both the temporal-invariant or stable characteristics of nodes and the dynamic-fluctuate preference changing with time. However, existing dynamic graph representation learning methods generally confound these two types of information into a shared representation space, which may lead to poor explanation, less robustness, and a limited ability when applied to different downstream tasks. Taking the real dynamic graphs of daily capital transactions on Tencent as an example, the learned representation of the state-of-the-art method achieves only 32% accuracy in predicting temporal-invariant characteristics of users like annual income. In this paper, we introduce a novel temporal invariance-fluctuation disentangled representation learning framework for dynamic graphs, namely DyTed. In particular, we propose a temporal-invariant representation generator and a dynamic-fluctuate representation generator with carefully designed pretext tasks to identify the two types of representations in dynamic graphs. To further enhance the disentanglement or separation, we propose a disentanglement-aware discriminator under an adversarial learning framework. Extensive experiments on Tencent and five commonly used public datasets demonstrate that the different parts of our disentangled representation can achieve state-of-the-art performance on various downstream tasks, as well as be more robust against noise, and is a general framework that can further improve existing methods.  ( 3 min )
    Auditing YouTube's Recommendation Algorithm for Misinformation Filter Bubbles. (arXiv:2210.10085v1 [cs.IR])
    In this paper, we present results of an auditing study performed over YouTube aimed at investigating how fast a user can get into a misinformation filter bubble, but also what it takes to "burst the bubble", i.e., revert the bubble enclosure. We employ a sock puppet audit methodology, in which pre-programmed agents (acting as YouTube users) delve into misinformation filter bubbles by watching misinformation promoting content. Then they try to burst the bubbles and reach more balanced recommendations by watching misinformation debunking content. We record search results, home page results, and recommendations for the watched videos. Overall, we recorded 17,405 unique videos, out of which we manually annotated 2,914 for the presence of misinformation. The labeled data was used to train a machine learning model classifying videos into three classes (promoting, debunking, neutral) with the accuracy of 0.82. We use the trained model to classify the remaining videos that would not be feasible to annotate manually. Using both the manually and automatically annotated data, we observe the misinformation bubble dynamics for a range of audited topics. Our key finding is that even though filter bubbles do not appear in some situations, when they do, it is possible to burst them by watching misinformation debunking content (albeit it manifests differently from topic to topic). We also observe a sudden decrease of misinformation filter bubble effect when misinformation debunking videos are watched after misinformation promoting videos, suggesting a strong contextuality of recommendations. Finally, when comparing our results with a previous similar study, we do not observe significant improvements in the overall quantity of recommended misinformation content.  ( 3 min )
    Graph Attention Networks Unveil Determinants of Intra- and Inter-city Health Disparity. (arXiv:2210.10142v1 [cs.LG])
    Understanding the determinants underlying variations in urban health status is important for informing urban design and planning, as well as public health policies. Multiple heterogeneous urban features could modulate the prevalence of diseases across different neighborhoods in cities and across different cities. This study examines heterogeneous features related to socio-demographics, population activity, mobility, and the built environment and their non-linear interactions to examine intra- and inter-city disparity in prevalence of four disease types: obesity, diabetes, cancer, and heart disease. Features related to population activity, mobility, and facility density are obtained from large-scale anonymized mobility data. These features are used in training and testing graph attention network (GAT) models to capture non-linear feature interactions as well as spatial interdependence among neighborhoods. We tested the models in five U.S. cities across the four disease types. The results show that the GAT model can predict the health status of people in neighborhoods based on the top five determinant features. The findings unveil that population activity and built-environment features along with socio-demographic features differentiate the health status of neighborhoods to such a great extent that a GAT model could predict the health status using these features with high accuracy. The results also show that the model trained on one city can predict health status in another city with high accuracy, allowing us to quantify the inter-city similarity and discrepancy in health status. The model and findings provide novel approaches and insights for urban designers, planners, and public health officials to better understand and improve health disparities in cities by considering the significant determinant features and their interactions.  ( 3 min )
    Nonparametric Quantile Regression: Non-Crossing Constraints and Conformal Prediction. (arXiv:2210.10161v1 [stat.ML])
    We propose a nonparametric quantile regression method using deep neural networks with a rectified linear unit penalty function to avoid quantile crossing. This penalty function is computationally feasible for enforcing non-crossing constraints in multi-dimensional nonparametric quantile regression. We establish non-asymptotic upper bounds for the excess risk of the proposed nonparametric quantile regression function estimators. Our error bounds achieve optimal minimax rate of convergence for the Holder class, and the prefactors of the error bounds depend polynomially on the dimension of the predictor, instead of exponentially. Based on the proposed non-crossing penalized deep quantile regression, we construct conformal prediction intervals that are fully adaptive to heterogeneity. The proposed prediction interval is shown to have good properties in terms of validity and accuracy under reasonable conditions. We also derive non-asymptotic upper bounds for the difference of the lengths between the proposed non-crossing conformal prediction interval and the theoretically oracle prediction interval. Numerical experiments including simulation studies and a real data example are conducted to demonstrate the effectiveness of the proposed method.  ( 2 min )
    Fast Approximation of the Generalized Sliced-Wasserstein Distance. (arXiv:2210.10268v1 [stat.ML])
    Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the Monte Carlo method. However, the complexity of that approximation can be expensive in high-dimensional settings. To that end, we propose to form deterministic and fast approximations of the generalized sliced Wasserstein distance by using the concentration of random projections when the defining functions are polynomial function, circular function, and neural network type function. Our approximations hinge upon an important result that one-dimensional projections of a high-dimensional random vector are approximately Gaussian.  ( 2 min )
    Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection. (arXiv:2210.10487v1 [cs.LG])
    Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.  ( 2 min )
    Few-shot Transferable Robust Representation Learning via Bilevel Attacks. (arXiv:2210.10485v1 [cs.LG])
    Existing adversarial learning methods for enhancing the robustness of deep neural networks assume the availability of a large amount of data from which we can generate adversarial examples. However, in an adversarial meta-learning setting, the model needs to train with only a few adversarial examples to learn a robust model for unseen tasks, which is a very difficult goal to achieve. Further, learning transferable robust representations for unseen domains is a difficult problem even with a large amount of data. To tackle such a challenge, we propose a novel adversarial self-supervised meta-learning framework with bilevel attacks which aims to learn robust representations that can generalize across tasks and domains. Specifically, in the inner loop, we update the parameters of the given encoder by taking inner gradient steps using two different sets of augmented samples, and generate adversarial examples for each view by maximizing the instance classification loss. Then, in the outer loop, we meta-learn the encoder parameter to maximize the agreement between the two adversarial examples, which enables it to learn robust representations. We experimentally validate the effectiveness of our approach on unseen domain adaptation tasks, on which it achieves impressive performance. Specifically, our method significantly outperforms the state-of-the-art meta-adversarial learning methods on few-shot learning tasks, as well as self-supervised learning baselines in standard learning settings with large-scale datasets.  ( 3 min )
    Uncertainty in Extreme Multi-label Classification. (arXiv:2210.10160v1 [cs.LG])
    Uncertainty quantification is one of the most crucial tasks to obtain trustworthy and reliable machine learning models for decision making. However, most research in this domain has only focused on problems with small label spaces and ignored eXtreme Multi-label Classification (XMC), which is an essential task in the era of big data for web-scale machine learning applications. Moreover, enormous label spaces could also lead to noisy retrieval results and intractable computational challenges for uncertainty quantification. In this paper, we aim to investigate general uncertainty quantification approaches for tree-based XMC models with a probabilistic ensemble-based framework. In particular, we analyze label-level and instance-level uncertainty in XMC, and propose a general approximation framework based on beam search to efficiently estimate the uncertainty with a theoretical guarantee under long-tail XMC predictions. Empirical studies on six large-scale real-world datasets show that our framework not only outperforms single models in predictive performance, but also can serve as strong uncertainty-based baselines for label misclassification and out-of-distribution detection, with significant speedup. Besides, our framework can further yield better state-of-the-art results based on deep XMC models with uncertainty quantification.  ( 2 min )
    Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval. (arXiv:2210.10486v1 [cs.CV])
    Representation learning for sketch-based image retrieval has mostly been tackled by learning embeddings that discard modality-specific information. As instances from different modalities can often provide complementary information describing the underlying concept, we propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation. Such encoders can then be applied to downstream tasks like cross-modal retrieval. We demonstrate the expressive capacity of the learned representations by performing a wide range of experiments and achieving state-of-the-art results on three fine-grained sketch-based image retrieval benchmarks: Shoe-V2, Chair-V2 and Sketchy. Implementation is available at https://github.com/abhrac/xmodal-vit.  ( 2 min )
    TEFL: Turbo Explainable Federated Learning for 6G Trustworthy Zero-Touch Network Slicing. (arXiv:2210.10147v1 [cs.IT])
    Sixth-generation (6G) networks anticipate intelligently supporting a massive number of coexisting and heterogeneous slices associated with various vertical use cases. Such a context urges the adoption of artificial intelligence (AI)-driven zero-touch management and orchestration (MANO) of the end-to-end (E2E) slices under stringent service level agreements (SLAs). Specifically, the trustworthiness of the AI black-boxes in real deployment can be achieved by explainable AI (XAI) tools to build transparency between the interacting actors in the slicing ecosystem, such as tenants, infrastructure providers and operators. Inspired by the turbo principle, this paper presents a novel iterative explainable federated learning (FL) approach where a constrained resource allocation model and an \emph{explainer} exchange -- in a closed loop (CL) fashion -- soft attributions of the features as well as inference predictions to achieve a transparent and SLA-aware zero-touch service management (ZSM) of 6G network slices at RAN-Edge setup under non-independent identically distributed (non-IID) datasets. In particular, we quantitatively validate the faithfulness of the explanations via the so-called attribution-based \emph{confidence metric} that is included as a constraint in the run-time FL optimization task. In this respect, Integrated-Gradient (IG) as well as Input $\times$ Gradient and SHAP are used to generate the attributions for the turbo explainable FL (TEFL), wherefore simulation results under different methods confirm its superiority over an unconstrained Integrated-Gradient \emph{post-hoc} FL baseline.  ( 3 min )
    Adaptive Neural Network Ensemble Using Frequency Distribution. (arXiv:2210.10360v1 [cs.LG])
    Neural network (NN) ensembles can reduce large prediction variance of NN and improve prediction accuracy. For highly nonlinear problems with insufficient data set, the prediction accuracy of NN models becomes unstable, resulting in a decrease in the accuracy of ensembles. Therefore, this study proposes a frequency distribution-based ensemble that identifies core prediction values, which are expected to be concentrated near the true prediction value. The frequency distribution-based ensemble classifies core prediction values supported by multiple prediction values by conducting statistical analysis with a frequency distribution, which is based on various prediction values obtained from a given prediction point. The frequency distribution-based ensemble can improve predictive performance by excluding prediction values with low accuracy and coping with the uncertainty of the most frequent value. An adaptive sampling strategy that sequentially adds samples based on the core prediction variance calculated as the variance of the core prediction values is proposed to improve the predictive performance of the frequency distribution-based ensemble efficiently. Results of various case studies show that the prediction accuracy of the frequency distribution-based ensemble is higher than that of Kriging and other existing ensemble methods. In addition, the proposed adaptive sampling strategy effectively improves the predictive performance of the frequency distribution-based ensemble compared with the previously developed space-filling and prediction variance-based strategies.  ( 2 min )
    Granger causal inference on DAGs identifies genomic loci regulating transcription. (arXiv:2210.10168v1 [cs.LG])
    When a dynamical system can be modeled as a sequence of observations, Granger causality is a powerful approach for detecting predictive interactions between its variables. However, traditional Granger causal inference has limited utility in domains where the dynamics need to be represented as directed acyclic graphs (DAGs) rather than as a linear sequence, such as with cell differentiation trajectories. Here, we present GrID-Net, a framework based on graph neural networks with lagged message passing for Granger causal inference on DAG-structured systems. Our motivating application is the analysis of single-cell multimodal data to identify genomic loci that mediate the regulation of specific genes. To our knowledge, GrID-Net is the first single-cell analysis tool that accounts for the temporal lag between a genomic locus becoming accessible and its downstream effect on a target gene's expression. We applied GrID-Net on multimodal single-cell assays that profile chromatin accessibility (ATAC-seq) and gene expression (RNA-seq) in the same cell and show that it dramatically outperforms existing methods for inferring regulatory locus-gene links, achieving up to 71% greater agreement with independent population genetics-based estimates. By extending Granger causality to DAG-structured dynamical systems, our work unlocks new domains for causal analyses and, more specifically, opens a path towards elucidating gene regulatory interactions relevant to cellular differentiation and complex human diseases at unprecedented scale and resolution.  ( 3 min )
    Explainable bilevel optimization: an application to the Helsinki deblur challenge. (arXiv:2210.10050v1 [eess.IV])
    In this paper we present a bilevel optimization scheme for the solution of a general image deblurring problem, in which a parametric variational-like approach is encapsulated within a machine learning scheme to provide a high quality reconstructed image with automatically learned parameters. The ingredients of the variational lower level and the machine learning upper one are specifically chosen for the Helsinki Deblur Challenge 2021, in which sequences of letters are asked to be recovered from out-of-focus photographs with increasing levels of blur. Our proposed procedure for the reconstructed image consists in a fixed number of FISTA iterations applied to the minimization of an edge preserving and binarization enforcing regularized least-squares functional. The parameters defining the variational model and the optimization steps, which, unlike most deep learning approaches, all have a precise and interpretable meaning, are learned via either a similarity index or a support vector machine strategy. Numerical experiments on the test images provided by the challenge authors show significant gains with respect to a standard variational approach and performances comparable with those of some of the proposed deep learning based algorithms which require the optimization of millions of parameters.  ( 2 min )
    High-Dimensional Performance Modeling via Tensor Completion. (arXiv:2210.10184v1 [cs.PF])
    Performance tuning, software/hardware co-design, and job scheduling are among the many tasks that rely on models to predict application performance. We propose and evaluate low rank tensor decomposition for modeling application performance. We use tensors to represent regular grids that discretize the input and configuration domain of an application. Application execution times mapped within grid-cells are averaged and represented by tensor elements. We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors. We then employ tensor completion to optimize a CP decomposition given a sparse set of observed runtimes. We consider alternative piecewise/grid-based (P/G) and supervised learning models for six applications and demonstrate that P/G models are significantly more accurate relative to model size. Among P/G models, CP decomposition of regular grids (CPR) offers higher accuracy and memory-efficiency, faster optimization, and superior extensibility via user-selected loss functions and domain partitioning. CPR models achieve a 2.18x geometric mean decrease in mean prediction error relative to the most accurate alternative models of size $\le$10 kilobytes.  ( 2 min )
    Comparing Spectroscopy Measurements in the Prediction of in Vitro Dissolution Profile using Artificial Neural Networks. (arXiv:2210.10292v1 [cs.LG])
    Dissolution testing is part of the target product quality that is essential in approving new products in the pharmaceutical industry. The prediction of the dissolution profile based on spectroscopic data is an alternative to the current destructive and time-consuming method. Raman and near-infrared (NIR) spectroscopies are two fast and complementary methods that provide information on the tablets' physical and chemical properties and can help predict their dissolution profiles. This work aims to compare the information collected by these spectroscopy methods to support the decision of which measurements should be used so that the accuracy requirement of the industry is met. Artificial neural network models were created, in which the spectroscopy data and the measured compression curves were used as an input individually and in different combinations in order to estimate the dissolution profiles. Results showed that using only the NIR transmission method along with the compression force data or the Raman and NIR reflection methods, the dissolution profile was estimated within the acceptance limits of the f2 similarity factor. Adding further spectroscopy measurements increased the prediction accuracy.  ( 3 min )
    A scan-specific unsupervised method for parallel MRI reconstruction via implicit neural representation. (arXiv:2210.10439v1 [eess.IV])
    Parallel imaging is a widely-used technique to accelerate magnetic resonance imaging (MRI). However, current methods still perform poorly in reconstructing artifact-free MRI images from highly undersampled k-space data. Recently, implicit neural representation (INR) has emerged as a new deep learning paradigm for learning the internal continuity of an object. In this study, we adopted INR to parallel MRI reconstruction. The MRI image was modeled as a continuous function of spatial coordinates. This function was parameterized by a neural network and learned directly from the measured k-space itself without additional fully sampled high-quality training data. Benefitting from the powerful continuous representations provided by INR, the proposed method outperforms existing methods by suppressing the aliasing artifacts and noise, especially at higher acceleration rates and smaller sizes of the auto-calibration signals. The high-quality results and scanning specificity make the proposed method hold the potential for further accelerating the data acquisition of parallel MRI.  ( 2 min )
    Self-supervised Heterogeneous Graph Pre-training Based on Structural Clustering. (arXiv:2210.10462v1 [cs.LG])
    Recent self-supervised pre-training methods on Heterogeneous Information Networks (HINs) have shown promising competitiveness over traditional semi-supervised Heterogeneous Graph Neural Networks (HGNNs). Unfortunately, their performance heavily depends on careful customization of various strategies for generating high-quality positive examples and negative examples, which notably limits their flexibility and generalization ability. In this work, we present SHGP, a novel Self-supervised Heterogeneous Graph Pre-training approach, which does not need to generate any positive examples or negative examples. It consists of two modules that share the same attention-aggregation scheme. In each iteration, the Att-LPA module produces pseudo-labels through structural clustering, which serve as the self-supervision signals to guide the Att-HGNN module to learn object embeddings and attention coefficients. The two modules can effectively utilize and enhance each other, promoting the model to learn discriminative embeddings. Extensive experiments on four real-world datasets demonstrate the superior effectiveness of SHGP against state-of-the-art unsupervised baselines and even semi-supervised baselines. We release our source code at: https://github.com/kepsail/SHGP.  ( 2 min )
    Optimizing Hierarchical Image VAEs for Sample Quality. (arXiv:2210.10205v1 [cs.LG])
    While hierarchical variational autoencoders (VAEs) have achieved great density estimation on image modeling tasks, samples from their prior tend to look less convincing than models with similar log-likelihood. We attribute this to learned representations that over-emphasize compressing imperceptible details of the image. To address this, we introduce a KL-reweighting strategy to control the amount of infor mation in each latent group, and employ a Gaussian output layer to reduce sharpness in the learning objective. To trade off image diversity for fidelity, we additionally introduce a classifier-free guidance strategy for hierarchical VAEs. We demonstrate the effectiveness of these techniques in our experiments. Code is available at https://github.com/tcl9876/visual-vae.  ( 2 min )
    Inference in conditioned dynamics through causality restoration. (arXiv:2210.10179v1 [physics.data-an])
    Computing observables from conditioned dynamics is typically computationally hard, because, although obtaining independent samples efficiently from the unconditioned dynamics is usually feasible, generally most of the samples must be discarded (in a form of importance sampling) because they do not satisfy the imposed conditions. Sampling directly from the conditioned distribution is non-trivial, as conditioning breaks the causal properties of the dynamics which ultimately renders the sampling procedure efficient. One standard way of achieving it is through a Metropolis Monte-Carlo procedure, but this procedure is normally slow and a very large number of Monte-Carlo steps is needed to obtain a small number of statistically independent samples. In this work, we propose an alternative method to produce independent samples from a conditioned distribution. The method learns the parameters of a generalized dynamical model that optimally describe the conditioned distribution in a variational sense. The outcome is an effective, unconditioned, dynamical model, from which one can trivially obtain independent samples, effectively restoring causality of the conditioned distribution. The consequences are twofold: on the one hand, it allows us to efficiently compute observables from the conditioned dynamics by simply averaging over independent samples. On the other hand, the method gives an effective unconditioned distribution which is easier to interpret. The method is flexible and can be applied virtually to any dynamics. We discuss an important application of the method, namely the problem of epidemic risk assessment from (imperfect) clinical tests, for a large family of time-continuous epidemic models endowed with a Gillespie-like sampler. We show that the method compares favorably against the state of the art, including the soft-margin approach and mean-field methods.  ( 3 min )
    Gaussian-Bernoulli RBMs Without Tears. (arXiv:2210.10318v1 [cs.LG])
    We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released at: \url{https://github.com/lrjconan/GRBM}.  ( 2 min )
  • Open

    Bring Your Own Algorithm for Optimal Differentially Private Stochastic Minimax Optimization. (arXiv:2206.00363v2 [cs.LG] UPDATED)
    We study differentially private (DP) algorithms for smooth stochastic minimax optimization, with stochastic minimization as a byproduct. The holy grail of these settings is to guarantee the optimal trade-off between the privacy and the excess population loss, using an algorithm with a linear time-complexity in the number of training samples. We provide a general framework for solving differentially private stochastic minimax optimization (DP-SMO) problems, which enables the practitioners to bring their own base optimization algorithm and use it as a black-box to obtain the near-optimal privacy-loss trade-off. Our framework is inspired from the recently proposed Phased-ERM method [22] for nonsmooth differentially private stochastic convex optimization (DP-SCO), which exploits the stability of the empirical risk minimization (ERM) for the privacy guarantee. The flexibility of our approach enables us to sidestep the requirement that the base algorithm needs to have bounded sensitivity, and allows the use of sophisticated variance-reduced accelerated methods to achieve near-linear time-complexity. To the best of our knowledge, these are the first near-linear time algorithms with near-optimal guarantees on the population duality gap for smooth DP-SMO, when the objective is (strongly-)convex--(strongly-)concave. Additionally, based on our flexible framework, we enrich the family of near-linear time algorithms for smooth DP-SCO with the near-optimal privacy-loss trade-off.  ( 3 min )
    Predictions of Electromotive Force of Magnetic Shape Memory Alloy (MSMA) Using Constitutive Model and Generalized Regression Neural Network. (arXiv:2206.03701v2 [cond-mat.mtrl-sci] UPDATED)
    Ferromagnetic shape memory alloys (MSMAs), such as Ni-Mn-Ga single crystals, can exhibit the shape memory effect due to an applied magnetic field at room temperature. Under a variable magnetic field and a constant bias stress loading, MSMAs have been used for actuation applications. This work introduced a new feature to the existing macroscale magneto-mechanical model for Ni-Mn-Ga single crystal. This model includes the fact that the magnetic easy axis in the two variants is not exactly perpendicular as observed by D silva et al. This offset helps explain some of the power harvesting capabilities of MSMAs. Model predictions are compared to experimental data collected on a Ni-Mn-Ga single crystal. The experiments include both stress-controlled loading with constant bias magnetic field load (which mimics power harvesting or sensing) and fieldcontrolled loading with constant bias compressive stress (which mimics actuation). Each type of test was performed at several different load levels, and the applied field was measured without the MSMA specimen present so that demagnetization does not affect the experimentally measured field as suggested by Eberle et al. Results show decent agreement between model predictions and experimental data. Although the model predicts experimental results decently, it does not capture all the features of the experimental data. In order to capture all the experimental features, finally, a generalized regression neural network (GRNN) was used to train the experimental data (stress, strain, magnetic field, and emf) so that it can make a reasonably better prediction.  ( 3 min )
    Autoregressive Generative Modeling with Noise Conditional Maximum Likelihood Estimation. (arXiv:2210.10715v1 [cs.LG])
    We introduce a simple modification to the standard maximum likelihood estimation (MLE) framework. Rather than maximizing a single unconditional likelihood of the data under the model, we maximize a family of \textit{noise conditional} likelihoods consisting of the data perturbed by a continuum of noise levels. We find that models trained this way are more robust to noise, obtain higher test likelihoods, and generate higher quality images. They can also be sampled from via a novel score-based sampling scheme which combats the classical \textit{covariate shift} problem that occurs during sample generation in autoregressive models. Applying this augmentation to autoregressive image models, we obtain 3.32 bits per dimension on the ImageNet 64x64 dataset, and substantially improve the quality of generated samples in terms of the Frechet Inception distance (FID) -- from 37.50 to 12.09 on the CIFAR-10 dataset.  ( 2 min )
    Identifiability of deep generative models without auxiliary information. (arXiv:2206.10044v2 [cs.LG] UPDATED)
    We prove identifiability of a broad class of deep latent variable models that (a) have universal approximation capabilities and (b) are the decoders of variational autoencoders that are commonly used in practice. Unlike existing work, our analysis does not require weak supervision, auxiliary information, or conditioning in the latent space. Specifically, we show that for a broad class of generative (i.e. unsupervised) models with universal approximation capabilities, the side information $u$ is not necessary: We prove identifiability of the entire generative model where we do not observe $u$ and only observe the data $x$. The models we consider match autoencoder architectures used in practice that leverage mixture priors in the latent space and ReLU/leaky-ReLU activations in the encoder, such as VaDE and MFC-VAE. Our main result is an identifiability hierarchy that significantly generalizes previous work and exposes how different assumptions lead to different "strengths" of identifiability, and includes certain "vanilla" VAEs with isotropic Gaussian priors as a special case. For example, our weakest result establishes (unsupervised) identifiability up to an affine transformation, and thus partially resolves an open problem regarding model identifiability raised in prior work. These theoretical results are augmented with experiments on both simulated and real data.  ( 3 min )
    Anytime-valid off-policy inference for contextual bandits. (arXiv:2210.10768v1 [stat.ME])
    Contextual bandits are a modern staple tool for active sequential experimentation in the tech industry. They involve online learning algorithms that adaptively (over time) learn policies to map observed contexts $X_t$ to actions $A_t$ in an attempt to maximize stochastic rewards $R_t$. This adaptivity raises interesting but hard statistical inference questions, especially counterfactual ones: for example, it is often of interest to estimate the properties of a hypothetical policy that is different from the logging policy that was used to collect the data -- a problem known as "off-policy evaluation" (OPE). Using modern martingale techniques, we present a comprehensive framework for OPE inference that relax many unnecessary assumptions made in past work, significantly improving on them theoretically and empirically. Our methods remain valid in very general settings, and can be employed while the original experiment is still running (that is, not necessarily post-hoc), when the logging policy may be itself changing (due to learning), and even if the context distributions are drifting over time. More concretely, we derive confidence sequences for various functionals of interest in OPE. These include doubly robust ones for time-varying off-policy mean reward values, but also confidence bands for the entire CDF of the off-policy reward distribution. All of our methods (a) are valid at arbitrary stopping times (b) only make nonparametric assumptions, and (c) do not require known bounds on the maximal importance weights, and (d) adapt to the empirical variance of the reward and weight distributions. In summary, our methods enable anytime-valid off-policy inference using adaptively collected contextual bandit data.  ( 3 min )
    Gaussian Process Sampling and Optimization with Approximate Upper and Lower Bounds. (arXiv:2110.12087v4 [cs.LG] UPDATED)
    Many functions have approximately-known upper and/or lower bounds, potentially aiding the modeling of such functions. In this paper, we introduce Gaussian process models for functions where such bounds are (approximately) known. More specifically, we propose the first use of such bounds to improve Gaussian process (GP) posterior sampling and Bayesian optimization (BO). That is, we transform a GP model satisfying the given bounds, and then sample and weight functions from its posterior. To further exploit these bounds in BO settings, we present bounded entropy search (BES) to select the point gaining the most information about the underlying function, estimated by the GP samples, while satisfying the output constraints. We characterize the sample variance bounds and show that the decision made by BES is explainable. Our proposed approach is conceptually straightforward and can be used as a plug in extension to existing methods for GP posterior sampling and Bayesian optimization.  ( 2 min )
    Koopman-based spectral clustering of directed and time-evolving graphs. (arXiv:2204.02951v2 [math.DS] UPDATED)
    While spectral clustering algorithms for undirected graphs are well established and have been successfully applied to unsupervised machine learning problems ranging from image segmentation and genome sequencing to signal processing and social network analysis, clustering directed graphs remains notoriously difficult. Two of the main challenges are that the eigenvalues and eigenvectors of graph Laplacians associated with directed graphs are in general complex-valued and that there is no universally accepted definition of clusters in directed graphs. We first exploit relationships between the graph Laplacian and transfer operators and in particular between clusters in undirected graphs and metastable sets in stochastic dynamical systems and then use a generalization of the notion of metastability to derive clustering algorithms for directed and time-evolving graphs. The resulting clusters can be interpreted as coherent sets, which play an important role in the analysis of transport and mixing processes in fluid flows.  ( 2 min )
    Interpolation Consistency Training for Semi-Supervised Learning. (arXiv:1903.03825v5 [stat.ML] UPDATED)
    We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark datasets. Our theoretical analysis shows that ICT corresponds to a certain type of data-adaptive regularization with unlabeled points which reduces overfitting to labeled points under high confidence values.  ( 2 min )
    Active Surrogate Estimators: An Active Learning Approach to Label-Efficient Model Evaluation. (arXiv:2202.06881v2 [cs.LG] UPDATED)
    We propose Active Surrogate Estimators (ASEs), a new method for label-efficient model evaluation. Evaluating model performance is a challenging and important problem when labels are expensive. ASEs address this active testing problem using a surrogate-based estimation approach that interpolates the errors of points with unknown labels, rather than forming a Monte Carlo estimator. ASEs actively learn the underlying surrogate, and we propose a novel acquisition strategy, XWED, that tailors this learning to the final estimation task. We find that ASEs offer greater label-efficiency than the current state-of-the-art when applied to challenging model evaluation problems for deep neural networks.  ( 2 min )
    Constrained Learning with Non-Convex Losses. (arXiv:2103.05134v5 [cs.LG] UPDATED)
    Though learning has become a core component of modern information processing, there is now ample evidence that it can lead to biased, unsafe, and prejudiced systems. The need to impose requirements on learning is therefore paramount, especially as it reaches critical applications in social, industrial, and medical domains. However, the non-convexity of most modern statistical problems is only exacerbated by the introduction of constraints. Whereas good unconstrained solutions can often be learned using empirical risk minimization, even obtaining a model that satisfies statistical constraints can be challenging. All the more so, a good one. In this paper, we overcome this issue by learning in the empirical dual domain, where constrained statistical learning problems become unconstrained and deterministic. We analyze the generalization properties of this approach by bounding the empirical duality gap -- i.e., the difference between our approximate, tractable solution and the solution of the original (non-convex) statistical problem -- and provide a practical constrained learning algorithm. These results establish a constrained counterpart to classical learning theory, enabling the explicit use of constraints in learning. We illustrate this theory and algorithm in rate-constrained learning applications arising in fairness and adversarial robustness.  ( 3 min )
    Adapting to Mixing Time in Stochastic Optimization with Markovian Data. (arXiv:2202.04428v2 [cs.LG] UPDATED)
    We consider stochastic optimization problems where data is drawn from a Markov chain. Existing methods for this setting crucially rely on knowing the mixing time of the chain, which in real-world applications is usually unknown. We propose the first optimization method that does not require the knowledge of the mixing time, yet obtains the optimal asymptotic convergence rate when applied to convex problems. We further show that our approach can be extended to: (i) finding stationary points in non-convex optimization with Markovian data, and (ii) obtaining better dependence on the mixing time in temporal difference (TD) learning; in both cases, our method is completely oblivious to the mixing time. Our method relies on a novel combination of multi-level Monte Carlo (MLMC) gradient estimation together with an adaptive learning method.  ( 2 min )
    Transformers Learn Shortcuts to Automata. (arXiv:2210.10749v1 [cs.LG])
    Algorithmic reasoning requires capabilities which are most naturally understood through recurrent models of computation, like the Turing machine. However, Transformer models, while lacking recurrence, are able to perform such reasoning using far fewer layers than the number of reasoning steps. This raises the question: what solutions are these shallow and non-recurrent models finding? We investigate this question in the setting of learning automata, discrete dynamical systems naturally suited to recurrent modeling and expressing algorithmic tasks. Our theoretical results completely characterize shortcut solutions, whereby a shallow Transformer with only $o(T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$. By representing automata using the algebraic structure of their underlying transformation semigroups, we obtain $O(\log T)$-depth simulators for all automata and $O(1)$-depth simulators for all automata whose associated groups are solvable. Empirically, we perform synthetic experiments by training Transformers to simulate a wide variety of automata, and show that shortcut solutions can be learned via standard training. We further investigate the brittleness of these solutions and propose potential mitigations.  ( 2 min )
    Lethal Dose Conjecture on Data Poisoning. (arXiv:2208.03309v3 [cs.LG] UPDATED)
    Data poisoning considers an adversary that distorts the training set of machine learning algorithms for malicious purposes. In this work, we bring to light one conjecture regarding the fundamentals of data poisoning, which we call the Lethal Dose Conjecture. The conjecture states: If $n$ clean training samples are needed for accurate predictions, then in a size-$N$ training set, only $\Theta(N/n)$ poisoned samples can be tolerated while ensuring accuracy. Theoretically, we verify this conjecture in multiple cases. We also offer a more general perspective of this conjecture through distribution discrimination. Deep Partition Aggregation (DPA) and its extension, Finite Aggregation (FA) are recent approaches for provable defenses against data poisoning, where they predict through the majority vote of many base models trained from different subsets of training set using a given learner. The conjecture implies that both DPA and FA are (asymptotically) optimal -- if we have the most data-efficient learner, they can turn it into one of the most robust defenses against data poisoning. This outlines a practical approach to developing stronger defenses against poisoning via finding data-efficient learners. Empirically, as a proof of concept, we show that by simply using different data augmentations for base learners, we can respectively double and triple the certified robustness of DPA on CIFAR-10 and GTSRB without sacrificing accuracy.  ( 3 min )
    A Flexible Approach for Normal Approximation of Geometric and Topological Statistics. (arXiv:2210.10744v1 [math.PR])
    We derive normal approximation results for a class of stabilizing functionals of binomial or Poisson point process, that are not necessarily expressible as sums of certain score functions. Our approach is based on a flexible notion of the add-one cost operator, which helps one to deal with the second-order cost operator via suitably appropriate first-order operators. We combine this flexible notion with the theory of strong stabilization to establish our results. We illustrate the applicability of our results by establishing normal approximation results for certain geometric and topological statistics arising frequently in practice. Several existing results also emerge as special cases of our approach.  ( 2 min )
    Interpolated Adversarial Training: Achieving Robust Neural Networks without Sacrificing Too Much Accuracy. (arXiv:1906.06784v7 [stat.ML] UPDATED)
    Adversarial robustness has become a central goal in deep learning, both in the theory and the practice. However, successful methods to improve the adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how the adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain the adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%. Moreover, we provide mathematical analysis of Interpolated Adversarial Training to confirm its efficiencies and demonstrate its advantages in terms of robustness and generalization.
    Unsupervised Space Partitioning for Nearest Neighbor Search. (arXiv:2206.08091v2 [cs.LG] UPDATED)
    Approximate Nearest Neighbor Search (ANNS) in high dimensional spaces is crucial for many real-life applications (e.g., e-commerce, web, multimedia, etc.) dealing with an abundance of data. This paper proposes an end-to-end learning framework that couples the partitioning (one critical step of ANNS) and learning-to-search steps using a custom loss function. A key advantage of our proposed solution is that it does not require any expensive pre-processing of the dataset, which is one of the critical limitations of the state-of-the-art approach. We achieve the above edge by formulating a multi-objective custom loss function that does not need ground truth labels to quantify the quality of a given data-space partition, making it entirely unsupervised. We also propose an ensembling technique by adding varying input weights to the loss function to train an ensemble of models to enhance the search quality. On several standard benchmarks for ANNS, we show that our method beats the state-of-the-art space partitioning method and the ubiquitous K-means clustering method while using fewer parameters and shorter offline training times. We also show that incorporating our space-partitioning strategy into state-of-the-art ANNS techniques such as ScaNN can improve their performance significantly. Finally, we present our unsupervised partitioning approach as a promising alternative to many widely used clustering methods, such as K-means clustering and DBSCAN.
    Variational methods for simulation-based inference. (arXiv:2203.04176v3 [stat.ML] UPDATED)
    We present Sequential Neural Variational Inference (SNVI), an approach to perform Bayesian inference in models with intractable likelihoods. SNVI combines likelihood-estimation (or likelihood-ratio-estimation) with variational inference to achieve a scalable simulation-based inference approach. SNVI maintains the flexibility of likelihood(-ratio) estimation to allow arbitrary proposals for simulations, while simultaneously providing a functional estimate of the posterior distribution without requiring MCMC sampling. We present several variants of SNVI and demonstrate that they are substantially more computationally efficient than previous algorithms, without loss of accuracy on benchmark tasks. We apply SNVI to a neuroscience model of the pyloric network in the crab and demonstrate that it can infer the posterior distribution with one order of magnitude fewer simulations than previously reported. SNVI vastly reduces the computational cost of simulation-based inference while maintaining accuracy and flexibility, making it possible to tackle problems that were previously inaccessible.
    Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms. (arXiv:2202.13001v6 [cs.LG] UPDATED)
    We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
    Generalization Properties of hyper-RKHS and its Applications. (arXiv:1809.09910v4 [cs.LG] UPDATED)
    This paper generalizes regularized regression problems in a hyper-reproducing kernel Hilbert space (hyper-RKHS), illustrates its utility for kernel learning and out-of-sample extensions, and proves asymptotic convergence results for the introduced regression models in an approximation theory view. Algorithmically, we consider two regularized regression models with bivariate forms in this space, including kernel ridge regression (KRR) and support vector regression (SVR) endowed with hyper-RKHS, and further combine divide-and-conquer with Nystr\"{o}m approximation for scalability in large sample cases. This framework is general: the underlying kernel is learned from a broad class, and can be positive definite or not, which adapts to various requirements in kernel learning. Theoretically, we study the convergence behavior of regularized regression algorithms in hyper-RKHS and derive the learning rates, which goes beyond the classical analysis on RKHS due to the non-trivial independence of pairwise samples and the characterisation of hyper-RKHS. Experimentally, results on several benchmarks suggest that the employed framework is able to learn a general kernel function form an arbitrary similarity matrix, and thus achieves a satisfactory performance on classification tasks.
    Graph sampling for node embedding. (arXiv:2210.10520v1 [stat.ML])
    Node embedding is a central topic in graph representation learning. Computational efficiency and scalability can be challenging to any method that requires full-graph operations. We propose sampling approaches to node embedding, with or without explicit modelling of the feature vector, which aim to extract useful information from both the eigenvectors related to the graph Laplacien and the given values associated with the graph.  ( 2 min )
    Influence Estimation for Generative Adversarial Networks. (arXiv:2101.08367v3 [stat.ML] UPDATED)
    Identifying harmful instances, whose absence in a training dataset improves model performance, is important for building better machine learning models. Although previous studies have succeeded in estimating harmful instances under supervised settings, they cannot be trivially extended to generative adversarial networks (GANs). This is because previous approaches require that (1) the absence of a training instance directly affects the loss value and that (2) the change in the loss directly measures the harmfulness of the instance for the performance of a model. In GAN training, however, neither of the requirements is satisfied. This is because, (1) the generator's loss is not directly affected by the training instances as they are not part of the generator's training steps, and (2) the values of GAN's losses normally do not capture the generative performance of a model. To this end, (1) we propose an influence estimation method that uses the Jacobian of the gradient of the generator's loss with respect to the discriminator's parameters (and vice versa) to trace how the absence of an instance in the discriminator's training affects the generator's parameters, and (2) we propose a novel evaluation scheme, in which we assess harmfulness of each training instance on the basis of how GAN evaluation metric (e.g., inception score) is expect to change due to the removal of the instance. We experimentally verified that our influence estimation method correctly inferred the changes in GAN evaluation metrics. Further, we demonstrated that the removal of the identified harmful instances effectively improved the model's generative performance with respect to various GAN evaluation metrics.  ( 3 min )
    Scaling Laws for Reward Model Overoptimization. (arXiv:2210.10760v1 [cs.LG])
    In reinforcement learning from human feedback, it is common to optimize against a reward model trained to predict human preferences. Because the reward model is an imperfect proxy, optimizing its value too much can hinder ground truth performance, in accordance with Goodhart's law. This effect has been frequently observed, but not carefully measured due to the expense of collecting human preference data. In this work, we use a synthetic setup in which a fixed "gold-standard" reward model plays the role of humans, providing labels used to train a proxy reward model. We study how the gold reward model score changes as we optimize against the proxy reward model using either reinforcement learning or best-of-$n$ sampling. We find that this relationship follows a different functional form depending on the method of optimization, and that in both cases its coefficients scale smoothly with the number of reward model parameters. We also study the effect on this relationship of the size of the reward model dataset, the number of reward model and policy parameters, and the coefficient of the KL penalty added to the reward in the reinforcement learning setup. We explore the implications of these empirical results for theoretical considerations in AI alignment.  ( 2 min )
    Estimating the Contamination Factor's Distribution in Unsupervised Anomaly Detection. (arXiv:2210.10487v1 [cs.LG])
    Anomaly detection methods identify examples that do not follow the expected behaviour, typically in an unsupervised fashion, by assigning real-valued anomaly scores to the examples based on various heuristics. These scores need to be transformed into actual predictions by thresholding, so that the proportion of examples marked as anomalies equals the expected proportion of anomalies, called contamination factor. Unfortunately, there are no good methods for estimating the contamination factor itself. We address this need from a Bayesian perspective, introducing a method for estimating the posterior distribution of the contamination factor of a given unlabeled dataset. We leverage on outputs of several anomaly detectors as a representation that already captures the basic notion of anomalousness and estimate the contamination using a specific mixture formulation. Empirically on 22 datasets, we show that the estimated distribution is well-calibrated and that setting the threshold using the posterior mean improves the anomaly detectors' performance over several alternative methods. All code is publicly available for full reproducibility.  ( 2 min )
    "Why did the Model Fail?": Attributing Model Performance Changes to Distribution Shifts. (arXiv:2210.10769v1 [cs.LG])
    Performance of machine learning models may differ between training and deployment for many reasons. For instance, model performance can change between environments due to changes in data quality, observing a different population than the one in training, or changes in the relationship between labels and features. These manifest as changes to the underlying data generating mechanisms, and thereby result in distribution shifts across environments. Attributing performance changes to specific shifts, such as covariate or concept shifts, is critical for identifying sources of model failures, and for taking mitigating actions that ensure robust models. In this work, we introduce the problem of attributing performance differences between environments to shifts in the underlying data generating mechanisms. We formulate the problem as a cooperative game and derive an importance weighting method for computing the value of a coalition (or a set) of distributions. The contribution of each distribution to the total performance change is then quantified as its Shapley value. We demonstrate the correctness and utility of our method on two synthetic datasets and two real-world case studies, showing its effectiveness in attributing performance changes to a wide range of distribution shifts.  ( 2 min )
    Towards Accurate Subgraph Similarity Computation via Neural Graph Pruning. (arXiv:2210.10643v1 [cs.LG])
    Subgraph similarity search, one of the core problems in graph search, concerns whether a target graph approximately contains a query graph. The problem is recently touched by neural methods. However, current neural methods do not consider pruning the target graph, though pruning is critically important in traditional calculations of subgraph similarities. One obstacle to applying pruning in neural methods is {the discrete property of pruning}. In this work, we convert graph pruning to a problem of node relabeling and then relax it to a differentiable problem. Based on this idea, we further design a novel neural network to approximate a type of subgraph distance: the subgraph edit distance (SED). {In particular, we construct the pruning component using a neural structure, and the entire model can be optimized end-to-end.} In the design of the model, we propose an attention mechanism to leverage the information about the query graph and guide the pruning of the target graph. Moreover, we develop a multi-head pruning strategy such that the model can better explore multiple ways of pruning the target graph. The proposed model establishes new state-of-the-art results across seven benchmark datasets. Extensive analysis of the model indicates that the proposed model can reasonably prune the target graph for SED computation. The implementation of our algorithm is released at our Github repo: https://github.com/tufts-ml/Prune4SED.  ( 3 min )
    Extending Graph Transformers with Quantum Computed Aggregation. (arXiv:2210.10610v1 [quant-ph])
    Recently, efforts have been made in the community to design new Graph Neural Networks (GNN), as limitations of Message Passing Neural Networks became more apparent. This led to the appearance of Graph Transformers using global graph features such as Laplacian Eigenmaps. In our paper, we introduce a GNN architecture where the aggregation weights are computed using the long-range correlations of a quantum system. These correlations are generated by translating the graph topology into the interactions of a set of qubits in a quantum computer. This work was inspired by the recent development of quantum processing units which enables the computation of a new family of global graph features that would be otherwise out of reach for classical hardware. We give some theoretical insights about the potential benefits of this approach, and benchmark our algorithm on standard datasets. Although not being adapted to all datasets, our model performs similarly to standard GNN architectures, and paves a promising future for quantum enhanced GNNs.  ( 2 min )
    A kernel Stein test of goodness of fit for sequential models. (arXiv:2210.10741v1 [stat.ML])
    We propose a goodness-of-fit measure for probability densities modelling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalised densities. Existing KSDs require the model to be defined on a fixed-dimension space. As our major contributions, we extend the KSD to the variable dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalised, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.  ( 2 min )
    A Deep Top-Down Approach to Hierarchically Coherent Probabilistic Forecasting. (arXiv:2204.10414v2 [cs.LG] UPDATED)
    Probabilistic, hierarchically coherent forecasting is a key problem in many practical forecasting applications -- the goal is to obtain coherent probabilistic predictions for a large number of time series arranged in a pre-specified tree hierarchy. In this paper, we present a probabilistic top-down approach to hierarchical forecasting that uses a novel attention-based RNN model to learn the distribution of the proportions according to which each parent prediction is split among its children nodes at any point in time. These probabilistic proportions are then coupled with an independent univariate probabilistic forecasting model for the root time series. The resulting forecasts are naturally coherent, and provide probabilistic predictions over all time series in the hierarchy. We experiment on several public datasets and demonstrate significant improvements up to 27% on most datasets compared to state-of-the-art probabilistic hierarchical models. Finally, we also provide theoretical justification for the superiority of our top-down approach compared to traditional bottom-up modeling.  ( 2 min )
    CLEAR: Causal Explanations from Attention in Neural Recommenders. (arXiv:2210.10621v1 [cs.IR])
    We present CLEAR, a method for learning session-specific causal graphs, in the possible presence of latent confounders, from attention in pre-trained attention-based recommenders. These causal graphs describe user behavior, within the context captured by attention, and can provide a counterfactual explanation for a recommendation. In essence, these causal graphs allow answering "why" questions uniquely for any specific session. Using empirical evaluations we show that, compared to naively using attention weights to explain input-output relations, counterfactual explanations found by CLEAR are shorter and an alternative recommendation is ranked higher in the original top-k recommendations.  ( 2 min )
    p$^3$VAE: a physics-integrated generative model. Application to the semantic segmentation of optical remote sensing images. (arXiv:2210.10418v1 [cs.CV])
    The combination of machine learning models with physical models is a recent research path to learn robust data representations. In this paper, we introduce p$^3$VAE, a generative model that integrates a perfect physical model which partially explains the true underlying factors of variation in the data. To fully leverage our hybrid design, we propose a semi-supervised optimization procedure and an inference scheme that comes along meaningful uncertainty estimates. We apply p$^3$VAE to the semantic segmentation of high-resolution hyperspectral remote sensing images. Our experiments on a simulated data set demonstrated the benefits of our hybrid model against conventional machine learning models in terms of extrapolation capabilities and interpretability. In particular, we show that p$^3$VAE naturally has high disentanglement capabilities. Our code and data have been made publicly available at https://github.com/Romain3Ch216/p3VAE.  ( 2 min )
    Multivariate outlier explanations using Shapley values and Mahalanobis distances. (arXiv:2210.10063v1 [stat.ME])
    For the purpose of explaining multivariate outlyingness, it is shown that the squared Mahalanobis distance of an observation can be decomposed into outlyingness contributions originating from single variables. The decomposition is obtained using the Shapley value, a well-known concept from game theory that became popular in the context of Explainable AI. In addition to outlier explanation, this concept also relates to the recent formulation of cellwise outlyingness, where Shapley values can be employed to obtain variable contributions for outlying observations with respect to their "expected" position given the multivariate data structure. In combination with squared Mahalanobis distances, Shapley values can be calculated at a low numerical cost, making them even more attractive for outlier interpretation. Simulations and real-world data examples demonstrate the usefulness of these concepts.  ( 2 min )
    A Reinforcement Learning Approach in Multi-Phase Second-Price Auction Design. (arXiv:2210.10278v1 [cs.LG])
    We study reserve price optimization in multi-phase second price auctions, where seller's prior actions affect the bidders' later valuations through a Markov Decision Process (MDP). Compared to the bandit setting in existing works, the setting in ours involves three challenges. First, from the seller's perspective, we need to efficiently explore the environment in the presence of potentially nontruthful bidders who aim to manipulates seller's policy. Second, we want to minimize the seller's revenue regret when the market noise distribution is unknown. Third, the seller's per-step revenue is unknown, nonlinear, and cannot even be directly observed from the environment. We propose a mechanism addressing all three challenges. To address the first challenge, we use a combination of a new technique named "buffer periods" and inspirations from Reinforcement Learning (RL) with low switching cost to limit bidders' surplus from untruthful bidding, thereby incentivizing approximately truthful bidding. The second one is tackled by a novel algorithm that removes the need for pure exploration when the market noise distribution is unknown. The third challenge is resolved by an extension of LSVI-UCB, where we use the auction's underlying structure to control the uncertainty of the revenue function. The three techniques culminate in the $\underline{\rm C}$ontextual-$\underline{\rm L}$SVI-$\underline{\rm U}$CB-$\underline{\rm B}$uffer (CLUB) algorithm which achieves $\tilde{ \mathcal{O}}(H^{5/2}\sqrt{K})$ revenue regret when the market noise is known and $\tilde{ \mathcal{O}}(H^{3}\sqrt{K})$ revenue regret when the noise is unknown with no assumptions on bidders' truthfulness.  ( 3 min )
    Bayesian Emulation for Computer Models with Multiple Partial Discontinuities. (arXiv:2210.10468v1 [stat.ME])
    Computer models are widely used across a range of scientific disciplines to describe various complex physical systems, however to perform full uncertainty quantification we often need to employ emulators. An emulator is a fast statistical construct that mimics the slow to evaluate computer model, and greatly aids the vastly more computationally intensive uncertainty quantification calculations that an important scientific analysis often requires. We examine the problem of emulating computer models that possess multiple, partial discontinuities occurring at known non-linear location. We introduce the TENSE framework, based on carefully designed correlation structures that respect the discontinuities while enabling full exploitation of any smoothness/continuity elsewhere. This leads to a single emulator object that can be updated by all runs simultaneously, and also used for efficient design. This approach avoids having to split the input space into multiple subregions. We apply the TENSE framework to the TNO Challenge II, emulating the OLYMPUS reservoir model, which possess multiple such discontinuities.  ( 2 min )
    Gaussian-Bernoulli RBMs Without Tears. (arXiv:2210.10318v1 [cs.LG])
    We revisit the challenging problem of training Gaussian-Bernoulli restricted Boltzmann machines (GRBMs), introducing two innovations. We propose a novel Gibbs-Langevin sampling algorithm that outperforms existing methods like Gibbs sampling. We propose a modified contrastive divergence (CD) algorithm so that one can generate images with GRBMs starting from noise. This enables direct comparison of GRBMs with deep generative models, improving evaluation protocols in the RBM literature. Moreover, we show that modified CD and gradient clipping are enough to robustly train GRBMs with large learning rates, thus removing the necessity of various tricks in the literature. Experiments on Gaussian Mixtures, MNIST, FashionMNIST, and CelebA show GRBMs can generate good samples, despite their single-hidden-layer architecture. Our code is released at: \url{https://github.com/lrjconan/GRBM}.  ( 2 min )
    Nonparametric Quantile Regression: Non-Crossing Constraints and Conformal Prediction. (arXiv:2210.10161v1 [stat.ML])
    We propose a nonparametric quantile regression method using deep neural networks with a rectified linear unit penalty function to avoid quantile crossing. This penalty function is computationally feasible for enforcing non-crossing constraints in multi-dimensional nonparametric quantile regression. We establish non-asymptotic upper bounds for the excess risk of the proposed nonparametric quantile regression function estimators. Our error bounds achieve optimal minimax rate of convergence for the Holder class, and the prefactors of the error bounds depend polynomially on the dimension of the predictor, instead of exponentially. Based on the proposed non-crossing penalized deep quantile regression, we construct conformal prediction intervals that are fully adaptive to heterogeneity. The proposed prediction interval is shown to have good properties in terms of validity and accuracy under reasonable conditions. We also derive non-asymptotic upper bounds for the difference of the lengths between the proposed non-crossing conformal prediction interval and the theoretically oracle prediction interval. Numerical experiments including simulation studies and a real data example are conducted to demonstrate the effectiveness of the proposed method.  ( 2 min )
    Distributional Adaptive Soft Regression Trees. (arXiv:2210.10389v1 [stat.ME])
    Random forests are an ensemble method relevant for many problems, such as regression or classification. They are popular due to their good predictive performance (compared to, e.g., decision trees) requiring only minimal tuning of hyperparameters. They are built via aggregation of multiple regression trees during training and are usually calculated recursively using hard splitting rules. Recently regression forests have been incorporated into the framework of distributional regression, a nowadays popular regression approach aiming at estimating complete conditional distributions rather than relating the mean of an output variable to input features only - as done classically. This article proposes a new type of a distributional regression tree using a multivariate soft split rule. One great advantage of the soft split is that smooth high-dimensional functions can be estimated with only one tree while the complexity of the function is controlled adaptive by information criteria. Moreover, the search for the optimal split variable is obsolete. We show by means of extensive simulation studies that the algorithm has excellent properties and outperforms various benchmark methods, especially in the presence of complex non-linear feature interactions. Finally, we illustrate the usefulness of our approach with an example on probabilistic forecasts for the Sun's activity.  ( 2 min )
    Rethinking Sharpness-Aware Minimization as Variational Inference. (arXiv:2210.10452v1 [stat.ML])
    Sharpness-aware minimization (SAM) aims to improve the generalisation of gradient-based learning by seeking out flat minima. In this work, we establish connections between SAM and Mean-Field Variational Inference (MFVI) of neural network parameters. We show that both these methods have interpretations as optimizing notions of flatness, and when using the reparametrisation trick, they both boil down to calculating the gradient at a perturbed version of the current mean parameter. This thinking motivates our study of algorithms that combine or interpolate between SAM and MFVI. We evaluate the proposed variational algorithms on several benchmark datasets, and compare their performance to variants of SAM. Taking a broader perspective, our work suggests that SAM-like updates can be used as a drop-in replacement for the reparametrisation trick.  ( 2 min )
    Fast Approximation of the Generalized Sliced-Wasserstein Distance. (arXiv:2210.10268v1 [stat.ML])
    Generalized sliced Wasserstein distance is a variant of sliced Wasserstein distance that exploits the power of non-linear projection through a given defining function to better capture the complex structures of the probability distributions. Similar to sliced Wasserstein distance, generalized sliced Wasserstein is defined as an expectation over random projections which can be approximated by the Monte Carlo method. However, the complexity of that approximation can be expensive in high-dimensional settings. To that end, we propose to form deterministic and fast approximations of the generalized sliced Wasserstein distance by using the concentration of random projections when the defining functions are polynomial function, circular function, and neural network type function. Our approximations hinge upon an important result that one-dimensional projections of a high-dimensional random vector are approximately Gaussian.  ( 2 min )
    Bayesian Optimization over Discrete and Mixed Spaces via Probabilistic Reparameterization. (arXiv:2210.10199v1 [cs.LG])
    Optimizing expensive-to-evaluate black-box functions of discrete (and potentially continuous) design parameters is a ubiquitous problem in scientific and engineering applications. Bayesian optimization (BO) is a popular, sample-efficient method that leverages a probabilistic surrogate model and an acquisition function (AF) to select promising designs to evaluate. However, maximizing the AF over mixed or high-cardinality discrete search spaces is challenging standard gradient-based methods cannot be used directly or evaluating the AF at every point in the search space would be computationally prohibitive. To address this issue, we propose using probabilistic reparameterization (PR). Instead of directly optimizing the AF over the search space containing discrete parameters, we instead maximize the expectation of the AF over a probability distribution defined by continuous parameters. We prove that under suitable reparameterizations, the BO policy that maximizes the probabilistic objective is the same as that which maximizes the AF, and therefore, PR enjoys the same regret bounds as the original BO policy using the underlying AF. Moreover, our approach provably converges to a stationary point of the probabilistic objective under gradient ascent using scalable, unbiased estimators of both the probabilistic objective and its gradient. Therefore, as the number of starting points and gradient steps increase, our approach will recover of a maximizer of the AF (an often-neglected requisite for commonly used BO regret bounds). We validate our approach empirically and demonstrate state-of-the-art optimization performance on a wide range of real-world applications. PR is complementary to (and benefits) recent work and naturally generalizes to settings with multiple objectives and black-box constraints.  ( 3 min )

  • Open

    How Tarteel Uses AI to Help Arabic Learners Perfect Their Pronunciation
    There are some 1.8 billion Muslims, but only 16% or so of them speak Arabic, the language of the Quran. This is in part due to the fact that many Muslims struggle to find qualified instructors to give them feedback on their Quran recitation. Enter today’s guest and his company Tarteel, a member of the Read article > The post How Tarteel Uses AI to Help Arabic Learners Perfect Their Pronunciation appeared first on NVIDIA Blog.  ( 4 min )
  • Open

    [R] State of the art audio classification
    I'm evaluating different architectures for audio classification and I'm a bit lost. I found a couple of survey papers but those are 2 years old and I'm looking for something newer. Have there been any major advances in this field? submitted by /u/the_javi_himself [link] [comments]  ( 123 min )
    [R] Anyone trained donut on SROIE dataset ?
    Hello, Anyone tried training Donut (https://arxiv.org/abs/2111.15664) on SROIE dataset ? I’m curious of how well it can perform. Also for those specialized in document understanding, I saw that in papers like layoutLM they report metrics higher than 0.95 for information extraction are those the exact match metrics or the classification metrics ? Thanks ! submitted by /u/Meddhouib10 [link] [comments]  ( 124 min )
    [D] How to augment "bad" samples in order to build and optimize algorithms?
    We don't get enough “bad” samples from our customers, because it’s almost impossible and challenging to artificially capture defects (bad samples) in production. Without bad samples, we cannot create or optimize algorithms. Any suggestions? submitted by /u/alovna88 [link] [comments]  ( 123 min )
    [D] Get decision rules for trained Isolation Forest outlier detection model
    I'm wondering if it's possible to translate a trained Isolation Forest outlier detection model into written rules (think if-else statements). This would be beneficial for model explainability. I think it's possible since it's just an ensemble model of decision trees. You would just get the rules from the top splits that determine which samples are outliers, depending on the contamination threshold. I assume some rules might overlap, so that some rules become reduntant. Am I wrong? link to isolation forest sklearn: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html submitted by /u/mrwafflezzz [link] [comments]  ( 125 min )
    [R] Is there an equivalent issue of robustness in using DNNs for regression problems?
    I’m reading through the literature for robustness (and adversarial attacks/defences) and all case studies are for classification problems or more rarely applications for NLP. I was wondering if there was a reason for this or if there was genuine gap in literature, which I am suspecting is unlikely. By regression I mean applications like using a DNN as a universal approximator for solving physics problems. submitted by /u/Evening-Apartment-70 [link] [comments]  ( 123 min )
    [D] Call for questions for Andrej Karpathy from Lex Fridman
    Hi, my name is Lex Fridman. I host a podcast. I'm talking to Andrej Karpathy on it soon. To me, Andrej is one of the best researchers and educators in the history of the machine learning field. If you have questions/topic suggestions you'd like us to discuss, including technical and philosophical ones, please let me know. submitted by /u/lexfridman [link] [comments]  ( 144 min )
    [p] New project: AI powered form builder (fillout.com)
    Fillout is the first AI-powered form builder. Just describe your form and our AI will instantly generate a form for you to customize. We're live on Product hunt and would appreciate your support and feedback! https://www.producthunt.com/posts/fillout-com Check out the demo video on that post to see how the AI works and let me know if you have any questions - I'll be around :) submitted by /u/dominicwhyte42 [link] [comments]  ( 124 min )
    [P] Training anime diffusion model from scratch with limited compute
    Blog post This uses a nice dataset by /u/gwern, and trains a decent latent-diffusion-based model from scratch, with several orders of magnitude less compute than Stable Diffusion (although anime-only, but it is better than Stable Diffusion on anime pictures). FWIW, a self-hosted demo (I tried to restrict it to produce only safe samples) It runs via Gradio proxy, so it is flaky and unstable at times, but works after some retries. I'll keep it running for the next several days. submitted by /u/enryu42 [link] [comments]  ( 124 min )
    [R] MIT releases all slides for efficient ML course
    I thought this comment in a different thread was worth highlighting: https://www.reddit.com/r/MachineLearning/comments/y7r4i2/comment/iswsr45/ MIT is releasing all the slides teaching their new efficientML initiative if you want a more gentle introduction: https://efficientml.ai/ submitted by /u/That_Violinist_18 [link] [comments]  ( 122 min )
    [D] Solving energy minimization problems using neural networks
    I am trying to solve a problem where a given input vector x must be transformed into an optimal output vector y. Both vectors are of the same length. The optimal transforming function y = F(x) is unknown. I can however measure how well a given output y’ matches some desired properties given x. More formally, there is a known and differentiable loss function L = G(x, y’). This loss function is different from usual loss functions which are used in neural networks (e.g. MSE) in the sense that it does not require any labels to compute the loss. It can compute the loss and its derivative only knowing the input x and the output y’ of the neural network. Further, I have many different input vectors X = {x_1, x_1, …, x_n} and they all have their corresponding optimal output vectors Y = {y_1, y_2,…  ( 127 min )
    [D] Python's library to multivariate time series forecasting: Sktime, modeltime, darts.
    Hello, I am just starting new project on time series forecasting and consider which library might be the best to use. In previous project I have been using sktime, but recently I have found modeltime and darts also. ​ So my question is: Have you use any of this library (maybe more then one) and could you guys tell me why you like them and why not? Thanks in advance for all answers. submitted by /u/popcornn1 [link] [comments]  ( 125 min )
    [D] Imagic Stable Diffusion training in 11 GB VRAM with diffusers and colab link.
    Text-Based Real Image Editing Code: https://github.com/ShivamShrirao/diffusers/tree/main/examples/imagic Colab: https://colab.research.google.com/github/ShivamShrirao/diffusers/blob/main/examples/imagic/Imagic_Stable_Diffusion.ipynb Still need to play around and tune the parameters a bit, may not work as is on every subject. Hopefully everyone can try it out now. Input Image A photo of Barack Obama smiling with a big grin. submitted by /u/0x00groot [link] [comments]  ( 124 min )
    [R] Research Job & Internship in NLP and Graph based Learning
    About us We are looking for two Interns (6 months) to work with our Research Group. It's a paid internship. Our mission is to help different life sciences industries and hospitals use machine learning methods to conduct faster and safer clinical development and regulatory programs. In addition, address a diverse set of research problems in the clinical and Biomedical domain. Our overarching goal is to use machine learning to improve every aspect of the scientific effort. We work with many different clients such as Gilead, Pfizer etc. and data types, including very large clinical trails data, Graph data, NLP records and tabular data. We have excellent computational resources, both of our own and shared within the company. As a machine learning research group, we develop new methods and …  ( 124 min )
    On-Device Training Under 256KB Memory [R]
    Training a machine-learning model on an intelligent edge device allows it to adapt to new data and make better predictions. For instance, training a model on a smart keyboard could enable the keyboard to continually learn from the user's writing. However, the training process requires so much memory that it is typically done using powerful computers at a data center, before the model is deployed on a device. This is more costly and raises privacy issues since user data must be sent to a central server. To address this problem, researchers at MIT and the MIT-IBM Watson AI Lab developed a new technique that enables on-device training using less than a quarter of a megabyte of memory (256KB) . They have released a paper on how they have acheived On-Device Training Under 256KB Memory. They also have They tested their framework by training a computer vision model to detect people in images. After only 10 minutes of training, it learned to complete the task successfully. Their method was able to train a model more than 20 times faster than other approaches. I am interested to see if someone takes these approaches described in the paper to reduce memory footprint of large machine learning models thereby reducing hardware requirements and make these models more accessible I have made a video on the research paper. Do checkout the video https://youtu.be/sQFeh9X2L3k submitted by /u/Sea-Photo5230 [link] [comments]  ( 124 min )
    [P] what algo/library yall think would be good for this
    in autoimmune disease alot of people have food sensitivities with alot of variance as to what they are. I wanted to make an app that helps identify potential food triggers from food journaling and release it for free and to have it as a portfolio project. What algorithm would i need where the input would be food journal info and symptom severity and the output would be potential correlations. This would be clustering correct? if so can someone point out a library that is good for this. submitted by /u/happyhornetsfan [link] [comments]  ( 126 min )
    [Project] DoorDash Eng Blog: Augmenting fuzzy matching with human review to maximize precision and recall
    https://doordash.engineering/2022/10/18/augmenting-fuzzy-matching-with-human-review-to-maximize-precision-and-recall/ I recently solved a business problem at DoorDash: we needed a way to onboard new advertisers (brands that sell products at convenience/grocery stores) at scale, without having to manually identify all their products that are available on DoorDash. We used a fuzzy-matching classifier with a human in the loop. The part I find the most interesting is how we were able to take an out-of-the-box fuzzy matching algorithm and – with relatively little technical effort – tweak it to improve precision and recall. See table toward the end of the piece. Comment away, either here or on the post itself! submitted by /u/chikinn [link] [comments]  ( 125 min )
  • Open

    The Future of Artificial Intelligence
    submitted by /u/Norbrah [link] [comments]  ( 118 min )
    It's all GPT-3 under the hood, but a great example of how AI writing assistants become widely used
    submitted by /u/17syllogisms [link] [comments]  ( 115 min )
    Anti-Work and Artificial Intelligence; All Help Appreciated!
    TL;DR: I am about to be employed part-time at a company that requires me to write enticing blurbs about articles I source for CEOs to post on their LinkedIn page. I am required to manage around 15 clients, who require 9 blurbs a week each. After some simple math, that equates to around 108-162 hours a week; the company limits my hours to 25 a week. I am looking to utilize AI to create an efficient process that makes this requirement actually achievable. Do you have any advice? Hello all! A little about myself: I am a 22-year-old who recently reenrolled in college after a two-year break due to the COVID pandemic. As a means of paying for rent, food, and other utilities during this time, I've begun to search for remote part-time positions and am in the midst of the interview process with o…  ( 128 min )
    Do Companies need a Chief AI-Ethics Officer?
    submitted by /u/Philo167 [link] [comments]  ( 118 min )
    Using AI to Translate Speech For a Primarily Oral Language
    submitted by /u/magenta_placenta [link] [comments]  ( 114 min )
    Best RPA software for your $ ?
    I've seen UiPath come up quite a bit but I'm wondering what else is out there. Any companies you guys know of that are reliable and affordable? submitted by /u/nattescott [link] [comments]  ( 119 min )
    The Netherlands Has Deployed NATO’s First Killer Robot Ground Vehicles
    submitted by /u/Black_RL [link] [comments]  ( 122 min )
    Is A* better than previous path finding algorithms?
    ‘Pathfinding’—or planning a route to a destination that bypasses obstacles—is a great problem in AI recently. In this regard, the A* search algorithm has garnered much attention for its ability to solve path complexities. However, according to Stack Overflow, there are situations in which A* may not be the best algorithm to solve a problem with. Although, there are a number of parameters to assess what constitutes the best algorithm for finding a solution. ​ https://analyticsindiamag.com/is-a-better-than-previous-path-finding-algorithms/ submitted by /u/analyticsindiam [link] [comments]  ( 117 min )
    An open-source image database that unlocks the power of AI for ocean exploration
    A new collaborative effort between MBARI and other research institutions is leveraging the power of artificial intelligence and machine learning to accelerate efforts to study the ocean. In order to manage impacts from climate change and other threats, researchers urgently need to learn more about the ocean's inhabitants, ecosystems, and processes. As scientists and engineers develop advanced robotics that can visualize marine life and environments to monitor changes in the ocean's health, they face a fundamental problem: The collection of images, video, and other visual data vastly exceeds researchers' capacity for analysis. FathomNet is an open-source image database that uses state-of-the-art data processing algorithms to help process the backlog of visual data. Using artificial intell…  ( 146 min )
    Create image based on another images
    Hello, Do you know a tool to create alternative images based on a series of images? Thanks submitted by /u/lxdprod [link] [comments]  ( 118 min )
    AI Dream 55 - AI 5 months ago feels like ages ago
    submitted by /u/LordPewPew777 [link] [comments]  ( 116 min )
    Stability AI uses latent diffusion models to allow users to create art in Stable Diffusion | The digital media firm wants to make AI art "accessible to all."
    submitted by /u/Tao_Dragon [link] [comments]  ( 118 min )
    "AI Artwork" Prompt: The World in a year
    submitted by /u/rotated12 [link] [comments]  ( 117 min )
    @Drawthistweet was suspended on twitter. Any similar AIs to it?
    their site - https://drawtt.com/ ​ and their twitter - https://twitter.com/DrawThisTweet which else you can reccomend ? submitted by /u/Ox0K3n [link] [comments]  ( 115 min )
    As of now would an AI be able to understand a movie and make a resume or even a critique of it ?
    submitted by /u/Constellation29 [link] [comments]  ( 113 min )
    I Let AI Control My Actions For One Day (GPT-3)
    submitted by /u/allaboutai-kris [link] [comments]  ( 116 min )
    Ubisoft AI to generate animation
    submitted by /u/Dazzling_Swordfish14 [link] [comments]  ( 116 min )
  • Open

    Detect fraudulent transactions using machine learning with Amazon SageMaker
    Businesses can lose billions of dollars each year due to malicious users and fraudulent transactions. As more and more business operations move online, fraud and abuses in online systems are also on the rise. To combat online fraud, many businesses have been using rule-based fraud detection systems. However, traditional fraud detection systems rely on a […]  ( 12 min )
    Implement RStudio on your AWS environment and access your data lake using AWS Lake Formation permissions
    R is a popular analytic programming language used by data scientists and analysts to perform data processing, conduct statistical analyses, create data visualizations, and build machine learning (ML) models. RStudio, the integrated development environment for R, provides open-source tools and enterprise-ready professional software for teams to develop and share their work across their organization Building, […]  ( 9 min )
    Design patterns for serial inference on Amazon SageMaker
    As machine learning (ML) goes mainstream and gains wider adoption, ML-powered applications are becoming increasingly common to solve a range of complex business problems. The solution to these complex business problems often requires using multiple ML models. These models can be sequentially combined to perform various tasks, such as preprocessing, data transformation, model selection, inference […]  ( 11 min )
  • Open

    Do Modern ImageNet Classifiers Accurately Predict Perceptual Similarity?
    Posted by Manoj Kumar, Research Engineer, and Ekin Dogus Cubuk, Research Scientist, Google Research The task of determining the similarity between images is an open problem in computer vision and is crucial for evaluating the realism of machine-generated images. Though there are a number of straightforward methods of estimating image similarity (e.g., low-level metrics that measure pixel differences, such as FSIM and SSIM), in many cases, the measured similarity differences do not match the differences perceived by a person. However, more recent work has demonstrated that intermediate representations of neural network classifiers, such as AlexNet, VGG and SqueezeNet trained on ImageNet, exhibit perceptual similarity as an emergent property. That is, Euclidean distances between encoded rep…  ( 26 min )
  • Open

    Inferencing the Transformer Model
    We have seen how to train the Transformer model on a dataset of English and German sentence pairs and how to plot the training and validation loss curves to diagnose the model’s learning performance and decide at which epoch to run inference on the trained model. We are now ready to run inference on the […] The post Inferencing the Transformer Model appeared first on Machine Learning Mastery.  ( 21 min )
  • Open

    What are sota hyperparameter optimization methods?
    Hello everyone! What do you guys think are the most promising approaches to tune hyperparameters for training environments using standard algorithms like PPO, SAC or similar? Many papers that I come across still conduct trial-and-error or run grid searches. As far as I know CleanRL implements TPE. Also, I'm wondering if F-Race is considerable. submitted by /u/LilHairdy [link] [comments]  ( 118 min )
    What is the best way to learn standard deviation of a Gaussian policy?
    Gaussian policy is of a prevalent use of DRL, due to its simplicity and efficiency. However, since in either simulations or most of real applications, the action space is bounded in a finite interval, it reminds us to constrain the standard deviation as to not trigger extreme numeric when combined with neural network. For instance, PPO learns a global std with a safe initial value (normally 1); SAC outputs it with a fully connected layer but clamps its value in an tolerant range. If we don't consider other distributions with finite support e.g. Beta policy, direct way to control the instability is to try a safer mapping from the final layer's output to a real value with a designed function. In practice, it mostly learns a log std and then revert it back by an exponential function. A more practical way is to use the Softplus function, which is less aggressive than the exponential function. Are there any other methods, new mappings, or techniques to achieve better numerical stability? It looks like this problem is not concerned much, however, it can be sensitive to the network architecture design. Without the Tanh activation function of the pre-layers, PPO can not work. Without enforcing action bound, SAC can not work with ReLU activation. I expect any ideas on this problem, many appreciates. submitted by /u/OutOfCharm [link] [comments]  ( 118 min )
  • Open

    Mitigating Covertly Unsafe Text within Natural Language Systems. (arXiv:2210.09306v1 [cs.AI] CROSS LISTED)
    An increasingly prevalent problem for intelligent technologies is text safety, as uncontrolled systems may generate recommendations to their users that lead to injury or life-threatening consequences. However, the degree of explicitness of a generated statement that can cause physical harm varies. In this paper, we distinguish types of text that can lead to physical harm and establish one particularly underexplored category: covertly unsafe text. Then, we further break down this category with respect to the system's information and discuss solutions to mitigate the generation of text in each of these subcategories. Ultimately, our work defines the problem of covertly unsafe language that causes physical harm and argues that this subtle yet dangerous issue needs to be prioritized by stakeholders and regulators. We highlight mitigation strategies to inspire future researchers to tackle this challenging problem and help improve safety within smart systems.  ( 2 min )
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v2 [cs.LG] UPDATED)
    Inferring causal structure poses a combinatorial search problem that typically involves evaluating structures with a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize causal structure learning. Rather than searching over structures, we train a variational inference model to predict the causal structure from observational or interventional data. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Instead, our inference model acquires domain-specific inductive biases for causal discovery solely from data generated by a simulator. The architecture of our inference model emulates permutation invariances that are crucial for statistical efficiency in structure learning, which facilitates generalization to significantly larger problem instances than seen during training. On synthetic data and semisynthetic gene expression data, our models exhibit robust generalization capabilities when subject to substantial distribution shifts and significantly outperform existing algorithms, especially in the challenging genomics domain. Our code and models are publicly available at: https://github.com/larslorch/avici.  ( 2 min )
    Anticipating Performativity by Predicting from Predictions. (arXiv:2208.07331v2 [stat.ML] UPDATED)
    Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they aim to predict. Understanding the causal effect of these predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes. This can make the causal effects of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identifiability, and despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal relationship between covariates, predictions and outcomes can be identified from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identifiability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can indeed find transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.  ( 3 min )
    Deep Learning-Enabled Semantic Communication Systems with Task-Unaware Transmitter and Dynamic Data. (arXiv:2205.00271v3 [cs.IT] UPDATED)
    Existing deep learning-enabled semantic communication systems often rely on shared background knowledge between the transmitter and receiver that includes empirical data and their associated semantic information. In practice, the semantic information is defined by the pragmatic task of the receiver and cannot be known to the transmitter. The actual observable data at the transmitter can also have non-identical distribution with the empirical data in the shared background knowledge library. To address these practical issues, this paper proposes a new neural network-based semantic communication system for image transmission, where the task is unaware at the transmitter and the data environment is dynamic. The system consists of two main parts, namely the semantic coding (SC) network and the data adaptation (DA) network. The SC network learns how to extract and transmit the semantic information using a receiver-leading training process. By using the domain adaptation technique from transfer learning, the DA network learns how to convert the data observed into a similar form of the empirical data that the SC network can process without retraining. Numerical experiments show that the proposed method can be adaptive to observable datasets while keeping high performance in terms of both data recovery and task execution.  ( 3 min )
    GlanceNets: Interpretabile, Leak-proof Concept-based Models. (arXiv:2205.15612v2 [cs.LG] UPDATED)
    There is growing interest in concept-based models (CBMs) that combine high-performance and interpretability by acquiring and reasoning with a vocabulary of high-level concepts. A key requirement is that the concepts be interpretable. Existing CBMs tackle this desideratum using a variety of heuristics based on unclear notions of interpretability, and fail to acquire concepts with the intended semantics. We address this by providing a clear definition of interpretability in terms of alignment between the model's representation and an underlying data generation process, and introduce GlanceNets, a new CBM that exploits techniques from disentangled representation learning and open-set recognition to achieve alignment, thus improving the interpretability of the learned concepts. We show that GlanceNets, paired with concept-level supervision, achieve better alignment than state-of-the-art approaches while preventing spurious information from unintendedly leaking into the learned concepts.  ( 2 min )
    Data-Efficient Augmentation for Training Neural Networks. (arXiv:2210.08363v2 [cs.LG] UPDATED)
    Data augmentation is essential to achieve state-of-the-art performance in many deep learning applications. However, the most effective augmentation techniques become computationally prohibitive for even medium-sized datasets. To address this, we propose a rigorous technique to select subsets of data points that when augmented, closely capture the training dynamics of full data augmentation. We first show that data augmentation, modeled as additive perturbations, improves learning and generalization by relatively enlarging and perturbing the smaller singular values of the network Jacobian, while preserving its prominent directions. This prevents overfitting and enhances learning the harder to learn information. Then, we propose a framework to iteratively extract small subsets of training data that when augmented, closely capture the alignment of the fully augmented Jacobian with labels/residuals. We prove that stochastic gradient descent applied to the augmented subsets found by our approach has similar training dynamics to that of fully augmented data. Our experiments demonstrate that our method achieves 6.3x speedup on CIFAR10 and 2.2x speedup on SVHN, and outperforms the baselines by up to 10% across various subset sizes. Similarly, on TinyImageNet and ImageNet, our method beats the baselines by up to 8%, while achieving up to 3.3x speedup across various subset sizes. Finally, training on and augmenting 50% subsets using our method on a version of CIFAR10 corrupted with label noise even outperforms using the full dataset.  ( 3 min )
    Approximation of Functionals by Neural Network without Curse of Dimensionality. (arXiv:2205.14421v4 [math.NA] UPDATED)
    In this paper, we establish a neural network to approximate functionals, which are maps from infinite dimensional spaces to finite dimensional spaces. The approximation error of the neural network is $O(1/\sqrt{m})$ where $m$ is the size of networks, which overcomes the curse of dimensionality. The key idea of the approximation is to define a Barron spectral space of functionals.
    Code Translation with Compiler Representations. (arXiv:2207.03578v3 [cs.PL] UPDATED)
    In this paper, we leverage low-level compiler intermediate representations (IR) to improve code translation. Traditional transpilers rely on syntactic information and handcrafted rules, which limits their applicability and produces unnatural-looking code. Applying neural machine translation (NMT) approaches to code has successfully broadened the set of programs on which one can get a natural-looking translation. However, they treat the code as sequences of text tokens, and still do not differentiate well enough between similar pieces of code which have different semantics in different languages. The consequence is low quality translation, reducing the practicality of NMT, and stressing the need for an approach significantly increasing its accuracy. Here we propose to augment code translation with IRs, specifically LLVM IR, with results on the C++, Java, Rust, and Go languages. Our method improves upon the state of the art for unsupervised code translation, increasing the number of correct translations by 11% on average, and up to 79% for the Java -> Rust pair with greedy decoding. With beam search, it increases the number of correct translations by 5.5% in average. We extend previous test sets for code translation, by adding hundreds of Go and Rust functions. Additionally, we train models with high performance on the problem of IR decompilation, generating programming source code from IR, and study using IRs as intermediary pivot for translation.
    Hidden State Variability of Pretrained Language Models Can Guide Computation Reduction for Transfer Learning. (arXiv:2210.10041v1 [cs.CL])
    While transferring a pretrained language model, common approaches conventionally attach their task-specific classifiers to the top layer and adapt all the pretrained layers. We investigate whether one could make a task-specific selection on which subset of the layers to adapt and where to place the classifier. The goal is to reduce the computation cost of transfer learning methods (e.g. fine-tuning or adapter-tuning) without sacrificing its performance. We propose to select layers based on the variability of their hidden states given a task-specific corpus. We say a layer is already ``well-specialized'' in a task if the within-class variability of its hidden states is low relative to the between-class variability. Our variability metric is cheap to compute and doesn't need any training or hyperparameter tuning. It is robust to data imbalance and data scarcity. Extensive experiments on the GLUE benchmark demonstrate that selecting layers based on our metric can yield significantly stronger performance than using the same number of top layers and often match the performance of fine-tuning or adapter-tuning the entire language model.
    Leveraging Cluster Analysis to Understand Educational Game Player Experiences and Support Design. (arXiv:2210.09911v1 [cs.HC])
    The ability for an educational game designer to understand their audience's play styles and resulting experience is an essential tool for improving their game's design. As a game is subjected to large-scale player testing, the designers require inexpensive, automated methods for categorizing patterns of player-game interactions. In this paper we present a simple, reusable process using best practices for data clustering, feasible for use within a small educational game studio. We utilize the method to analyze a real-time strategy game, processing game telemetry data to determine categories of players based on their in-game actions, the feedback they received, and their progress through the game. An interpretive analysis of these clusters results in actionable insights for the game's designers.
    DOPE: Doubly Optimistic and Pessimistic Exploration for Safe Reinforcement Learning. (arXiv:2112.00885v3 [cs.LG] UPDATED)
    Safe reinforcement learning is extremely challenging--not only must the agent explore an unknown environment, it must do so while ensuring no safety constraint violations. We formulate this safe reinforcement learning (RL) problem using the framework of a finite-horizon Constrained Markov Decision Process (CMDP) with an unknown transition probability function, where we model the safety requirements as constraints on the expected cumulative costs that must be satisfied during all episodes of learning. We propose a model-based safe RL algorithm that we call Doubly Optimistic and Pessimistic Exploration (DOPE), and show that it achieves an objective regret $\tilde{O}(|\mathcal{S}|\sqrt{|\mathcal{A}| K})$ without violating the safety constraints during learning, where $|\mathcal{S}|$ is the number of states, $|\mathcal{A}|$ is the number of actions, and $K$ is the number of learning episodes. Our key idea is to combine a reward bonus for exploration (optimism) with a conservative constraint (pessimism), in addition to the standard optimistic model-based exploration. DOPE is not only able to improve the objective regret bound, but also shows a significant empirical performance improvement as compared to earlier optimism-pessimism approaches.
    GROW: A Row-Stationary Sparse-Dense GEMM Accelerator for Memory-Efficient Graph Convolutional Neural Networks. (arXiv:2203.00158v3 [cs.AR] UPDATED)
    Graph convolutional neural networks (GCNs) have emerged as a key technology in various application domains where the input data is relational. A unique property of GCNs is that its two primary execution stages, aggregation and combination, exhibit drastically different dataflows. Consequently, prior GCN accelerators tackle this research space by casting the aggregation and combination stages as a series of sparse-dense matrix multiplication. However, prior work frequently suffers from inefficient data movements, leaving significant performance left on the table. We present GROW, a GCN accelerator based on Gustavson's algorithm to architect a row-wise product based sparse-dense GEMM accelerator. GROW co-designs the software/hardware that strikes a balance in locality and parallelism for GCNs, achieving significant energy-efficiency improvements vs. state-of-the-art GCN accelerators.
    Bootstrapped Transformer for Offline Reinforcement Learning. (arXiv:2206.08569v2 [cs.LG] UPDATED)
    Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.
    Importance Weighting Correction of Regularized Least-Squares for Covariate and Target Shifts. (arXiv:2210.09709v1 [stat.ML])
    In many real world problems, the training data and test data have different distributions. This situation is commonly referred as a dataset shift. The most common settings for dataset shift often considered in the literature are {\em covariate shift } and {\em target shift}. Importance weighting (IW) correction is a universal method for correcting the bias present in learning scenarios under dataset shift. The question one may ask is: does IW correction work equally well for different dataset shift scenarios? By investigating the generalization properties of the weighted kernel ridge regression (W-KRR) under covariate and target shifts we show that the answer is negative, except when IW is bounded and the model is wellspecified. In the latter cases, a minimax optimal rates are achieved by importance weighted kernel ridge regression (IW-KRR) in both, covariate and target shift scenarios. Slightly relaxing the boundedness condition of the IW we show that the IW-KRR still achieves the optimal rates under target shift while leading to slower rates for covariate shift. In the case of the model misspecification we show that the performance of the W-KRR under covariate shift could be substantially increased by designing an alternative reweighting function. The distinction between misspecified and wellspecified scenarios does not seem to be crucial in the learning problems under target shift.
    MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos. (arXiv:2210.09887v1 [cs.CV])
    Convolutional neural network inference on video input is computationally expensive and has high memory bandwidth requirements. Recently, researchers managed to reduce the cost of processing upcoming frames by only processing pixels that changed significantly. Using sparse convolutions, the sparsity of frame differences can be translated to speedups on current inference devices. However, previous work was relying on static cameras. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a CNN framework that supports moving cameras and variable resolution input. We propose a spherical buffer which enables seamless fusion of newly unveiled regions and previously processed regions - without increasing the memory footprint. Our evaluations show that we outperform previous work significantly by explicitly adding support for moving camera input.
    STay-ON-the-Ridge: Guaranteed Convergence to Local Minimax Equilibrium in Nonconvex-Nonconcave Games. (arXiv:2210.09769v1 [cs.LG])
    Min-max optimization problems involving nonconvex-nonconcave objectives have found important applications in adversarial training and other multi-agent learning settings. Yet, no known gradient descent-based method is guaranteed to converge to (even local notions of) min-max equilibrium in the nonconvex-nonconcave setting. For all known methods, there exist relatively simple objectives for which they cycle or exhibit other undesirable behavior different from converging to a point, let alone to some game-theoretically meaningful one~\cite{flokas2019poincare,hsieh2021limits}. The only known convergence guarantees hold under the strong assumption that the initialization is very close to a local min-max equilibrium~\cite{wang2019solving}. Moreover, the afore-described challenges are not just theoretical curiosities. All known methods are unstable in practice, even in simple settings. We propose the first method that is guaranteed to converge to a local min-max equilibrium for smooth nonconvex-nonconcave objectives. Our method is second-order and provably escapes limit cycles as long as it is initialized at an easy-to-find initial point. Both the definition of our method and its convergence analysis are motivated by the topological nature of the problem. In particular, our method is not designed to decrease some potential function, such as the distance of its iterate from the set of local min-max equilibria or the projected gradient of the objective, but is designed to satisfy a topological property that guarantees the avoidance of cycles and implies its convergence.
    Private Non-Convex Federated Learning Without a Trusted Server. (arXiv:2203.06735v2 [cs.LG] UPDATED)
    We study federated learning (FL)--especially cross-silo FL--with non-convex loss functions and data from people who do not trust the server or other silos. In this setting, each silo (e.g. hospital) must protect the privacy of each person's data (e.g. patient's medical record), even if the server or other silos act as adversarial eavesdroppers. To that end, we consider inter-silo record-level (ISRL) differential privacy (DP), which requires silo $i$'s communications to satisfy record/item-level DP. We give novel ISRL-DP algorithms for FL with heterogeneous (non-i.i.d.) silo data and two classes of Lipschitz continuous loss functions: First, we consider losses satisfying the Proximal Polyak-Lojasiewicz (PL) inequality, which is an extension of the classical PL condition to the constrained setting. Prior works only considered unconstrained private optimization with Lipschitz PL loss, which rules out most interesting PL losses such as strongly convex problems and linear/logistic regression. However, by analyzing the proximal PL scenario, we permit these losses and others (e.g. LASSO, some neural nets) which are Lipschitz on a restricted parameter domain. Our algorithms nearly attain the optimal strongly convex, homogeneous (i.i.d.) rate for ISRL-DP FL without assuming convexity or i.i.d. data. Second, we give the first private algorithms for non-convex non-smooth loss functions. Our utility bounds even improve on the state-of-the-art bounds for smooth losses. We complement our upper bounds with lower bounds. Additionally, we provide shuffle DP (SDP) algorithms that improve over the state-of-the-art central DP algorithms under more practical trust assumptions. Numerical experiments show that our algorithm has better accuracy than baselines for most privacy levels.
    Early Diagnosis of Retinal Blood Vessel Damage via Deep Learning-Powered Collective Intelligence Models. (arXiv:2210.09449v1 [eess.IV])
    Early diagnosis of retinal diseases such as diabetic retinopathy has had the attention of many researchers. Deep learning through the introduction of convolutional neural networks has become a prominent solution for image-related tasks such as classification and segmentation. Most tasks in image classification are handled by deep CNNs pretrained and evaluated on imagenet dataset. However, these models do not always translate to the best result on other datasets. Devising a neural network manually from scratch based on heuristics may not lead to an optimal model as there are numerous hyperparameters in play. In this paper, we use two nature-inspired swarm algorithms: particle swarm optimization (PSO) and ant colony optimization (ACO) to obtain TDCN models to perform classification of fundus images into severity classes. The power of swarm algorithms is used to search for various combinations of convolutional, pooling, and normalization layers to provide the best model for the task. It is observed that TDCN-PSO outperforms imagenet models and existing literature, while TDCN-ACO achieves faster architecture search. The best TDCN model achieves an accuracy of 90.3%, AUC ROC of 0.956, and a Cohen kappa score of 0.967. The results were compared with the previous studies to show that the proposed TDCN models exhibit superior performance.
    SQ Lower Bounds for Learning Single Neurons with Massart Noise. (arXiv:2210.09949v1 [cs.LG])
    We study the problem of PAC learning a single neuron in the presence of Massart noise. Specifically, for a known activation function $f: \mathbb{R} \to \mathbb{R}$, the learner is given access to labeled examples $(\mathbf{x}, y) \in \mathbb{R}^d \times \mathbb{R}$, where the marginal distribution of $\mathbf{x}$ is arbitrary and the corresponding label $y$ is a Massart corruption of $f(\langle \mathbf{w}, \mathbf{x} \rangle)$. The goal of the learner is to output a hypothesis $h: \mathbb{R}^d \to \mathbb{R}$ with small squared loss. For a range of activation functions, including ReLUs, we establish super-polynomial Statistical Query (SQ) lower bounds for this learning problem. In more detail, we prove that no efficient SQ algorithm can approximate the optimal error within any constant factor. Our main technical contribution is a novel SQ-hard construction for learning $\{ \pm 1\}$-weight Massart halfspaces on the Boolean hypercube that is interesting on its own right.
    Unpacking Reward Shaping: Understanding the Benefits of Reward Engineering on Sample Complexity. (arXiv:2210.09579v1 [cs.LG])
    Reinforcement learning provides an automated framework for learning behaviors from high-level reward specifications, but in practice the choice of reward function can be crucial for good results -- while in principle the reward only needs to specify what the task is, in reality practitioners often need to design more detailed rewards that provide the agent with some hints about how the task should be completed. The idea of this type of ``reward-shaping'' has been often discussed in the literature, and is often a critical part of practical applications, but there is relatively little formal characterization of how the choice of reward shaping can yield benefits in sample complexity. In this work, we build on the framework of novelty-based exploration to provide a simple scheme for incorporating shaped rewards into RL along with an analysis tool to show that particular choices of reward shaping provably improve sample efficiency. We characterize the class of problems where these gains are expected to be significant and show how this can be connected to practical algorithms in the literature. We confirm that these results hold in practice in an experimental evaluation, providing an insight into the mechanisms through which reward shaping can significantly improve the complexity of reinforcement learning while retaining asymptotic performance.
    Deep Bidirectional Language-Knowledge Graph Pretraining. (arXiv:2210.09338v1 [cs.CL])
    Pretraining a language model (LM) on text has been shown to help various downstream NLP tasks. Recent works show that a knowledge graph (KG) can complement text data, offering structured background knowledge that provides a useful scaffold for reasoning. However, these works are not pretrained to learn a deep fusion of the two modalities at scale, limiting the potential to acquire fully joint representations of text and KG. Here we propose DRAGON (Deep Bidirectional Language-Knowledge Graph Pretraining), a self-supervised approach to pretraining a deeply joint language-knowledge foundation model from text and KG at scale. Specifically, our model takes pairs of text segments and relevant KG subgraphs as input and bidirectionally fuses information from both modalities. We pretrain this model by unifying two self-supervised reasoning tasks, masked language modeling and KG link prediction. DRAGON outperforms existing LM and LM+KG models on diverse downstream tasks including question answering across general and biomedical domains, with +5% absolute gain on average. In particular, DRAGON achieves notable performance on complex reasoning about language and knowledge (+10% on questions involving long contexts or multi-step reasoning) and low-resource QA (+8% on OBQA and RiddleSense), and new state-of-the-art results on various BioNLP tasks. Our code and trained models are available at https://github.com/michiyasunaga/dragon.
    A Novel Feature Representation for Malware Classification. (arXiv:2210.09580v1 [cs.CR])
    In this study we have presented a novel feature representation for malicious programs that can be used for malware classification. We have shown how to construct the features in a bottom-up approach, and analyzed the overlap of malicious and benign programs in terms of their components. We have shown that our method of analysis offers an increase in feature resolution that is descriptive of data movement in comparison to tf-idf features.
    Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion. (arXiv:2210.10044v1 [cs.RO])
    An attached arm can significantly increase the applicability of legged robots to several mobile manipulation tasks that are not possible for the wheeled or tracked counterparts. The standard hierarchical control pipeline for such legged manipulators is to decouple the controller into that of manipulation and locomotion. However, this is ineffective. It requires immense engineering to support coordination between the arm and legs, and error can propagate across modules causing non-smooth unnatural motions. It is also biological implausible given evidence for strong motor synergies across limbs. In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups. Videos are at https://maniploco.github.io
    LobsDICE: Offline Learning from Observation via Stationary Distribution Correction Estimation. (arXiv:2202.13536v2 [cs.LG] UPDATED)
    We consider the problem of learning from observation (LfO), in which the agent aims to mimic the expert's behavior from the state-only demonstrations by experts. We additionally assume that the agent cannot interact with the environment but has access to the action-labeled transition data collected by some agents with unknown qualities. This offline setting for LfO is appealing in many real-world scenarios where the ground-truth expert actions are inaccessible and the arbitrary environment interactions are costly or risky. In this paper, we present LobsDICE, an offline LfO algorithm that learns to imitate the expert policy via optimization in the space of stationary distributions. Our algorithm solves a single convex minimization problem, which minimizes the divergence between the two state-transition distributions induced by the expert and the agent policy. Through an extensive set of offline LfO tasks, we show that LobsDICE outperforms strong baseline methods.
    Single-level Adversarial Data Synthesis based on Neural Tangent Kernels. (arXiv:2204.04090v6 [cs.LG] UPDATED)
    Generative adversarial networks (GANs) have achieved impressive performance in data synthesis and have driven the development of many applications. However, GANs are known to be hard to train due to their bilevel objective, which leads to the problems of convergence, mode collapse, and gradient vanishing. In this paper, we propose a new generative model called the generative adversarial NTK (GA-NTK) that has a single-level objective. The GA-NTK keeps the spirit of adversarial learning (which helps generate plausible data) while avoiding the training difficulties of GANs. This is done by modeling the discriminator as a Gaussian process with a neural tangent kernel (NTK-GP) whose training dynamics can be completely described by a closed-form formula. We analyze the convergence behavior of GA-NTK trained by gradient descent and give some sufficient conditions for convergence. We also conduct extensive experiments to study the advantages and limitations of GA-NTK and propose some techniques that make GA-NTK more practical.
    Scale-Agnostic Super-Resolution in MRI using Feature-Based Coordinate Networks. (arXiv:2210.08676v2 [cs.CV] UPDATED)
    We propose using a coordinate network decoder for the task of super-resolution in MRI. The continuous signal representation of coordinate networks enables this approach to be scale-agnostic, i.e. one can train over a continuous range of scales and subsequently query at arbitrary resolutions. Due to the difficulty of performing super-resolution on inherently noisy data, we analyze network behavior under multiple denoising strategies. Lastly we compare this method to a standard convolutional decoder using both quantitative metrics and a radiologist study implemented in Voxel, our newly developed tool for web-based evaluation of medical images.
    Automatic Differentiation of Programs with Discrete Randomness. (arXiv:2210.08572v2 [cs.LG] UPDATED)
    Automatic differentiation (AD), a technique for constructing new programs which compute the derivative of an original program, has become ubiquitous throughout scientific computing and deep learning due to the improved performance afforded by gradient-based optimization. However, AD systems have been restricted to the subset of programs that have a continuous dependence on parameters. Programs that have discrete stochastic behaviors governed by distribution parameters, such as flipping a coin with probability $p$ of being heads, pose a challenge to these systems because the connection between the result (heads vs tails) and the parameters ($p$) is fundamentally discrete. In this paper we develop a new reparameterization-based methodology that allows for generating programs whose expectation is the derivative of the expectation of the original program. We showcase how this method gives an unbiased and low-variance estimator which is as automated as traditional AD mechanisms. We demonstrate unbiased forward-mode AD of discrete-time Markov chains, agent-based models such as Conway's Game of Life, and unbiased reverse-mode AD of a particle filter. Our code is available at https://github.com/gaurav-arya/StochasticAD.jl.
    Fine-mixing: Mitigating Backdoors in Fine-tuned Language Models. (arXiv:2210.09545v1 [cs.CL])
    Deep Neural Networks (DNNs) are known to be vulnerable to backdoor attacks. In Natural Language Processing (NLP), DNNs are often backdoored during the fine-tuning process of a large-scale Pre-trained Language Model (PLM) with poisoned samples. Although the clean weights of PLMs are readily available, existing methods have ignored this information in defending NLP models against backdoor attacks. In this work, we take the first step to exploit the pre-trained (unfine-tuned) weights to mitigate backdoors in fine-tuned language models. Specifically, we leverage the clean pre-trained weights via two complementary techniques: (1) a two-step Fine-mixing technique, which first mixes the backdoored weights (fine-tuned on poisoned data) with the pre-trained weights, then fine-tunes the mixed weights on a small subset of clean data; (2) an Embedding Purification (E-PUR) technique, which mitigates potential backdoors existing in the word embeddings. We compare Fine-mixing with typical backdoor mitigation methods on three single-sentence sentiment classification tasks and two sentence-pair classification tasks and show that it outperforms the baselines by a considerable margin in all scenarios. We also show that our E-PUR method can benefit existing mitigation methods. Our work establishes a simple but strong baseline defense for secure fine-tuned NLP models against backdoor attacks.
    Understanding COVID-19 Vaccine Campaign on Facebook using Minimal Supervision. (arXiv:2210.10031v1 [cs.CL])
    In the age of social media, where billions of internet users share information and opinions, the negative impact of pandemics is not limited to the physical world. It provokes a surge of incomplete, biased, and incorrect information, also known as an infodemic. This global infodemic jeopardizes measures to control the pandemic by creating panic, vaccine hesitancy, and fragmented social response. Platforms like Facebook allow advertisers to adapt their messaging to target different demographics and help alleviate or exacerbate the infodemic problem depending on their content. In this paper, we propose a minimally supervised multi-task learning framework for understanding messaging on Facebook related to the covid vaccine by identifying ad themes and moral foundations. Furthermore, we perform a more nuanced thematic analysis of messaging tactics of vaccine campaigns on social media so that policymakers can make better decisions on pandemic control.
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v2 [cs.LG] UPDATED)
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.
    Factored Adaptation for Non-Stationary Reinforcement Learning. (arXiv:2203.16582v2 [cs.LG] UPDATED)
    Dealing with non-stationarity in environments (e.g., in the transition dynamics) and objectives (e.g., in the reward functions) is a challenging problem that is crucial in real-world applications of reinforcement learning (RL). While most current approaches model the changes as a single shared embedding vector, we leverage insights from the recent causality literature to model non-stationarity in terms of individual latent change factors, and causal graphs across different environments. In particular, we propose Factored Adaptation for Non-Stationary RL (FANS-RL), a factored adaption approach that learns jointly both the causal structure in terms of a factored MDP, and a factored representation of the individual time-varying change factors. We prove that under standard assumptions, we can completely recover the causal graph representing the factored transition and reward function, as well as a partial structure between the individual change factors and the state components. Through our general framework, we can consider general non-stationary scenarios with different function types and changing frequency, including changes across episodes and within episodes. Experimental results demonstrate that FANS-RL outperforms existing approaches in terms of return, compactness of the latent state representation, and robustness to varying degrees of non-stationarity.
    Robust Reinforcement Learning using Offline Data. (arXiv:2208.05129v2 [cs.LG] UPDATED)
    The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.
    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v2 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
    CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection. (arXiv:2210.09267v2 [cs.CV] UPDATED)
    Robust 3D object detection is critical for safe autonomous driving. Camera and radar sensors are synergistic as they capture complementary information and work well under different environmental conditions. Fusing camera and radar data is challenging, however, as each of the sensors lacks information along a perpendicular axis, that is, depth is unknown to camera and elevation is unknown to radar. We propose the camera-radar matching network CramNet, an efficient approach to fuse the sensor readings from camera and radar in a joint 3D space. To leverage radar range measurements for better camera depth predictions, we propose a novel ray-constrained cross-attention mechanism that resolves the ambiguity in the geometric correspondences between camera features and radar features. Our method supports training with sensor modality dropout, which leads to robust 3D object detection, even when a camera or radar sensor suddenly malfunctions on a vehicle. We demonstrate the effectiveness of our fusion approach through extensive experiments on the RADIATE dataset, one of the few large-scale datasets that provide radar radio frequency imagery. A camera-only variant of our method achieves competitive performance in monocular 3D object detection on the Waymo Open Dataset.
    Nonmyopic Multiclass Active Search with Diminishing Returns for Diverse Discovery. (arXiv:2202.03593v2 [cs.LG] UPDATED)
    Active search is a setting in adaptive experimental design where we aim to uncover members of rare, valuable class(es) subject to a budget constraint. An important consideration in this problem is diversity among the discovered targets -- in many applications, diverse discoveries offer more insight and may be preferable in downstream tasks. However, most existing active search policies either assume that all targets belong to a common positive class or encourage diversity via simple heuristics. We present a novel formulation of active search with multiple target classes, characterized by a utility function chosen from a flexible family whose members encourage diversity via a diminishing returns mechanism. We then study this problem under the Bayesian lens and prove a hardness result for approximating the optimal policy for arbitrary positive, increasing, and concave utility functions. Finally, we design an efficient, nonmyopic approximation to the optimal policy for this class of utilities and demonstrate its superior empirical performance in a variety of settings, including drug discovery.
    Unsupervised Cross-Task Generalization via Retrieval Augmentation. (arXiv:2204.07937v2 [cs.CL] UPDATED)
    Humans can perform unseen tasks by recalling relevant skills acquired previously and then generalizing them to the target tasks, even if there is no supervision at all. In this paper, we aim to improve this kind of cross-task generalization ability of massive multi-task language models, such as T0 and FLAN, in an unsupervised setting. We propose a retrieval-augmentation method named ReCross that takes a few unlabelled examples as queries to retrieve a small subset of upstream data and uses them to update the multi-task model for better generalization. ReCross is a straightforward yet effective retrieval method that combines both efficient dense retrieval and effective pair-wise reranking. Our results and analysis show that it significantly outperforms both non-retrieval methods and other baseline methods.
    Shallow and Deep Nonparametric Convolutions for Gaussian Processes. (arXiv:2206.08972v2 [stat.ML] UPDATED)
    A key challenge in the practical application of Gaussian processes (GPs) is selecting a proper covariance function. The moving average, or process convolutions, construction of GPs allows some additional flexibility, but still requires choosing a proper smoothing kernel, which is non-trivial. Previous approaches have built covariance functions by using GP priors over the smoothing kernel, and by extension the covariance, as a way to bypass the need to specify it in advance. However, such models have been limited in several ways: they are restricted to single dimensional inputs, e.g. time; they only allow modelling of single outputs and they do not scale to large datasets since inference is not straightforward. In this paper, we introduce a nonparametric process convolution formulation for GPs that alleviates these weaknesses by using a functional sampling approach based on Matheron's rule to perform fast sampling using interdomain inducing variables. Furthermore, we propose a composition of these nonparametric convolutions that serves as an alternative to classic deep GP models, and allows the covariance functions of the intermediate layers to be inferred from the data. We test the performance of our model on benchmarks for single output GPs, multiple output GPs and deep GPs and find that our approach can provide improvements over standard GP models, particularly for larger datasets.
    Learning to simulate realistic limit order book markets from data as a World Agent. (arXiv:2210.09897v1 [q-fin.TR])
    Multi-agent market simulators usually require careful calibration to emulate real markets, which includes the number and the type of agents. Poorly calibrated simulators can lead to misleading conclusions, potentially causing severe loss when employed by investment banks, hedge funds, and traders to study and evaluate trading strategies. In this paper, we propose a world model simulator that accurately emulates a limit order book market -- it requires no agent calibration but rather learns the simulated market behavior directly from historical data. Traditional approaches fail short to learn and calibrate trader population, as historical labeled data with details on each individual trader strategy is not publicly available. Our approach proposes to learn a unique "world" agent from historical data. It is intended to emulate the overall trader population, without the need of making assumptions about individual market agent strategies. We implement our world agent simulator models as a Conditional Generative Adversarial Network (CGAN), as well as a mixture of parametric distributions, and we compare our models against previous work. Qualitatively and quantitatively, we show that the proposed approaches consistently outperform previous work, providing more realism and responsiveness.
    Scaling Laws Under the Microscope: Predicting Transformer Performance from Small Scale Experiments. (arXiv:2202.06387v2 [cs.CL] UPDATED)
    Neural scaling laws define a predictable relationship between a model's parameter count and its performance after training in the form of a power law. However, most research to date has not explicitly investigated whether scaling laws can be used to accelerate model development. In this work, we perform such an empirical investigation across a wide range of language understanding tasks, starting from models with as few as 10K parameters, and evaluate downstream performance across 9 language understanding tasks. We find that scaling laws emerge at finetuning time in some NLP tasks, and that they can also be exploited for debugging convergence when training large models. Moreover, for tasks where scaling laws exist, they can be used to predict the performance of larger models, which enables effective model selection. However, revealing scaling laws requires careful hyperparameter tuning and multiple runs for the purpose of uncertainty estimation, which incurs additional overhead, partially offsetting the computational benefits.
    Clustering Categorical Data: Soft Rounding k-modes. (arXiv:2210.09640v1 [cs.LG])
    Over the last three decades, researchers have intensively explored various clustering tools for categorical data analysis. Despite the proposal of various clustering algorithms, the classical k-modes algorithm remains a popular choice for unsupervised learning of categorical data. Surprisingly, our first insight is that in a natural generative block model, the k-modes algorithm performs poorly for a large range of parameters. We remedy this issue by proposing a soft rounding variant of the k-modes algorithm (SoftModes) and theoretically prove that our variant addresses the drawbacks of the k-modes algorithm in the generative model. Finally, we empirically verify that SoftModes performs well on both synthetic and real-world datasets.
    Agglomerative Hierarchical Clustering with Dynamic Time Warping for Household Load Curve Clustering. (arXiv:2210.09523v1 [cs.LG])
    Energy companies often implement various demand response (DR) programs to better match electricity demand and supply by offering the consumers incentives to reduce their demand during critical periods. Classifying clients according to their consumption patterns enables targeting specific groups of consumers for DR. Traditional clustering algorithms use standard distance measurement to find the distance between two points. The results produced by clustering algorithms such as K-means, K-medoids, and Gaussian Mixture Models depend on the clustering parameters or initial clusters. In contrast, our methodology uses a shape-based approach that combines Agglomerative Hierarchical Clustering (AHC) with Dynamic Time Warping (DTW) to classify residential households' daily load curves based on their consumption patterns. While DTW seeks the optimal alignment between two load curves, AHC provides a realistic initial clusters center. In this paper, we compare the results with other clustering algorithms such as K-means, K-medoids, and GMM using different distance measures, and we show that AHC using DTW outperformed other clustering algorithms and needed fewer clusters.
    A Hybrid System of Sound Event Detection Transformer and Frame-wise Model for DCASE 2022 Task 4. (arXiv:2210.09529v1 [cs.SD])
    In this paper, we describe in detail our system for DCASE 2022 Task4. The system combines two considerably different models: an end-to-end Sound Event Detection Transformer (SEDT) and a frame-wise model, Metric Learning and Focal Loss CNN (MLFL-CNN). The former is an event-wise model which learns event-level representations and predicts sound event categories and boundaries directly, while the latter is based on the widely adopted frame-classification scheme, under which each frame is classified into event categories and event boundaries are obtained by post-processing such as thresholding and smoothing. For SEDT, self-supervised pre-training using unlabeled data is applied, and semi-supervised learning is adopted by using an online teacher, which is updated from the student model using the Exponential Moving Average (EMA) strategy and generates reliable pseudo labels for weakly-labeled and unlabeled data. For the frame-wise model, the ICT-TOSHIBA system of DCASE 2021 Task 4 is used. Experimental results show that the hybrid system considerably outperforms either individual model and achieves psds1 of 0.420 and psds2 of 0.783 on the validation set without external data. The code is available at https://github.com/965694547/Hybrid-system-of-frame-wise-model-and-SEDT.
    Making Split Learning Resilient to Label Leakage by Potential Energy Loss. (arXiv:2210.09617v1 [cs.CR])
    As a practical privacy-preserving learning method, split learning has drawn much attention in academia and industry. However, its security is constantly being questioned since the intermediate results are shared during training and inference. In this paper, we focus on the privacy leakage problem caused by the trained split model, i.e., the attacker can use a few labeled samples to fine-tune the bottom model, and gets quite good performance. To prevent such kind of privacy leakage, we propose the potential energy loss to make the output of the bottom model become a more `complicated' distribution, by pushing outputs of the same class towards the decision boundary. Therefore, the adversary suffers a large generalization error when fine-tuning the bottom model with only a few leaked labeled samples. Experiment results show that our method significantly lowers the attacker's fine-tuning accuracy, making the split model more resilient to label leakage.
    A Human-ML Collaboration Framework for Improving Video Content Reviews. (arXiv:2210.09500v1 [cs.LG])
    We deal with the problem of localized in-video taxonomic human annotation in the video content moderation domain, where the goal is to identify video segments that violate granular policies, e.g., community guidelines on an online video platform. High quality human labeling is critical for enforcement in content moderation. This is challenging due to the problem of information overload - raters need to apply a large taxonomy of granular policy violations with ambiguous definitions, within a limited review duration to relatively long videos. Our key contribution is a novel human-machine learning (ML) collaboration framework aimed at maximizing the quality and efficiency of human decisions in this setting - human labels are used to train segment-level models, the predictions of which are displayed as "hints" to human raters, indicating probable regions of the video with specific policy violations. The human verified/corrected segment labels can help refine the model further, hence creating a human-ML positive feedback loop. Experiments show improved human video moderation decision quality, and efficiency through more granular annotations submitted within a similar review duration, which enable a 5-8% AUC improvement in the hint generation models.
    Hybrid Bayesian network discovery with latent variables by scoring multiple interventions. (arXiv:2112.10574v2 [cs.LG] UPDATED)
    In Bayesian Networks (BNs), the direction of edges is crucial for causal reasoning and inference. However, Markov equivalence class considerations mean it is not always possible to establish edge orientations, which is why many BN structure learning algorithms cannot orientate all edges from purely observational data. Moreover, latent confounders can lead to false positive edges. Relatively few methods have been proposed to address these issues. In this work, we present the hybrid mFGS-BS (majority rule and Fast Greedy equivalence Search with Bayesian Scoring) algorithm for structure learning from discrete data that involves an observational data set and one or more interventional data sets. The algorithm assumes causal insufficiency in the presence of latent variables and produces a Partial Ancestral Graph (PAG). Structure learning relies on a hybrid approach and a novel Bayesian scoring paradigm that calculates the posterior probability of each directed edge being added to the learnt graph. Experimental results based on well-known networks of up to 109 variables and 10k sample size show that mFGS-BS improves structure learning accuracy relative to the state-of-the-art and it is computationally efficient.
    ODEs learn to walk: ODE-Net based data-driven modeling for crowd dynamics. (arXiv:2210.09602v1 [cs.LG])
    Predicting the behaviors of pedestrian crowds is of critical importance for a variety of real-world problems. Data driven modeling, which aims to learn the mathematical models from observed data, is a promising tool to construct models that can make accurate predictions of such systems. In this work, we present a data-driven modeling approach based on the ODE-Net framework, for constructing continuous-time models of crowd dynamics. We discuss some challenging issues in applying the ODE-Net method to such problems, which are primarily associated with the dimensionality of the underlying crowd system, and we propose to address these issues by incorporating the social-force concept in the ODE-Net framework. Finally application examples are provided to demonstrate the performance of the proposed method.
    Private Stochastic Optimization in the Presence of Outliers: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses. (arXiv:2209.07403v2 [cs.LG] UPDATED)
    We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are (possibly) not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it is often unrealistic: in many practical problems, the loss function may not be uniformly Lipschitz. Even when the loss function is Lipschitz continuous, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work does not require the loss function to be uniformly Lipschitz. Instead, building on a recent line of work [WXDX20, KLZ22], we make the weaker assumption that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$-th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to the prior works, our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.
    Small Transformers Compute Universal Metric Embeddings. (arXiv:2209.06788v2 [cs.LG] UPDATED)
    We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called \emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $n\log(n)$ and width about $n^2$ can bi-H\"{o}lder embed any $n$-point dataset from $\mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If $\mathcal{X}$'s geometry is sufficiently regular, we obtain stronger, bi-Lipschitz guarantees for all points in the dataset. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers can compute bi-H\"{o}lder embeddings with arbitrarily small distortion.
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v2 [cs.SI] UPDATED)
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.
    On the Importance of Architectures and Hyperparameters for Fairness in Face Recognition. (arXiv:2210.09943v1 [cs.CV])
    Face recognition systems are deployed across the world by government agencies and contractors for sensitive and impactful tasks, such as surveillance and database matching. Despite their widespread use, these systems are known to exhibit bias across a range of sociodemographic dimensions, such as gender and race. Nonetheless, an array of works proposing pre-processing, training, and post-processing methods have failed to close these gaps. Here, we take a very different approach to this problem, identifying that both architectures and hyperparameters of neural networks are instrumental in reducing bias. We first run a large-scale analysis of the impact of architectures and training hyperparameters on several common fairness metrics and show that the implicit convention of choosing high-accuracy architectures may be suboptimal for fairness. Motivated by our findings, we run the first neural architecture search for fairness, jointly with a search for hyperparameters. We output a suite of models which Pareto-dominate all other competitive architectures in terms of accuracy and fairness. Furthermore, we show that these models transfer well to other face recognition datasets with similar and distinct protected attributes. We release our code and raw result files so that researchers and practitioners can replace our fairness metrics with a bias measure of their choice.
    Diversity Preference-Aware Link Recommendation for Online Social Networks. (arXiv:2205.10689v2 [cs.LG] UPDATED)
    Link recommendation, which recommends links to connect unlinked online social network users, is a fundamental social network analytics problem with ample business implications. Existing link recommendation methods tend to recommend similar friends to a user but overlook the user's diversity preference, although social psychology theories suggest the criticality of diversity preference to link recommendation performance. In recommender systems, a field related to link recommendation, a number of diversification methods have been proposed to improve the diversity of recommended items. Nevertheless, diversity preference is distinct from diversity studied by diversification methods. To address these research gaps, we define and operationalize the concept of diversity preference for link recommendation and propose a new link recommendation problem: the diversity preference-aware link recommendation problem. We then analyze key properties of the new link recommendation problem and develop a novel link recommendation method to solve the problem. Using two large-scale online social network data sets, we conduct extensive empirical evaluations to demonstrate the superior performance of our method over representative diversification methods adapted for link recommendation as well as state-of-the-art link recommendation methods.
    Anti-Symmetric DGN: a stable architecture for Deep Graph Networks. (arXiv:2210.09789v1 [cs.LG])
    Deep Graph Networks (DGNs) currently dominate the research landscape of learning from graphs, due to their efficiency and ability to implement an adaptive message-passing scheme between the nodes. However, DGNs are typically limited in their ability to propagate and preserve long-term dependencies between nodes, \ie they suffer from the over-squashing phenomena. This reduces their effectiveness, since predictive problems may require to capture interactions at different, and possibly large, radii in order to be effectively solved. In this work, we present Anti-Symmetric Deep Graph Networks (A-DGNs), a framework for stable and non-dissipative DGN design, conceived through the lens of ordinary differential equations. We give theoretical proof that our method is stable and non-dissipative, leading to two key results: long-range information between nodes is preserved, and no gradient vanishing or explosion occurs in training. We empirically validate the proposed approach on several graph benchmarks, showing that A-DGN yields to improved performance and enables to learn effectively even when dozens of layers are used.
    Recovering Private Text in Federated Learning of Language Models. (arXiv:2205.08514v2 [cs.CL] UPDATED)
    Federated learning allows distributed users to collaboratively train a model while keeping each user's data private. Recently, a growing body of work has demonstrated that an eavesdropping attacker can effectively recover image data from gradients transmitted during federated learning. However, little progress has been made in recovering text data. In this paper, we present a novel attack method FILM for federated learning of language models (LMs). For the first time, we show the feasibility of recovering text from large batch sizes of up to 128 sentences. Unlike image-recovery methods that are optimized to match gradients, we take a distinct approach that first identifies a set of words from gradients and then directly reconstructs sentences based on beam search and a prior-based reordering strategy. We conduct the FILM attack on several large-scale datasets and show that it can successfully reconstruct single sentences with high fidelity for large batch sizes and even multiple sentences if applied iteratively. We evaluate three defense methods: gradient pruning, DPSGD, and a simple approach to freeze word embeddings that we propose. We show that both gradient pruning and DPSGD lead to a significant drop in utility. However, if we fine-tune a public pre-trained LM on private text without updating word embeddings, it can effectively defend the attack with minimal data utility loss. Together, we hope that our results can encourage the community to rethink the privacy concerns of LM training and its standard practices in the future.
    Differentiable Parsing and Visual Grounding of Verbal Instructions for Object Placement. (arXiv:2210.00215v2 [cs.RO] UPDATED)
    Grounding spatial relations in natural language for object placing could have ambiguity and compositionality issues. To address the issues, we introduce ParaGon, a PARsing And visual GrOuNding framework for language-conditioned object placement. It parses language instructions into relations between objects and grounds those objects in visual scenes. A particle-based GNN then conducts relational reasoning between grounded objects for placement generation. ParaGon encodes all of those procedures into neural networks for end-to-end training, which avoids annotating parsing and object reference grounding labels. Our approach inherently integrates parsing-based methods into a probabilistic, data-driven framework. It is data-efficient and generalizable for learning compositional instructions, robust to noisy language inputs, and adapts to the uncertainty of ambiguous instructions.
    Generating Natural Language Proofs with Verifier-Guided Search. (arXiv:2205.12443v2 [cs.CL] UPDATED)
    Deductive reasoning over natural language is a challenging problem in NLP. In this work, we focus on proof generation: Given a hypothesis and a set of supporting facts, the model generates a proof tree indicating how to deduce the hypothesis from supporting facts. Compared to generating the entire proof in one shot, stepwise generation can better exploit the compositionality and generalize to longer proofs but has achieved limited success on real-world data. Existing stepwise methods struggle to generate proof steps that are both logically valid and relevant to the hypothesis. Instead, they tend to hallucinate invalid steps given the hypothesis. In this paper, we present a novel stepwise method, NLProofS (Natural Language Proof Search), which learns to generate relevant steps conditioning on the hypothesis. At the core of our approach, we train an independent verifier to check the validity of the proof steps to prevent hallucination. Instead of generating steps greedily, we search for proofs maximizing a global proof score judged by the verifier. NLProofS achieves state-of-the-art performance on EntailmentBank and RuleTaker. Specifically, it improves the correctness of predicted proofs from 27.7% to 33.3% in the distractor setting of EntailmentBank, demonstrating the effectiveness of NLProofS in generating challenging human-authored proofs.
    Extensible Proxy for Efficient NAS. (arXiv:2210.09459v1 [cs.LG])
    Neural Architecture Search (NAS) has become a de facto approach in the recent trend of AutoML to design deep neural networks (DNNs). Efficient or near-zero-cost NAS proxies are further proposed to address the demanding computational issues of NAS, where each candidate architecture network only requires one iteration of backpropagation. The values obtained from the proxies are considered the predictions of architecture performance on downstream tasks. However, two significant drawbacks hinder the extended usage of Efficient NAS proxies. (1) Efficient proxies are not adaptive to various search spaces. (2) Efficient proxies are not extensible to multi-modality downstream tasks. Based on the observations, we design a Extensible proxy (Eproxy) that utilizes self-supervised, few-shot training (i.e., 10 iterations of backpropagation) which yields near-zero costs. The key component that makes Eproxy efficient is an untrainable convolution layer termed barrier layer that add the non-linearities to the optimization spaces so that the Eproxy can discriminate the performance of architectures in the early stage. Furthermore, to make Eproxy adaptive to different downstream tasks/search spaces, we propose a Discrete Proxy Search (DPS) to find the optimized training settings for Eproxy with only handful of benchmarked architectures on the target tasks. Our extensive experiments confirm the effectiveness of both Eproxy and Eproxy+DPS. Code is available at https://github.com/leeyeehoo/GenNAS-Zero.
    Degradation-invariant Enhancement of Fundus Images via Pyramid Constraint Network. (arXiv:2210.09606v1 [eess.IV])
    As an economical and efficient fundus imaging modality, retinal fundus images have been widely adopted in clinical fundus examination. Unfortunately, fundus images often suffer from quality degradation caused by imaging interferences, leading to misdiagnosis. Despite impressive enhancement performances that state-of-the-art methods have achieved, challenges remain in clinical scenarios. For boosting the clinical deployment of fundus image enhancement, this paper proposes the pyramid constraint to develop a degradation-invariant enhancement network (PCE-Net), which mitigates the demand for clinical data and stably enhances unknown data. Firstly, high-quality images are randomly degraded to form sequences of low-quality ones sharing the same content (SeqLCs). Then individual low-quality images are decomposed to Laplacian pyramid features (LPF) as the multi-level input for the enhancement. Subsequently, a feature pyramid constraint (FPC) for the sequence is introduced to enforce the PCE-Net to learn a degradation-invariant model. Extensive experiments have been conducted under the evaluation metrics of enhancement and segmentation. The effectiveness of the PCE-Net was demonstrated in comparison with state-of-the-art methods and the ablation study. The source code of this study is publicly available at https://github.com/HeverLaw/PCENet-Image-Enhancement.
    Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization. (arXiv:1905.06635v2 [cs.LG] UPDATED)
    Solving for adversarial examples with projected gradient descent has been demonstrated to be highly effective in fooling the neural network based classifiers. However, in the black-box setting, the attacker is limited only to the query access to the network and solving for a successful adversarial example becomes much more difficult. To this end, recent methods aim at estimating the true gradient signal based on the input queries but at the cost of excessive queries. We propose an efficient discrete surrogate to the optimization problem which does not require estimating the gradient and consequently becomes free of the first order update hyperparameters to tune. Our experiments on Cifar-10 and ImageNet show the state of the art black-box attack performance with significant reduction in the required queries compared to a number of recently proposed methods. The source code is available at https://github.com/snu-mllab/parsimonious-blackbox-attack.
    The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. (arXiv:2203.07259v3 [cs.CL] UPDATED)
    Transformer-based language models have become a key building block for natural language processing. While these models are extremely accurate, they can be too large and computationally intensive to run on standard deployments. A variety of compression methods, including distillation, quantization, structured and unstructured pruning are known to decrease model size and increase inference speed, with low accuracy loss. In this context, this paper's contributions are two-fold. We perform an in-depth study of the accuracy-compression trade-off for unstructured weight pruning of BERT models. We introduce Optimal BERT Surgeon (oBERT), an efficient and accurate weight pruning method based on approximate second-order information, which we show to yield state-of-the-art results in both stages of language tasks: pre-training and fine-tuning. Specifically, oBERT extends existing work on unstructured second-order pruning by allowing for pruning blocks of weights, and by being applicable at the BERT scale. Second, we investigate the impact of this pruning method when compounding compression approaches to obtain highly compressed but accurate models for deployment on edge devices. These models significantly push boundaries of the current state-of-the-art sparse BERT models with respect to all metrics: model size, inference speed and task accuracy. For example, relative to the dense BERT-base, we obtain 10x model size compression (in MB) with < 1% accuracy drop, 10x CPU-inference speedup with < 2% accuracy drop, and 29x CPU-inference speedup with < 7.5% accuracy drop. Our code, fully integrated with Transformers and SparseML, is available at https://github.com/neuralmagic/sparseml/tree/main/research/optimal_BERT_surgeon_oBERT.
    Online Agnostic Multiclass Boosting. (arXiv:2205.15113v2 [cs.LG] UPDATED)
    Boosting is a fundamental approach in machine learning that enjoys both strong theoretical and practical guarantees. At a high-level, boosting algorithms cleverly aggregate weak learners to generate predictions with arbitrarily high accuracy. In this way, boosting algorithms convert weak learners into strong ones. Recently, Brukhim et al. extended boosting to the online agnostic binary classification setting. A key ingredient in their approach is a clean and simple reduction to online convex optimization, one that efficiently converts an arbitrary online convex optimizer to an agnostic online booster. In this work, we extend this reduction to multiclass problems and give the first boosting algorithm for online agnostic mutliclass classification. Our reduction also enables the construction of algorithms for statistical agnostic, online realizable, and statistical realizable multiclass boosting.
    Domain-Specific Risk Minimization for Out-of-Distribution Generalization. (arXiv:2208.08661v3 [cs.LG] UPDATED)
    Recent domain generalization (DG) approaches typically use the hypothesis learned on source domains for inference on the unseen target domain. However, such a hypothesis can be arbitrarily far from the optimal one for the target domain, induced by a gap termed ``adaptivity gap''. Without exploiting the domain information from the unseen test samples, adaptivity gap estimation and minimization are intractable, which hinders us to robustify a model to any unknown distribution. In this paper, we first establish a generalization bound that explicitly considers the adaptivity gap. Our bound motivates two strategies to reduce the gap: the first one is ensembling multiple classifiers to enrich the hypothesis space, then we propose effective gap estimation methods for guiding the selection of a better hypothesis for the target. The other method is minimizing the gap directly by adapting model parameters using online target samples. We thus propose \textbf{Domain-specific Risk Minimization (DRM)}. During training, DRM models the distributions of different source domains separately; for inference, DRM performs online model steering using the source hypothesis for each arriving target sample. Extensive experiments demonstrate the effectiveness of the proposed DRM for domain generalization with the following advantages: 1) it significantly outperforms competitive baselines on different distributional shift settings; 2) it achieves either comparable or superior accuracies on all source domains compared to vanilla empirical risk minimization; 3) it remains simple and efficient during training, and 4) it is complementary to invariant learning approaches.
    Cyclical Variational Bayes Monte Carlo for Efficient Multi-Modal Posterior Distributions Evaluation. (arXiv:2202.11645v2 [stat.CO] UPDATED)
    Multimodal distributions of some physics based model parameters are often encountered in engineering due to different situations such as a change in some environmental conditions, and the presence of some types of damage and nonlinearity. In statistical model updating, for locally identifiable parameters, it can be anticipated that multi-modal posterior distributions would be found. The full characterization of these multi-modal distributions is important as methodologies for structural condition monitoring in structures are frequently based in the comparison of the damaged and healthy models of the structure. The characterization of posterior multi-modal distributions using state-of-the-art sampling techniques would require a large number of simulations of expensive to run physics-based models. Therefore, when a limited number of simulations can be run, as it often occurs in engineering, the traditional sampling techniques would not be able to capture accurately the multimodal distributions. This could potentially lead to large numerical errors when assessing the performance of an engineering structure under uncertainty.
    Sampling and Update Frequencies in Proximal Variance-Reduced Stochastic Gradient Methods. (arXiv:2002.05545v3 [math.OC] UPDATED)
    Variance-reduced stochastic gradient methods have gained popularity in recent times. Several variants exist with different strategies for the storing and sampling of gradients and this work concerns the interactions between these two aspects. We present a general proximal variance-reduced gradient method and analyze it under strong convexity assumptions. Special cases of the algorithm include SAGA, L-SVRG and their proximal variants. Our analysis sheds light on epoch-length selection and the need to balance the convergence of the iterates with how often gradients are stored. The analysis improves on other convergence rates found in the literature and produces a new and faster converging sampling strategy for SAGA. Problem instances for which the predicted rates are the same as the practical rates are presented together with problems based on real world data.
    Out of Distribution Reasoning by Weakly-Supervised Disentangled Logic Variational Autoencoder. (arXiv:2210.09959v1 [cs.LG])
    Out-of-distribution (OOD) detection, i.e., finding test samples derived from a different distribution than the training set, as well as reasoning about such samples (OOD reasoning), are necessary to ensure the safety of results generated by machine learning models. Recently there have been promising results for OOD detection in the latent space of variational autoencoders (VAEs). However, without disentanglement, VAEs cannot perform OOD reasoning. Disentanglement ensures a one- to-many mapping between generative factors of OOD (e.g., rain in image data) and the latent variables to which they are encoded. Although previous literature has focused on weakly-supervised disentanglement on simple datasets with known and independent generative factors. In practice, achieving full disentanglement through weak supervision is impossible for complex datasets, such as Carla, with unknown and abstract generative factors. As a result, we propose an OOD reasoning framework that learns a partially disentangled VAE to reason about complex datasets. Our framework consists of three steps: partitioning data based on observed generative factors, training a VAE as a logic tensor network that satisfies disentanglement rules, and run-time OOD reasoning. We evaluate our approach on the Carla dataset and compare the results against three state-of-the-art methods. We found that our framework outperformed these methods in terms of disentanglement and end-to-end OOD reasoning.
    EDLaaS: Fully Homomorphic Encryption Over Neural Network Graphs for Vision and Private Strawberry Yield Forecasting. (arXiv:2110.13638v2 [cs.LG] UPDATED)
    We present automatically parameterised Fully Homomorphic Encryption (FHE) for encrypted neural network inference and exemplify our inference over FHE compatible neural networks with our own open-source framework and reproducible examples. We use the 4th generation Cheon, Kim, Kim and Song (CKKS) FHE scheme over fixed points provided by the Microsoft Simple Encrypted Arithmetic Library (MS-SEAL). We significantly enhance the usability and applicability of FHE in deep learning contexts, with a focus on the constituent graphs, traversal, and optimisation. We find that FHE is not a panacea for all privacy preserving machine learning (PPML) problems, and that certain limitations still remain, such as model training. However we also find that in certain contexts FHE is well suited for computing completely private predictions with neural networks. The ability to privately compute sensitive problems more easily, while lowering the barriers to entry, can allow otherwise too-sensitive fields to begin advantaging themselves of performant third-party neural networks. Lastly we show how encrypted deep learning can be applied to a sensitive real world problem in agri-food, i.e. strawberry yield forecasting, demonstrating competitive performance. We argue that the adoption of encrypted deep learning methods at scale could allow for a greater adoption of deep learning methodologies where privacy concerns exists, hence having a large positive potential impact within the agri-food sector and its journey to net zero.
    LiteTransformerSearch: Training-free Neural Architecture Search for Efficient Language Models. (arXiv:2203.02094v2 [cs.LG] UPDATED)
    The Transformer architecture is ubiquitously used as the building block of large-scale autoregressive language models. However, finding architectures with the optimal trade-off between task performance (perplexity) and hardware constraints like peak memory utilization and latency is non-trivial. This is exacerbated by the proliferation of various hardware. We leverage the somewhat surprising empirical observation that the number of decoder parameters in autoregressive Transformers has a high rank correlation with task performance, irrespective of the architecture topology. This observation organically induces a simple Neural Architecture Search (NAS) algorithm that uses decoder parameters as a proxy for perplexity without need for any model training. The search phase of our training-free algorithm, dubbed Lightweight Transformer Search (LTS), can be run directly on target devices since it does not require GPUs. Using on-target-device measurements, LTS extracts the Pareto-frontier of perplexity versus any hardware performance cost. We evaluate LTS on diverse devices from ARM CPUs to NVIDIA GPUs and two popular autoregressive Transformer backbones: GPT-2 and Transformer-XL. Results show that the perplexity of 16-layer GPT-2 and Transformer-XL can be achieved with up to 1.5x, 2.5x faster runtime and 1.2x, 2.0x lower peak memory utilization. When evaluated in zero and one-shot settings, LTS Pareto-frontier models achieve higher average accuracy compared to the 350M parameter OPT across 14 tasks, with up to 1.6x lower latency. LTS extracts the Pareto-frontier in under 3 hours while running on a commodity laptop. We effectively remove the carbon footprint of hundreds of GPU hours of training during search, offering a strong simple baseline for future NAS methods in autoregressive language modeling.
    A data-driven approach for the closure of RANS models by the divergence of the Reynolds Stress Tensor. (arXiv:2203.16944v3 [physics.flu-dyn] UPDATED)
    In the present paper a new data-driven model is proposed to close and increase accuracy of RANS equations. The divergence of the Reynolds Stress Tensor (RST) is obtained through a Neural Network (NN) whose architecture and input choice guarantee both Galilean and coordinates-frame rotation. The former derives from the input choice of the NN while the latter from the expansion of the divergence of the RST into a vector basis. This approach has been widely used for data-driven models for the anisotropic RST or the RST discrepancies and it is here proposed for the divergence of the RST. Hence, a constitutive relation of the divergence of the RST from mean quantities is proposed to obtain such expansion. Moreover, once the proposed data-driven approach is trained, there is no need to run any classic turbulence model to close the equations. The well-known tests of flow in a square duct and over periodic hills are used to show advantages of the present method compared to standard turbulence models.
    The Tail Wagging the Dog: Dataset Construction Biases of Social Bias Benchmarks. (arXiv:2210.10040v1 [cs.CL])
    How reliably can we trust the scores obtained from social bias benchmarks as faithful indicators of problematic social biases in a given language model? In this work, we study this question by contrasting social biases with non-social biases stemming from choices made during dataset construction that might not even be discernible to the human eye. To do so, we empirically simulate various alternative constructions for a given benchmark based on innocuous modifications (such as paraphrasing or random-sampling) that maintain the essence of their social bias. On two well-known social bias benchmarks (Winogender and BiasNLI) we observe that these shallow modifications have a surprising effect on the resulting degree of bias across various models. We hope these troubling observations motivate more robust measures of social biases.
    Short-term Load Forecasting with Distributed Long Short-Term Memory. (arXiv:2208.01147v2 [cs.LG] UPDATED)
    With the employment of smart meters, massive data on consumer behaviour can be collected by retailers. From the collected data, the retailers may obtain the household profile information and implement demand response. While retailers prefer to acquire a model as accurate as possible among different customers, there are two major challenges. First, different retailers in the retail market do not share their consumer's electricity consumption data as these data are regarded as their assets, which has led to the problem of data island. Second, the electricity load data are highly heterogeneous since different retailers may serve various consumers. To this end, a fully distributed short-term load forecasting framework based on a consensus algorithm and Long Short-Term Memory (LSTM) is proposed, which may protect the customer's privacy and satisfy the accurate load forecasting requirement. Specifically, a fully distributed learning framework is exploited for distributed training, and a consensus technique is applied to meet confidential privacy. Case studies show that the proposed method has comparable performance with centralised methods regarding the accuracy, but the proposed method shows advantages in training speed and data privacy.
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v2 [cs.LG] UPDATED)
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them are significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. This allows us to provide approximation, stability, and descent guarantees. Moreover, our method automatically and dynamically adapts the ranks during training to achieve the desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.
    A Practical, Progressively-Expressive GNN. (arXiv:2210.09521v1 [cs.LG])
    Message passing neural networks (MPNNs) have become a dominant flavor of graph neural networks (GNNs) in recent years. Yet, MPNNs come with notable limitations; namely, they are at most as powerful as the 1-dimensional Weisfeiler-Leman (1-WL) test in distinguishing graphs in a graph isomorphism testing frame-work. To this end, researchers have drawn inspiration from the k-WL hierarchy to develop more expressive GNNs. However, current k-WL-equivalent GNNs are not practical for even small values of k, as k-WL becomes combinatorially more complex as k grows. At the same time, several works have found great empirical success in graph learning tasks without highly expressive models, implying that chasing expressiveness with a coarse-grained ruler of expressivity like k-WL is often unneeded in practical tasks. To truly understand the expressiveness-complexity tradeoff, one desires a more fine-grained ruler, which can more gradually increase expressiveness. Our work puts forth such a proposal: Namely, we first propose the (k, c)(<=)-SETWL hierarchy with greatly reduced complexity from k-WL, achieved by moving from k-tuples of nodes to sets with <=k nodes defined over <=c connected components in the induced original graph. We show favorable theoretical results for this model in relation to k-WL, and concretize it via (k, c)(<=)-SETGNN, which is as expressive as (k, c)(<=)-SETWL. Our model is practical and progressively-expressive, increasing in power with k and c. We demonstrate effectiveness on several benchmark datasets, achieving several state-of-the-art results with runtime and memory usage applicable to practical graphs. We open source our implementation at https://github.com/LingxiaoShawn/KCSetGNN.
    Online Damage Recovery for Physical Robots with Hierarchical Quality-Diversity. (arXiv:2210.09918v1 [cs.RO])
    In real-world environments, robots need to be resilient to damages and robust to unforeseen scenarios. Quality-Diversity (QD) algorithms have been successfully used to make robots adapt to damages in seconds by leveraging a diverse set of learned skills. A high diversity of skills increases the chances of a robot to succeed at overcoming new situations since there are more potential alternatives to solve a new task.However, finding and storing a large behavioural diversity of multiple skills often leads to an increase in computational complexity. Furthermore, robot planning in a large skill space is an additional challenge that arises with an increased number of skills. Hierarchical structures can help reducing this search and storage complexity by breaking down skills into primitive skills. In this paper, we introduce the Hierarchical Trial and Error algorithm, which uses a hierarchical behavioural repertoire to learn diverse skills and leverages them to make the robot adapt quickly in the physical world. We show that the hierarchical decomposition of skills enables the robot to learn more complex behaviours while keeping the learning of the repertoire tractable. Experiments with a hexapod robot show that our method solves a maze navigation tasks with 20% less actions in simulation, and 43% less actions in the physical world, for the most challenging scenarios than the best baselines while having 78% less complete failures.
    Models and Mechanisms for Spatial Data Fairness. (arXiv:2204.01880v2 [cs.DB] UPDATED)
    Fairness in data-driven decision-making studies scenarios where individuals from certain population segments may be unfairly treated when being considered for loan or job applications, access to public resources, or other types of services. In location-based applications, decisions are based on individual whereabouts, which often correlate with sensitive attributes such as race, income, and education. While fairness has received significant attention recently, e.g., in machine learning, there is little focus on achieving fairness when dealing with location data. Due to their characteristics and specific type of processing algorithms, location data pose important fairness challenges. We introduce the concept of spatial data fairness to address the specific challenges of location data and spatial queries. We devise a novel building block to achieve fairness in the form of fair polynomials. Next, we propose two mechanisms based on fair polynomials that achieve individual spatial fairness, corresponding to two common location-based decision-making types: distance-based and zone-based. Extensive experimental results on real data show that the proposed mechanisms achieve spatial fairness without sacrificing utility.
    Learning Interface Conditions in Domain Decomposition Solvers. (arXiv:2205.09833v2 [cs.LG] UPDATED)
    Domain decomposition methods are widely used and effective in the approximation of solutions to partial differential equations. Yet the optimal construction of these methods requires tedious analysis and is often available only in simplified, structured-grid settings, limiting their use for more complex problems. In this work, we generalize optimized Schwarz domain decomposition methods to unstructured-grid problems, using Graph Convolutional Neural Networks (GCNNs) and unsupervised learning to learn optimal modifications at subdomain interfaces. A key ingredient in our approach is an improved loss function, enabling effective training on relatively small problems, but robust performance on arbitrarily large problems, with computational cost linear in problem size. The performance of the learned linear solvers is compared with both classical and optimized domain decomposition algorithms, for both structured- and unstructured-grid problems.
    Global Optimality and Finite Sample Analysis of Softmax Off-Policy Actor Critic under State Distribution Mismatch. (arXiv:2111.02997v2 [cs.LG] UPDATED)
    In this paper, we establish the global optimality and convergence rate of an off-policy actor critic algorithm in the tabular setting without using density ratio to correct the discrepancy between the state distribution of the behavior policy and that of the target policy. Our work goes beyond existing works on the optimality of policy gradient methods in that existing works use the exact policy gradient for updating the policy parameters while we use an approximate and stochastic update step. Our update step is not a gradient update because we do not use a density ratio to correct the state distribution, which aligns well with what practitioners do. Our update is approximate because we use a learned critic instead of the true value function. Our update is stochastic because at each step the update is done for only the current state action pair. Moreover, we remove several restrictive assumptions from existing works in our analysis. Central to our work is the finite sample analysis of a generic stochastic approximation algorithm with time-inhomogeneous update operators on time-inhomogeneous Markov chains, based on its uniform contraction properties.
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v2 [cs.LG] UPDATED)
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{chizat2018global, mei2018mean}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated unbounded nonlinearity of the training dynamics. This work establishes the first linear convergence result for vanilla two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novel time-depdendent estimate of the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.
    Classifying Turbulent Environments via Machine Learning. (arXiv:2201.00732v2 [physics.flu-dyn] UPDATED)
    The problem of classifying turbulent environments from partial observation is key for some theoretical and applied fields, from engineering to earth observation and astrophysics, e.g. to precondition searching of optimal control policies in different turbulent backgrounds, to predict the probability of rare events and/or to infer physical parameters labelling different turbulent set-ups. To achieve such goal one can use different tools depending on the system's knowledge and on the quality and quantity of the accessible data. In this context, we assume to work in a model-free setup completely blind to all dynamical laws, but with a large quantity of (good quality) data for training. As a prototype of complex flows with different attractors, and different multi-scale statistical properties we selected 10 turbulent 'ensembles' by changing the rotation frequency of the frame of reference of the 3d domain and we suppose to have access to a set of partial observations limited to the instantaneous kinetic energy distribution in a 2d plane, as it is often the case in geophysics and astrophysics. We compare results obtained by a Machine Learning (ML) approach consisting of a state-of-the-art Deep Convolutional Neural Network (DCNN) against Bayesian inference which exploits the information on velocity and enstrophy moments. First, we discuss the supremacy of the ML approach, presenting also results at changing the number of training data and of the hyper-parameters. Second, we present an ablation study on the input data aimed to perform a ranking on the importance of the flow features used by the DCNN, helping to identify the main physical contents used by the classifier. Finally, we discuss the main limitations of such data-driven methods and potential interesting applications.
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v2 [cs.LG] UPDATED)
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a non-asymptotic `Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions -- where sharpness is minimum amongst all minimisers.
    What You See is What You Get: Principled Deep Learning via Distributional Generalization. (arXiv:2204.03230v2 [cs.LG] UPDATED)
    Having similar behavior at training time and test time $-$ what we call a "What You See Is What You Get" (WYSIWYG) property $-$ is desirable in machine learning. Models trained with standard stochastic gradient descent (SGD), however, do not necessarily have this property, as their complex behaviors such as robustness or subgroup performance can differ drastically between training and test time. In contrast, we show that Differentially-Private (DP) training provably ensures the high-level WYSIWYG property, which we quantify using a notion of distributional generalization. Applying this connection, we introduce new conceptual tools for designing deep-learning methods by reducing generalization concerns to optimization ones: to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the training data. By applying this novel design principle, which bypasses "pathologies" of SGD, we construct simple algorithms that are competitive with SOTA in several distributional-robustness applications, significantly improve the privacy vs. disparate impact trade-off of DP-SGD, and mitigate robust overfitting in adversarial training. Finally, we also improve on theoretical bounds relating DP, stability, and distributional generalization.
    Discrete State-Action Abstraction via the Successor Representation. (arXiv:2206.03467v2 [cs.AI] UPDATED)
    While the difficulty of reinforcement learning problems is typically related to the complexity of their state spaces, Abstraction proposes that solutions often lie in simpler underlying latent spaces. Prior works have focused on learning either a continuous or dense abstraction, or require a human to provide one. Information-dense representations capture features irrelevant for solving tasks, and continuous spaces can struggle to represent discrete objects. In this work we automatically learn a sparse discrete abstraction of the underlying environment. We do so using a simple end-to-end trainable model based on the successor representation and max-entropy regularization. We describe an algorithm to apply our model, named Discrete State-Action Abstraction (DSAA), which computes an action abstraction in the form of temporally extended actions, i.e., Options, to transition between discrete abstract states. Empirically, we demonstrate the effects of different exploration schemes on our resulting abstraction, and show that it is efficient for solving downstream tasks.
    Deep Black-Box Reinforcement Learning with Movement Primitives. (arXiv:2210.09622v1 [cs.LG])
    \Episode-based reinforcement learning (ERL) algorithms treat reinforcement learning (RL) as a black-box optimization problem where we learn to select a parameter vector of a controller, often represented as a movement primitive, for a given task descriptor called a context. ERL offers several distinct benefits in comparison to step-based RL. It generates smooth control trajectories, can handle non-Markovian reward definitions, and the resulting exploration in parameter space is well suited for solving sparse reward settings. Yet, the high dimensionality of the movement primitive parameters has so far hampered the effective use of deep RL methods. In this paper, we present a new algorithm for deep ERL. It is based on differentiable trust region layers, a successful on-policy deep RL algorithm. These layers allow us to specify trust regions for the policy update that are solved exactly for each state using convex optimization, which enables policies learning with the high precision required for the ERL. We compare our ERL algorithm to state-of-the-art step-based algorithms in many complex simulated robotic control tasks. In doing so, we investigate different reward formulations - dense, sparse, and non-Markovian. While step-based algorithms perform well only on dense rewards, ERL performs favorably on sparse and non-Markovian rewards. Moreover, our results show that the sparse and the non-Markovian rewards are also often better suited to define the desired behavior, allowing us to obtain considerably higher quality policies compared to step-based RL.
    Overview frequency principle/spectral bias in deep learning. (arXiv:2201.07395v2 [cs.LG] UPDATED)
    Understanding deep learning is increasingly emergent as it penetrates more and more into industry and science. In recent years, a research line from Fourier analysis sheds lights into this magical "black box" by showing a Frequency Principle (F-Principle or spectral bias) of the training behavior of deep neural networks (DNNs) -- DNNs often fit functions from low to high frequency during the training. The F-Principle is first demonstrated by one-dimensional synthetic data followed by the verification in high-dimensional real datasets. A series of works subsequently enhance the validity of the F-Principle. This low-frequency implicit bias reveals the strength of neural network in learning low-frequency functions as well as its deficiency in learning high-frequency functions. Such understanding inspires the design of DNN-based algorithms in practical problems, explains experimental phenomena emerging in various scenarios, and further advances the study of deep learning from the frequency perspective. Although incomplete, we provide an overview of F-Principle and propose some open problems for future research.
    Make Some Noise: Reliable and Efficient Single-Step Adversarial Training. (arXiv:2202.01181v3 [cs.LG] UPDATED)
    Recently, Wong et al. showed that adversarial training with single-step FGSM leads to a characteristic failure mode named Catastrophic Overfitting (CO), in which a model becomes suddenly vulnerable to multi-step attacks. Experimentally they showed that simply adding a random perturbation prior to FGSM (RS-FGSM) could prevent CO. However, Andriushchenko and Flammarion observed that RS-FGSM still leads to CO for larger perturbations, and proposed a computationally expensive regularizer (GradAlign) to avoid it. In this work, we methodically revisit the role of noise and clipping in single-step adversarial training. Contrary to previous intuitions, we find that using a stronger noise around the clean sample combined with \textit{not clipping} is highly effective in avoiding CO for large perturbation radii. We then propose Noise-FGSM (N-FGSM) that, while providing the benefits of single-step adversarial training, does not suffer from CO. Empirical analyses on a large suite of experiments show that N-FGSM is able to match or surpass the performance of previous state-of-the-art GradAlign, while achieving 3x speed-up. Code can be found in https://github.com/pdejorge/N-FGSM
    Towards Efficient and Effective Self-Supervised Learning of Visual Representations. (arXiv:2210.09866v1 [cs.CV])
    Self-supervision has emerged as a propitious method for visual representation learning after the recent paradigm shift from handcrafted pretext tasks to instance-similarity based approaches. Most state-of-the-art methods enforce similarity between various augmentations of a given image, while some methods additionally use contrastive approaches to explicitly ensure diverse representations. While these approaches have indeed shown promising direction, they require a significantly larger number of training iterations when compared to the supervised counterparts. In this work, we explore reasons for the slow convergence of these methods, and further propose to strengthen them using well-posed auxiliary tasks that converge significantly faster, and are also useful for representation learning. The proposed method utilizes the task of rotation prediction to improve the efficiency of existing state-of-the-art methods. We demonstrate significant gains in performance using the proposed method on multiple datasets, specifically for lower training epochs.
    Provably Robust Detection of Out-of-distribution Data (almost) for free. (arXiv:2106.04260v2 [cs.LG] UPDATED)
    The application of machine learning in safety-critical systems requires a reliable assessment of uncertainty. However, deep neural networks are known to produce highly overconfident predictions on out-of-distribution (OOD) data. Even if trained to be non-confident on OOD data, one can still adversarially manipulate OOD data so that the classifier again assigns high confidence to the manipulated samples. We show that two previously published defenses can be broken by better adapted attacks, highlighting the importance of robustness guarantees around OOD data. Since the existing method for this task is hard to train and significantly limits accuracy, we construct a classifier that can simultaneously achieve provably adversarially robust OOD detection and high clean accuracy. Moreover, by slightly modifying the classifier's architecture our method provably avoids the asymptotic overconfidence problem of standard neural networks. We provide code for all our experiments.
    Learning Sparse Fixed-Structure Gaussian Bayesian Networks. (arXiv:2107.10450v3 [cs.DS] UPDATED)
    Gaussian Bayesian networks (a.k.a. linear Gaussian structural equation models) are widely used to model causal interactions among continuous variables. In this work, we study the problem of learning a fixed-structure Gaussian Bayesian network up to a bounded error in total variation distance. We analyze the commonly used node-wise least squares regression (LeastSquares) and prove that it has a near-optimal sample complexity. We also study a couple of new algorithms for the problem: - BatchAvgLeastSquares takes the average of several batches of least squares solutions at each node, so that one can interpolate between the batch size and the number of batches. We show that BatchAvgLeastSquares also has near-optimal sample complexity. - CauchyEst takes the median of solutions to several batches of linear systems at each node. We show that the algorithm specialized to polytrees, CauchyEstTree, has near-optimal sample complexity. Experimentally, we show that for uncontaminated, realizable data, the LeastSquares algorithm performs best, but in the presence of contamination or DAG misspecification, CauchyEst/CauchyEstTree and BatchAvgLeastSquares respectively perform better.
    Transferability Properties of Graph Neural Networks. (arXiv:2112.04629v3 [cs.LG] UPDATED)
    Graph neural networks (GNNs) are composed of layers consisting of graph convolutions and pointwise nonlinearities. Due to their invariance and stability properties, GNNs are provably successful at learning representations from data supported on moderate-scale graphs. However, they are difficult to learn on large-scale graphs. In this paper, we study the problem of training GNNs on graphs of moderate size and transferring them to large-scale graphs. We use graph limits called graphons to define limit objects for graph filters and GNNs -- graphon filters and graphon neural networks (WNNs) -- which we interpret as generative models for graph filters and GNNs. We then show that graphon filters and WNNs can be approximated by graph filters and GNNs sampled from them on weighted and stochastic graphs. Because the error of these approximations can be upper bounded, by a triangle inequality argument we can further bound the error of transferring a graph filter or a GNN across graphs. Our results show that (i) the transference error decreases with the graph size, and (ii) that graph filters have a transferability-discriminability tradeoff that in GNNs is alleviated by the scattering behavior of the nonlinearity. These findings are demonstrated empirically in a movie recommendation problem and in a decentralized control task.
    Bag of Tricks for Developing Diabetic Retinopathy Analysis Framework to Overcome Data Scarcity. (arXiv:2210.09558v1 [eess.IV])
    Recently, diabetic retinopathy (DR) screening utilizing ultra-wide optical coherence tomography angiography (UW-OCTA) has been used in clinical practices to detect signs of early DR. However, developing a deep learning-based DR analysis system using UW-OCTA images is not trivial due to the difficulty of data collection and the absence of public datasets. By realistic constraints, a model trained on small datasets may obtain sub-par performance. Therefore, to help ophthalmologists be less confused about models' incorrect decisions, the models should be robust even in data scarcity settings. To address the above practical challenging, we present a comprehensive empirical study for DR analysis tasks, including lesion segmentation, image quality assessment, and DR grading. For each task, we introduce a robust training scheme by leveraging ensemble learning, data augmentation, and semi-supervised learning. Furthermore, we propose reliable pseudo labeling that excludes uncertain pseudo-labels based on the model's confidence scores to reduce the negative effect of noisy pseudo-labels. By exploiting the proposed approaches, we achieved 1st place in the Diabetic Retinopathy Analysis Challenge.
    Nonlinear Invariant Risk Minimization: A Causal Approach. (arXiv:2102.12353v6 [cs.LG] UPDATED)
    Due to spurious correlations, machine learning systems often fail to generalize to environments whose distributions differ from the ones used at training time. Prior work addressing this, either explicitly or implicitly, attempted to find a data representation that has an invariant relationship with the target. This is done by leveraging a diverse set of training environments to reduce the effect of spurious features and build an invariant predictor. However, these methods have generalization guarantees only when both data representation and classifiers come from a linear model class. We propose invariant Causal Representation Learning (iCaRL), an approach that enables out-of-distribution (OOD) generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifiers). It builds upon a practical and general assumption: the prior over the data representation (i.e., a set of latent variables encoding the data) given the target and the environment belongs to general exponential family distributions. Based on this, we show that it is possible to identify the data representation up to simple transformations. We also prove that all direct causes of the target can be fully discovered, which further enables us to obtain generalization guarantees in the nonlinear setting. Extensive experiments on both synthetic and real-world datasets show that our approach outperforms a variety of baseline methods. Finally, in the discussion, we further explore the aforementioned assumption and propose a more general hypothesis, called the Agnostic Hypothesis: there exist a set of hidden causal factors affecting both inputs and outcomes. The Agnostic Hypothesis can provide a unifying view of machine learning. More importantly, it can inspire a new direction to explore a general theory for identifying hidden causal factors, which is key to enabling the OOD generalization guarantees.
    Enabling Heterogeneous Domain Adaptation in Multi-inhabitants Smart Home Activity Learning. (arXiv:2210.09499v1 [cs.AI])
    Domain adaptation for sensor-based activity learning is of utmost importance in remote health monitoring research. However, many domain adaptation algorithms suffer with failure to operate adaptation in presence of target domain heterogeneity (which is always present in reality) and presence of multiple inhabitants dramatically hinders their generalizability producing unsatisfactory results for semi-supervised and unseen activity learning tasks. We propose \emph{AEDA}, a novel deep auto-encoder-based model to enable semi-supervised domain adaptation in the existence of target domain heterogeneity and how to incorporate it to empower heterogeneity to any homogeneous deep domain adaptation architecture for cross-domain activity learning. Experimental evaluation on 18 different heterogeneous and multi-inhabitants use-cases of 8 different domains created from 2 publicly available human activity datasets (wearable and ambient smart homes) shows that \emph{AEDA} outperforms (max. 12.8\% and 8.9\% improvements for ambient smart home and wearables) over existing domain adaptation techniques for both seen and unseen activity learning in a heterogeneous setting.
    Rethinking Value Function Learning for Generalization in Reinforcement Learning. (arXiv:2210.09960v1 [cs.LG])
    We focus on the problem of training RL agents on multiple training environments to improve observational generalization performance. In prior methods, policy and value networks are separately optimized using a disjoint network architecture to avoid interference and obtain a more accurate value function. We identify that the value network in the multiple-environment setting is more challenging to optimize and prone to overfitting training data than in the conventional single-environment setting. In addition, we find that appropriate regularization of the value network is required for better training and test performance. To this end, we propose Delayed-Critic Policy Gradient (DCPG), which implicitly penalizes the value estimates by optimizing the value network less frequently with more training data than the policy network, which can be implemented using a shared network architecture. Furthermore, we introduce a simple self-supervised task that learns the forward and inverse dynamics of environments using a single discriminator, which can be jointly optimized with the value network. Our proposed algorithms significantly improve observational generalization performance and sample efficiency in the Procgen Benchmark.
    Learning Less Generalizable Patterns with an Asymmetrically Trained Double Classifier for Better Test-Time Adaptation. (arXiv:2210.09834v1 [cs.AI])
    Deep neural networks often fail to generalize outside of their training distribution, in particular when only a single data domain is available during training. While test-time adaptation has yielded encouraging results in this setting, we argue that, to reach further improvements, these approaches should be combined with training procedure modifications aiming to learn a more diverse set of patterns. Indeed, test-time adaptation methods usually have to rely on a limited representation because of the shortcut learning phenomenon: only a subset of the available predictive patterns is learned with standard training. In this paper, we first show that the combined use of existing training-time strategies, and test-time batch normalization, a simple adaptation method, does not always improve upon the test-time adaptation alone on the PACS benchmark. Furthermore, experiments on Office-Home show that very few training-time methods improve upon standard training, with or without test-time batch normalization. We therefore propose a novel approach using a pair of classifiers and a shortcut patterns avoidance loss that mitigates the shortcut learning behavior by reducing the generalization ability of the secondary classifier, using the additional shortcut patterns avoidance loss that encourages the learning of samples specific patterns. The primary classifier is trained normally, resulting in the learning of both the natural and the more complex, less generalizable, features. Our experiments show that our method improves upon the state-of-the-art results on both benchmarks and benefits the most to test-time batch normalization.
    Efficient Augmentation for Imbalanced Deep Learning. (arXiv:2207.06080v2 [cs.LG] UPDATED)
    Deep learning models tend to memorize training data, which hurts their ability to generalize to under-represented classes. We empirically study a convolutional neural network's internal representation of imbalanced image data and measure the generalization gap between a model's feature embeddings in the training and test sets, showing that the gap is wider for minority classes. This insight enables us to design an efficient three-phase CNN training framework for imbalanced data. The framework involves training the network end-to-end on imbalanced data to learn accurate feature embeddings, performing data augmentation in the learned embedded space to balance the train distribution, and fine-tuning the classifier head on the embedded balanced training data. We propose Expansive Over-Sampling (EOS) as a data augmentation technique to utilize in the training framework. EOS forms synthetic training instances as convex combinations between the minority class samples and their nearest enemies in the embedded space to reduce the generalization gap. The proposed framework improves the accuracy over leading cost-sensitive and resampling methods commonly used in imbalanced learning. Moreover, it is more computationally efficient than standard data pre-processing methods, such as SMOTE and GAN-based oversampling, as it requires fewer parameters and less training time.
    Pareto Manifold Learning: Tackling multiple tasks via ensembles of single-task models. (arXiv:2210.09759v1 [cs.LG])
    In Multi-Task Learning, tasks may compete and limit the performance achieved on each other rather than guiding the optimization trajectory to a common solution, superior to its single-task counterparts. There is often not a single solution that is optimal for all tasks, leading practitioners to balance tradeoffs between tasks' performance, and to resort to optimality in the Pareto sense. Current Multi-Task Learning methodologies either completely neglect this aspect of functional diversity, and produce one solution in the Pareto Front predefined by their optimization schemes, or produce diverse but discrete solutions, each requiring a separate training run. In this paper, we conjecture that there exist Pareto Subspaces, i.e., weight subspaces where multiple optimal functional solutions lie. We propose Pareto Manifold Learning, an ensembling method in weight space that is able to discover such a parameterization and produces a continuous Pareto Front in a single training run, allowing practitioners to modulate the performance on each task during inference on the fly. We validate the proposed method on a diverse set of multi-task learning benchmarks, ranging from image classification to tabular datasets and scene understanding, and show that Pareto Manifold Learning outperforms state-of-the-art algorithms.
    Predicting Winning Regions in Parity Games via Graph Neural Networks (Extended Abstract). (arXiv:2210.09924v1 [cs.GT])
    Solving parity games is a major building block for numerous applications in reactive program verification and synthesis. While there exist efficient approaches to solving parity games in practice, none of these have a polynomial worst-case runtime complexity. We present a incomplete approach to determining the winning regions of parity games via graph neural networks. Our evaluation on 900 randomly generated parity games shows that this approach is efficient in practice. It moreover correctly determines the winning regions of ~60% of the games in our data set and only incurs minor errors in the remaining ones.
    Vision Paper: Causal Inference for Interpretable and Robust Machine Learning in Mobility Analysis. (arXiv:2210.10010v1 [cs.LG])
    Artificial intelligence (AI) is revolutionizing many areas of our lives, leading a new era of technological advancement. Particularly, the transportation sector would benefit from the progress in AI and advance the development of intelligent transportation systems. Building intelligent transportation systems requires an intricate combination of artificial intelligence and mobility analysis. The past few years have seen rapid development in transportation applications using advanced deep neural networks. However, such deep neural networks are difficult to interpret and lack robustness, which slows the deployment of these AI-powered algorithms in practice. To improve their usability, increasing research efforts have been devoted to developing interpretable and robust machine learning methods, among which the causal inference approach recently gained traction as it provides interpretable and actionable information. Moreover, most of these methods are developed for image or sequential data which do not satisfy specific requirements of mobility data analysis. This vision paper emphasizes research challenges in deep learning-based mobility analysis that require interpretability and robustness, summarizes recent developments in using causal inference for improving the interpretability and robustness of machine learning methods, and highlights opportunities in developing causally-enabled machine learning models tailored for mobility analysis. This research direction will make AI in the transportation sector more interpretable and reliable, thus contributing to safer, more efficient, and more sustainable future transportation systems.
    On the Information Content of Predictions in Word Analogy Tests. (arXiv:2210.09972v1 [cs.CL])
    An approach is proposed to quantify, in bits of information, the actual relevance of analogies in analogy tests. The main component of this approach is a softaccuracy estimator that also yields entropy estimates with compensated biases. Experimental results obtained with pre-trained GloVe 300-D vectors and two public analogy test sets show that proximity hints are much more relevant than analogies in analogy tests, from an information content perspective. Accordingly, a simple word embedding model is used to predict that analogies carry about one bit of information, which is experimentally corroborated.
    Revisit Systematic Generalization via Meaningful Learning. (arXiv:2003.06658v5 [cs.CL] UPDATED)
    Humans can systematically generalize to novel compositions of existing concepts. Recent studies argue that neural networks appear inherently ineffective in such cognitive capacity, leading to a pessimistic view and a lack of attention to optimistic results. We revisit this controversial topic from the perspective of meaningful learning, an exceptional capability of humans to learn novel concepts by connecting them with known ones. We reassess the compositional skills of sequence-to-sequence models conditioned on the semantic links between new and old concepts. Our observations suggest that models can successfully one-shot generalize to novel concepts and compositions through semantic linking, either inductively or deductively. We demonstrate that prior knowledge plays a key role as well. In addition to synthetic tests, we further conduct proof-of-concept experiments in machine translation and semantic parsing, showing the benefits of meaningful learning in applications. We hope our positive findings will encourage excavating modern neural networks' potential in systematic generalization through more advanced learning schemes.
    Linear Guardedness and its Implications. (arXiv:2210.10012v1 [cs.LG])
    Previous work on concept identification in neural representations has focused on linear concept subspaces and their neutralization. In this work, we formulate the notion of linear guardedness -- the inability to directly predict a given concept from the representation -- and study its implications. We show that, in the binary case, the neutralized concept cannot be recovered by an additional linear layer. However, we point out that -- contrary to what was implicitly argued in previous works -- multiclass softmax classifiers can be constructed that indirectly recover the concept. Thus, linear guardedness does not guarantee that linear classifiers do not utilize the neutralized concepts, shedding light on theoretical limitations of linear information removal methods.
    On Classification Thresholds for Graph Attention with Edge Features. (arXiv:2210.10014v1 [cs.LG])
    The recent years we have seen the rise of graph neural networks for prediction tasks on graphs. One of the dominant architectures is graph attention due to its ability to make predictions using weighted edge features and not only node features. In this paper we analyze, theoretically and empirically, graph attention networks and their ability of correctly labelling nodes in a classic classification task. More specifically, we study the performance of graph attention on the classic contextual stochastic block model (CSBM). In CSBM the nodes and edge features are obtained from a mixture of Gaussians and the edges from a stochastic block model. We consider a general graph attention mechanism that takes random edge features as input to determine the attention coefficients. We study two cases, in the first one, when the edge features are noisy, we prove that the majority of the attention coefficients are up to a constant uniform. This allows us to prove that graph attention with edge features is not better than simple graph convolution for achieving perfect node classification. Second, we prove that when the edge features are clean graph attention can distinguish intra- from inter-edges and this makes graph attention better than classic graph convolution.
    RAPO: An Adaptive Ranking Paradigm for Bilingual Lexicon Induction. (arXiv:2210.09926v1 [cs.CL])
    Bilingual lexicon induction induces the word translations by aligning independently trained word embeddings in two languages. Existing approaches generally focus on minimizing the distances between words in the aligned pairs, while suffering from low discriminative capability to distinguish the relative orders between positive and negative candidates. In addition, the mapping function is globally shared by all words, whose performance might be hindered by the deviations in the distributions of different languages. In this work, we propose a novel ranking-oriented induction model RAPO to learn personalized mapping function for each word. RAPO is capable of enjoying the merits from the unique characteristics of a single word and the cross-language isomorphism simultaneously. Extensive experimental results on public datasets including both rich-resource and low-resource languages demonstrate the superiority of our proposal. Our code is publicly available in \url{https://github.com/Jlfj345wf/RAPO}.
    Towards Understanding GD with Hard and Conjugate Pseudo-labels for Test-Time Adaptation. (arXiv:2210.10019v1 [cs.LG])
    We consider a setting that a model needs to adapt to a new domain under distribution shifts, given that only unlabeled test samples from the new domain are accessible at test time. A common idea in most of the related works is constructing pseudo-labels for the unlabeled test samples and applying gradient descent (GD) to a loss function with the pseudo-labels. Recently, Goyal et al. (2022) propose conjugate labels, which is a new kind of pseudo-labels for self-training at test time. They empirically show that the conjugate label outperforms other ways of pseudo-labeling on many domain adaptation benchmarks. However, provably showing that GD with conjugate labels learns a good classifier for test-time adaptation remains open. In this work, we aim at theoretically understanding GD with hard and conjugate labels for a binary classification problem. We show that for square loss, GD with conjugate labels converges to a solution that minimizes the testing 0-1 loss under a Gaussian model, while GD with hard pseudo-labels fails in this task. We also analyze them under different loss functions for the update. Our results shed lights on understanding when and why GD with hard labels or conjugate labels works in test-time adaptation.
    Theoretical Guarantees for Permutation-Equivariant Quantum Neural Networks. (arXiv:2210.09974v1 [quant-ph])
    Despite the great promise of quantum machine learning models, there are several challenges one must overcome before unlocking their full potential. For instance, models based on quantum neural networks (QNNs) can suffer from excessive local minima and barren plateaus in their training landscapes. Recently, the nascent field of geometric quantum machine learning (GQML) has emerged as a potential solution to some of those issues. The key insight of GQML is that one should design architectures, such as equivariant QNNs, encoding the symmetries of the problem at hand. Here, we focus on problems with permutation symmetry (i.e., the group of symmetry $S_n$), and show how to build $S_n$-equivariant QNNs. We provide an analytical study of their performance, proving that they do not suffer from barren plateaus, quickly reach overparametrization, and can generalize well from small amounts of data. To verify our results, we perform numerical simulations for a graph state classification task. Our work provides the first theoretical guarantees for equivariant QNNs, thus indicating the extreme power and potential of GQML.
    Locally Smoothed Gaussian Process Regression. (arXiv:2210.09998v1 [stat.ML])
    We develop a novel framework to accelerate Gaussian process regression (GPR). In particular, we consider localization kernels at each data point to down-weigh the contributions from other data points that are far away, and we derive the GPR model stemming from the application of such localization operation. Through a set of experiments, we demonstrate the competitive performance of the proposed approach compared to full GPR, other localized models, and deep Gaussian processes. Crucially, these performances are obtained with considerable speedups compared to standard global GPR due to the sparsification effect of the Gram matrix induced by the localization operation.
    Research of an optimization model for servicing a network of ATMs and information payment terminals. (arXiv:2210.09927v1 [cs.LG])
    The steadily high demand for cash contributes to the expansion of the network of Bank payment terminals. To optimize the amount of cash in payment terminals, it is necessary to minimize the cost of servicing them and ensure that there are no excess funds in the network. The purpose of this work is to create a cash management system in the network of payment terminals. The article discusses the solution to the problem of determining the optimal amount of funds to be loaded into the terminals, and the effective frequency of collection, which allows to get additional income by investing the released funds. The paper presents the results of predicting daily cash withdrawals at ATMs using a triple exponential smoothing model, a recurrent neural network with long short-term memory, and a model of singular spectrum analysis. These forecasting models allowed us to obtain a sufficient level of correct forecasts with good accuracy and completeness. The results of forecasting cash withdrawals were used to build a discrete optimal control model, which was used to develop an optimal schedule for adding funds to the payment terminal. It is proved that the efficiency and reliability of the proposed model is higher than that of the classical Baumol-Tobin inventory management model: when tested on the time series of three ATMs, the discrete optimal control model did not allow exhaustion of funds and allowed to earn on average 30% more than the classical model.
    Perceptual Grouping in Vision-Language Models. (arXiv:2210.09996v1 [cs.CV])
    Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models.
    Off-policy evaluation for learning-to-rank via interpolating the item-position model and the position-based model. (arXiv:2210.09512v1 [cs.LG])
    A critical need for industrial recommender systems is the ability to evaluate recommendation policies offline, before deploying them to production. Unfortunately, widely used off-policy evaluation methods either make strong assumptions about how users behave that can lead to excessive bias, or they make fewer assumptions and suffer from large variance. We tackle this problem by developing a new estimator that mitigates the problems of the two most popular off-policy estimators for rankings, namely the position-based model and the item-position model. In particular, the new estimator, called INTERPOL, addresses the bias of a potentially misspecified position-based model, while providing an adaptable bias-variance trade-off compared to the item-position model. We provide theoretical arguments as well as empirical results that highlight the performance of our novel estimation approach.
    Few-Shot Learning of Compact Models via Task-Specific Meta Distillation. (arXiv:2210.09922v1 [cs.LG])
    We consider a new problem of few-shot learning of compact models. Meta-learning is a popular approach for few-shot learning. Previous work in meta-learning typically assumes that the model architecture during meta-training is the same as the model architecture used for final deployment. In this paper, we challenge this basic assumption. For final deployment, we often need the model to be small. But small models usually do not have enough capacity to effectively adapt to new tasks. In the mean time, we often have access to the large dataset and extensive computing power during meta-training since meta-training is typically performed on a server. In this paper, we propose task-specific meta distillation that simultaneously learns two models in meta-learning: a large teacher model and a small student model. These two models are jointly learned during meta-training. Given a new task during meta-testing, the teacher model is first adapted to this task, then the adapted teacher model is used to guide the adaptation of the student model. The adapted student model is used for final deployment. We demonstrate the effectiveness of our approach in few-shot image classification using model-agnostic meta-learning (MAML). Our proposed method outperforms other alternatives on several benchmark datasets.
    Differentially Private Diffusion Models. (arXiv:2210.09929v1 [stat.ML])
    While modern machine learning models rely on increasingly large training datasets, data is often limited in privacy-sensitive domains. Generative models trained with differential privacy (DP) on sensitive data can sidestep this challenge, providing access to synthetic data instead. However, training DP generative models is highly challenging due to the noise injected into training to enforce DP. We propose to leverage diffusion models (DMs), an emerging class of deep generative models, and introduce Differentially Private Diffusion Models (DPDMs), which enforce privacy using differentially private stochastic gradient descent (DP-SGD). We motivate why DP-SGD is well suited for training DPDMs, and thoroughly investigate the DM parameterization and the sampling algorithm, which turn out to be crucial ingredients in DPDMs. Furthermore, we propose noise multiplicity, a simple yet powerful modification of the DM training objective tailored to the DP setting to boost performance. We validate our novel DPDMs on widely-used image generation benchmarks and achieve state-of-the-art (SOTA) performance by large margins. For example, on MNIST we improve the SOTA FID from 48.4 to 5.01 and downstream classification accuracy from 83.2% to 98.1% for the privacy setting DP-$(\varepsilon{=}10, \delta{=}10^{-5})$. Moreover, on standard benchmarks, classifiers trained on DPDM-generated synthetic data perform on par with task-specific DP-SGD-trained classifiers, which has not been demonstrated before for DP generative models. Project page and code: https://nv-tlabs.github.io/DPDM.
    MaSS: Multi-attribute Selective Suppression. (arXiv:2210.09904v1 [cs.LG])
    The recent rapid advances in machine learning technologies largely depend on the vast richness of data available today, in terms of both the quantity and the rich content contained within. For example, biometric data such as images and voices could reveal people's attributes like age, gender, sentiment, and origin, whereas location/motion data could be used to infer people's activity levels, transportation modes, and life habits. Along with the new services and applications enabled by such technological advances, various governmental policies are put in place to regulate such data usage and protect people's privacy and rights. As a result, data owners often opt for simple data obfuscation (e.g., blur people's faces in images) or withholding data altogether, which leads to severe data quality degradation and greatly limits the data's potential utility. Aiming for a sophisticated mechanism which gives data owners fine-grained control while retaining the maximal degree of data utility, we propose Multi-attribute Selective Suppression, or MaSS, a general framework for performing precisely targeted data surgery to simultaneously suppress any selected set of attributes while preserving the rest for downstream machine learning tasks. MaSS learns a data modifier through adversarial games between two sets of networks, where one is aimed at suppressing selected attributes, and the other ensures the retention of the rest of the attributes via general contrastive loss as well as explicit classification metrics. We carried out an extensive evaluation of our proposed method using multiple datasets from different domains including facial images, voice audio, and video clips, and obtained promising results in MaSS' generalizability and capability of suppressing targeted attributes without negatively affecting the data's usability in other downstream ML tasks.
    Global Explanation of Tree-Ensembles Models Based on Item Response Theory. (arXiv:2210.09933v1 [cs.LG])
    Explainable Artificial Intelligence - XAI is aimed at studying and developing techniques to explain black box models, that is, models that provide limited self-explanation of their predictions. In recent years, XAI researchers have been formalizing proposals and developing new measures to explain how these models make specific predictions. In previous studies, evidence has been found on how model (dataset and algorithm) complexity affects global explanations generated by XAI measures Ciu, Dalex, Eli5, Lofo, Shap and Skater, suggesting that there is room for the development of a new XAI measure that builds on the complexity of the model. Thus, this research proposes a measure called Explainable based on Item Response Theory - eXirt, which is capable of explaining tree-ensemble models by using the properties of Item Response Theory (IRT). For this purpose, a benchmark was created using 40 different datasets and 2 different algorithms (Random Forest and Gradient Boosting), thus generating 6 different explainability ranks using known XAI measures along with 1 data purity rank and 1 rank of the measure eXirt, amounting to 8 global ranks for each model, i.e., 640 ranks altogether. The results show that eXirt displayed different ranks than those of the other measures, which demonstrates that the advocated methodology generates global explanations of tree-ensemble models that have not yet been explored, either for the more difficult models to explain or even the easier ones.
    Dense FixMatch: a simple semi-supervised learning method for pixel-wise prediction tasks. (arXiv:2210.09919v1 [cs.CV])
    We propose Dense FixMatch, a simple method for online semi-supervised learning of dense and structured prediction tasks combining pseudo-labeling and consistency regularization via strong data augmentation. We enable the application of FixMatch in semi-supervised learning problems beyond image classification by adding a matching operation on the pseudo-labels. This allows us to still use the full strength of data augmentation pipelines, including geometric transformations. We evaluate it on semi-supervised semantic segmentation on Cityscapes and Pascal VOC with different percentages of labeled data and ablate design choices and hyper-parameters. Dense FixMatch significantly improves results compared to supervised learning using only labeled data, approaching its performance with 1/4 of the labeled samples.
    Contextual bandits with concave rewards, and an application to fair ranking. (arXiv:2210.09957v1 [cs.LG])
    We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.
    It's a long way! Layer-wise Relevance Propagation for Echo State Networks applied to Earth System Variability. (arXiv:2210.09958v1 [cs.LG])
    Artificial neural networks (ANNs) are known to be powerful methods for many hard problems (e.g. image classification, speech recognition or time series prediction). However, these models tend to produce black-box results and are often difficult to interpret. Layer-wise relevance propagation (LRP) is a widely used technique to understand how ANN models come to their conclusion and to understand what a model has learned. Here, we focus on Echo State Networks (ESNs) as a certain type of recurrent neural networks, also known as reservoir computing. ESNs are easy to train and only require a small number of trainable parameters, but are still black-box models. We show how LRP can be applied to ESNs in order to open the black-box. We also show how ESNs can be used not only for time series prediction but also for image classification: Our ESN model serves as a detector for El Nino Southern Oscillation (ENSO) from sea surface temperature anomalies. ENSO is actually a well-known problem and has been extensively discussed before. But here we use this simple problem to demonstrate how LRP can significantly enhance the explainablility of ESNs.
    RPM: Generalizable Behaviors for Multi-Agent Reinforcement Learning. (arXiv:2210.09646v1 [cs.MA])
    Despite the recent advancement in multi-agent reinforcement learning (MARL), the MARL agents easily overfit the training environment and perform poorly in the evaluation scenarios where other agents behave differently. Obtaining generalizable policies for MARL agents is thus necessary but challenging mainly due to complex multi-agent interactions. In this work, we model the problem with Markov Games and propose a simple yet effective method, ranked policy memory (RPM), to collect diverse multi-agent trajectories for training MARL policies with good generalizability. The main idea of RPM is to maintain a look-up memory of policies. In particular, we try to acquire various levels of behaviors by saving policies via ranking the training episode return, i.e., the episode return of agents in the training environment; when an episode starts, the learning agent can then choose a policy from the RPM as the behavior policy. This innovative self-play training framework leverages agents' past policies and guarantees the diversity of multi-agent interaction in the training data. We implement RPM on top of MARL algorithms and conduct extensive experiments on Melting Pot. It has been demonstrated that RPM enables MARL agents to interact with unseen agents in multi-agent generalization evaluation scenarios and complete given tasks, and it significantly boosts the performance up to 402% on average.
    Uncertainty estimation for out-of-distribution detection in computational histopathology. (arXiv:2210.09909v1 [cs.CV])
    In computational histopathology algorithms now outperform humans on a range of tasks, but to date none are employed for automated diagnoses in the clinic. Before algorithms can be involved in such high-stakes decisions they need to "know when they don't know", i.e., they need to estimate their predictive uncertainty. This allows them to defer potentially erroneous predictions to a human pathologist, thus increasing their safety. Here, we evaluate the predictive performance and calibration of several uncertainty estimation methods on clinical histopathology data. We show that a distance-aware uncertainty estimation method outperforms commonly used approaches, such as Monte Carlo dropout and deep ensembles. However, we observe a drop in predictive performance and calibration on novel samples across all uncertainty estimation methods tested. We also investigate the use of uncertainty thresholding to reject out-of-distribution samples for selective prediction. We demonstrate the limitations of this approach and suggest areas for future research.
    Measure-Theoretic Probability of Complex Co-occurrence and E-Integral. (arXiv:2210.09913v1 [stat.ML])
    Complex high-dimensional co-occurrence data are increasingly popular from a complex system of interacting physical, biological and social processes in discretely indexed modifiable areal units or continuously indexed locations of a study region for landscape-based mechanism. Modeling, predicting and interpreting complex co-occurrences are very general and fundamental problems of statistical and machine learning in a broad variety of real-world modern applications. Probability and conditional probability of co-occurrence are introduced by being defined in a general setting with set functions to develop a rigorous measure-theoretic foundation for the inherent challenge of data sparseness. The data sparseness is a main challenge inherent to probabilistic modeling and reasoning of co-occurrence in statistical inference. The behavior of a class of natural integrals called E-integrals is investigated based on the defined conditional probability of co-occurrence. The results on the properties of E-integral are presented. The paper offers a novel measure-theoretic framework where E-integral as a basic measure-theoretic concept can be the starting point for the expectation functional approach preferred by Whittle (1992) and Pollard (2001) to the development of probability theory for the inherent challenge of co-occurrences emerging in modern high-dimensional co-occurrence data problems and opens the doors to more sophisticated and interesting research in complex high-dimensional co-occurrence data science.
    Near Real-time CO$_2$ Emissions Based on Carbon Satellite And Artificial Intelligence. (arXiv:2210.09850v1 [physics.ao-ph])
    To limit global warming to pre-industrial levels, global governments, industry and academia are taking aggressive efforts to reduce carbon emissions. The evaluation of anthropogenic carbon dioxide (CO$_2$) emissions, however, depends on the self-reporting information that is not always reliable. Society need to develop an objective, independent, and generalized system to meter CO$_2$ emissions. Satellite CO$_2$ observation from space that reports column-average regional CO$_2$ dry-air mole fractions has gradually indicated its potential to build such a system. Nevertheless, estimating anthropogenic CO$_2$ emissions from CO$_2$ observing satellite is bottlenecked by the influence of the highly complicated physical characteristics of atmospheric activities. Here we provide the first method that combines the advanced artificial intelligence (AI) techniques and the carbon satellite monitor to quantify anthropogenic CO$_2$ emissions. We propose an integral AI based pipeline that contains both a data retrieval algorithm and a two-step data-driven solution. First, the data retrieval algorithm can generate effective datasets from multi-modal data including carbon satellite, the information of carbon sources, and several environmental factors. Second, the two-step data-driven solution that applies the powerful representation of deep learning techniques to learn to quantify anthropogenic CO$_2$ emissions from satellite CO$_2$ observation with other factors. Our work unmasks the potential of quantifying CO$_2$ emissions based on the combination of deep learning algorithms and the carbon satellite monitor.
    Universal hidden monotonic trend estimation with contrastive learning. (arXiv:2210.09817v1 [cs.LG])
    In this paper, we describe a universal method for extracting the underlying monotonic trend factor from time series data. We propose an approach related to the Mann-Kendall test, a standard monotonic trend detection method and call it contrastive trend estimation (CTE). We show that the CTE method identifies any hidden trend underlying temporal data while avoiding the standard assumptions used for monotonic trend identification. In particular, CTE can take any type of temporal data (vector, images, graphs, time series, etc.) as input. We finally illustrate the interest of our CTE method through several experiments on different types of data and problems.
    Towards Proactive Information Retrieval in Noisy Text with Wikipedia Concepts. (arXiv:2210.09877v1 [cs.IR])
    Extracting useful information from the user history to clearly understand informational needs is a crucial feature of a proactive information retrieval system. Regarding understanding information and relevance, Wikipedia can provide the background knowledge that an intelligent system needs. This work explores how exploiting the context of a query using Wikipedia concepts can improve proactive information retrieval on noisy text. We formulate two models that use entity linking to associate Wikipedia topics with the relevance model. Our experiments around a podcast segment retrieval task demonstrate that there is a clear signal of relevance in Wikipedia concepts while a ranking model can improve precision by incorporating them. We also find Wikifying the background context of a query can help disambiguate the meaning of the query, further helping proactive information retrieval.
    Anchor-Changing Regularized Natural Policy Gradient for Multi-Objective Reinforcement Learning. (arXiv:2206.05357v2 [cs.LG] UPDATED)
    We study policy optimization for Markov decision processes (MDPs) with multiple reward value functions, which are to be jointly optimized according to given criteria such as proportional fairness (smooth concave scalarization), hard constraints (constrained MDP), and max-min trade-off. We propose an Anchor-changing Regularized Natural Policy Gradient (ARNPG) framework, which can systematically incorporate ideas from well-performing first-order methods into the design of policy optimization algorithms for multi-objective MDP problems. Theoretically, the designed algorithms based on the ARNPG framework achieve $\tilde{O}(1/T)$ global convergence with exact gradients. Empirically, the ARNPG-guided algorithms also demonstrate superior performance compared to some existing policy gradient-based approaches in both exact gradients and sample-based scenarios.
    Average-Reward Off-Policy Policy Evaluation with Function Approximation. (arXiv:2101.02808v3 [cs.LG] UPDATED)
    We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks.
    Generative models uncertainty estimation. (arXiv:2210.09767v1 [cs.LG])
    In recent years fully-parametric fast simulation methods based on generative models have been proposed for a variety of high-energy physics detectors. By their nature, the quality of data-driven models degrades in the regions of the phase space where the data are sparse. Since machine-learning models are hard to analyse from the physical principles, the commonly used testing procedures are performed in a data-driven way and can't be reliably used in such regions. In our work we propose three methods to estimate the uncertainty of generative models inside and outside of the training phase space region, along with data-driven calibration techniques. A test of the proposed methods on the LHCb RICH fast simulation is also presented.
    Generalized Many-Body Dispersion Correction through Random-phase Approximation for Chemically Accurate Density Functional Theory. (arXiv:2210.09784v1 [physics.chem-ph])
    We extend our recently proposed Deep Learning-aided many-body dispersion (DNN-MBD) model to quadrupole polarizability (Q) terms using a generalized Random Phase Approximation (RPA) formalism enabling to include van der Waals contributions beyond dipole. The resulting DNN-MBDQ model only relies on ab initio-derived quantities as the introduced quadrupole polarizabilities are recursively retrieved from dipole ones, in turn modelled via the Tkatchenko-Scheffler method. A transferable and efficient deep-neuronal network (DNN) provides atom in molecule volumes, while a single range-separation parameter is used to couple the model to Density Functional Theory (DFT). Since it can be computed at negligible cost, the DNN-MBDQ approach can be coupled with DFT functionals such as as PBE/PBE0 or B86bPBE(dispersionless). DNN-MBQ-PBE/PBE0 reaches chemical accuracy exhibiting superior accuracy compared to other dispersion-corrected models, especially at near-equilibrium ranges where errors are lowered by nearly 25% compared to our dipole-only approach while gains reach nearly 50% compared to other corrected schemes.
    Representation Power of Graph Convolutions : Neural Tangent Kernel Analysis. (arXiv:2210.09809v1 [cs.LG])
    The fundamental principle of Graph Neural Networks (GNNs) is to exploit the structural information of the data by aggregating the neighboring nodes using a graph convolution. Therefore, understanding its influence on the network performance is crucial. Convolutions based on graph Laplacian have emerged as the dominant choice with the symmetric normalization of the adjacency matrix $A$, defined as $D^{-1/2}AD^{-1/2}$, being the most widely adopted one, where $D$ is the degree matrix. However, some empirical studies show that row normalization $D^{-1}A$ outperforms it in node classification. Despite the widespread use of GNNs, there is no rigorous theoretical study on the representation power of these convolution operators, that could explain this behavior. In this work, we analyze the influence of the graph convolutions theoretically using Graph Neural Tangent Kernel in a semi-supervised node classification setting. Under a Degree Corrected Stochastic Block Model, we prove that: (i) row normalization preserves the underlying class structure better than other convolutions; (ii) performance degrades with network depth due to over-smoothing, but the loss in class information is the slowest in row normalization; (iii) skip connections retain the class information even at infinite depth, thereby eliminating over-smoothing. We finally validate our theoretical findings on real datasets.
    Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel. (arXiv:2210.09818v1 [cs.LG])
    Identifying unfamiliar inputs, also known as out-of-distribution (OOD) detection, is a crucial property of any decision making process. A simple and empirically validated technique is based on deep ensembles where the variance of predictions over different neural networks acts as a substitute for input uncertainty. Nevertheless, a theoretical understanding of the inductive biases leading to the performance of deep ensemble's uncertainty estimation is missing. To improve our description of their behavior, we study deep ensembles with large layer widths operating in simplified linear training regimes, in which the functions trained with gradient descent can be described by the neural tangent kernel. We identify two sources of noise, each inducing a distinct inductive bias in the predictive variance at initialization. We further show theoretically and empirically that both noise sources affect the predictive variance of non-linear deep ensembles in toy models and realistic settings after training. Finally, we propose practical ways to eliminate part of these noise sources leading to significant changes and improved OOD detection in trained deep ensembles.
    Eye-tracking based classification of Mandarin Chinese readers with and without dyslexia using neural sequence models. (arXiv:2210.09819v1 [cs.CL])
    Eye movements are known to reflect cognitive processes in reading, and psychological reading research has shown that eye gaze patterns differ between readers with and without dyslexia. In recent years, researchers have attempted to classify readers with dyslexia based on their eye movements using Support Vector Machines (SVMs). However, these approaches (i) are based on highly aggregated features averaged over all words read by a participant, thus disregarding the sequential nature of the eye movements, and (ii) do not consider the linguistic stimulus and its interaction with the reader's eye movements. In the present work, we propose two simple sequence models that process eye movements on the entire stimulus without the need of aggregating features across the sentence. Additionally, we incorporate the linguistic stimulus into the model in two ways -- contextualized word embeddings and manually extracted linguistic features. The models are evaluated on a Mandarin Chinese dataset containing eye movements from children with and without dyslexia. Our results show that (i) even for a logographic script such as Chinese, sequence models are able to classify dyslexia on eye gaze sequences, reaching state-of-the-art performance, and (ii) incorporating the linguistic stimulus does not help to improve classification performance.
    Analyzing the Robustness of PECNet. (arXiv:2210.09846v1 [cs.AI])
    Comprehensive robustness analysis of PECNet, a pedestrian trajectory prediction system for autonomous vehicles. A novel metric is introduced for dataset analysis and classification. Synthetic data augmentation techniques ranging from Newtonian mechanics to Deep Reinforcement Learning based simulations are used to improve and test the system. An improvement of 9.5% over state-of-the-art results is seen on the FDE while compromising ADE. We introduce novel architectural changes using SIRENs for higher precision results to validate our robustness hypotheses. Additionally, we diagrammatically propose a novel multi-modal system for the same task.
    Deformably-Scaled Transposed Convolution. (arXiv:2210.09446v1 [cs.CV])
    Transposed convolution is crucial for generating high-resolution outputs, yet has received little attention compared to convolution layers. In this work we revisit transposed convolution and introduce a novel layer that allows us to place information in the image selectively and choose the `stroke breadth' at which the image is synthesized, whilst incurring a small additional parameter cost. For this we introduce three ideas: firstly, we regress offsets to the positions where the transpose convolution results are placed; secondly we broadcast the offset weight locations over a learnable neighborhood; and thirdly we use a compact parametrization to share weights and restrict offsets. We show that simply substituting upsampling operators with our novel layer produces substantial improvements across tasks as diverse as instance segmentation, object detection, semantic segmentation, generative image modeling, and 3D magnetic resonance image enhancement, while outperforming all existing variants of transposed convolutions. Our novel layer can be used as a drop-in replacement for 2D and 3D upsampling operators and the code will be publicly available.
    Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation. (arXiv:2210.09549v1 [cs.CV])
    Recently, diffusion models have been proven to perform remarkably well in text-to-image synthesis tasks in a number of studies, immediately presenting new study opportunities for image generation. Google's Imagen follows this research trend and outperforms DALLE2 as the best model for text-to-image generation. However, Imagen merely uses a T5 language model for text processing, which cannot ensure learning the semantic information of the text. Furthermore, the Efficient UNet leveraged by Imagen is not the best choice in image processing. To address these issues, we propose the Swinv2-Imagen, a novel text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model, effectively improving the quality of generated images. On top of that, we also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations. Extensive experiments are conducted to evaluate the performance of the proposed model by using three real-world datasets, i.e., MSCOCO, CUB and MM-CelebA-HQ. The experimental results show that the proposed Swinv2-Imagen model outperforms several popular state-of-the-art methods.
    DAGAD: Data Augmentation for Graph Anomaly Detection. (arXiv:2210.09766v1 [cs.LG])
    Graph anomaly detection in this paper aims to distinguish abnormal nodes that behave differently from the benign ones accounting for the majority of graph-structured instances. Receiving increasing attention from both academia and industry, yet existing research on this task still suffers from two critical issues when learning informative anomalous behavior from graph data. For one thing, anomalies are usually hard to capture because of their subtle abnormal behavior and the shortage of background knowledge about them, which causes severe anomalous sample scarcity. Meanwhile, the overwhelming majority of objects in real-world graphs are normal, bringing the class imbalance problem as well. To bridge the gaps, this paper devises a novel Data Augmentation-based Graph Anomaly Detection (DAGAD) framework for attributed graphs, equipped with three specially designed modules: 1) an information fusion module employing graph neural network encoders to learn representations, 2) a graph data augmentation module that fertilizes the training set with generated samples, and 3) an imbalance-tailored learning module to discriminate the distributions of the minority (anomalous) and majority (normal) classes. A series of experiments on three datasets prove that DAGAD outperforms ten state-of-the-art baseline detectors concerning various mostly-used metrics, together with an extensive ablation study validating the strength of our proposed modules.
    Bagged $k$-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets. (arXiv:2210.09786v1 [stat.ML])
    In this paper, we propose an ensemble learning algorithm named \textit{bagged $k$-distance for mode-based clustering} (\textit{BDMBC}) by putting forward a new measurement called the \textit{probability of localized level sets} (\textit{PLLS}), which enables us to find all clusters for varying densities with a global threshold. On the theoretical side, we show that with a properly chosen number of nearest neighbors $k_D$ in the bagged $k$-distance, the sub-sample size $s$, the bagging rounds $B$, and the number of nearest neighbors $k_L$ for the localized level sets, BDMBC can achieve optimal convergence rates for mode estimation. It turns out that with a relatively small $B$, the sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k_D$ can be reduced simultaneously. Moreover, we establish optimal convergence results for the level set estimation of the PLLS in terms of Hausdorff distance, which reveals that BDMBC can find localized level sets for varying densities and thus enjoys local adaptivity. On the practical side, we conduct numerical experiments to empirically verify the effectiveness of BDMBC for mode estimation and level set estimation, which demonstrates the promising accuracy and efficiency of our proposed algorithm.
    A Pilot Study on Teacher-Facing Real-Time Classroom Game Dashboards. (arXiv:2210.09427v1 [cs.HC])
    Educational games are an increasingly popular teaching tool in modern classrooms. However, the development of complementary tools for teachers facilitating classroom gameplay is lacking. We present the results of a participatory design process for a teacher-facing, real-time game data dashboard. This two-phase process included a workshop to elicit teachers' requirements for such a tool, and a pilot study of our dashboard prototype. We analyze post-gameplay survey and interview data to understand teachers' experiences with the tool in terms of evidence of co-design, feasibility, and effectiveness. Our results indicate the participatory design yielded a tool both useful for and usable by teachers within the context of a real class gameplay session. We advocate for the continued development of data-driven teacher tools to improve the effectiveness of games deployed in the classroom.
    DPIS: An Enhanced Mechanism for Differentially Private SGD with Importance Sampling. (arXiv:2210.09634v1 [cs.CR])
    Nowadays, differential privacy (DP) has become a well-accepted standard for privacy protection, and deep neural networks (DNN) have been immensely successful in machine learning. The combination of these two techniques, i.e., deep learning with differential privacy, promises the privacy-preserving release of high-utility models trained with sensitive data such as medical records. A classic mechanism for this purpose is DP-SGD, which is a differentially private version of the stochastic gradient descent (SGD) optimizer commonly used for DNN training. Subsequent approaches have improved various aspects of the model training process, including noise decay schedule, model architecture, feature engineering, and hyperparameter tuning. However, the core mechanism for enforcing DP in the SGD optimizer remains unchanged ever since the original DP-SGD algorithm, which has increasingly become a fundamental barrier limiting the performance of DP-compliant machine learning solutions. Motivated by this, we propose DPIS, a novel mechanism for differentially private SGD training that can be used as a drop-in replacement of the core optimizer of DP-SGD, with consistent and significant accuracy gains over the latter. The main idea is to employ importance sampling (IS) in each SGD iteration for mini-batch selection, which reduces both sampling variance and the amount of random noise injected to the gradients that is required to satisfy DP. Integrating IS into the complex mathematical machinery of DP-SGD is highly non-trivial. DPIS addresses the challenge through novel mechanism designs, fine-grained privacy analysis, efficiency enhancements, and an adaptive gradient clipping optimization. Extensive experiments on four benchmark datasets, namely MNIST, FMNIST, CIFAR-10 and IMDb, demonstrate the superior effectiveness of DPIS over existing solutions for deep learning with differential privacy.
    Virtual Reality via Object Poses and Active Learning: Realizing Telepresence Robots with Aerial Manipulation Capabilities. (arXiv:2210.09678v1 [cs.RO])
    This article presents a novel telepresence system for advancing aerial manipulation in dynamic and unstructured environments. The proposed system not only features a haptic device, but also a virtual reality (VR) interface that provides real-time 3D displays of the robot's workspace as well as a haptic guidance to its remotely located operator. To realize this, multiple sensors namely a LiDAR, cameras and IMUs are utilized. For processing of the acquired sensory data, pose estimation pipelines are devised for industrial objects of both known and unknown geometries. We further propose an active learning pipeline in order to increase the sample efficiency of a pipeline component that relies on Deep Neural Networks (DNNs) based object detection. All these algorithms jointly address various challenges encountered during the execution of perception tasks in industrial scenarios. In the experiments, exhaustive ablation studies are provided to validate the proposed pipelines. Methodologically, these results commonly suggest how an awareness of the algorithms' own failures and uncertainty ("introspection") can be used tackle the encountered problems. Moreover, outdoor experiments are conducted to evaluate the effectiveness of the overall system in enhancing aerial manipulation capabilities. In particular, with flight campaigns over days and nights, from spring to winter, and with different users and locations, we demonstrate over 70 robust executions of pick-and-place, force application and peg-in-hole tasks with the DLR cable-Suspended Aerial Manipulator (SAM). As a result, we show the viability of the proposed system in future industrial applications.
    Hierarchical Model-Based Imitation Learning for Planning in Autonomous Driving. (arXiv:2210.09539v1 [cs.RO])
    We demonstrate the first large-scale application of model-based generative adversarial imitation learning (MGAIL) to the task of dense urban self-driving. We augment standard MGAIL using a hierarchical model to enable generalization to arbitrary goal routes, and measure performance using a closed-loop evaluation framework with simulated interactive agents. We train policies from expert trajectories collected from real vehicles driving over 100,000 miles in San Francisco, and demonstrate a steerable policy that can navigate robustly even in a zero-shot setting, generalizing to synthetic scenarios with novel goals that never occurred in real-world driving. We also demonstrate the importance of mixing closed-loop MGAIL losses with open-loop behavior cloning losses, and show our best policy approaches the performance of the expert. We evaluate our imitative model in both average and challenging scenarios, and show how it can serve as a useful prior to plan successful trajectories.
    ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design. (arXiv:2210.09573v1 [cs.LG])
    Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
    Transfer learning with affine model transformation. (arXiv:2210.09745v1 [stat.ML])
    Supervised transfer learning (TL) has received considerable attention because of its potential to boost the predictive power of machine learning in cases with limited data. In a conventional scenario, cross-domain differences are modeled and estimated using a given set of source models and samples from a target domain. For example, if there is a functional relationship between source and target domains, only domain-specific factors are additionally learned using target samples to shift the source models to the target. However, the general methodology for modeling and estimating such cross-domain shifts has been less studied. This study presents a TL framework that simultaneously and separately estimates domain shifts and domain-specific factors using given target samples. Assuming consistency and invertibility of the domain transformation functions, we derive an optimal family of functions to represent the cross-domain shift. The newly derived class of transformation functions takes the same form as invertible neural networks using affine coupling layers, which are widely used in generative deep learning. We show that the proposed method encompasses a wide range of existing methods, including the most common TL procedure based on feature extraction using neural networks. We also clarify the theoretical properties of the proposed method, such as the convergence rate of the generalization error, and demonstrate the practical benefits of separately modeling and estimating domain-specific factors through several case studies.
    FLECS-CGD: A Federated Learning Second-Order Framework via Compression and Sketching with Compressed Gradient Differences. (arXiv:2210.09626v1 [cs.LG])
    In the recent paper FLECS (Agafonov et al, FLECS: A Federated Learning Second-Order Framework via Compression and Sketching), the second-order framework FLECS was proposed for the Federated Learning problem. This method utilize compression of sketched Hessians to make communication costs low. However, the main bottleneck of FLECS is gradient communication without compression. In this paper, we propose the modification of FLECS with compressed gradient differences, which we call FLECS-CGD (FLECS with Compressed Gradient Differences) and make it applicable for stochastic optimization. Convergence guarantees are provided in strongly convex and nonconvex cases. Experiments show the practical benefit of proposed approach.
    On effects of Knowledge Distillation on Transfer Learning. (arXiv:2210.09668v1 [cs.LG])
    Knowledge distillation is a popular machine learning technique that aims to transfer knowledge from a large 'teacher' network to a smaller 'student' network and improve the student's performance by training it to emulate the teacher. In recent years, there has been significant progress in novel distillation techniques that push performance frontiers across multiple problems and benchmarks. Most of the reported work focuses on achieving state-of-the-art results on the specific problem. However, there has been a significant gap in understanding the process and how it behaves under certain training scenarios. Similarly, transfer learning (TL) is an effective technique in training neural networks on a limited dataset faster by reusing representations learned from a different but related problem. Despite its effectiveness and popularity, there has not been much exploration of knowledge distillation on transfer learning. In this thesis, we propose a machine learning architecture we call TL+KD that combines knowledge distillation with transfer learning; we then present a quantitative and qualitative comparison of TL+KD with TL in the domain of image classification. Through this work, we show that using guidance and knowledge from a larger teacher network during fine-tuning, we can improve the student network to achieve better validation performances like accuracy. We characterize the improvement in the validation performance of the model using a variety of metrics beyond just accuracy scores, and study its performance in scenarios such as input degradation.
    Not All Poisons are Created Equal: Robust Training against Data Poisoning. (arXiv:2210.09671v1 [cs.LG])
    Data poisoning causes misclassification of test time target examples by injecting maliciously crafted samples in the training data. Existing defenses are often effective only against a specific type of targeted attack, significantly degrade the generalization performance, or are prohibitive for standard deep learning pipelines. In this work, we propose an efficient defense mechanism that significantly reduces the success rate of various data poisoning attacks, and provides theoretical guarantees for the performance of the model. Targeted attacks work by adding bounded perturbations to a randomly selected subset of training data to match the targets' gradient or representation. We show that: (i) under bounded perturbations, only a number of poisons can be optimized to have a gradient that is close enough to that of the target and make the attack successful; (ii) such effective poisons move away from their original class and get isolated in the gradient space; (iii) dropping examples in low-density gradient regions during training can successfully eliminate the effective poisons, and guarantees similar training dynamics to that of training on full data. Our extensive experiments show that our method significantly decreases the success rate of state-of-the-art targeted attacks, including Gradient Matching and Bullseye Polytope, and easily scales to large datasets.
    Hidet: Task Mapping Programming Paradigm for Deep Learning Tensor Programs. (arXiv:2210.09603v1 [cs.LG])
    As deep learning models nowadays are widely adopted by both cloud services and edge devices, the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators (e.g., NVIDIA GPUs and Google TPUs) and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations (e.g., double buffering). In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering directly in the tensor programs. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity (e.g., allowing program statement-level optimizations). We call the proposed method the task-mapping-oriented programming paradigm. With the proposed paradigm, we implement a deep learning compiler - Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average) with enriched optimizations. It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively.
    WaGI : Wavelet-based GAN Inversion for Preserving High-frequency Image Details. (arXiv:2210.09655v1 [cs.CV])
    Recent GAN inversion models focus on preserving image-specific details through various methods, e.g., generator tuning or feature mixing. While those are helpful for preserving details compared to a naiive low-rate latent inversion, they still fail to maintain high-frequency features precisely. In this paper, we point out that the existing GAN inversion models have inherent limitations in both structural and training aspects, which preclude the delicate reconstruction of high-frequency features. Especially, we prove that the widely-used loss term in GAN inversion, i.e., L2, is biased to reconstruct low-frequency features mainly. To overcome this problem, we propose a novel GAN inversion model, coined WaGI, which enables to handle high-frequency features explicitly, by using a novel wavelet-based loss term and a newly proposed wavelet fusion scheme. To the best of our knowledge, WaGI is the first attempt to interpret GAN inversion in the frequency domain. We demonstrate that WaGI shows outstanding results on both inversion and editing, compared to the existing state-of-the-art GAN inversion models. Especially, WaGI robustly preserves high-frequency features of images even in the editing scenario. We will release our code with the pre-trained model after the review.
    Understanding CNN Fragility When Learning With Imbalanced Data. (arXiv:2210.09465v1 [cs.CV])
    Convolutional neural networks (CNNs) have achieved impressive results on imbalanced image data, but they still have difficulty generalizing to minority classes and their decisions are difficult to interpret. These problems are related because the method by which CNNs generalize to minority classes, which requires improvement, is wrapped in a blackbox. To demystify CNN decisions on imbalanced data, we focus on their latent features. Although CNNs embed the pattern knowledge learned from a training set in model parameters, the effect of this knowledge is contained in feature and classification embeddings (FE and CE). These embeddings can be extracted from a trained model and their global, class properties (e.g., frequency, magnitude and identity) can be analyzed. We find that important information regarding the ability of a neural network to generalize to minority classes resides in the class top-K CE and FE. We show that a CNN learns a limited number of class top-K CE per category, and that their number and magnitudes vary based on whether the same class is balanced or imbalanced. This calls into question whether a CNN has learned intrinsic class features, or merely frequently occurring ones that happen to exist in the sampled class distribution. We also hypothesize that latent class diversity is as important as the number of class examples, which has important implications for re-sampling and cost-sensitive methods. These methods generally focus on rebalancing model weights, class numbers and margins; instead of diversifying class latent features through augmentation. We also demonstrate that a CNN has difficulty generalizing to test data if the magnitude of its top-K latent features do not match the training set. We use three popular image datasets and two cost-sensitive algorithms commonly employed in imbalanced learning for our experiments.
    An Improved Structured Mesh Generation Method Based on Physics-informed Neural Networks. (arXiv:2210.09546v1 [cs.GR])
    Mesh generation remains a key technology in many areas where numerical simulations are required. As numerical algorithms become more efficient and computers become more powerful, the percentage of time devoted to mesh generation becomes higher. In this paper, we present an improved structured mesh generation method. The method formulates the meshing problem as a global optimization problem related to a physics-informed neural network. The mesh is obtained by intelligently solving the physical boundary-constrained partial differential equations. To improve the prediction accuracy of the neural network, we also introduce a novel auxiliary line strategy and an efficient network model during meshing. The strategy first employs a priori auxiliary lines to provide ground truth data and then uses these data to construct a loss term to better constrain the convergence of the subsequent training. The experimental results indicate that the proposed method is effective and robust. It can accurately approximate the mapping (transformation) from the computational domain to the physical domain and enable fast high-quality structured mesh generation.
    Anisotropic Multi-Scale Graph Convolutional Network for Dense Shape Correspondence. (arXiv:2210.09466v1 [cs.CV])
    This paper studies 3D dense shape correspondence, a key shape analysis application in computer vision and graphics. We introduce a novel hybrid geometric deep learning-based model that learns geometrically meaningful and discretization-independent features with a U-Net model as the primary node feature extraction module, followed by a successive spectral-based graph convolutional network. To create a diverse set of filters, we use anisotropic wavelet basis filters, being sensitive to both different directions and band-passes. This filter set overcomes the over-smoothing behavior of conventional graph neural networks. To further improve the model's performance, we add a function that perturbs the feature maps in the last layer ahead of fully connected layers, forcing the network to learn more discriminative features overall. The resulting correspondence maps show state-of-the-art performance on the benchmark datasets based on average geodesic errors and superior robustness to discretization in 3D meshes. Our approach provides new insights and practical solutions to the dense shape correspondence research.
    An enhanced method of initial cluster center selection for K-means algorithm. (arXiv:2210.09507v1 [cs.LG])
    Clustering is one of the widely used techniques to find out patterns from a dataset that can be applied in different applications or analyses. K-means, the most popular and simple clustering algorithm, might get trapped into local minima if not properly initialized and the initialization of this algorithm is done randomly. In this paper, we propose a novel approach to improve initial cluster selection for K-means algorithm. This algorithm is based on the fact that the initial centroids must be well separated from each other since the final clusters are separated groups in feature space. The Convex Hull algorithm facilitates the computing of the first two centroids and the remaining ones are selected according to the distance from previously selected centers. To ensure the selection of one center per cluster, we use the nearest neighbor technique. To check the robustness of our proposed algorithm, we consider several real-world datasets. We obtained only 7.33%, 7.90%, and 0% clustering error in Iris, Letter, and Ruspini data respectively which proves better performance than other existing systems. The results indicate that our proposed method outperforms the conventional K means approach by accelerating the computation when the number of clusters is greater than 2.
    Improving Adversarial Robustness by Contrastive Guided Diffusion Process. (arXiv:2210.09643v1 [cs.LG])
    Synthetic data generation has become an emerging tool to help improve the adversarial robustness in classification tasks since robust learning requires a significantly larger amount of training samples compared with standard classification tasks. Among various deep generative models, the diffusion model has been shown to produce high-quality synthetic images and has achieved good performance in improving the adversarial robustness. However, diffusion-type methods are typically slow in data generation as compared with other generative models. Although different acceleration techniques have been proposed recently, it is also of great importance to study how to improve the sample efficiency of generated data for the downstream task. In this paper, we first analyze the optimality condition of synthetic distribution for achieving non-trivial robust accuracy. We show that enhancing the distinguishability among the generated data is critical for improving adversarial robustness. Thus, we propose the Contrastive-Guided Diffusion Process (Contrastive-DP), which adopts the contrastive loss to guide the diffusion model in data generation. We verify our theoretical results using simulations and demonstrate the good performance of Contrastive-DP on image datasets.
    TFAD: A Decomposition Time Series Anomaly Detection Architecture with Time-Frequency Analysis. (arXiv:2210.09693v1 [cs.LG])
    Time series anomaly detection is a challenging problem due to the complex temporal dependencies and the limited label data. Although some algorithms including both traditional and deep models have been proposed, most of them mainly focus on time-domain modeling, and do not fully utilize the information in the frequency domain of the time series data. In this paper, we propose a Time-Frequency analysis based time series Anomaly Detection model, or TFAD for short, to exploit both time and frequency domains for performance improvement. Besides, we incorporate time series decomposition and data augmentation mechanisms in the designed time-frequency architecture to further boost the abilities of performance and interpretability. Empirical studies on widely used benchmark datasets show that our approach obtains state-of-the-art performance in univariate and multivariate time series anomaly detection tasks. Code is provided at https://github.com/DAMO-DI-ML/CIKM22-TFAD.
    Split-KalmanNet: A Robust Model-Based Deep Learning Approach for SLAM. (arXiv:2210.09636v1 [eess.SP])
    Simultaneous localization and mapping (SLAM) is a method that constructs a map of an unknown environment and localizes the position of a moving agent on the map simultaneously. Extended Kalman filter (EKF) has been widely adopted as a low complexity solution for online SLAM, which relies on a motion and measurement model of the moving agent. In practice, however, acquiring precise information about these models is very challenging, and the model mismatch effect causes severe performance loss in SLAM. In this paper, inspired by the recently proposed KalmanNet, we present a robust EKF algorithm using the power of deep learning for online SLAM, referred to as Split-KalmanNet. The key idea of Split-KalmanNet is to compute the Kalman gain using the Jacobian matrix of a measurement function and two recurrent neural networks (RNNs). The two RNNs independently learn the covariance matrices for a prior state estimate and the innovation from data. The proposed split structure in the computation of the Kalman gain allows to compensate for state and measurement model mismatch effects independently. Numerical simulation results verify that Split-KalmanNet outperforms the traditional EKF and the state-of-the-art KalmanNet algorithm in various model mismatch scenarios.
    Towards Fair Classification against Poisoning Attacks. (arXiv:2210.09503v1 [cs.LG])
    Fair classification aims to stress the classification models to achieve the equality (treatment or prediction quality) among different sensitive groups. However, fair classification can be under the risk of poisoning attacks that deliberately insert malicious training samples to manipulate the trained classifiers' performance. In this work, we study the poisoning scenario where the attacker can insert a small fraction of samples into training data, with arbitrary sensitive attributes as well as other predictive features. We demonstrate that the fairly trained classifiers can be greatly vulnerable to such poisoning attacks, with much worse accuracy & fairness trade-off, even when we apply some of the most effective defenses (originally proposed to defend traditional classification tasks). As countermeasures to defend fair classification tasks, we propose a general and theoretically guaranteed framework which accommodates traditional defense methods to fair classification against poisoning attacks. Through extensive experiments, the results validate that the proposed defense framework obtains better robustness in terms of accuracy and fairness than representative baseline methods.
    Adan: Adaptive Nesterov Momentum Algorithm for Faster Optimizing Deep Models. (arXiv:2208.06677v3 [cs.LG] UPDATED)
    Adaptive gradient algorithms borrow the moving average idea of heavy ball acceleration to estimate accurate first- and second-order moments of gradient for accelerating convergence. However, Nesterov acceleration which converges faster than heavy ball acceleration in theory and also in many empirical cases is much less investigated under the adaptive gradient setting. In this work, we propose the ADAptive Nesterov momentum algorithm, Adan for short, to speed up the training of deep neural networks effectively. Adan first reformulates the vanilla Nesterov acceleration to develop a new Nesterov momentum estimation (NME) method, which avoids the extra computation and memory overhead of computing gradient at the extrapolation point. Then Adan adopts NME to estimate the first- and second-order moments of the gradient in adaptive gradient algorithms for convergence acceleration. Besides, we prove that Adan finds an $\epsilon$-approximate first-order stationary point within $O(\epsilon^{-3.5})$ stochastic gradient complexity on the nonconvex stochastic problems (e.g., deep learning problems), matching the best-known lower bound. Extensive experimental results show that Adan surpasses the corresponding SoTA optimizers on both vision transformers (ViTs) and CNNs, and sets new SoTAs for many popular networks, e.g., ResNet, ConvNext, ViT, Swin, MAE, LSTM, Transformer-XL, and BERT. More surprisingly, Adan can use half of the training cost (epochs) of SoTA optimizers to achieve higher or comparable performance on ViT and ResNet, e.t.c., and also shows great tolerance to a large range of minibatch size, e.g., from 1k to 32k. We hope Adan can contribute to the development of deep learning by reducing training cost and relieving engineering burden of trying different optimizers on various architectures. Code is released at https://github.com/sail-sg/Adan.
    No Pairs Left Behind: Improving Metric Learning with Regularized Triplet Objective. (arXiv:2210.09506v1 [cs.LG])
    We propose a novel formulation of the triplet objective function that improves metric learning without additional sample mining or overhead costs. Our approach aims to explicitly regularize the distance between the positive and negative samples in a triplet with respect to the anchor-negative distance. As an initial validation, we show that our method (called No Pairs Left Behind [NPLB]) improves upon the traditional and current state-of-the-art triplet objective formulations on standard benchmark datasets. To show the effectiveness and potentials of NPLB on real-world complex data, we evaluate our approach on a large-scale healthcare dataset (UK Biobank), demonstrating that the embeddings learned by our model significantly outperform all other current representations on tested downstream tasks. Additionally, we provide a new model-agnostic single-time health risk definition that, when used in tandem with the learned representations, achieves the most accurate prediction of subjects' future health complications. Our results indicate that NPLB is a simple, yet effective framework for improving existing deep metric learning models, showcasing the potential implications of metric learning in more complex applications, especially in the biological and healthcare domains.
    A Mixing Time Lower Bound for a Simplified Version of BART. (arXiv:2210.09352v1 [stat.ML])
    Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, biostatistics, and causal inference. BART uses Markov Chain Monte Carlo (MCMC) to obtain approximate posterior samples over a parameterized space of sums of trees, but it has often been observed that the chains are slow to mix. In this paper, we provide the first lower bound on the mixing time for a simplified version of BART in which we reduce the sum to a single tree and use a subset of the possible moves for the MCMC proposal distribution. Our lower bound for the mixing time grows exponentially with the number of data points. Inspired by this new connection between the mixing time and the number of data points, we perform rigorous simulations on BART. We show qualitatively that BART's mixing time increases with the number of data points. The slow mixing time of the simplified BART suggests a large variation between different runs of the simplified BART algorithm and a similar large variation is known for BART in the literature. This large variation could result in a lack of stability in the models, predictions, and posterior intervals obtained from the BART MCMC samples. Our lower bound and simulations suggest increasing the number of chains with the number of data points.
    Deepfake Text Detection: Limitations and Opportunities. (arXiv:2210.09421v1 [cs.CR])
    Recent advances in generative models for language have enabled the creation of convincing synthetic text or deepfake text. Prior work has demonstrated the potential for misuse of deepfake text to mislead content consumers. Therefore, deepfake text detection, the task of discriminating between human and machine-generated text, is becoming increasingly critical. Several defenses have been proposed for deepfake text detection. However, we lack a thorough understanding of their real-world applicability. In this paper, we collect deepfake text from 4 online services powered by Transformer-based tools to evaluate the generalization ability of the defenses on content in the wild. We develop several low-cost adversarial attacks, and investigate the robustness of existing defenses against an adaptive attacker. We find that many defenses show significant degradation in performance under our evaluation scenarios compared to their original claimed performance. Our evaluation shows that tapping into the semantic information in the text content is a promising approach for improving the robustness and generalization performance of deepfake text detection schemes.
    AMPNet: Attention as Message Passing for Graph Neural Networks. (arXiv:2210.09475v1 [cs.LG])
    Feature-level interactions between nodes can carry crucial information for understanding complex interactions in graph-structured data. Current interpretability techniques, however, are limited in their ability to capture feature-level interactions between different nodes. In this work, we propose AMPNet, a general Graph Neural Network (GNN) architecture for uncovering feature-level interactions between different spatial locations within graph-structured data. Our framework applies a multiheaded attention operation during message-passing to contextualize messages based on the feature interactions between different nodes. We evaluate AMPNet on several benchmark and real-world datasets, and develop a synthetic benchmark based on cyclic cellular automata to test the ability of our framework to recover cyclic patterns in node states based on feature-interactions. We also propose several methods for addressing the scalability of our architecture to large graphs, including subgraph sampling during training and node feature downsampling.
    SVLDL: Improved Speaker Age Estimation Using Selective Variance Label Distribution Learning. (arXiv:2210.09524v1 [cs.SD])
    Estimating age from a single speech is a classic and challenging topic. Although Label Distribution Learning (LDL) can represent adjacent indistinguishable ages well, the uncertainty of the age estimate for each utterance varies from person to person, i.e., the variance of the age distribution is different. To address this issue, we propose selective variance label distribution learning (SVLDL) method to adapt the variance of different age distributions. Furthermore, the model uses WavLM as the speech feature extractor and adds the auxiliary task of gender recognition to further improve the performance. Two tricks are applied on the loss function to enhance the robustness of the age estimation and improve the quality of the fitted age distribution. Extensive experiments show that the model achieves state-of-the-art performance on all aspects of the NIST SRE08-10 and a real-world datasets.
    A tradeoff between universality of equivariant models and learnability of symmetries. (arXiv:2210.09444v1 [stat.ML])
    We prove an impossibility result, which in the context of function learning says the following: under certain conditions, it is impossible to simultaneously learn symmetries and functions equivariant under them using an ansatz consisting of equivariant functions. To formalize this statement, we carefully study notions of approximation for groups and semigroups. We analyze certain families of neural networks for whether they satisfy the conditions of the impossibility result: what we call ``linearly equivariant'' networks, and group-convolutional networks. A lot can be said precisely about linearly equivariant networks, making them theoretically useful. On the practical side, our analysis of group-convolutional neural networks allows us generalize the well-known ``convolution is all you need'' theorem to non-homogeneous spaces. We additionally find an important difference between group convolution and semigroup convolution.
    Planning for Sample Efficient Imitation Learning. (arXiv:2210.09598v1 [cs.LG])
    Imitation learning is a class of promising policy learning algorithms that is free from many practical issues with reinforcement learning, such as the reward design issue and the exploration hardness. However, the current imitation algorithm struggles to achieve both high performance and high in-environment sample efficiency simultaneously. Behavioral Cloning (BC) does not need in-environment interactions, but it suffers from the covariate shift problem which harms its performance. Adversarial Imitation Learning (AIL) turns imitation learning into a distribution matching problem. It can achieve better performance on some tasks but it requires a large number of in-environment interactions. Inspired by the recent success of EfficientZero in RL, we propose EfficientImitate (EI), a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. Our algorithmic contribution in this paper is two-fold. First, we extend AIL into the MCTS-based RL. Second, we show the seemingly incompatible two classes of imitation algorithms (BC and AIL) can be naturally unified under our framework, enjoying the benefits of both. We benchmark our method not only on the state-based DeepMind Control Suite, but also on the image version which many previous works find highly challenging. Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency. EI shows over 4x gain in performance in the limited sample setting on state-based and image-based tasks and can solve challenging problems like Humanoid, where previous methods fail with small amount of interactions. Our code is available at https://github.com/zhaohengyin/EfficientImitate.
    Neural network accelerator for quantum control. (arXiv:2208.02645v2 [quant-ph] UPDATED)
    Efficient quantum control is necessary for practical quantum computing implementations with current technologies. Conventional algorithms for determining optimal control parameters are computationally expensive, largely excluding them from use outside of the simulation. Existing hardware solutions structured as lookup tables are imprecise and costly. By designing a machine learning model to approximate the results of traditional tools, a more efficient method can be produced. Such a model can then be synthesized into a hardware accelerator for use in quantum systems. In this study, we demonstrate a machine learning algorithm for predicting optimal pulse parameters. This algorithm is lightweight enough to fit on a low-resource FPGA and perform inference with a latency of 175 ns and pipeline interval of 5 ns with $~>~$0.99 gate fidelity. In the long term, such an accelerator could be used near quantum computing hardware where traditional computers cannot operate, enabling quantum control at a reasonable cost at low latencies without incurring large data bandwidths outside of the cryogenic environment.
    Joint rotational invariance and adversarial training of a dual-stream Transformer yields state of the art Brain-Score for Area V4. (arXiv:2203.06649v3 [q-bio.NC] UPDATED)
    Modern high-scoring models of vision in the brain score competition do not stem from Vision Transformers. However, in this paper, we provide evidence against the unexpected trend of Vision Transformers (ViT) being not perceptually aligned with human visual representations by showing how a dual-stream Transformer, a CrossViT$~\textit{a la}$ Chen et al. (2021), under a joint rotationally-invariant and adversarial optimization procedure yields 2nd place in the aggregate Brain-Score 2022 competition(Schrimpf et al., 2020b) averaged across all visual categories, and at the time of the competition held 1st place for the highest explainable variance of area V4. In addition, our current Transformer-based model also achieves greater explainable variance for areas V4, IT and Behaviour than a biologically-inspired CNN (ResNet50) that integrates a frontal V1-like computation module (Dapello et al.,2020). To assess the contribution of the optimization scheme with respect to the CrossViT architecture, we perform several additional experiments on differently optimized CrossViT's regarding adversarial robustness, common corruption benchmarks, mid-ventral stimuli interpretation and feature inversion. Against our initial expectations, our family of results provides tentative support for an $\textit{"All roads lead to Rome"}$ argument enforced via a joint optimization rule even for non biologically-motivated models of vision such as Vision Transformers. Code is available at https://github.com/williamberrios/BrainScore-Transformers
    TorchDIVA: An Extensible Computational Model of Speech Production built on an Open-Source Machine Learning Library. (arXiv:2210.09334v1 [eess.AS])
    The DIVA model is a computational model of speech motor control that combines a simulation of the brain regions responsible for speech production with a model of the human vocal tract. The model is currently implemented in Matlab Simulink; however, this is less than ideal as most of the development in speech technology research is done in Python. This means there is a wealth of machine learning tools which are freely available in the Python ecosystem that cannot be easily integrated with DIVA. We present TorchDIVA, a full rebuild of DIVA in Python using PyTorch tensors. DIVA source code was directly translated from Matlab to Python, and built-in Simulink signal blocks were implemented from scratch. After implementation, the accuracy of each module was evaluated via systematic block-by-block validation. The TorchDIVA model is shown to produce outputs that closely match those of the original DIVA model, with a negligible difference between the two. We additionally present an example of the extensibility of TorchDIVA as a research platform. Speech quality enhancement in TorchDIVA is achieved through an integration with an existing PyTorch generative vocoder called DiffWave. A modified DiffWave mel-spectrum upsampler was trained on human speech waveforms and conditioned on the TorchDIVA speech production. The results indicate improved speech quality metrics in the DiffWave-enhanced output as compared to the baseline. This enhancement would have been difficult or impossible to accomplish in the original Matlab implementation. This proof-of-concept demonstrates the value TorchDIVA will bring to the research community. Researchers can download the new implementation at: https://github.com/skinahan/DIVA_PyTorch
    A Transfer Learning Based Approach for Classification of COVID-19 and Pneumonia in CT Scan Imaging. (arXiv:2210.09403v1 [eess.IV])
    The world is still overwhelmed by the spread of the COVID-19 virus. With over 250 Million infected cases as of November 2021 and affecting 219 countries and territories, the world remains in the pandemic period. Detecting COVID-19 using the deep learning method on CT scan images can play a vital role in assisting medical professionals and decision authorities in controlling the spread of the disease and providing essential support for patients. The convolution neural network is widely used in the field of large-scale image recognition. The current method of RT-PCR to diagnose COVID-19 is time-consuming and universally limited. This research aims to propose a deep learning-based approach to classify COVID-19 pneumonia patients, bacterial pneumonia, viral pneumonia, and healthy (normal cases). This paper used deep transfer learning to classify the data via Inception-ResNet-V2 neural network architecture. The proposed model has been intentionally simplified to reduce the implementation cost so that it can be easily implemented and used in different geographical areas, especially rural and developing regions.
    Random Orthogonalization for Federated Learning in Massive MIMO Systems. (arXiv:2210.09881v1 [cs.IT])
    We propose a novel communication design, termed random orthogonalization, for federated learning (FL) in a massive multiple-input and multiple-output (MIMO) wireless system. The key novelty of random orthogonalization comes from the tight coupling of FL and two unique characteristics of massive MIMO -- channel hardening and favorable propagation. As a result, random orthogonalization can achieve natural over-the-air model aggregation without requiring transmitter side channel state information (CSI) for the uplink phase of FL, while significantly reducing the channel estimation overhead at the receiver. We extend this principle to the downlink communication phase and develop a simple but highly effective model broadcast method for FL. We also relax the massive MIMO assumption by proposing an enhanced random orthogonalization design for both uplink and downlink FL communications, that does not rely on channel hardening or favorable propagation. Theoretical analyses with respect to both communication and machine learning performance are carried out. In particular, an explicit relationship among the convergence rate, the number of clients, and the number of antennas is established. Experimental results validate the effectiveness and efficiency of random orthogonalization for FL in massive MIMO.
    Automatic Emergency Dust-Free solution on-board International Space Station with Bi-GRU (AED-ISS). (arXiv:2210.08549v1 [stat.AP] CROSS LISTED)
    With a rising attention for the issue of PM2.5 or PM0.3, particulate matters have become not only a potential threat to both the environment and human, but also a harming existence to instruments onboard International Space Station (ISS). Our team is aiming to relate various concentration of particulate matters to magnetic fields, humidity, acceleration, temperature, pressure and CO2 concentration. Our goal is to establish an early warning system (EWS), which is able to forecast the levels of particulate matters and provides ample reaction time for astronauts to protect their instruments in some experiments or increase the accuracy of the measurements; In addition, the constructed model can be further developed into a prototype of a remote-sensing smoke alarm for applications related to fires. In this article, we will implement the Bi-GRU (Bidirectional Gated Recurrent Unit) algorithms that collect data for past 90 minutes and predict the levels of particulates which over 2.5 micrometer per 0.1 liter for the next 1 minute, which is classified as an early warning
    On Accelerated Perceptrons and Beyond. (arXiv:2210.09371v1 [cs.LG])
    The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $\gamma > 0$. A foundational result is that Perceptron converges after $\Omega(1/\gamma^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $\Omega(\sqrt{\log n}/\gamma)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $\Omega(\sqrt{(p-1)\log n}/\gamma)$ convergence rate, while existing algorithms suffer the $\Omega({(p-1)}/\gamma^2)$ convergence rate.
    How Would The Viewer Feel? Estimating Wellbeing From Video Scenarios. (arXiv:2210.10039v1 [cs.CV])
    In recent years, deep neural networks have demonstrated increasingly strong abilities to recognize objects and activities in videos. However, as video understanding becomes widely used in real-world applications, a key consideration is developing human-centric systems that understand not only the content of the video but also how it would affect the wellbeing and emotional state of viewers. To facilitate research in this setting, we introduce two large-scale datasets with over 60,000 videos manually annotated for emotional response and subjective wellbeing. The Video Cognitive Empathy (VCE) dataset contains annotations for distributions of fine-grained emotional responses, allowing models to gain a detailed understanding of affective states. The Video to Valence (V2V) dataset contains annotations of relative pleasantness between videos, which enables predicting a continuous spectrum of wellbeing. In experiments, we show how video models that are primarily trained to recognize actions and find contours of objects can be repurposed to understand human preferences and the emotional content of videos. Although there is room for improvement, predicting wellbeing and emotional response is on the horizon for state-of-the-art models. We hope our datasets can help foster further advances at the intersection of commonsense video understanding and human preference learning.
    Towards Generating Adversarial Examples on Mixed-type Data. (arXiv:2210.09405v1 [cs.LG])
    The existence of adversarial attacks (or adversarial examples) brings huge concern about the machine learning (ML) model's safety issues. For many safety-critical ML tasks, such as financial forecasting, fraudulent detection, and anomaly detection, the data samples are usually mixed-type, which contain plenty of numerical and categorical features at the same time. However, how to generate adversarial examples with mixed-type data is still seldom studied. In this paper, we propose a novel attack algorithm M-Attack, which can effectively generate adversarial examples in mixed-type data. Based on M-Attack, attackers can attempt to mislead the targeted classification model's prediction, by only slightly perturbing both the numerical and categorical features in the given data samples. More importantly, by adding designed regularizations, our generated adversarial examples can evade potential detection models, which makes the attack indeed insidious. Through extensive empirical studies, we validate the effectiveness and efficiency of our attack method and evaluate the robustness of existing classification models against our proposed attack. The experimental results highlight the feasibility of generating adversarial examples toward machine learning models in real-world applications.
    Probabilistic Categorical Adversarial Attack & Adversarial Training. (arXiv:2210.09364v1 [cs.LG])
    The existence of adversarial examples brings huge concern for people to apply Deep Neural Networks (DNNs) in safety-critical tasks. However, how to generate adversarial examples with categorical data is an important problem but lack of extensive exploration. Previously established methods leverage greedy search method, which can be very time-consuming to conduct successful attack. This also limits the development of adversarial training and potential defenses for categorical data. To tackle this problem, we propose Probabilistic Categorical Adversarial Attack (PCAA), which transfers the discrete optimization problem to a continuous problem that can be solved efficiently by Projected Gradient Descent. In our paper, we theoretically analyze its optimality and time complexity to demonstrate its significant advantage over current greedy based attacks. Moreover, based on our attack, we propose an efficient adversarial training framework. Through a comprehensive empirical study, we justify the effectiveness of our proposed attack and defense algorithms.
    Federated Graph Machine Learning: A Survey of Concepts, Techniques, and Applications. (arXiv:2207.11812v2 [cs.LG] UPDATED)
    Graph machine learning has gained great attention in both academia and industry recently. Most of the graph machine learning models, such as Graph Neural Networks (GNNs), are trained over massive graph data. However, in many real-world scenarios, such as hospitalization prediction in healthcare systems, the graph data is usually stored at multiple data owners and cannot be directly accessed by any other parties due to privacy concerns and regulation restrictions. Federated Graph Machine Learning (FGML) is a promising solution to tackle this challenge by training graph machine learning models in a federated manner. In this survey, we conduct a comprehensive review of the literature in FGML. Specifically, we first provide a new taxonomy to divide the existing problems in FGML into two settings, namely, FL with structured data and structured FL. Then, we review the mainstream techniques in each setting and elaborate on how they address the challenges under FGML. In addition, we summarize the real-world applications of FGML from different domains and introduce open graph datasets and platforms adopted in FGML. Finally, we present several limitations in the existing studies with promising research directions in this field.
    Composite Spatial Monte Carlo Integration Based on Generalized Least Squares. (arXiv:2204.03248v3 [stat.CO] UPDATED)
    Although evaluation of the expectations on the Ising model is essential in various applications, it is mostly infeasible because of intractable multiple summations. Spatial Monte Carlo integration (SMCI) is a sampling-based approximation. It can provide high-accuracy estimations for such intractable expectations. To evaluate the expectation of a function of variables in a specific region (called target region), SMCI considers a larger region containing the target region (called sum region). In SMCI, the multiple summation for the variables in the sum region is precisely executed, and that in the outer region is evaluated by the sampling approximation such as the standard Monte Carlo integration. It is guaranteed that the accuracy of the SMCI estimator improves monotonically as the size of the sum region increases. However, a haphazard expansion of the sum region could cause a combinatorial explosion. Therefore, we hope to improve the accuracy without such an expansion. In this paper, based on the theory of generalized least squares (GLS), a new effective method is proposed by combining multiple SMCI estimators. The validity of the proposed method is demonstrated theoretically and numerically. The results indicate that the proposed method can be effective in the inverse Ising problem (or Boltzmann machine learning).
    RibSeg v2: A Large-scale Benchmark for Rib Labeling and Anatomical Centerline Extraction. (arXiv:2210.09309v1 [eess.IV])
    Automatic rib labeling and anatomical centerline extraction are common prerequisites for various clinical applications. Prior studies either use in-house datasets that are inaccessible to communities, or focus on rib segmentation that neglects the clinical significance of rib labeling. To address these issues, we extend our prior dataset (RibSeg) on the binary rib segmentation task to a comprehensive benchmark, named RibSeg v2, with 660 CT scans (15,466 individual ribs in total) and annotations manually inspected by experts for rib labeling and anatomical centerline extraction. Based on the RibSeg v2, we develop a pipeline including deep learning-based methods for rib labeling, and a skeletonization-based method for centerline extraction. To improve computational efficiency, we propose a sparse point cloud representation of CT scans and compare it with standard dense voxel grids. Moreover, we design and analyze evaluation metrics to address the key challenges of each task. Our dataset, code, and model are available online to facilitate open research at https://github.com/M3DV/RibSeg
    Sufficient Exploration for Convex Q-learning. (arXiv:2210.09409v1 [math.OC])
    In recent years there has been a collective research effort to find new formulations of reinforcement learning that are simultaneously more efficient and more amenable to analysis. This paper concerns one approach that builds on the linear programming (LP) formulation of optimal control of Manne. A primal version is called logistic Q-learning, and a dual variant is convex Q-learning. This paper focuses on the latter, while building bridges with the former. The main contributions follow: (i) The dual of convex Q-learning is not precisely Manne's LP or a version of logistic Q-learning, but has similar structure that reveals the need for regularization to avoid over-fitting. (ii) A sufficient condition is obtained for a bounded solution to the Q-learning LP. (iii) Simulation studies reveal numerical challenges when addressing sampled-data systems based on a continuous time model. The challenge is addressed using state-dependent sampling. The theory is illustrated with applications to examples from OpenAI gym. It is shown that convex Q-learning is successful in cases where standard Q-learning diverges, such as the LQR problem.
    Robust Imitation of a Few Demonstrations with a Backwards Model. (arXiv:2210.09337v1 [cs.LG])
    Behavior cloning of expert demonstrations can speed up learning optimal policies in a more sample-efficient way over reinforcement learning. However, the policy cannot extrapolate well to unseen states outside of the demonstration data, creating covariate shift (agent drifting away from demonstrations) and compounding errors. In this work, we tackle this issue by extending the region of attraction around the demonstrations so that the agent can learn how to get back onto the demonstrated trajectories if it veers off-course. We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths. With optimal or near-optimal demonstrations, the learned policy will be both optimal and robust to deviations, with a wider region of attraction. On continuous control domains, we evaluate the robustness when starting from different initial states unseen in the demonstration data. While both our method and other imitation learning baselines can successfully solve the tasks for initial states in the training distribution, our method exhibits considerably more robustness to different initial states.
    Measures of Information Reflect Memorization. (arXiv:2210.09404v1 [cs.LG])
    Neural networks are known to exploit spurious artifacts (or shortcuts) that co-occur with a target label, exhibiting heuristic memorization. On the other hand, networks have been shown to memorize training examples, resulting in example-level memorization. These kinds of memorization impede generalization of networks beyond their training distributions. Detecting such memorization could be challenging, often requiring researchers to curate tailored test sets. In this work, we hypothesize -- and subsequently show -- that the diversity in the activation patterns of different neurons is reflective of model generalization and memorization. We quantify the diversity in the neural activations through information-theoretic measures and find support for our hypothesis on experiments spanning several natural language and vision tasks. Importantly, we discover that information organization points to the two forms of memorization, even for neural activations computed on unlabeled in-distribution examples. Lastly, we demonstrate the utility of our findings for the problem of model selection. The associated code and other resources for this work are available at https://linktr.ee/InformationMeasures .
    Review Learning: Alleviating Catastrophic Forgetting with Generative Replay without Generator. (arXiv:2210.09394v1 [cs.AI])
    When a deep learning model is sequentially trained on different datasets, it forgets the knowledge acquired from previous data, a phenomenon known as catastrophic forgetting. It deteriorates performance of the deep learning model on diverse datasets, which is critical in privacy-preserving deep learning (PPDL) applications based on transfer learning (TL). To overcome this, we propose review learning (RL), a generative-replay-based continual learning technique that does not require a separate generator. Data samples are generated from the memory stored within the synaptic weights of the deep learning model which are used to review knowledge acquired from previous datasets. The performance of RL was validated through PPDL experiments. Simulations and real-world medical multi-institutional experiments were conducted using three types of binary classification electronic health record data. In the real-world experiments, the global area under the receiver operating curve was 0.710 for RL and 0.655 for TL. Thus, RL was highly effective in retaining previously learned knowledge.
    Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems. (arXiv:2210.09382v1 [cs.LG])
    Despite the established convergence theory of Optimistic Gradient Descent Ascent (OGDA) and Extragradient (EG) methods for the convex-concave minimax problems, little is known about the theoretical guarantees of these methods in nonconvex settings. To bridge this gap, for the first time, this paper establishes the convergence of OGDA and EG methods under the nonconvex-strongly-concave (NC-SC) and nonconvex-concave (NC-C) settings by providing a unified analysis through the lens of single-call extra-gradient methods. We further establish lower bounds on the convergence of GDA/OGDA/EG, shedding light on the tightness of our analysis. We also conduct experiments supporting our theoretical results. We believe our results will advance the theoretical understanding of OGDA and EG methods for solving complicated nonconvex minimax real-world problems, e.g., Generative Adversarial Networks (GANs) or robust neural networks training.
    Online Convex Optimization with Unbounded Memory. (arXiv:2210.09903v1 [cs.LG])
    Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their chosen decision. However, in many of the motivating applications the loss of the learner depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and existing generalizations thereof fail to capture this. In this work we introduce a generalization of the OCO framework, ``Online Convex Optimization with Unbounded Memory'', that captures long-term dependence on past decisions. We introduce the notion of $p$-effective memory capacity, $H_p$, that quantifies the maximum influence of past decisions on current losses. We prove a $O(\sqrt{H_1 T})$ policy regret bound and a stronger $O(\sqrt{H_p T})$ policy regret bound under mild additional assumptions. These bounds are optimal in terms of their dependence on the time horizon $T$. We show the broad applicability of our framework by using it to derive regret bounds, and to simplify existing regret bound derivations, for a variety of online learning problems including an online variant of performative prediction and online linear control.
    Attraction-Repulsion Spectrum in Neighbor Embeddings. (arXiv:2007.08902v4 [cs.LG] UPDATED)
    Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher $k$NN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimisation strategy employed by UMAP strongly lowers the effective repulsion. Likewise, ForceAtlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian Eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them.
    Distribution Compression in Near-linear Time. (arXiv:2111.07941v6 [stat.ML] UPDATED)
    In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.
    Non-stationary Transformers: Exploring the Stationarity in Time Series Forecasting. (arXiv:2205.14415v3 [cs.LG] UPDATED)
    Transformers have shown great power in time series forecasting due to their global-range modeling ability. However, their performance can degenerate terribly on non-stationary real-world data in which the joint distribution changes over time. Previous studies primarily adopt stationarization to attenuate the non-stationarity of original series for better predictability. But the stationarized series deprived of inherent non-stationarity can be less instructive for real-world bursty events forecasting. This problem, termed over-stationarization in this paper, leads Transformers to generate indistinguishable temporal attentions for different series and impedes the predictive capability of deep models. To tackle the dilemma between series predictability and model capability, we propose Non-stationary Transformers as a generic framework with two interdependent modules: Series Stationarization and De-stationary Attention. Concretely, Series Stationarization unifies the statistics of each input and converts the output with restored statistics for better predictability. To address the over-stationarization problem, De-stationary Attention is devised to recover the intrinsic non-stationary information into temporal dependencies by approximating distinguishable attentions learned from raw series. Our Non-stationary Transformers framework consistently boosts mainstream Transformers by a large margin, which reduces MSE by 49.43% on Transformer, 47.34% on Informer, and 46.89% on Reformer, making them the state-of-the-art in time series forecasting. Code is available at this repository: https://github.com/thuml/Nonstationary_Transformers.
    Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning. (arXiv:2210.09452v1 [cs.CV])
    Learning representations for individual instances when only bag-level labels are available is a fundamental challenge in multiple instance learning (MIL). Recent works have shown promising results using contrastive self-supervised learning (CSSL), which learns to push apart representations corresponding to two different randomly-selected instances. Unfortunately, in real-world applications such as medical image classification, there is often class imbalance, so randomly-selected instances mostly belong to the same majority class, which precludes CSSL from learning inter-class differences. To address this issue, we propose a novel framework, Iterative Self-paced Supervised Contrastive Learning for MIL Representations (ItS2CLR), which improves the learned representation by exploiting instance-level pseudo labels derived from the bag-level labels. The framework employs a novel self-paced sampling strategy to ensure the accuracy of pseudo labels. We evaluate ItS2CLR on three medical datasets, showing that it improves the quality of instance-level pseudo labels and representations, and outperforms existing MIL methods in terms of both bag and instance level accuracy.
    CEIP: Combining Explicit and Implicit Priors for Reinforcement Learning with Demonstrations. (arXiv:2210.09496v1 [cs.LG])
    Although reinforcement learning has found widespread use in dense reward settings, training autonomous agents with sparse rewards remains challenging. To address this difficulty, prior work has shown promising results when using not only task-specific demonstrations but also task-agnostic albeit somewhat related demonstrations. In most cases, the available demonstrations are distilled into an implicit prior, commonly represented via a single deep net. Explicit priors in the form of a database that can be queried have also been shown to lead to encouraging results. To better benefit from available demonstrations, we develop a method to Combine Explicit and Implicit Priors (CEIP). CEIP exploits multiple implicit priors in the form of normalizing flows in parallel to form a single complex prior. Moreover, CEIP uses an effective explicit retrieval and push-forward mechanism to condition the implicit priors. In three challenging environments, we find the proposed CEIP method to improve upon sophisticated state-of-the-art techniques.
    From Play to Policy: Conditional Behavior Generation from Uncurated Robot Data. (arXiv:2210.10047v1 [cs.RO])
    While large-scale sequence modeling from offline data has led to impressive performance gains in natural language and image generation, directly translating such ideas to robotics has been challenging. One critical reason for this is that uncurated robot demonstration data, i.e. play data, collected from non-expert human demonstrators are often noisy, diverse, and distributionally multi-modal. This makes extracting useful, task-centric behaviors from such data a difficult generative modeling problem. In this work, we present Conditional Behavior Transformers (C-BeT), a method that combines the multi-modal generation ability of Behavior Transformer with future-conditioned goal specification. On a suite of simulated benchmark tasks, we find that C-BeT improves upon prior state-of-the-art work in learning from play data by an average of 45.7%. Further, we demonstrate for the first time that useful task-centric behaviors can be learned on a real-world robot purely from play data without any task labels or reward information. Robot videos are best viewed on our project website: https://play-to-policy.github.io
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v1 [cs.LG])
    Despite the great empirical success of actor-critic methods, its finite-time convergence is still poorly understood in its most practical form. In particular, the analysis of single-timescale actor-critic presents significant challenges due to the highly inaccurate critic estimation and the complex error propagation dynamics over iterations. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity, which is rarely the case in practical applications. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under i.i.d. sampling. Our analysis develops a novel framework that evaluates and controls the error propagation between actor and critic in a systematic way. To our knowledge, this is the first finite-time analysis for online single-timescale actor-critic method. Overall, our results compare favorably to the existing literature on analyzing actor-critic in terms of considering the most practical settings and requiring weaker assumptions.
    Generalizing in the Real World with Representation Learning. (arXiv:2210.09925v1 [cs.LG])
    Machine learning (ML) formalizes the problem of getting computers to learn from experience as optimization of performance according to some metric(s) on a set of data examples. This is in contrast to requiring behaviour specified in advance (e.g. by hard-coded rules). Formalization of this problem has enabled great progress in many applications with large real-world impact, including translation, speech recognition, self-driving cars, and drug discovery. But practical instantiations of this formalism make many assumptions - for example, that data are i.i.d.: independent and identically distributed - whose soundness is seldom investigated. And in making great progress in such a short time, the field has developed many norms and ad-hoc standards, focused on a relatively small range of problem settings. As applications of ML, particularly in artificial intelligence (AI) systems, become more pervasive in the real world, we need to critically examine these assumptions, norms, and problem settings, as well as the methods that have become de-facto standards. There is much we still do not understand about how and why deep networks trained with stochastic gradient descent are able to generalize as well as they do, why they fail when they do, and how they will perform on out-of-distribution data. In this thesis I cover some of my work towards better understanding deep net generalization, identify several ways assumptions and problem settings fail to generalize to the real world, and propose ways to address those failures in practice.
    MMGA: Multimodal Learning with Graph Alignment. (arXiv:2210.09946v1 [cs.MM])
    Multimodal pre-training breaks down the modality barriers and allows the individual modalities to be mutually augmented with information, resulting in significant advances in representation learning. However, graph modality, as a very general and important form of data, cannot be easily interacted with other modalities because of its non-regular nature. In this paper, we propose MMGA (Multimodal learning with Graph Alignment), a novel multimodal pre-training framework to incorporate information from graph (social network), image and text modalities on social media to enhance user representation learning. In MMGA, a multi-step graph alignment mechanism is proposed to add the self-supervision from graph modality to optimize the image and text encoders, while using the information from the image and text modalities to guide the graph encoder learning. We conduct experiments on the dataset crawled from Instagram. The experimental results show that MMGA works well on the dataset and improves the fans prediction task's performance. We release our dataset, the first social media multimodal dataset with graph, of 60,000 users labeled with specific topics based on 2 million posts to facilitate future research.
    Unsupervised visualization of image datasets using contrastive learning. (arXiv:2210.09879v1 [cs.LG])
    Visualization methods based on the nearest neighbor graph, such as t-SNE or UMAP, are widely used for visualizing high-dimensional data. Yet, these approaches only produce meaningful results if the nearest neighbors themselves are meaningful. For images represented in pixel space this is not the case, as distances in pixel space are often not capturing our sense of similarity and therefore neighbors are not semantically close. This problem can be circumvented by self-supervised approaches based on contrastive learning, such as SimCLR, relying on data augmentation to generate implicit neighbors, but these methods do not produce two-dimensional embeddings suitable for visualization. Here, we present a new method, called t-SimCNE, for unsupervised visualization of image data. T-SimCNE combines ideas from contrastive learning and neighbor embeddings, and trains a parametric mapping from the high-dimensional pixel space into two dimensions. We show that the resulting 2D embeddings achieve classification accuracy comparable to the state-of-the-art high-dimensional SimCLR representations, thus faithfully capturing semantic relationships. Using t-SimCNE, we obtain informative visualizations of the CIFAR-10 and CIFAR-100 datasets, showing rich cluster structure and highlighting artifacts and outliers.
    Scaling Adversarial Training to Large Perturbation Bounds. (arXiv:2210.09852v1 [cs.LG])
    The vulnerability of Deep Neural Networks to Adversarial Attacks has fuelled research towards building robust models. While most Adversarial Training algorithms aim at defending attacks constrained within low magnitude Lp norm bounds, real-world adversaries are not limited by such constraints. In this work, we aim to achieve adversarial robustness within larger bounds, against perturbations that may be perceptible, but do not change human (or Oracle) prediction. The presence of images that flip Oracle predictions and those that do not makes this a challenging setting for adversarial robustness. We discuss the ideal goals of an adversarial defense algorithm beyond perceptual limits, and further highlight the shortcomings of naively extending existing training algorithms to higher perturbation bounds. In order to overcome these shortcomings, we propose a novel defense, Oracle-Aligned Adversarial Training (OA-AT), to align the predictions of the network with that of an Oracle during adversarial training. The proposed approach achieves state-of-the-art performance at large epsilon bounds (such as an L-inf bound of 16/255 on CIFAR-10) while outperforming existing defenses (AWP, TRADES, PGD-AT) at standard bounds (8/255) as well.
    UniTune: Text-Driven Image Editing by Fine Tuning an Image Generation Model on a Single Image. (arXiv:2210.09477v1 [cs.CV])
    We present UniTune, a simple and novel method for general text-driven image editing. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high semantic and visual fidelity to the input image. UniTune uses text, an intuitive interface for art-direction, and does not require additional inputs, like masks or sketches. At the core of our method is the observation that with the right choice of parameters, we can fine-tune a large text-to-image diffusion model on a single image, encouraging the model to maintain fidelity to the input image while still allowing expressive manipulations. We used Imagen as our text-to-image model, but we expect UniTune to work with other large-scale models as well. We test our method in a range of different use cases, and demonstrate its wide applicability.
    Machine-Learning-Optimized Perovskite Nanoplatelet Synthesis. (arXiv:2210.09783v1 [physics.app-ph])
    With the demand for renewable energy and efficient devices rapidly increasing, a need arises to find and optimize novel (nano)materials. This can be an extremely tedious process, often relying significantly on trial and error. Machine learning has emerged recently as a powerful alternative; however, most approaches require a substantial amount of data points, i.e., syntheses. Here, we merge three machine-learning models with Bayesian Optimization and are able to dramatically improve the quality of CsPbBr3 nanoplatelets (NPLs) using only approximately 200 total syntheses. The algorithm can predict the resulting PL emission maxima of the NPL dispersions based on the precursor ratios, which lead to previously unobtainable 7 and 8 ML NPLs. Aided by heuristic knowledge, the algorithm should be easily applicable to other nanocrystal syntheses and significantly help to identify interesting compositions and rapidly improve their quality.
    SA-MLP: Distilling Graph Knowledge from GNNs into Structure-Aware MLP. (arXiv:2210.09609v1 [cs.LG])
    The message-passing mechanism helps Graph Neural Networks (GNNs) achieve remarkable results on various node classification tasks. Nevertheless, the recursive nodes fetching and aggregation in message-passing cause inference latency when deploying GNNs to large-scale graphs. One promising inference acceleration direction is to distill the GNNs into message-passing-free student multi-layer perceptrons (MLPs). However, the MLP student cannot fully learn the structure knowledge due to the lack of structure inputs, which causes inferior performance in the heterophily and inductive scenarios. To address this, we intend to inject structure information into MLP-like students in low-latency and interpretable ways. Specifically, we first design a Structure-Aware MLP (SA-MLP) student that encodes both features and structures without message-passing. Then, we introduce a novel structure-mixing knowledge distillation strategy to enhance the learning ability of MLPs for structure information. Furthermore, we design a latent structure embedding approximation technique with two-stage distillation for inductive scenarios. Extensive experiments on eight benchmark datasets under both transductive and inductive settings show that our SA-MLP can consistently outperform the teacher GNNs, while maintaining faster inference as MLPs. The source code of our work can be found in https://github.com/JC-202/SA-MLP.
    Consistent Multiclass Algorithms for Complex Metrics and Constraints. (arXiv:2210.09695v1 [stat.ML])
    We present consistent algorithms for multiclass learning with complex performance metrics and constraints, where the objective and constraint are defined by arbitrary functions of the confusion matrix. This setting includes many common performance metrics such as the multiclass G-mean and micro F1-measure, and constraints such as those on the classifier's precision and recall and more recent measures of fairness discrepancy. We give a general framework for designing consistent algorithms for such complex design goals by viewing the learning problem as an optimization problem over the set of feasible confusion matrices. We provide multiple instantiations of our framework under different assumptions on the performance metrics and constraints, and in each case show rates of convergence to the optimal (feasible) classifier (and this asymptotic consistency). Experiments on a variety of multiclass classification and fairness-constrained problems show that our algorithms compare favorably to the state-of-the-art baselines.
    Graph Anomaly Detection with Unsupervised GNNs. (arXiv:2210.09535v1 [cs.LG])
    Graph-based anomaly detection finds numerous applications in the real-world. Thus, there exists extensive literature on the topic that has recently shifted toward deep detection models due to advances in deep learning and graph neural networks (GNNs). A vast majority of prior work focuses on detecting node/edge/subgraph anomalies within a single graph, with much less work on graph-level anomaly detection in a graph database. This work aims to fill two gaps in the literature: We (1) design GLAM, an end-to-end graph-level anomaly detection model based on GNNs, and (2) focus on unsupervised model selection, which is notoriously hard due to lack of any labels, yet especially critical for deep NN based models with a long list of hyper-parameters. Further, we propose a new pooling strategy for graph-level embedding, called MMD-pooling, that is geared toward detecting distribution anomalies which has not been considered before. Through extensive experiments on 15 real-world datasets, we show that (i) GLAM outperforms node-level and two-stage (i.e. not end-to-end) baselines, and (ii) model selection picks a significantly more effective model than expectation (i.e. average) -- without using any labels -- among candidates with otherwise large variation in performance.
    Fine-tune your Classifier: Finding Correlations With Temperature. (arXiv:2210.09715v1 [cs.LG])
    Temperature is a widely used hyperparameter in various tasks involving neural networks, such as classification or metric learning, whose choice can have a direct impact on the model performance. Most of existing works select its value using hyperparameter optimization methods requiring several runs to find the optimal value. We propose to analyze the impact of temperature on classification tasks by describing a dataset as a set of statistics computed on representations on which we can build a heuristic giving us a default value of temperature. We study the correlation between these extracted statistics and the observed optimal temperatures. This preliminary study on more than a hundred combinations of different datasets and features extractors highlights promising results towards the construction of a general heuristic for temperature.
    Robot Learning Theory of Mind through Self-Observation: Exploiting the Intentions-Beliefs Synergy. (arXiv:2210.09435v1 [cs.RO])
    In complex environments, where the human sensory system reaches its limits, our behaviour is strongly driven by our beliefs about the state of the world around us. Accessing others' beliefs, intentions, or mental states in general, could thus allow for more effective social interactions in natural contexts. Yet these variables are not directly observable. Theory of Mind (TOM), the ability to attribute to other agents' beliefs, intentions, or mental states in general, is a crucial feature of human social interaction and has become of interest to the robotics community. Recently, new models that are able to learn TOM have been introduced. In this paper, we show the synergy between learning to predict low-level mental states, such as intentions and goals, and attributing high-level ones, such as beliefs. Assuming that learning of beliefs can take place by observing own decision and beliefs estimation processes in partially observable environments and using a simple feed-forward deep learning model, we show that when learning to predict others' intentions and actions, faster and more accurate predictions can be acquired if beliefs attribution is learnt simultaneously with action and intentions prediction. We show that the learning performance improves even when observing agents with a different decision process and is higher when observing beliefs-driven chunks of behaviour. We propose that our architectural approach can be relevant for the design of future adaptive social robots that should be able to autonomously understand and assist human partners in novel natural environments and tasks.
    Clustering-based Aggregations for Prediction in Event Streams. (arXiv:2210.09738v1 [cs.LG])
    Predicting the behaviour of shoppers provides valuable information for retailers, such as the expected spend of a shopper or the total turnover of a supermarket. The ability to make predictions on an individual level is useful, as it allows supermarkets to accurately perform targeted marketing. However, given the expected number of shoppers and their diverse behaviours, making accurate predictions on an individual level is difficult. This problem does not only arise in shopper behaviour, but also in various business processes, such as predicting when an invoice will be paid. In this paper we present CAPiES, a framework that focuses on this trade-off in an online setting. By making predictions on a larger number of entities at a time, we improve the predictive accuracy but at the potential cost of usefulness since we can say less about the individual entities. CAPiES is developed in an online setting, where we continuously update the prediction model and make new predictions over time. We show the existence of the trade-off in an experimental evaluation in two real-world scenarios: a supermarket with over 160 000 shoppers and a paint factory with over 171 000 invoices.
    CAN-BERT do it? Controller Area Network Intrusion Detection System based on BERT Language Model. (arXiv:2210.09439v1 [cs.LG])
    Due to the rising number of sophisticated customer functionalities, electronic control units (ECUs) are increasingly integrated into modern automotive systems. However, the high connectivity between the in-vehicle and the external networks paves the way for hackers who could exploit in-vehicle network protocols' vulnerabilities. Among these protocols, the Controller Area Network (CAN), known as the most widely used in-vehicle networking technology, lacks encryption and authentication mechanisms, making the communications delivered by distributed ECUs insecure. Inspired by the outstanding performance of bidirectional encoder representations from transformers (BERT) for improving many natural language processing tasks, we propose in this paper ``CAN-BERT", a deep learning based network intrusion detection system, to detect cyber attacks on CAN bus protocol. We show that the BERT model can learn the sequence of arbitration identifiers (IDs) in the CAN bus for anomaly detection using the ``masked language model" unsupervised training objective. The experimental results on the ``Car Hacking: Attack \& Defense Challenge 2020" dataset show that ``CAN-BERT" outperforms state-of-the-art approaches. In addition to being able to identify in-vehicle intrusions in real-time within 0.8 ms to 3 ms w.r.t CAN ID sequence length, it can also detect a wide variety of cyberattacks with an F1-score of between 0.81 and 0.99.
    Adaptive Oracle-Efficient Online Learning. (arXiv:2210.09385v1 [cs.LG])
    The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponentially-large (or even infinite) space of decisions and select that which performed the best on any dataset. But despite the benefits of computational feasibility, oracle-efficient algorithms exhibit one major limitation: while performing well in worst-case settings, they do not adapt well to friendly environments. In this paper we consider two such friendly scenarios, (a) "small-loss" problems and (b) IID data. We provide a new framework for designing follow-the-perturbed-leader algorithms that are oracle-efficient and adapt well to the small-loss environment, under a particular condition which we call approximability (which is spiritually related to sufficient conditions provided by Dud\'{i}k et al., [2020]). We identify a series of real-world settings, including online auctions and transductive online classification, for which approximability holds. We also extend the algorithm to an IID data setting and establish a "best-of-both-worlds" bound in the oracle-efficient setting.
    Risk of re-identification for shared clinical speech recordings. (arXiv:2210.09975v1 [eess.AS])
    Large, curated datasets are required to leverage speech-based tools in healthcare. These are costly to produce, resulting in increased interest in data sharing. As speech can potentially identify speakers (i.e., voiceprints), sharing recordings raises privacy concerns. We examine the re-identification risk for speech recordings, without reference to demographic or metadata, using a state-of-the-art speaker recognition system. We demonstrate that the risk is inversely related to the number of comparisons an adversary must consider, i.e., the search space. Risk is high for a small search space but drops as the search space grows ($precision >0.85$ for $3*10^{6}$ comparisons). Next, we show that the nature of a speech recording influences re-identification risk, with non-connected speech (e.g., vowel prolongation) being harder to identify. Our findings suggest that speaker recognition systems can be used to re-identify participants in specific circumstances, but in practice, the re-identification risk appears low.
    CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback. (arXiv:2210.09505v1 [cs.LG])
    We propose a novel regularizer for supervised learning called Conditioning on Noisy Targets (CNT). This approach consists in conditioning the model on a noisy version of the target(s) (e.g., actions in imitation learning or labels in classification) at a random noise level (from small to large noise). At inference time, since we do not know the target, we run the network with only noise in place of the noisy target. CNT provides hints through the noisy label (with less noise, we can more easily infer the true target). This give two main benefits: 1) the top-down feedback allows the model to focus on simpler and more digestible sub-problems and 2) rather than learning to solve the task from scratch, the model will first learn to master easy examples (with less noise), while slowly progressing toward harder examples (with more noise).
    Probabilistic Forecasting Methods for System-Level Electricity Load Forecasting. (arXiv:2210.09399v1 [cs.LG])
    Load forecasts have become an integral part of energy security. Due to the various influencing factors that can be considered in such a forecast, there is also a wide range of models that attempt to integrate these parameters into a system in various ways. Due to the growing importance of probabilistic load forecast models, different approaches are presented in this analysis. The focus is on different models from the short-term sector. After that, another model from the long-term sector is presented. Then, the presented models are put in relation to each other and examined with reference to advantages and disadvantages. Afterwards, the presented papers are analyzed with focus on their comparability to each other. Finally, an outlook on further areas of development in the literature will be discussed.
    Learning Diversified Feature Representations for Facial Expression Recognition in the Wild. (arXiv:2210.09381v1 [cs.CV])
    Diversity of the features extracted by deep neural networks is important for enhancing the model generalization ability and accordingly its performance in different learning tasks. Facial expression recognition in the wild has attracted interest in recent years due to the challenges existing in this area for extracting discriminative and informative features from occluded images in real-world scenarios. In this paper, we propose a mechanism to diversify the features extracted by CNN layers of state-of-the-art facial expression recognition architectures for enhancing the model capacity in learning discriminative features. To evaluate the effectiveness of the proposed approach, we incorporate this mechanism in two state-of-the-art models to (i) diversify local/global features in an attention-based model and (ii) diversify features extracted by different learners in an ensemble-based model. Experimental results on three well-known facial expression recognition in-the-wild datasets, AffectNet, FER+, and RAF-DB, show the effectiveness of our method, achieving the state-of-the-art performance of 89.99% on RAF-DB, 89.34% on FER+ and the competitive accuracy of 60.02% on AffectNet dataset.  ( 2 min )
  • Open

    Training Scale-Invariant Neural Networks on the Sphere Can Happen in Three Regimes. (arXiv:2209.03695v2 [cs.LG] UPDATED)
    A fundamental property of deep learning normalization techniques, such as batch normalization, is making the pre-normalization parameters scale invariant. The intrinsic domain of such parameters is the unit sphere, and therefore their gradient optimization dynamics can be represented via spherical optimization with varying effective learning rate (ELR), which was studied previously. However, the varying ELR may obscure certain characteristics of the intrinsic loss landscape structure. In this work, we investigate the properties of training scale-invariant neural networks directly on the sphere using a fixed ELR. We discover three regimes of such training depending on the ELR value: convergence, chaotic equilibrium, and divergence. We study these regimes in detail both on a theoretical examination of a toy example and on a thorough empirical analysis of real scale-invariant deep learning models. Each regime has unique features and reflects specific properties of the intrinsic loss landscape, some of which have strong parallels with previous research on both regular and scale-invariant neural networks training. Finally, we demonstrate how the discovered regimes are reflected in conventional training of normalized networks and how they can be leveraged to achieve better optima.
    Estimating Optimal Infinite Horizon Dynamic Treatment Regimes via pT-Learning. (arXiv:2110.10719v2 [stat.ME] UPDATED)
    Recent advances in mobile health (mHealth) technology provide an effective way to monitor individuals' health statuses and deliver just-in-time personalized interventions. However, the practical use of mHealth technology raises unique challenges to existing methodologies on learning an optimal dynamic treatment regime. Many mHealth applications involve decision-making with large numbers of intervention options and under an infinite time horizon setting where the number of decision stages diverges to infinity. In addition, temporary medication shortages may cause optimal treatments to be unavailable, while it is unclear what alternatives can be used. To address these challenges, we propose a Proximal Temporal consistency Learning (pT-Learning) framework to estimate an optimal regime that is adaptively adjusted between deterministic and stochastic sparse policy models. The resulting minimax estimator avoids the double sampling issue in the existing algorithms. It can be further simplified and can easily incorporate off-policy data without mismatched distribution corrections. We study theoretical properties of the sparse policy and establish finite-sample bounds on the excess risk and performance error. The proposed method is provided in our proximalDTR package and is evaluated through extensive simulation studies and the OhioT1DM mHealth dataset.
    Warped Dynamic Linear Models for Time Series of Counts. (arXiv:2110.14790v3 [stat.ME] UPDATED)
    Dynamic Linear Models (DLMs) are commonly employed for time series analysis due to their versatile structure, simple recursive updating, ability to handle missing data, and probabilistic forecasting. However, the options for count time series are limited: Gaussian DLMs require continuous data, while Poisson-based alternatives often lack sufficient modeling flexibility. We introduce a novel semiparametric methodology for count time series by warping a Gaussian DLM. The warping function has two components: a (nonparametric) transformation operator that provides distributional flexibility and a rounding operator that ensures the correct support for the discrete data-generating process. We develop conjugate inference for the warped DLM, which enables analytic and recursive updates for the state space filtering and smoothing distributions. We leverage these results to produce customized and efficient algorithms for inference and forecasting, including Monte Carlo simulation for offline analysis and an optimal particle filter for online inference. This framework unifies and extends a variety of discrete time series models and is valid for natural counts, rounded values, and multivariate observations. Simulation studies illustrate the excellent forecasting capabilities of the warped DLM. The proposed approach is applied to a multivariate time series of daily overdose counts and demonstrates both modeling and computational successes.
    Distribution Compression in Near-linear Time. (arXiv:2111.07941v6 [stat.ML] UPDATED)
    In distribution compression, one aims to accurately summarize a probability distribution $\mathbb{P}$ using a small number of representative points. Near-optimal thinning procedures achieve this goal by sampling $n$ points from a Markov chain and identifying $\sqrt{n}$ points with $\widetilde{\mathcal{O}}(1/\sqrt{n})$ discrepancy to $\mathbb{P}$. Unfortunately, these algorithms suffer from quadratic or super-quadratic runtime in the sample size $n$. To address this deficiency, we introduce Compress++, a simple meta-procedure for speeding up any thinning algorithm while suffering at most a factor of $4$ in error. When combined with the quadratic-time kernel halving and kernel thinning algorithms of Dwivedi and Mackey (2021), Compress++ delivers $\sqrt{n}$ points with $\mathcal{O}(\sqrt{\log n/n})$ integration error and better-than-Monte-Carlo maximum mean discrepancy in $\mathcal{O}(n \log^3 n)$ time and $\mathcal{O}( \sqrt{n} \log^2 n )$ space. Moreover, Compress++ enjoys the same near-linear runtime given any quadratic-time input and reduces the runtime of super-quadratic algorithms by a square-root factor. In our benchmarks with high-dimensional Monte Carlo samples and Markov chains targeting challenging differential equation posteriors, Compress++ matches or nearly matches the accuracy of its input algorithm in orders of magnitude less time.
    Small Transformers Compute Universal Metric Embeddings. (arXiv:2209.06788v2 [cs.LG] UPDATED)
    We study representations of data from an arbitrary metric space $\mathcal{X}$ in the space of univariate Gaussian mixtures with a transport metric (Delon and Desolneux 2020). We derive embedding guarantees for feature maps implemented by small neural networks called \emph{probabilistic transformers}. Our guarantees are of memorization type: we prove that a probabilistic transformer of depth about $n\log(n)$ and width about $n^2$ can bi-H\"{o}lder embed any $n$-point dataset from $\mathcal{X}$ with low metric distortion, thus avoiding the curse of dimensionality. We further derive probabilistic bi-Lipschitz guarantees, which trade off the amount of distortion and the probability that a randomly chosen pair of points embeds with that distortion. If $\mathcal{X}$'s geometry is sufficiently regular, we obtain stronger, bi-Lipschitz guarantees for all points in the dataset. As applications, we derive neural embedding guarantees for datasets from Riemannian manifolds, metric trees, and certain types of combinatorial graphs. When instead embedding into multivariate Gaussian mixtures, we show that probabilistic transformers can compute bi-H\"{o}lder embeddings with arbitrarily small distortion.
    Anticipating Performativity by Predicting from Predictions. (arXiv:2208.07331v2 [stat.ML] UPDATED)
    Predictions about people, such as their expected educational achievement or their credit risk, can be performative and shape the outcome that they aim to predict. Understanding the causal effect of these predictions on the eventual outcomes is crucial for foreseeing the implications of future predictive models and selecting which models to deploy. However, this causal estimation task poses unique challenges: model predictions are usually deterministic functions of input features and highly correlated with outcomes. This can make the causal effects of predictions on outcomes impossible to disentangle from the direct effect of the covariates. We study this problem through the lens of causal identifiability, and despite the hardness of this problem in full generality, we highlight three natural scenarios where the causal relationship between covariates, predictions and outcomes can be identified from observational data: randomization in predictions, overparameterization of the predictive model deployed during data collection, and discrete prediction outputs. Empirically we show that given our identifiability conditions hold, standard variants of supervised learning that predict from predictions by treating the prediction as an input feature can indeed find transferable functional relationships that allow for conclusions about newly deployed predictive models. These positive results fundamentally rely on model predictions being recorded during data collection, bringing forward the importance of rethinking standard data collection practices to enable progress towards a better understanding of social outcomes and performative feedback loops.
    Tight Analysis of Extra-gradient and Optimistic Gradient Methods For Nonconvex Minimax Problems. (arXiv:2210.09382v1 [cs.LG])
    Despite the established convergence theory of Optimistic Gradient Descent Ascent (OGDA) and Extragradient (EG) methods for the convex-concave minimax problems, little is known about the theoretical guarantees of these methods in nonconvex settings. To bridge this gap, for the first time, this paper establishes the convergence of OGDA and EG methods under the nonconvex-strongly-concave (NC-SC) and nonconvex-concave (NC-C) settings by providing a unified analysis through the lens of single-call extra-gradient methods. We further establish lower bounds on the convergence of GDA/OGDA/EG, shedding light on the tightness of our analysis. We also conduct experiments supporting our theoretical results. We believe our results will advance the theoretical understanding of OGDA and EG methods for solving complicated nonconvex minimax real-world problems, e.g., Generative Adversarial Networks (GANs) or robust neural networks training.
    Generalizing in the Real World with Representation Learning. (arXiv:2210.09925v1 [cs.LG])
    Machine learning (ML) formalizes the problem of getting computers to learn from experience as optimization of performance according to some metric(s) on a set of data examples. This is in contrast to requiring behaviour specified in advance (e.g. by hard-coded rules). Formalization of this problem has enabled great progress in many applications with large real-world impact, including translation, speech recognition, self-driving cars, and drug discovery. But practical instantiations of this formalism make many assumptions - for example, that data are i.i.d.: independent and identically distributed - whose soundness is seldom investigated. And in making great progress in such a short time, the field has developed many norms and ad-hoc standards, focused on a relatively small range of problem settings. As applications of ML, particularly in artificial intelligence (AI) systems, become more pervasive in the real world, we need to critically examine these assumptions, norms, and problem settings, as well as the methods that have become de-facto standards. There is much we still do not understand about how and why deep networks trained with stochastic gradient descent are able to generalize as well as they do, why they fail when they do, and how they will perform on out-of-distribution data. In this thesis I cover some of my work towards better understanding deep net generalization, identify several ways assumptions and problem settings fail to generalize to the real world, and propose ways to address those failures in practice.
    Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations. (arXiv:2205.13571v2 [cs.LG] UPDATED)
    Neural networks have achieved tremendous success in a large variety of applications. However, their memory footprint and computational demand can render them impractical in application settings with limited hardware or energy resources. In this work, we propose a novel algorithm to find efficient low-rank subnetworks. Remarkably, these subnetworks are determined and adapted already during the training phase and the overall time and memory resources required by both training and evaluating them are significantly reduced. The main idea is to restrict the weight matrices to a low-rank manifold and to update the low-rank factors rather than the full matrix during training. To derive training updates that are restricted to the prescribed manifold, we employ techniques from dynamic model order reduction for matrix differential equations. This allows us to provide approximation, stability, and descent guarantees. Moreover, our method automatically and dynamically adapts the ranks during training to achieve the desired approximation accuracy. The efficiency of the proposed method is demonstrated through a variety of numerical experiments on fully-connected and convolutional networks.
    Adaptive Oracle-Efficient Online Learning. (arXiv:2210.09385v1 [cs.LG])
    The classical algorithms for online learning and decision-making have the benefit of achieving the optimal performance guarantees, but suffer from computational complexity limitations when implemented at scale. More recent sophisticated techniques, which we refer to as oracle-efficient methods, address this problem by dispatching to an offline optimization oracle that can search through an exponentially-large (or even infinite) space of decisions and select that which performed the best on any dataset. But despite the benefits of computational feasibility, oracle-efficient algorithms exhibit one major limitation: while performing well in worst-case settings, they do not adapt well to friendly environments. In this paper we consider two such friendly scenarios, (a) "small-loss" problems and (b) IID data. We provide a new framework for designing follow-the-perturbed-leader algorithms that are oracle-efficient and adapt well to the small-loss environment, under a particular condition which we call approximability (which is spiritually related to sufficient conditions provided by Dud\'{i}k et al., [2020]). We identify a series of real-world settings, including online auctions and transductive online classification, for which approximability holds. We also extend the algorithm to an IID data setting and establish a "best-of-both-worlds" bound in the oracle-efficient setting.
    Measure-Theoretic Probability of Complex Co-occurrence and E-Integral. (arXiv:2210.09913v1 [stat.ML])
    Complex high-dimensional co-occurrence data are increasingly popular from a complex system of interacting physical, biological and social processes in discretely indexed modifiable areal units or continuously indexed locations of a study region for landscape-based mechanism. Modeling, predicting and interpreting complex co-occurrences are very general and fundamental problems of statistical and machine learning in a broad variety of real-world modern applications. Probability and conditional probability of co-occurrence are introduced by being defined in a general setting with set functions to develop a rigorous measure-theoretic foundation for the inherent challenge of data sparseness. The data sparseness is a main challenge inherent to probabilistic modeling and reasoning of co-occurrence in statistical inference. The behavior of a class of natural integrals called E-integrals is investigated based on the defined conditional probability of co-occurrence. The results on the properties of E-integral are presented. The paper offers a novel measure-theoretic framework where E-integral as a basic measure-theoretic concept can be the starting point for the expectation functional approach preferred by Whittle (1992) and Pollard (2001) to the development of probability theory for the inherent challenge of co-occurrences emerging in modern high-dimensional co-occurrence data problems and opens the doors to more sophisticated and interesting research in complex high-dimensional co-occurrence data science.
    Estimating the Arc Length of the Optimal ROC Curve and Lower Bounding the Maximal AUC. (arXiv:2110.09651v3 [math.ST] UPDATED)
    In this paper, we show the arc length of the optimal ROC curve is an $f$-divergence. By leveraging this result, we express the arc length using a variational objective and estimate it accurately using positive and negative samples. We show this estimator has a non-parametric convergence rate $O_p(n^{-\beta/4})$ ($\beta \in (0,1]$ depends on the smoothness). Using the same technique, we show the surface area between the optimal ROC curve and the diagonal can be expressed via a similar variational objective. These new insights lead to a novel classification procedure that maximizes an approximate lower bound of the maximal AUC. Experiments on CIFAR-10 datasets show the proposed two-step procedure achieves good AUC performance in imbalanced binary classification tasks.
    What You See is What You Get: Principled Deep Learning via Distributional Generalization. (arXiv:2204.03230v2 [cs.LG] UPDATED)
    Having similar behavior at training time and test time $-$ what we call a "What You See Is What You Get" (WYSIWYG) property $-$ is desirable in machine learning. Models trained with standard stochastic gradient descent (SGD), however, do not necessarily have this property, as their complex behaviors such as robustness or subgroup performance can differ drastically between training and test time. In contrast, we show that Differentially-Private (DP) training provably ensures the high-level WYSIWYG property, which we quantify using a notion of distributional generalization. Applying this connection, we introduce new conceptual tools for designing deep-learning methods by reducing generalization concerns to optimization ones: to mitigate unwanted behavior at test time, it is provably sufficient to mitigate this behavior on the training data. By applying this novel design principle, which bypasses "pathologies" of SGD, we construct simple algorithms that are competitive with SOTA in several distributional-robustness applications, significantly improve the privacy vs. disparate impact trade-off of DP-SGD, and mitigate robust overfitting in adversarial training. Finally, we also improve on theoretical bounds relating DP, stability, and distributional generalization.
    Mean-Field Analysis of Two-Layer Neural Networks: Global Optimality with Linear Convergence Rates. (arXiv:2205.09860v2 [cs.LG] UPDATED)
    We consider optimizing two-layer neural networks in the mean-field regime where the learning dynamics of network weights can be approximated by the evolution in the space of probability measures over the weight parameters associated with the neurons. The mean-field regime is a theoretically attractive alternative to the NTK (lazy training) regime which is only restricted locally in the so-called neural tangent kernel space around specialized initializations. Several prior works (\cite{chizat2018global, mei2018mean}) establish the asymptotic global optimality of the mean-field regime, but it is still challenging to obtain a quantitative convergence rate due to the complicated unbounded nonlinearity of the training dynamics. This work establishes the first linear convergence result for vanilla two-layer neural networks trained by continuous-time noisy gradient descent in the mean-field regime. Our result relies on a novel time-depdendent estimate of the logarithmic Sobolev constants for a family of measures determined by the evolving distribution of hidden neurons.
    Online Agnostic Multiclass Boosting. (arXiv:2205.15113v2 [cs.LG] UPDATED)
    Boosting is a fundamental approach in machine learning that enjoys both strong theoretical and practical guarantees. At a high-level, boosting algorithms cleverly aggregate weak learners to generate predictions with arbitrarily high accuracy. In this way, boosting algorithms convert weak learners into strong ones. Recently, Brukhim et al. extended boosting to the online agnostic binary classification setting. A key ingredient in their approach is a clean and simple reduction to online convex optimization, one that efficiently converts an arbitrary online convex optimizer to an agnostic online booster. In this work, we extend this reduction to multiclass problems and give the first boosting algorithm for online agnostic mutliclass classification. Our reduction also enables the construction of algorithms for statistical agnostic, online realizable, and statistical realizable multiclass boosting.
    Tree-Values: selective inference for regression trees. (arXiv:2106.07816v2 [stat.ME] UPDATED)
    We consider conducting inference on the output of the Classification and Regression Tree (CART) [Breiman et al., 1984] algorithm. A naive approach to inference that does not account for the fact that the tree was estimated from the data will not achieve standard guarantees, such as Type 1 error rate control and nominal coverage. Thus, we propose a selective inference framework for conducting inference on a fitted CART tree. In a nutshell, we condition on the fact that the tree was estimated from the data. We propose a test for the difference in the mean response between a pair of terminal nodes that controls the selective Type 1 error rate, and a confidence interval for the mean response within a single terminal node that attains the nominal selective coverage. Efficient algorithms for computing the necessary conditioning sets are provided. We apply these methods in simulation and to a dataset involving the association between portion control interventions and caloric intake.
    Online Convex Optimization with Unbounded Memory. (arXiv:2210.09903v1 [cs.LG])
    Online convex optimization (OCO) is a widely used framework in online learning. In each round, the learner chooses a decision in some convex set and an adversary chooses a convex loss function, and then the learner suffers the loss associated with their chosen decision. However, in many of the motivating applications the loss of the learner depends not only on the current decision but on the entire history of decisions until that point. The OCO framework and existing generalizations thereof fail to capture this. In this work we introduce a generalization of the OCO framework, ``Online Convex Optimization with Unbounded Memory'', that captures long-term dependence on past decisions. We introduce the notion of $p$-effective memory capacity, $H_p$, that quantifies the maximum influence of past decisions on current losses. We prove a $O(\sqrt{H_1 T})$ policy regret bound and a stronger $O(\sqrt{H_p T})$ policy regret bound under mild additional assumptions. These bounds are optimal in terms of their dependence on the time horizon $T$. We show the broad applicability of our framework by using it to derive regret bounds, and to simplify existing regret bound derivations, for a variety of online learning problems including an online variant of performative prediction and online linear control.
    Finite-time analysis of single-timescale actor-critic. (arXiv:2210.09921v1 [cs.LG])
    Despite the great empirical success of actor-critic methods, its finite-time convergence is still poorly understood in its most practical form. In particular, the analysis of single-timescale actor-critic presents significant challenges due to the highly inaccurate critic estimation and the complex error propagation dynamics over iterations. Existing works on analyzing single-timescale actor-critic only focus on the i.i.d. sampling or tabular setting for simplicity, which is rarely the case in practical applications. We consider the more practical online single-timescale actor-critic algorithm on continuous state space, where the critic is updated with a single Markovian sample per actor step. We prove that the online single-timescale actor-critic method is guaranteed to find an $\epsilon$-approximate stationary point with $\widetilde{\mathcal{O}}(\epsilon^{-2})$ sample complexity under standard assumptions, which can be further improved to $\mathcal{O}(\epsilon^{-2})$ under i.i.d. sampling. Our analysis develops a novel framework that evaluates and controls the error propagation between actor and critic in a systematic way. To our knowledge, this is the first finite-time analysis for online single-timescale actor-critic method. Overall, our results compare favorably to the existing literature on analyzing actor-critic in terms of considering the most practical settings and requiring weaker assumptions.
    Consistent Multiclass Algorithms for Complex Metrics and Constraints. (arXiv:2210.09695v1 [stat.ML])
    We present consistent algorithms for multiclass learning with complex performance metrics and constraints, where the objective and constraint are defined by arbitrary functions of the confusion matrix. This setting includes many common performance metrics such as the multiclass G-mean and micro F1-measure, and constraints such as those on the classifier's precision and recall and more recent measures of fairness discrepancy. We give a general framework for designing consistent algorithms for such complex design goals by viewing the learning problem as an optimization problem over the set of feasible confusion matrices. We provide multiple instantiations of our framework under different assumptions on the performance metrics and constraints, and in each case show rates of convergence to the optimal (feasible) classifier (and this asymptotic consistency). Experiments on a variety of multiclass classification and fairness-constrained problems show that our algorithms compare favorably to the state-of-the-art baselines.
    Parsimonious Black-Box Adversarial Attacks via Efficient Combinatorial Optimization. (arXiv:1905.06635v2 [cs.LG] UPDATED)
    Solving for adversarial examples with projected gradient descent has been demonstrated to be highly effective in fooling the neural network based classifiers. However, in the black-box setting, the attacker is limited only to the query access to the network and solving for a successful adversarial example becomes much more difficult. To this end, recent methods aim at estimating the true gradient signal based on the input queries but at the cost of excessive queries. We propose an efficient discrete surrogate to the optimization problem which does not require estimating the gradient and consequently becomes free of the first order update hyperparameters to tune. Our experiments on Cifar-10 and ImageNet show the state of the art black-box attack performance with significant reduction in the required queries compared to a number of recently proposed methods. The source code is available at https://github.com/snu-mllab/parsimonious-blackbox-attack.
    On Accelerated Perceptrons and Beyond. (arXiv:2210.09371v1 [cs.LG])
    The classical Perceptron algorithm of Rosenblatt can be used to find a linear threshold function to correctly classify $n$ linearly separable data points, assuming the classes are separated by some margin $\gamma > 0$. A foundational result is that Perceptron converges after $\Omega(1/\gamma^{2})$ iterations. There have been several recent works that managed to improve this rate by a quadratic factor, to $\Omega(\sqrt{\log n}/\gamma)$, with more sophisticated algorithms. In this paper, we unify these existing results under one framework by showing that they can all be described through the lens of solving min-max problems using modern acceleration techniques, mainly through optimistic online learning. We then show that the proposed framework also lead to improved results for a series of problems beyond the standard Perceptron setting. Specifically, a) For the margin maximization problem, we improve the state-of-the-art result from $O(\log t/t^2)$ to $O(1/t^2)$, where $t$ is the number of iterations; b) We provide the first result on identifying the implicit bias property of the classical Nesterov's accelerated gradient descent (NAG) algorithm, and show NAG can maximize the margin with an $O(1/t^2)$ rate; c) For the classical $p$-norm Perceptron problem, we provide an algorithm with $\Omega(\sqrt{(p-1)\log n}/\gamma)$ convergence rate, while existing algorithms suffer the $\Omega({(p-1)}/\gamma^2)$ convergence rate.
    Nonlinear Invariant Risk Minimization: A Causal Approach. (arXiv:2102.12353v6 [cs.LG] UPDATED)
    Due to spurious correlations, machine learning systems often fail to generalize to environments whose distributions differ from the ones used at training time. Prior work addressing this, either explicitly or implicitly, attempted to find a data representation that has an invariant relationship with the target. This is done by leveraging a diverse set of training environments to reduce the effect of spurious features and build an invariant predictor. However, these methods have generalization guarantees only when both data representation and classifiers come from a linear model class. We propose invariant Causal Representation Learning (iCaRL), an approach that enables out-of-distribution (OOD) generalization in the nonlinear setting (i.e., nonlinear representations and nonlinear classifiers). It builds upon a practical and general assumption: the prior over the data representation (i.e., a set of latent variables encoding the data) given the target and the environment belongs to general exponential family distributions. Based on this, we show that it is possible to identify the data representation up to simple transformations. We also prove that all direct causes of the target can be fully discovered, which further enables us to obtain generalization guarantees in the nonlinear setting. Extensive experiments on both synthetic and real-world datasets show that our approach outperforms a variety of baseline methods. Finally, in the discussion, we further explore the aforementioned assumption and propose a more general hypothesis, called the Agnostic Hypothesis: there exist a set of hidden causal factors affecting both inputs and outcomes. The Agnostic Hypothesis can provide a unifying view of machine learning. More importantly, it can inspire a new direction to explore a general theory for identifying hidden causal factors, which is key to enabling the OOD generalization guarantees.
    On Gradient Descent Convergence beyond the Edge of Stability. (arXiv:2206.04172v2 [cs.LG] UPDATED)
    Gradient Descent (GD) is a powerful workhorse of modern machine learning thanks to its scalability and efficiency in high-dimensional spaces. Its ability to find local minimisers is only guaranteed for losses with Lipschitz gradients, where it can be seen as a `bona-fide' discretisation of an underlying gradient flow. Yet, many ML setups involving overparametrised models do not fall into this problem class, which has motivated research beyond the so-called ``Edge of Stability'' (EoS), where the step-size crosses the admissibility threshold inversely proportional to the Lipschitz constant above. Perhaps surprisingly, GD has been empirically observed to still converge regardless of local instability and oscillatory behavior. The incipient theoretical analysis of this phenomena has mainly focused in the overparametrised regime, where the effect of choosing a large learning rate may be associated to a `Sharpness-Minimisation' implicit regularisation within the manifold of minimisers, under appropriate asymptotic limits. In contrast, in this work we directly examine the conditions for such unstable convergence, focusing on simple, yet representative, learning problems. Specifically, we characterize a local condition involving third-order derivatives that stabilizes oscillations of GD above the EoS, and leverage such property in a teacher-student setting, under population loss. Finally, focusing on Matrix Factorization, we establish a non-asymptotic `Local Implicit Bias' of GD above the EoS, whereby quasi-symmetric initializations converge to symmetric solutions -- where sharpness is minimum amongst all minimisers.
    Composite Spatial Monte Carlo Integration Based on Generalized Least Squares. (arXiv:2204.03248v3 [stat.CO] UPDATED)
    Although evaluation of the expectations on the Ising model is essential in various applications, it is mostly infeasible because of intractable multiple summations. Spatial Monte Carlo integration (SMCI) is a sampling-based approximation. It can provide high-accuracy estimations for such intractable expectations. To evaluate the expectation of a function of variables in a specific region (called target region), SMCI considers a larger region containing the target region (called sum region). In SMCI, the multiple summation for the variables in the sum region is precisely executed, and that in the outer region is evaluated by the sampling approximation such as the standard Monte Carlo integration. It is guaranteed that the accuracy of the SMCI estimator improves monotonically as the size of the sum region increases. However, a haphazard expansion of the sum region could cause a combinatorial explosion. Therefore, we hope to improve the accuracy without such an expansion. In this paper, based on the theory of generalized least squares (GLS), a new effective method is proposed by combining multiple SMCI estimators. The validity of the proposed method is demonstrated theoretically and numerically. The results indicate that the proposed method can be effective in the inverse Ising problem (or Boltzmann machine learning).
    Attraction-Repulsion Spectrum in Neighbor Embeddings. (arXiv:2007.08902v4 [cs.LG] UPDATED)
    Neighbor embeddings are a family of methods for visualizing complex high-dimensional datasets using $k$NN graphs. To find the low-dimensional embedding, these algorithms combine an attractive force between neighboring pairs of points with a repulsive force between all points. One of the most popular examples of such algorithms is t-SNE. Here we empirically show that changing the balance between the attractive and the repulsive forces in t-SNE using the exaggeration parameter yields a spectrum of embeddings, which is characterized by a simple trade-off: stronger attraction can better represent continuous manifold structures, while stronger repulsion can better represent discrete cluster structures and yields higher $k$NN recall. We find that UMAP embeddings correspond to t-SNE with increased attraction; mathematical analysis shows that this is because the negative sampling optimisation strategy employed by UMAP strongly lowers the effective repulsion. Likewise, ForceAtlas2, commonly used for visualizing developmental single-cell transcriptomic data, yields embeddings corresponding to t-SNE with the attraction increased even more. At the extreme of this spectrum lie Laplacian Eigenmaps. Our results demonstrate that many prominent neighbor embedding algorithms can be placed onto the attraction-repulsion spectrum, and highlight the inherent trade-offs between them.
    A Mixing Time Lower Bound for a Simplified Version of BART. (arXiv:2210.09352v1 [stat.ML])
    Bayesian Additive Regression Trees (BART) is a popular Bayesian non-parametric regression algorithm. The posterior is a distribution over sums of decision trees, and predictions are made by averaging approximate samples from the posterior. The combination of strong predictive performance and the ability to provide uncertainty measures has led BART to be commonly used in the social sciences, biostatistics, and causal inference. BART uses Markov Chain Monte Carlo (MCMC) to obtain approximate posterior samples over a parameterized space of sums of trees, but it has often been observed that the chains are slow to mix. In this paper, we provide the first lower bound on the mixing time for a simplified version of BART in which we reduce the sum to a single tree and use a subset of the possible moves for the MCMC proposal distribution. Our lower bound for the mixing time grows exponentially with the number of data points. Inspired by this new connection between the mixing time and the number of data points, we perform rigorous simulations on BART. We show qualitatively that BART's mixing time increases with the number of data points. The slow mixing time of the simplified BART suggests a large variation between different runs of the simplified BART algorithm and a similar large variation is known for BART in the literature. This large variation could result in a lack of stability in the models, predictions, and posterior intervals obtained from the BART MCMC samples. Our lower bound and simulations suggest increasing the number of chains with the number of data points.
    Scaling Adversarial Training to Large Perturbation Bounds. (arXiv:2210.09852v1 [cs.LG])
    The vulnerability of Deep Neural Networks to Adversarial Attacks has fuelled research towards building robust models. While most Adversarial Training algorithms aim at defending attacks constrained within low magnitude Lp norm bounds, real-world adversaries are not limited by such constraints. In this work, we aim to achieve adversarial robustness within larger bounds, against perturbations that may be perceptible, but do not change human (or Oracle) prediction. The presence of images that flip Oracle predictions and those that do not makes this a challenging setting for adversarial robustness. We discuss the ideal goals of an adversarial defense algorithm beyond perceptual limits, and further highlight the shortcomings of naively extending existing training algorithms to higher perturbation bounds. In order to overcome these shortcomings, we propose a novel defense, Oracle-Aligned Adversarial Training (OA-AT), to align the predictions of the network with that of an Oracle during adversarial training. The proposed approach achieves state-of-the-art performance at large epsilon bounds (such as an L-inf bound of 16/255 on CIFAR-10) while outperforming existing defenses (AWP, TRADES, PGD-AT) at standard bounds (8/255) as well.
    Theoretical Guarantees for Permutation-Equivariant Quantum Neural Networks. (arXiv:2210.09974v1 [quant-ph])
    Despite the great promise of quantum machine learning models, there are several challenges one must overcome before unlocking their full potential. For instance, models based on quantum neural networks (QNNs) can suffer from excessive local minima and barren plateaus in their training landscapes. Recently, the nascent field of geometric quantum machine learning (GQML) has emerged as a potential solution to some of those issues. The key insight of GQML is that one should design architectures, such as equivariant QNNs, encoding the symmetries of the problem at hand. Here, we focus on problems with permutation symmetry (i.e., the group of symmetry $S_n$), and show how to build $S_n$-equivariant QNNs. We provide an analytical study of their performance, proving that they do not suffer from barren plateaus, quickly reach overparametrization, and can generalize well from small amounts of data. To verify our results, we perform numerical simulations for a graph state classification task. Our work provides the first theoretical guarantees for equivariant QNNs, thus indicating the extreme power and potential of GQML.
    Differentially Private Diffusion Models. (arXiv:2210.09929v1 [stat.ML])
    While modern machine learning models rely on increasingly large training datasets, data is often limited in privacy-sensitive domains. Generative models trained with differential privacy (DP) on sensitive data can sidestep this challenge, providing access to synthetic data instead. However, training DP generative models is highly challenging due to the noise injected into training to enforce DP. We propose to leverage diffusion models (DMs), an emerging class of deep generative models, and introduce Differentially Private Diffusion Models (DPDMs), which enforce privacy using differentially private stochastic gradient descent (DP-SGD). We motivate why DP-SGD is well suited for training DPDMs, and thoroughly investigate the DM parameterization and the sampling algorithm, which turn out to be crucial ingredients in DPDMs. Furthermore, we propose noise multiplicity, a simple yet powerful modification of the DM training objective tailored to the DP setting to boost performance. We validate our novel DPDMs on widely-used image generation benchmarks and achieve state-of-the-art (SOTA) performance by large margins. For example, on MNIST we improve the SOTA FID from 48.4 to 5.01 and downstream classification accuracy from 83.2% to 98.1% for the privacy setting DP-$(\varepsilon{=}10, \delta{=}10^{-5})$. Moreover, on standard benchmarks, classifiers trained on DPDM-generated synthetic data perform on par with task-specific DP-SGD-trained classifiers, which has not been demonstrated before for DP generative models. Project page and code: https://nv-tlabs.github.io/DPDM.
    A tradeoff between universality of equivariant models and learnability of symmetries. (arXiv:2210.09444v1 [stat.ML])
    We prove an impossibility result, which in the context of function learning says the following: under certain conditions, it is impossible to simultaneously learn symmetries and functions equivariant under them using an ansatz consisting of equivariant functions. To formalize this statement, we carefully study notions of approximation for groups and semigroups. We analyze certain families of neural networks for whether they satisfy the conditions of the impossibility result: what we call ``linearly equivariant'' networks, and group-convolutional networks. A lot can be said precisely about linearly equivariant networks, making them theoretically useful. On the practical side, our analysis of group-convolutional neural networks allows us generalize the well-known ``convolution is all you need'' theorem to non-homogeneous spaces. We additionally find an important difference between group convolution and semigroup convolution.
    Private Stochastic Optimization in the Presence of Outliers: Optimal Rates for (Non-Smooth) Convex Losses and Extension to Non-Convex Losses. (arXiv:2209.07403v2 [cs.LG] UPDATED)
    We study differentially private (DP) stochastic optimization (SO) with data containing outliers and loss functions that are (possibly) not Lipschitz continuous. To date, the vast majority of work on DP SO assumes that the loss is uniformly Lipschitz over data (i.e. stochastic gradients are uniformly bounded over all data points). While this assumption is convenient, it is often unrealistic: in many practical problems, the loss function may not be uniformly Lipschitz. Even when the loss function is Lipschitz continuous, the worst-case Lipschitz parameter of the loss over all data points may be extremely large due to outliers. In such cases, the error bounds for DP SO, which scale with the worst-case Lipschitz parameter of the loss, are vacuous. To address these limitations, this work does not require the loss function to be uniformly Lipschitz. Instead, building on a recent line of work [WXDX20, KLZ22], we make the weaker assumption that stochastic gradients have bounded $k$-th order moments for some $k \geq 2$. Compared with works on DP Lipschitz SO, our excess risk scales with the $k$-th moment bound instead of the Lipschitz parameter of the loss, allowing for significantly faster rates in the presence of outliers. For convex and strongly convex loss functions, we provide the first asymptotically optimal excess risk bounds (up to a logarithmic factor). In contrast to the prior works, our bounds do not require the loss function to be differentiable/smooth. We also devise an accelerated algorithm for smooth losses that runs in linear time and has excess risk that is tight in certain practical parameter regimes. Additionally, our work is the first to address non-convex non-Lipschitz loss functions satisfying the Proximal-PL inequality; this covers some practical machine learning models. Our Proximal-PL algorithm has near-optimal excess risk.
    Robust Reinforcement Learning using Offline Data. (arXiv:2208.05129v2 [cs.LG] UPDATED)
    The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems.
    CNT (Conditioning on Noisy Targets): A new Algorithm for Leveraging Top-Down Feedback. (arXiv:2210.09505v1 [cs.LG])
    We propose a novel regularizer for supervised learning called Conditioning on Noisy Targets (CNT). This approach consists in conditioning the model on a noisy version of the target(s) (e.g., actions in imitation learning or labels in classification) at a random noise level (from small to large noise). At inference time, since we do not know the target, we run the network with only noise in place of the noisy target. CNT provides hints through the noisy label (with less noise, we can more easily infer the true target). This give two main benefits: 1) the top-down feedback allows the model to focus on simpler and more digestible sub-problems and 2) rather than learning to solve the task from scratch, the model will first learn to master easy examples (with less noise), while slowly progressing toward harder examples (with more noise).
    Locally Smoothed Gaussian Process Regression. (arXiv:2210.09998v1 [stat.ML])
    We develop a novel framework to accelerate Gaussian process regression (GPR). In particular, we consider localization kernels at each data point to down-weigh the contributions from other data points that are far away, and we derive the GPR model stemming from the application of such localization operation. Through a set of experiments, we demonstrate the competitive performance of the proposed approach compared to full GPR, other localized models, and deep Gaussian processes. Crucially, these performances are obtained with considerable speedups compared to standard global GPR due to the sparsification effect of the Gram matrix induced by the localization operation.
    When Does Group Invariant Learning Survive Spurious Correlations?. (arXiv:2206.14534v2 [cs.LG] UPDATED)
    By inferring latent groups in the training data, recent works introduce invariant learning to the case where environment annotations are unavailable. Typically, learning group invariance under a majority/minority split is empirically shown to be effective in improving out-of-distribution generalization on many datasets. However, theoretical guarantee for these methods on learning invariant mechanisms is lacking. In this paper, we reveal the insufficiency of existing group invariant learning methods in preventing classifiers from depending on spurious correlations in the training set. Specifically, we propose two criteria on judging such sufficiency. Theoretically and empirically, we show that existing methods can violate both criteria and thus fail in generalizing to spurious correlation shifts. Motivated by this, we design a new group invariant learning method, which constructs groups with statistical independence tests, and reweights samples by group label proportion to meet the criteria. Experiments on both synthetic and real data demonstrate that the new method significantly outperforms existing group invariant learning methods in generalizing to spurious correlation shifts.
    Casual inference of General Treatment Effects using Neural Networks with A Diverging Number of Confounders. (arXiv:2009.07055v4 [stat.ME] UPDATED)
    The estimation of causal effects is a primary goal of behavioral, social, economic and biomedical sciences. Under the unconfoundedness condition, adjustment for confounders requires estimating the nuisance functions relating outcome and/or treatment to confounders. This paper considers a generalized optimization framework for efficient estimation of general treatment effects using feedforward artificial neural networks (ANNs) when the number of covariates is allowed to increase with the sample size. We estimate the nuisance function by ANNs, and develop a new approximation error bound for the ANNs approximators when the nuisance function belongs to a mixed Sobolev space. We show that the ANNs can alleviate the curse of dimensionality under this circumstance. We further establish the consistency and asymptotic normality of the proposed treatment effects estimators, and apply a weighted bootstrap procedure for conducting inference. The proposed methods are illustrated via simulation studies and a real data application.
    Transfer learning with affine model transformation. (arXiv:2210.09745v1 [stat.ML])
    Supervised transfer learning (TL) has received considerable attention because of its potential to boost the predictive power of machine learning in cases with limited data. In a conventional scenario, cross-domain differences are modeled and estimated using a given set of source models and samples from a target domain. For example, if there is a functional relationship between source and target domains, only domain-specific factors are additionally learned using target samples to shift the source models to the target. However, the general methodology for modeling and estimating such cross-domain shifts has been less studied. This study presents a TL framework that simultaneously and separately estimates domain shifts and domain-specific factors using given target samples. Assuming consistency and invertibility of the domain transformation functions, we derive an optimal family of functions to represent the cross-domain shift. The newly derived class of transformation functions takes the same form as invertible neural networks using affine coupling layers, which are widely used in generative deep learning. We show that the proposed method encompasses a wide range of existing methods, including the most common TL procedure based on feature extraction using neural networks. We also clarify the theoretical properties of the proposed method, such as the convergence rate of the generalization error, and demonstrate the practical benefits of separately modeling and estimating domain-specific factors through several case studies.
    Shallow and Deep Nonparametric Convolutions for Gaussian Processes. (arXiv:2206.08972v2 [stat.ML] UPDATED)
    A key challenge in the practical application of Gaussian processes (GPs) is selecting a proper covariance function. The moving average, or process convolutions, construction of GPs allows some additional flexibility, but still requires choosing a proper smoothing kernel, which is non-trivial. Previous approaches have built covariance functions by using GP priors over the smoothing kernel, and by extension the covariance, as a way to bypass the need to specify it in advance. However, such models have been limited in several ways: they are restricted to single dimensional inputs, e.g. time; they only allow modelling of single outputs and they do not scale to large datasets since inference is not straightforward. In this paper, we introduce a nonparametric process convolution formulation for GPs that alleviates these weaknesses by using a functional sampling approach based on Matheron's rule to perform fast sampling using interdomain inducing variables. Furthermore, we propose a composition of these nonparametric convolutions that serves as an alternative to classic deep GP models, and allows the covariance functions of the intermediate layers to be inferred from the data. We test the performance of our model on benchmarks for single output GPs, multiple output GPs and deep GPs and find that our approach can provide improvements over standard GP models, particularly for larger datasets.  ( 3 min )
    Amortized Inference for Causal Structure Learning. (arXiv:2205.12934v2 [cs.LG] UPDATED)
    Inferring causal structure poses a combinatorial search problem that typically involves evaluating structures with a score or independence test. The resulting search is costly, and designing suitable scores or tests that capture prior knowledge is difficult. In this work, we propose to amortize causal structure learning. Rather than searching over structures, we train a variational inference model to predict the causal structure from observational or interventional data. This allows us to bypass both the search over graphs and the hand-engineering of suitable score functions. Instead, our inference model acquires domain-specific inductive biases for causal discovery solely from data generated by a simulator. The architecture of our inference model emulates permutation invariances that are crucial for statistical efficiency in structure learning, which facilitates generalization to significantly larger problem instances than seen during training. On synthetic data and semisynthetic gene expression data, our models exhibit robust generalization capabilities when subject to substantial distribution shifts and significantly outperform existing algorithms, especially in the challenging genomics domain. Our code and models are publicly available at: https://github.com/larslorch/avici.  ( 2 min )
    Learning Sparse Fixed-Structure Gaussian Bayesian Networks. (arXiv:2107.10450v3 [cs.DS] UPDATED)
    Gaussian Bayesian networks (a.k.a. linear Gaussian structural equation models) are widely used to model causal interactions among continuous variables. In this work, we study the problem of learning a fixed-structure Gaussian Bayesian network up to a bounded error in total variation distance. We analyze the commonly used node-wise least squares regression (LeastSquares) and prove that it has a near-optimal sample complexity. We also study a couple of new algorithms for the problem: - BatchAvgLeastSquares takes the average of several batches of least squares solutions at each node, so that one can interpolate between the batch size and the number of batches. We show that BatchAvgLeastSquares also has near-optimal sample complexity. - CauchyEst takes the median of solutions to several batches of linear systems at each node. We show that the algorithm specialized to polytrees, CauchyEstTree, has near-optimal sample complexity. Experimentally, we show that for uncontaminated, realizable data, the LeastSquares algorithm performs best, but in the presence of contamination or DAG misspecification, CauchyEst/CauchyEstTree and BatchAvgLeastSquares respectively perform better.  ( 3 min )
    Bagged $k$-Distance for Mode-Based Clustering Using the Probability of Localized Level Sets. (arXiv:2210.09786v1 [stat.ML])
    In this paper, we propose an ensemble learning algorithm named \textit{bagged $k$-distance for mode-based clustering} (\textit{BDMBC}) by putting forward a new measurement called the \textit{probability of localized level sets} (\textit{PLLS}), which enables us to find all clusters for varying densities with a global threshold. On the theoretical side, we show that with a properly chosen number of nearest neighbors $k_D$ in the bagged $k$-distance, the sub-sample size $s$, the bagging rounds $B$, and the number of nearest neighbors $k_L$ for the localized level sets, BDMBC can achieve optimal convergence rates for mode estimation. It turns out that with a relatively small $B$, the sub-sample size $s$ can be much smaller than the number of training data $n$ at each bagging round, and the number of nearest neighbors $k_D$ can be reduced simultaneously. Moreover, we establish optimal convergence results for the level set estimation of the PLLS in terms of Hausdorff distance, which reveals that BDMBC can find localized level sets for varying densities and thus enjoys local adaptivity. On the practical side, we conduct numerical experiments to empirically verify the effectiveness of BDMBC for mode estimation and level set estimation, which demonstrates the promising accuracy and efficiency of our proposed algorithm.  ( 2 min )
    Contextual bandits with concave rewards, and an application to fair ranking. (arXiv:2210.09957v1 [cs.LG])
    We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure.  ( 2 min )
    Random Orthogonalization for Federated Learning in Massive MIMO Systems. (arXiv:2210.09881v1 [cs.IT])
    We propose a novel communication design, termed random orthogonalization, for federated learning (FL) in a massive multiple-input and multiple-output (MIMO) wireless system. The key novelty of random orthogonalization comes from the tight coupling of FL and two unique characteristics of massive MIMO -- channel hardening and favorable propagation. As a result, random orthogonalization can achieve natural over-the-air model aggregation without requiring transmitter side channel state information (CSI) for the uplink phase of FL, while significantly reducing the channel estimation overhead at the receiver. We extend this principle to the downlink communication phase and develop a simple but highly effective model broadcast method for FL. We also relax the massive MIMO assumption by proposing an enhanced random orthogonalization design for both uplink and downlink FL communications, that does not rely on channel hardening or favorable propagation. Theoretical analyses with respect to both communication and machine learning performance are carried out. In particular, an explicit relationship among the convergence rate, the number of clients, and the number of antennas is established. Experimental results validate the effectiveness and efficiency of random orthogonalization for FL in massive MIMO.  ( 2 min )
    $k$-Means Clustering for Persistent Homology. (arXiv:2210.10003v1 [stat.AP])
    Persistent homology is a fundamental methodology from topological data analysis that summarizes the lifetimes of topological features within a dataset as a persistence diagram; it has recently gained much popularity from its myriad successful applications to many domains. However, a significant challenge to its widespread implementation, especially in statistical methodology and machine learning algorithms, is the format of the persistence diagram as a multiset of half-open intervals. In this paper, we comprehensively study $k$-means clustering where the input is various embeddings of persistence diagrams, as well as persistence diagrams themselves and their generalizations as persistence measures. We show that the clustering performance directly on persistence diagrams and measures far outperform their vectorized representations, despite their more complex representations. Moreover, we prove convergence of the algorithm on persistence diagram space and establish theoretical properties of the solution to the optimization problem in the Karush--Kuhn--Tucker framework.  ( 2 min )
    Importance Weighting Correction of Regularized Least-Squares for Covariate and Target Shifts. (arXiv:2210.09709v1 [stat.ML])
    In many real world problems, the training data and test data have different distributions. This situation is commonly referred as a dataset shift. The most common settings for dataset shift often considered in the literature are {\em covariate shift } and {\em target shift}. Importance weighting (IW) correction is a universal method for correcting the bias present in learning scenarios under dataset shift. The question one may ask is: does IW correction work equally well for different dataset shift scenarios? By investigating the generalization properties of the weighted kernel ridge regression (W-KRR) under covariate and target shifts we show that the answer is negative, except when IW is bounded and the model is wellspecified. In the latter cases, a minimax optimal rates are achieved by importance weighted kernel ridge regression (IW-KRR) in both, covariate and target shift scenarios. Slightly relaxing the boundedness condition of the IW we show that the IW-KRR still achieves the optimal rates under target shift while leading to slower rates for covariate shift. In the case of the model misspecification we show that the performance of the W-KRR under covariate shift could be substantially increased by designing an alternative reweighting function. The distinction between misspecified and wellspecified scenarios does not seem to be crucial in the learning problems under target shift.  ( 3 min )
    SQ Lower Bounds for Learning Single Neurons with Massart Noise. (arXiv:2210.09949v1 [cs.LG])
    We study the problem of PAC learning a single neuron in the presence of Massart noise. Specifically, for a known activation function $f: \mathbb{R} \to \mathbb{R}$, the learner is given access to labeled examples $(\mathbf{x}, y) \in \mathbb{R}^d \times \mathbb{R}$, where the marginal distribution of $\mathbf{x}$ is arbitrary and the corresponding label $y$ is a Massart corruption of $f(\langle \mathbf{w}, \mathbf{x} \rangle)$. The goal of the learner is to output a hypothesis $h: \mathbb{R}^d \to \mathbb{R}$ with small squared loss. For a range of activation functions, including ReLUs, we establish super-polynomial Statistical Query (SQ) lower bounds for this learning problem. In more detail, we prove that no efficient SQ algorithm can approximate the optimal error within any constant factor. Our main technical contribution is a novel SQ-hard construction for learning $\{ \pm 1\}$-weight Massart halfspaces on the Boolean hypercube that is interesting on its own right.  ( 2 min )

  • Open

    [D] GPT-3 is a DREAM for citation-farmers - Threat Model Tuesday #1
    GPT-3 and the multitude of similar models that have come out over the last couple years likely represent a serious threat to scientific conferences. What can we do about it? Some historical context: computer-generated papers have been showing up in major publications since a model called SCIgen released in 2005. SCIgen uses a simple context-free grammar to produce templatized papers that are basically pseudo-scientific gibberish. People are still finding those papers many years later. These papers are generally churned out to inflate citation statistics, or by well-meaning researchers probing suspected low publication standards at existing conferences (which I generally don’t recommend, since it adds to the mountains of paper that we have to churn through as reviewers). There’s a neglig…  ( 126 min )
    [R] Visual Reinforcement Learning with Self-Supervised 3D Representations
    Project page: https://yanjieze.com/3d4rl/ https://reddit.com/link/y7l4fg/video/25479kzr5nu91/player submitted by /u/TheLastRefugee [link] [comments]  ( 125 min )
    [R] Action-conditioned On-demand Motion Generation [ACMMM 2022]
    Paper: https://dl.acm.org/doi/10.1145/3503161.3548287 Project Page: https://roychowdhuryresearch.github.io/ODMO_ACMMM2022/ Github: https://github.com/roychowdhuryresearch/ODMO ​ ​ https://reddit.com/link/y7jdz4/video/rqdi20lntmu91/player submitted by /u/CapurLu [link] [comments]  ( 125 min )
    [D] any model / existing tool of redacting passwords/tokens from files?
    I have mass amount of log files from various products, they might contain passwords or tokens. Obviously there is no single pattern to identify them. I know that amazon comprehend can detect some passwords, but instead of building my own model, it feels like there is someone in the world who encountered this, is there any tool / library that performs this logic? submitted by /u/Arik1313 [link] [comments]  ( 125 min )
    [P] ONNX model analysis tool in Rust
    I recently wrote an ONNX model analysis tool as a means of exploring the ONNX format. It's capable of summarizing key model stats and creating visualizations - all from your terminal! https://reddit.com/link/y79lng/video/8ik0283bnku91/player Check it out here: https://github.com/FL33TW00D/steelix Disclaimer: It's very much early days and may not work 100% for your model! submitted by /u/BlockDesigns [link] [comments]  ( 124 min )
    [D] What is a free tool for generating image segmentation masks?
    Need to teach a bunch of radiologists how to annotate the region of interest in an image for a segmentation task. I have been using the VGG online tool, but now, we need something that's easier and does not require as many clicks to get the work done. What's a good option for that? We had a look at solutions from Plainsight.ai and Labelbox. Are there any other alternatives? submitted by /u/l34df4rm3r [link] [comments]  ( 123 min )
    [D] How frustrating are the ML interviews these days!!! TOP 3% interview joke
    Hi all, Just want to share my recent experience with you. I'm an ML engineer have 4 years of experience mostly with NLP. Recently I needed a remote job so I applied to company X which claims they hire the top 3% (No one knows how they got this number). I applied two times, the first time passed the coding test and failed in the technical interview cause I wasn't able to solve 2 questions within 30min (solved the first one and the second almost got it before the time is up). Second Trial: I acknowledged my weaknesses and grinded Leetcode for a while (since this is what only matters these days to get a job), and applied again, this time I moved to the Technical Interview phase directly, again chatted a bit (doesn't matter at all what you will say about our experience) and he gave me a dat…  ( 155 min )
    [P] Stochastic Differentiable Programming: Unbiased Automatic Differentiation for Discrete Stochastic Programs (such as particle filters, agent-based models, and more!)
    Sharing our paper accepted at NeurIPS: Automatic Differentiation of Programs with Discrete Randomness A summary is given in a Twitter thread. It goes into a high-level of how the method works. The core idea behind the paper is the following: if we treat a program as a random variable X(p), can we come up with a similar number definition such that we get two random variables, (X(p),Y(p)), such that E[Y(p)] = dE[X(p)]/dp? We give a detailed derivation of how to do this properly. This method gives an unbiased, low variance, and fully automatic method for automatic differentiation of such programs. A fairly optimized implementation of the method is available as an open-source package: https://github.com/gaurav-arya/StochasticAD.jl There are still many things to do in this area. For example, from this it should be easy to train neural networks to generate differential equations of mean behavior over time which match the statistics of an agent-based model, but we can't say we've tried all of the applications. Also, there's a lot more to do in terms of compiler optimizations for this new AD system. But, even if you don't use it, the idea is fun and cool and you should check it out! submitted by /u/ChrisRackauckas [link] [comments]  ( 130 min )
    [P] XAI Recipes for the HuggingFace 🤗 Image Classification Models
    Hi, Wanted to share a new tutorial that shows how to apply advanced XAI with the pytorch-grad-cam package, on all (or most of the?) vision models in HuggingFace. https://jacobgil.github.io/pytorch-gradcam-book/HuggingFace.html Applying XAI on a new model architecture with pytorch-grad-cam often involves understanding how the internal activations are shaped, and creating a reshape transform that brings them to a canonical expected shape that can be used for visualization. The goal here is to make XAI available for the new complex models and make it a real world tool that can be easily applied. submitted by /u/jacobgil [link] [comments]  ( 129 min )
    [D] What unique features could diffusion model-created images have that will help classify those fakes from real images?
    Just like the title, what kind of classification model would help with detecting diffusion-created fake images? Is there anything specific that such images have, that will help the classifier learn about it? (not the watermarks, as anyone can get rid of them while creating the images). Also, what kind of classifier would help with classifying between diffusion-created fakes and GAN-created fakes? submitted by /u/Top-Crab8018 [link] [comments]  ( 126 min )
    [P] Why I quit my lucrative job at Google to start Vectara? (neural search as a service for developers everywhere).
    (posting this on behalf of my cofounder: Amin Ahmad). A friend of mine asked me recently why I decided to start Vectara? Now that we’ve left stealth mode and launched publicly, I thought some of you might also be wondering, and so I hope the following paragraphs will serve to provide insight into those reasons. You are probably aware that the 2010s were marked by rapid advances in artificial intelligence, fueled by the dominance of deep neural networks as the tool of choice for machine learning. Perhaps no field advanced as rapidly as language understanding, which, at the dawn of the decade and despite the existence of a huge body of linguistics research and decades of sustained effort, could not pretend to the feats of comprehension regularly displayed by young children. Around 2016, m…  ( 126 min )
    [D] An idea about how memory works
    Guys, Here is an idea about how memory really works. Because we say really, there will be no arithmetical operations, i.e. +-\/*, involved, let alone their complex like gradient descent; it is supposed to be purely biological. The whole idea will be solely built on a long-proven neurochemical mechanism, synaptic plasticity, which says that the synaptic strength between two neurons enhances or weakens over time, in response to their simultaneous firing. Or simply, "neurons that fire together wire together". Let us suppose that a neural network has two neurons A and B, and neuron A has a synaptic connection to neuron B. Now, for the stimulus on neuron A to make the two neurons fire together, the stimulus must trigger neuron A to fire, and afterward, the synaptic connection must transmit th…  ( 140 min )
    [D] Machine Learning conferences/journals with a mathematical slant?
    I'm an undergrad coming from an applied mathematics background, and have been fascinated by mathematical approaches to the foundations of deep learning and ML in general (e.g., geometric deep learning, Ising models). I'm currently working on a research project which is highly mathematical in flavour, and I was wondering if there are conferences, tracks, and/or journals geared towards more theoretical/mathematical results. Would also be great to hear about how such results might be received at major ML conferences like ICML. Thanks! submitted by /u/vajraadhvan [link] [comments]  ( 130 min )
    [D] Are new object detection architectures better in practice?
    So suppose we have been given an image and we have to return number of objects of each class present in that image. Does going by conventional route ,i.e. firstly finding purpose boxes and then applying classification, is better than directly applying object detection architectures like YOLO and SSD.I know these architectures are relatively faster but in my case training or predicting time is not an issue at all. For context, the image is different types of grains scattered on ground. Any comment or suggestion would be much appreciated. submitted by /u/baby_yoda_fangay [link] [comments]  ( 126 min )
  • Open

    Trading generalized derivatives for classical ones
    Generalized functions have generalized derivatives. This is how we make sense of things like delta “functions” that are not functions, or functions that are not differentiable satisfying a differential equation. More on that here. A major theme in the modern approach to partial differential equations is to first look for solutions in a space of […] Trading generalized derivatives for classical ones first appeared on John D. Cook.  ( 5 min )
    The view from a Galilean moon
    The Galilean moons are the four largest moons of Jupiter, first observed by Galileo, contra Stigler’s law of eponymy. This post shows what the Jovian system look like from the perspective of each of these moons, a sort of pre-Copernican perspective in a Jovian context. The view from Io Here’s what Europa would look like […] The view from a Galilean moon first appeared on John D. Cook.  ( 5 min )
    What if Copernicus had been a Martian?
    The genius of Copernicus was to realize that this plot becomes this plot under a clever change of coordinates. The first plot is the apparent motion of Mars, Jupiter, and Saturn as viewed from Earth over the course of 30 years. Of course Copernicus knew of Mercury and Venus as well, but I’ve omitted them […] What if Copernicus had been a Martian? first appeared on John D. Cook.  ( 4 min )
  • Open

    Visual Reinforcement Learning with Self-Supervised 3D Representations
    Project page: https://yanjieze.com/3d4rl/ https://reddit.com/link/y7l06v/video/xvqf5nvu4nu91/player submitted by /u/TheLastRefugee [link] [comments]  ( 118 min )
    REINFORCE Policy Gradient
    This is a follow up on my previous post REINFORCE on Frozen Lake, in which I needed help to see whether my state representation was correct or not. There was a comment suggesting to use one-hot representations for my state and feature vectors but for some reason that comment got removed. I applied those changes and my agent yet fails. Tried applying grid search for the learning rates of w (value function parameter) and theta (policy parameter), but the agent just seems to be behaving randomly. Any ideas why? This is the colab notebook in case anyone wants to view it. ​ PS: New to reddit, so my apologies if I'm breaking any guidelines and do let me know if I am doing so ​ https://preview.redd.it/rl27lhn3zlu91.png?width=844&format=png&auto=webp&s=e8060c7f5541cc26dabe40b3c77ce8af4dded1c9 https://preview.redd.it/mipxogn3zlu91.png?width=1188&format=png&auto=webp&s=edc98417553704e209e325e31b2cc354db3eab4f https://preview.redd.it/6z68nhn3zlu91.png?width=1663&format=png&auto=webp&s=8792df9213208e0796afd9ef5cc8ceadbd5186b1 https://preview.redd.it/vv0knjn3zlu91.png?width=868&format=png&auto=webp&s=5a4e8159795130e6a087a26a627cb5f72de32760 submitted by /u/anuraagshankar [link] [comments]  ( 118 min )
    [D] Decision-time planning algorithms for discrete problems other than the AlphaZero family?
    I'm wondering if there are any other algorithms that perform decision-time planning on discrete problems? For continuous problems there is a bunch of different options, but I can't seem to find anything else for discrete ones. submitted by /u/gaymuslimsocialist [link] [comments]  ( 118 min )
    Action formulation from pytorch net
    Hello, I'm trying to apply deep reinforcement learning on a simulation I programmed. The simulation simulates the behavior of some number of electric vehicle users. It tracks their energy consumption and location. When they are in a charging dock the RL agent can distribute charge to them. I want my network to output a binary for each charging spot at each time, i.e., 1 to give charge, 0 to not give charge. Is this feasible to formulate with pytorch? If so, could you give me ideas to do so? ​ Million thanks in advance. submitted by /u/arachnarus96 [link] [comments]  ( 120 min )
    skrl: Modular and Flexible Library for Reinforcement Learning
    ​ https://preview.redd.it/rb8ntizvyiu91.png?width=1926&format=png&auto=webp&s=ed034f40f4b302f89e72f7f625f91fc0f29a7059 skrl is an open-source modular library for Reinforcement Learning written in Python (using PyTorch) and designed with a focus on readability, simplicity, and transparency of algorithm implementation. In addition to supporting the OpenAI Gym and DeepMind environment interfaces, it allows loading and configuring NVIDIA Isaac Gym and NVIDIA Omniverse Isaac Gym environments, enabling agents' simultaneous training by scopes (subsets of environments among all available environments), which may or may not share resources, in the same run. ​ Visit the documentation for usage details and examples: https://skrl.readthedocs.io/en/latest/ submitted by /u/Toni-SM [link] [comments]  ( 119 min )
    What is the difference between macro-action and option?
    submitted by /u/TK-SZ [link] [comments]  ( 117 min )
  • Open

    Why Neural Networks can learn (almost) anything
    submitted by /u/keghn [link] [comments]  ( 106 min )
    Need help w/ Mask RCNN Setup
    I have been trying to setup Mask RCNN by matterport on my PC and failed so many times. Does anyone knows any accurate tutorial which proven well for your setup. submitted by /u/ChocolateOwn7840 [link] [comments]  ( 110 min )
  • Open

    DSC Weekly 18 October 2022 – How the Pandemic Created a Time Warp
    Recessions, it has been said, are typically only visible in a rear-view mirror. In theory, we are in a recession when the economy shrinks over two quarters. In practice, however, not all recessions are the same. The post DSC Weekly 18 October 2022 – How the Pandemic Created a Time Warp appeared first on Data Science Central.  ( 23 min )
    Occam’s Razor – Simplify To Solve Problems
    Therefore, the golden principle to be at peace with these irritants in your life is simple: Simplify it. The post Occam’s Razor – Simplify To Solve Problems appeared first on Data Science Central.  ( 21 min )
    Dealing with the Ethical Workplace Dilemma of Emerging Tech
    With the increasing use of technology in the workplace, employers and employees face a new ethical dilemma. The adoption of new technologies has recently accelerated because of the worldwide pandemic sending several individuals from the office area to their residences. Indeed, various governments have reacted to the emergency by implementing strict regulations through lockdown and… Read More »Dealing with the Ethical Workplace Dilemma of Emerging Tech The post Dealing with the Ethical Workplace Dilemma of Emerging Tech appeared first on Data Science Central.  ( 20 min )
    How Data Visualization is Essential for the Banking and Finance Sector
    Representation of data using graphics such as charts, plots, infographics, heat maps, bubble clouds, scatter plots, mekko charts, animation, etc., is termed data visualization. Such visual displays and representation of information help communicate complex data relationships and data-driven insights in a way that makes it easy to understand and base decisions on. The post How Data Visualization is Essential for the Banking and Finance Sector appeared first on Data Science Central.  ( 21 min )
    Machine Learning Superstars: The Top 30 Influencers To Follow in 2023
    This list is based on LinkedIn. Criteria for selection include the number of followers (above 50k), the relevancy and contributions to the field, relevant education and professional experience, as well as recent activity. This list is in alphabetical order. Kirk Borne. Worldwide top influencer since 2013. Data Scientist. Global Speaker. Consultant. Astrophysicist. Space Scientist. Big… Read More »Machine Learning Superstars: The Top 30 Influencers To Follow in 2023 The post Machine Learning Superstars: The Top 30 Influencers To Follow in 2023 appeared first on Data Science Central.  ( 22 min )
    Drone-based IoT Connect Smart Islands
    As an application domain for IoT, smart cities initially showed promise but have struggled over the last decade. However, Smart islands (instead of smart cities) may be good for the success of the Internet of Things. The post Drone-based IoT Connect Smart Islands appeared first on Data Science Central.  ( 19 min )
  • Open

    I used Novel AI to generate this essay. I input the words “I am an AI and I will prove to you that I am sentient.” and I let the AI write the rest.
    submitted by /u/BrandNewLogicVL [link] [comments]  ( 109 min )
    Northeastern University's Institute of Experiential AI Seminar Series
    Join us for lunch and an engaging discussion at Northeastern University's Curry Student Center on Wednesday, 10/19! Or, join us virtually and BYOL for this seminar: "General Fine-Tuning (GFT): A Little Language for Deep Nets" Our Sr. Principal Research Scientist, Kenneth Church will introduce GFT and share how he 'read' 250M papers over his summer vacation submitted by /u/hadwahnos [link] [comments]  ( 109 min )
    Mix Your Face And A Style Together in Stable Diffusion!
    submitted by /u/PuppetHere [link] [comments]  ( 109 min )
    I made a movie about a movie that was made with AI. Trying to explain how it works and what I think.
    submitted by /u/vonderdeckentok [link] [comments]  ( 110 min )
    Whats the most advanced deepfake software available right now?
    My work is currently exploring ML and deepfake tech for animation, specifically for video games. After looking into it there doesn't seem to be one particular program that stands above the rest, do you have any suggestions on which software is the best for advanced deformations based on large training/data models? submitted by /u/Qwerty177 [link] [comments]  ( 109 min )
    Protecting Human Art & Artists from AI?
    submitted by /u/JenkyMcJenkyPants [link] [comments]  ( 114 min )
    Breakthrough Tesla AI Supercomputer | New Google NLP Driven Robotics Perform Complex Multistep Tasks According To Speech Commands
    submitted by /u/kenickh [link] [comments]  ( 111 min )
    Research on customers&chatbots. You'll find insights on what to improve in your customer service automation
    Hi, there! My team prepared the report on what customers love and hate about customer service chatbots. Together with our project managers, and marketers we analyzed data of thousands of conversations with real users, and made thi reserach with insigts, stats and recommendations on how to overcome struggels. This research would be helpful for: Companies that have a chatbot that isn't performing well and they don't know why Companies who want to implement a chatbot but are unsure about the best ways of creating a chatbot. Important things! Here're more details about our research: Use case: Customer support Platform: Website Language: English Locations: Europe, USA Industries: E-commerce, Retail, Retail health ​ Here's the link to the general article on the best chatbot practices and there you'll find the research. submitted by /u/Avandegraund [link] [comments]  ( 113 min )
    The Brilliant Language Of Lanes
    On Tesla’s AI day, the Autopilot team revealed the improvements and massive upgrades in their software. Overall, the Full Self Driving (FSD) has released 35 software updates to date. Ashok Elluswamy, the Autopilot Director, announced that around 160,000 customers globally have been running the beta software of the autopilot and the self-driving system. This is a leap from 2,000 customers last year. https://analyticsindiamag.com/the-brilliant-language-of-lanes/ submitted by /u/analyticsindiam [link] [comments]  ( 108 min )
    Laion's new dataset shows how AI can help with AI training
    submitted by /u/Zirius_Sadfaces [link] [comments]  ( 108 min )
    AI for good using the AI Analytics toolkit powered by oneAPI “Good or Bad Pill?” by Eduardo Alvarez
    Learn about how AI is used to identify pills with issues before they get to you… submitted by /u/tonym-intel [link] [comments]  ( 109 min )
    Cyber Angel
    submitted by /u/widgia [link] [comments]  ( 108 min )
    I want to feed an AI a bunch of images to eventually let it categorize them, is there an open source API for this?
    Hi, I going to build a program which can take a bunch of images and categorize them, so I'm thinking I first need an AI that I can feed a bunch of already categorized images and day "this image is category A, this is B, this is also B", and so on, so that eventually it can just take a bunch of images and send them back categorized. Where do I begin? Any recommendations on where to start on this? There's so many text-to-images AIs out there but I have no idea where to begin in this case. Thankful for any help! submitted by /u/hoozt [link] [comments]  ( 115 min )
    neograd - A deep learning framework created from scratch using Python and NumPy
    Hey everyone! I released v0.0.2 of neograd, a deep learning framework created from scratch using Python and NumPy, with automatic differentiation capabilities. I’d taken for granted that I understood how convolutions work. Just implement a sliding window, perform element-wise multiplication, take its sum, sounds so simple right? Add to that - accounting for the running time of the algorithm, backward pass to get its gradients and convolutions over volumes, this turned out to be an excruciating undertaking. This release includes:- Gradient checking to check the correctness of gradients that are calculated by autograd- Optimization algorithms like Momentum, RMSProp, and Adam- 2D, 3D Convolution and 2D, 3D Pooling layers for Convolutional Neural Networks- Save trained models, weights to dis…  ( 113 min )
    Clustering Pokémon in 15 minutes with Machine Learning
    Hey everyone, thought this would be a fun read where we demonstrate clustering on the Pokémon character statistics database in just 15 minutes. Read here if you're interested! If you have any question, feel free to ask away too. submitted by /u/PIEXCHANGE [link] [comments]  ( 108 min )
  • Open

    NVIDIA, Oracle CEOs in Fireside Chat Light Pathways to Enterprise AI
    Speeding adoption of enterprise AI and accelerated computing, Oracle CEO Safra Catz and NVIDIA founder and CEO Jensen Huang discussed their companies’ expanding collaboration in a fireside chat live streamed today from Oracle CloudWorld in Las Vegas. Oracle and NVIDIA announced plans to bring NVIDIA’s full accelerated computing stack to Oracle Cloud Infrastructure (OCI). It Read article > The post NVIDIA, Oracle CEOs in Fireside Chat Light Pathways to Enterprise AI appeared first on NVIDIA Blog.  ( 5 min )
    Meta’s Grand Teton Brings NVIDIA Hopper to Its Data Centers
    Meta today announced its next-generation AI platform, Grand Teton, including NVIDIA’s collaboration on design. Compared to the company’s previous generation Zion EX platform, the Grand Teton system packs in more memory, network bandwidth and compute capacity, said Alexis Bjorlin, vice president of Meta Infrastructure Hardware, at the 2022 OCP Global Summit, an Open Compute Project Read article > The post Meta’s Grand Teton Brings NVIDIA Hopper to Its Data Centers appeared first on NVIDIA Blog.  ( 4 min )
    Adobe MAX Kicks Off With Creative App Updates and 3D Artist Anna Natter Impresses This Week ‘In the NVIDIA Studio’
    Editor’s note: This post is part of our weekly In the NVIDIA Studio series, which celebrates featured artists, offers creative tips and tricks, and demonstrates how NVIDIA Studio technology improves creative workflows. In the coming weeks, we’ll be deep diving on new GeForce RTX 40 Series GPU features, technologies and resources, and how they dramatically Read article > The post Adobe MAX Kicks Off With Creative App Updates and 3D Artist Anna Natter Impresses This Week ‘In the NVIDIA Studio’ appeared first on NVIDIA Blog.  ( 10 min )
    Souped-Up Auto Quotes: ProovStation Delivers GPU-Driven AI Appraisals
    Vehicle appraisals are getting souped up with a GPU-accelerated AI overhaul. ProovStation, a four-year-old startup based in Lyon, France, is taking on the ambitious computer-vision quest of automating vehicle inspection and repair estimates, aiming AI-driven super-high-resolution stations at businesses worldwide. It recently launched three of its state-of-the-art vehicle inspection scanners at French retail giant Carrefour’s Read article > The post Souped-Up Auto Quotes: ProovStation Delivers GPU-Driven AI Appraisals appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Plotting the Training and Validation Loss Curves for the Transformer Model
    We have previously seen how to train the Transformer model for neural machine translation. Before moving on to inferencing the trained model, let us first explore how to modify the training code slightly to be able to plot the training and validation loss curves that can be generated during the learning process.  The training and […] The post Plotting the Training and Validation Loss Curves for the Transformer Model appeared first on Machine Learning Mastery.  ( 22 min )
    Attend the Data Science Symposium 2022
    Sponsored Post      Attend the Data Science Symposium 2022 on November 8 The Center for Business Analytics at the University of Cincinnati will present its annual Data Science Symposium 2022 on November 8. This all day in-person event will have three featured speakers and two tech talk tracks with four concurrent presentations in each track. The […] The post Attend the Data Science Symposium 2022 appeared first on Machine Learning Mastery.  ( 10 min )
  • Open

    Table Tennis: A Research Platform for Agile Robotics
    Posted by Avi Singh, Research Scientist, and Laura Graesser, Research Engineer, Robotics at Google Robot learning has been applied to a wide range of challenging real world tasks, including dexterous manipulation, legged locomotion, and grasping. It is less common to see robot learning applied to dynamic, high-acceleration tasks requiring tight-loop human-robot interactions, such as table tennis. There are two complementary properties of the table tennis task that make it interesting for robotic learning research. First, the task requires both speed and precision, which puts significant demands on a learning algorithm. At the same time, the problem is highly-structured (with a fixed, predictable environment) and naturally multi-agent (the robot can play with humans or another robot), makin…  ( 26 min )
  • Open

    AI artist showcase — Sundog
    Learn how artist Sundog uses his traditional filmmaking skills along with AI tools to create fantastic experiences  ( 10 min )
    Getting Started in AI , Data Science, and Machine Learning
    Lately, I have been getting many questions from folks who don’t have an engineering or computer science background, especially women…  ( 10 min )
    How Big is BIGDATA?
    How Big is it, exactly? Let’s study the newest data management techniques. Straight talk, no fuss!  ( 14 min )
    Customer-Churn Prediction Using Machine Learning
    Predicting the Telecom Customer Churn.  ( 10 min )
    Artificial Intelligence & Enterprise Content Management in the Future of Education-1
    AI and ECM technologies are already having a big impact on education these days, and that impact is only going to keep growing in the…  ( 7 min )
    Incorporating Sentiment Analysis into E-commerce
    Studying users’ behavior and understanding their sentiments have become substantial to businesses with the increasing platforms operating…  ( 10 min )
  • Open

    Train a time series forecasting model faster with Amazon SageMaker Canvas Quick build
    Today, Amazon SageMaker Canvas introduces the ability to use the Quick build feature with time series forecasting use cases. This allows you to train models and generate the associated explainability scores in under 20 minutes, at which point you can generate predictions on new, unseen data. Quick build training enables faster experimentation to understand how […]  ( 8 min )
    Use Amazon SageMaker Canvas for exploratory data analysis
    Exploratory data analysis (EDA) is a common task performed by business analysts to discover patterns, understand relationships, validate assumptions, and identify anomalies in their data. In machine learning (ML), it’s important to first understand the data and its relationships before getting into model building. Traditional ML development cycles can sometimes take months and require advanced […]  ( 9 min )
    Run ensemble ML models on Amazon SageMaker
    Model deployment in machine learning (ML) is becoming increasingly complex. You want to deploy not just one ML model but large groups of ML models represented as ensemble workflows. These workflows are comprised of multiple ML models. Productionizing these ML models is challenging because you need to adhere to various performance and latency requirements. Amazon […]  ( 8 min )
  • Open

    SpA-Former: Transformer image shadow detection and removal via spatial attention. (arXiv:2206.10910v3 [cs.CV] UPDATED)
    In this paper, we propose an end-to-end SpA-Former to recover a shadow-free image from a single shaded image. Unlike traditional methods that require two steps for shadow detection and then shadow removal, the SpA-Former unifies these steps into one, which is a one-stage network capable of directly learning the mapping function between shadows and no shadows, it does not require a separate shadow detection. Thus, SpA-former is adaptable to real image de-shadowing for shadows projected on different semantic regions. SpA-Former consists of transformer layer and a series of joint Fourier transform residual blocks and two-wheel joint spatial attention. The network in this paper is able to handle the task while achieving a very fast processing efficiency. Our code is relased on https://github.com/zhangbaijin/SpA-Former-shadow-removal  ( 2 min )
    A Unified Sequence Interface for Vision Tasks. (arXiv:2206.07669v2 [cs.CV] UPDATED)
    While language tasks are naturally expressed in a single, unified, modeling framework, i.e., generating sequences of tokens, this has not been the case in computer vision. As a result, there is a proliferation of distinct architectures and loss functions for different vision tasks. In this work we show that a diverse set of "core" computer vision tasks can also be unified if formulated in terms of a shared pixel-to-sequence interface. We focus on four tasks, namely, object detection, instance segmentation, keypoint detection, and image captioning, all with diverse types of outputs, e.g., bounding boxes or dense masks. Despite that, by formulating the output of each task as a sequence of discrete tokens with a unified interface, we show that one can train a neural network with a single model architecture and loss function on all these tasks, with no task-specific customization. To solve a specific task, we use a short prompt as task description, and the sequence output adapts to the prompt so it can produce task-specific output. We show that such a model can achieve competitive performance compared to well-established task-specific models.  ( 3 min )
    On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning. (arXiv:2210.06443v2 [cs.LG] UPDATED)
    Rehearsal approaches enjoy immense popularity with Continual Learning (CL) practitioners. These methods collect samples from previously encountered data distributions in a small memory buffer; subsequently, they repeatedly optimize on the latter to prevent catastrophic forgetting. This work draws attention to a hidden pitfall of this widespread practice: repeated optimization on a small pool of data inevitably leads to tight and unstable decision boundaries, which are a major hindrance to generalization. To address this issue, we propose Lipschitz-DrivEn Rehearsal (LiDER), a surrogate objective that induces smoothness in the backbone network by constraining its layer-wise Lipschitz constants w.r.t. replay examples. By means of extensive experiments, we show that applying LiDER delivers a stable performance gain to several state-of-the-art rehearsal CL methods across multiple datasets, both in the presence and absence of pre-training. Through additional ablative experiments, we highlight peculiar aspects of buffer overfitting in CL and better characterize the effect produced by LiDER. Code is available at https://github.com/aimagelab/LiDER  ( 2 min )
    Free Probability for predicting the performance of feed-forward fully connected neural networks. (arXiv:2111.00841v3 [stat.ML] UPDATED)
    Gradient descent during the learning process of a neural network can be subject to many instabilities. The spectral density of the Jacobian is a key component for analyzing stability. Following the works of Pennington et al., such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory (FPT). We present a reliable and very fast method for computing the associated spectral densities, for given architecture and initialization. This method has a controlled and proven convergence. Our technique is based on an homotopy method: it is an adaptative Newton-Raphson scheme which chains basins of attraction. In order to demonstrate the relevance of our method we show that the relevant FPT metrics computed before training are highly correlated to final test accuracies - up to 85\%. We also nuance the idea that learning happens at the edge of chaos by giving evidence that a very desirable feature for neural networks is the hyperbolicity of their Jacobian at initialization.  ( 3 min )
    Improved Algorithms for Neural Active Learning. (arXiv:2210.00423v2 [cs.LG] UPDATED)
    We improve the theoretical and empirical performance of neural-network(NN)-based active learning algorithms for the non-parametric streaming setting. In particular, we introduce two regret metrics by minimizing the population loss that are more suitable in active learning than the one used in state-of-the-art (SOTA) related work. Then, the proposed algorithm leverages the powerful representation of NNs for both exploitation and exploration, has the query decision-maker tailored for $k$-class classification problems with the performance guarantee, utilizes the full feedback, and updates parameters in a more practical and efficient manner. These careful designs lead to a better regret upper bound, improving by a multiplicative factor $O(\log T)$ and removing the curse of input dimensionality. Furthermore, we show that the algorithm can achieve the same performance as the Bayes-optimal classifier in the long run under the hard-margin setting in classification problems. In the end, we use extensive experiments to evaluate the proposed algorithm and SOTA baselines, to show the improved empirical performance.  ( 2 min )
    Optimistic Curiosity Exploration and Conservative Exploitation with Linear Reward Shaping. (arXiv:2209.07288v2 [cs.LG] UPDATED)
    In this work, we study the simple yet universally applicable case of reward shaping in value-based Deep Reinforcement Learning (DRL). We show that reward shifting in the form of the linear transformation is equivalent to changing the initialization of the $Q$-function in function approximation. Based on such an equivalence, we bring the key insight that a positive reward shifting leads to conservative exploitation, while a negative reward shifting leads to curiosity-driven exploration. Accordingly, conservative exploitation improves offline RL value estimation, and optimistic value estimation improves exploration for online RL. We validate our insight on a range of RL tasks and show its improvement over baselines: (1) In offline RL, the conservative exploitation leads to improved performance based on off-the-shelf algorithms; (2) In online continuous control, multiple value functions with different shifting constants can be used to tackle the exploration-exploitation dilemma for better sample efficiency; (3) In discrete control tasks, a negative reward shifting yields an improvement over the curiosity-based exploration method.  ( 2 min )
    FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings. (arXiv:2210.04620v2 [cs.LG] UPDATED)
    Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of few ($2$--$50$) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works have proposed representative datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist, thereby slowing algorithmic research in this critical application. In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between theory and practice of cross-silo FL. FLamby encompasses 7 healthcare datasets with natural splits, covering multiple tasks, modalities, and data volumes, each accompanied with baseline training code. As an illustration, we additionally benchmark standard FL algorithms on all datasets. Our flexible and modular suite allows researchers to easily download datasets, reproduce results and re-use the different components for their research. FLamby is available at~\url{www.github.com/owkin/flamby}.  ( 3 min )
    Self-supervision is not magic: Understanding Data Augmentation in Image Anomaly Detection. (arXiv:2208.07734v4 [cs.LG] UPDATED)
    Self-supervised learning (SSL) has emerged as a promising alternative to create supervisory signals to real-world tasks, avoiding the extensive cost of labeling. SSL is particularly attractive for unsupervised tasks such as anomaly detection (AD), where labeled anomalies are costly to secure, difficult to simulate, or even nonexistent. A large catalog of augmentation functions have been used for SSL-based AD (SSAD) on image data, and recent works have observed that the type of augmentation has a significant impact on performance. Motivated by those, this work sets out to put image-based SSAD under a larger lens and carefully investigate the role of data augmentation in AD through extensive experiments on three different models across 420 different tasks. Our main finding is that self-supervision acts as a yet-another model hyperparameter and should be chosen carefully in regard to the nature of true anomalies. That is, the alignment between data augmentation and the underlying anomaly-generating mechanism in given data is the key to the success of SSAD, and in the lack thereof, SSL even impairs (!) the accuracy. Moving beyond proposing another SSAD method, our study contributes to a better understanding of this growing area and lays out new directions for future research.  ( 3 min )
    Strong Lensing Source Reconstruction Using Continuous Neural Fields. (arXiv:2206.14820v2 [astro-ph.CO] UPDATED)
    From the nature of dark matter to the rate of expansion of our Universe, observations of distant galaxies distorted through strong gravitational lensing have the potential to answer some of the major open questions in astrophysics. Modeling galaxy-galaxy strong lensing observations presents a number of challenges as the exact configuration of both the background source and foreground lens galaxy is unknown. A timely call, prompted by a number of upcoming surveys anticipating high-resolution lensing images, demands methods that can efficiently model lenses at their full complexity. In this work, we introduce a method that uses continuous neural fields to non-parametrically reconstruct the complex morphology of a source galaxy while simultaneously inferring a distribution over foreground lens galaxy configurations. We demonstrate the efficacy of our method through experiments on simulated data targeting high-resolution lensing images similar to those anticipated in near-future astrophysical surveys.  ( 2 min )
    Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis. (arXiv:2206.02013v2 [cs.LG] UPDATED)
    Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes is impossible. Causality provides a useful framework for modeling distribution shifts, since causal models encode both observational and interventional distributions. In this work, we explore the sparse mechanism shift hypothesis, which posits that distribution shifts occur due to a small number of changing causal conditionals. Motivated by this idea, we apply it to learning causal structure from heterogeneous environments, where i.i.d. data only allows for learning an equivalence class of graphs without restrictive assumptions. We propose the Mechanism Shift Score (MSS), a score-based approach amenable to various empirical estimators, which provably identifies the entire causal structure with high probability if the sparse mechanism shift hypothesis holds. Empirically, we verify behavior predicted by the theory and compare multiple estimators and score functions to identify the best approaches in practice. Compared to other methods, we show how MSS bridges a gap by both being nonparametric as well as explicitly leveraging sparse changes.  ( 3 min )
    Distraction is All You Need for Fairness. (arXiv:2203.07593v2 [cs.LG] UPDATED)
    Bias in training datasets must be managed for various groups in classification tasks to ensure parity or equal treatment. With the recent growth in artificial intelligence models and their expanding role in automated decision-making, ensuring that these models are not biased is vital. There is an abundance of evidence suggesting that these models could contain or even amplify the bias present in the data on which they are trained, inherent to their objective function and learning algorithms; Many researchers direct their attention to this issue in different directions, namely, changing data to be statistically independent, adversarial training for restricting the capabilities of a particular competitor who aims to maximize parity, etc. These methods result in information loss and do not provide a suitable balance between accuracy and fairness or do not ensure limiting the biases in training. To this end, we propose a powerful strategy for training deep learning models called the Distraction module, which can be theoretically proven effective in controlling bias from affecting the classification results. This method can be utilized with different data types (e.g., Tabular, images, graphs, etc.). We demonstrate the potency of the proposed method by testing it on UCI Adult and Heritage Health datasets (tabular), POKEC-Z, POKEC-N and NBA datasets (graph), and CelebA dataset (vision). Using state-of-the-art methods proposed in the fairness literature for each dataset, we exhibit our model is superior to these proposed methods in minimizing bias and maintaining accuracy.  ( 3 min )
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v6 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in the deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric that reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.  ( 3 min )
    Domain Adaptation via Maximizing Surrogate Mutual Information. (arXiv:2110.12184v3 [cs.LG] UPDATED)
    Unsupervised domain adaptation (UDA) aims to predict unlabeled data from target domain with access to labeled data from the source domain. In this work, we propose a novel framework called SIDA (Surrogate Mutual Information Maximization Domain Adaptation) with strong theoretical guarantees. To be specific, SIDA implements adaptation by maximizing mutual information (MI) between features. In the framework, a surrogate joint distribution models the underlying joint distribution of the unlabeled target domain. Our theoretical analysis validates SIDA by bounding the expected risk on target domain with MI and surrogate distribution bias. Experiments show that our approach is comparable with state-of-the-art unsupervised adaptation methods on standard UDA tasks.  ( 2 min )
    FedCross: Towards Accurate Federated Learning via Multi-Model Cross Aggregation. (arXiv:2210.08285v1 [cs.LG])
    Due to the remarkable performance in preserving data privacy for decentralized data scenarios, Federated Learning (FL) has been considered as a promising distributed machine learning paradigm to deal with data silos problems. Typically, conventional FL approaches adopts a one-to-multi training scheme, where the cloud server keeps only one single global model for all the involved clients for the purpose of model aggregation. However, this scheme suffers from inferior classification performance, since only one global model cannot always accommodate all the incompatible convergence directions of local models, resulting in a low convergence rate and classification accuracy. To address this issue, this paper presents an efficient FL framework named FedCross, which adopts a novel multi-to-multi FL training scheme based on our proposed similarity-based multi-model cross aggregation method. Unlike traditional FL methods, in each round of FL training, FedCross uses a small set of distinct intermediate models to conduct weighted fusion under the guidance of model similarities. In this way, the intermediate models used by FedCross can sufficiently respect the convergence characteristics of clients, thus leading to much fewer conflicts in tuning the convergence directions of clients. Finally, in the deployment stage, FedCross forms a global model for all the clients by performing the federated averaging on the trained immediate models.  ( 2 min )
    Markov Observation Models. (arXiv:2208.06368v2 [stat.ML] UPDATED)
    Herein, the Hidden Markov Model is expanded to allow for Markov chain observations. In particular, the observations are assumed to be a Markov chain whose one step transition probabilities depend upon the hidden Markov chain. An Expectation-Maximization analog to the Baum-Welch algorithm is developed for this more general model to estimate the transition probabilities for both the hidden state and for the observations as well as to estimate the probabilities for the initial joint hidden-state-observation distribution. A believe state or filter recursion to track the hidden state then arises from the calculations of this Expectation-Maximization algorithm. A dynamic programming analog to the Viterbi algorithm is also developed to estimate the most likely sequence of hidden states given the sequence of observations.  ( 2 min )
    Hybrid Intelligent Testing in Simulation-Based Verification. (arXiv:2205.09552v3 [cs.AR] UPDATED)
    Efficient and effective testing for simulation-based hardware verification is challenging. Using constrained random test generation, several millions of tests may be required to achieve coverage goals. The vast majority of tests do not contribute to coverage progress, yet they consume verification resources. In this paper, we propose a hybrid intelligent testing approach combining two methods that have previously been treated separately, namely Coverage-Directed Test Selection and Novelty-Driven Verification. Coverage-Directed Test Selection learns from coverage feedback to bias testing toward the most effective tests. Novelty-Driven Verification learns to identify and simulate stimuli that differ from previous stimuli, thereby reducing the number of simulations and increasing testing efficiency. We discuss the strengths and limitations of each method, and we show how our approach addresses each method's limitations, leading to hardware testing that is both efficient and effective.  ( 2 min )
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v3 [stat.ML] UPDATED)
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.  ( 2 min )
    BOAT: Bilateral Local Attention Vision Transformer. (arXiv:2201.13027v1 [cs.CV] CROSS LISTED)
    Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.  ( 2 min )
    Spartan: Differentiable Sparsity via Regularized Transportation. (arXiv:2205.14107v2 [cs.LG] UPDATED)
    We present Spartan, a method for training sparse neural network models with a predetermined level of sparsity. Spartan is based on a combination of two techniques: (1) soft top-k masking of low-magnitude parameters via a regularized optimal transportation problem and (2) dual averaging-based parameter updates with hard sparsification in the forward pass. This scheme realizes an exploration-exploitation tradeoff: early in training, the learner is able to explore various sparsity patterns, and as the soft top-k approximation is gradually sharpened over the course of training, the balance shifts towards parameter optimization with respect to a fixed sparsity mask. Spartan is sufficiently flexible to accommodate a variety of sparsity allocation policies, including both unstructured and block structured sparsity, as well as general cost-sensitive sparsity allocation mediated by linear models of per-parameter costs. On ImageNet-1K classification, Spartan yields 95% sparse ResNet-50 models and 90% block sparse ViT-B/16 models while incurring absolute top-1 accuracy losses of less than 1% compared to fully dense training.
    SGD with Coordinate Sampling: Theory and Practice. (arXiv:2105.11818v2 [stat.ML] UPDATED)
    While classical forms of stochastic gradient descent algorithm treat the different coordinates in the same way, a framework allowing for adaptive (non uniform) coordinate sampling is developed to leverage structure in data. In a non-convex setting and including zeroth order gradient estimate, almost sure convergence as well as non-asymptotic bounds are established. Within the proposed framework, we develop an algorithm, MUSKETEER, based on a reinforcement strategy: after collecting information on the noisy gradients, it samples the most promising coordinate (all for one); then it moves along the one direction yielding an important decrease of the objective (one for all). Numerical experiments on both synthetic and real data examples confirm the effectiveness of MUSKETEER in large scale problems.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v3 [cs.LG] UPDATED)
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler-Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.
    Experimental Standards for Deep Learning in Natural Language Processing Research. (arXiv:2204.06251v2 [cs.LG] UPDATED)
    The field of Deep Learning (DL) has undergone explosive growth during the last decade, with a substantial impact on Natural Language Processing (NLP) as well. Yet, compared to more established disciplines, a lack of common experimental standards remains an open challenge to the field at large. Starting from fundamental scientific principles, we distill ongoing discussions on experimental standards in NLP into a single, widely-applicable methodology. Following these best practices is crucial to strengthen experimental evidence, improve reproducibility and support scientific progress. These standards are further collected in a public repository to help them transparently adapt to future needs.
    Influencing Long-Term Behavior in Multiagent Reinforcement Learning. (arXiv:2203.03535v4 [cs.LG] UPDATED)
    The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.
    BrainGB: A Benchmark for Brain Network Analysis with Graph Neural Networks. (arXiv:2204.07054v2 [q-bio.NC] UPDATED)
    Mapping the connectome of the human brain using structural or functional connectivity has become one of the most pervasive paradigms for neuroimaging analysis. Recently, Graph Neural Networks (GNNs) motivated from geometric deep learning have attracted broad interest due to their established power for modeling complex networked data. Despite their superior performance in many fields, there has not yet been a systematic study of how to design effective GNNs for brain network analysis. To bridge this gap, we present BrainGB, a benchmark for brain network analysis with GNNs. BrainGB standardizes the process by (1) summarizing brain network construction pipelines for both functional and structural neuroimaging modalities and (2) modularizing the implementation of GNN designs. We conduct extensive experiments on datasets across cohorts and modalities and recommend a set of general recipes for effective GNN designs on brain networks. To support open and reproducible research on GNN-based brain network analysis, we host the BrainGB website at https://braingb.us with models, tutorials, examples, as well as an out-of-box Python package. We hope that this work will provide useful empirical evidence and offer insights for future research in this novel and promising direction.
    StreamNet: A WAE for White Matter Streamline Analysis. (arXiv:2209.01498v2 [q-bio.QM] UPDATED)
    We present StreamNet, an autoencoder architecture for the analysis of the highly heterogeneous geometry of large collections of white matter streamlines. This proposed framework takes advantage of geometry-preserving properties of the Wasserstein-1 metric in order to achieve direct encoding and reconstruction of entire bundles of streamlines. We show that the model not only accurately captures the distributive structures of streamlines in the population, but is also able to achieve superior reconstruction performance between real and synthetic streamlines. Experimental model performance is evaluated on white matter streamlines resulting from T1-weighted diffusion imaging of 40 healthy controls using recent state of the art bundle comparison metric that measures fiber-shape similarities.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v2 [cs.LG] UPDATED)
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Gauge-equivariant flow models for sampling in lattice field theories with pseudofermions. (arXiv:2207.08945v3 [hep-lat] UPDATED)
    This work presents gauge-equivariant architectures for flow-based sampling in fermionic lattice field theories using pseudofermions as stochastic estimators for the fermionic determinant. This is the default approach in state-of-the-art lattice field theory calculations, making this development critical to the practical application of flow models to theories such as QCD. Methods by which flow-based sampling approaches can be improved via standard techniques such as even/odd preconditioning and the Hasenbusch factorization are also outlined. Numerical demonstrations in two-dimensional U(1) and SU(3) gauge theories with $N_f=2$ flavors of fermions are provided.
    Understanding Deep Neural Function Approximation in Reinforcement Learning via $\epsilon$-Greedy Exploration. (arXiv:2209.07376v2 [cs.LG] UPDATED)
    This paper provides a theoretical study of deep neural function approximation in reinforcement learning (RL) with the $\epsilon$-greedy exploration under the online setting. This problem setting is motivated by the successful deep Q-networks (DQN) framework that falls in this regime. In this work, we provide an initial attempt on theoretical understanding deep RL from the perspective of function class and neural networks architectures (e.g., width and depth) beyond the ``linear'' regime. To be specific, we focus on the value based algorithm with the $\epsilon$-greedy exploration via deep (and two-layer) neural networks endowed by Besov (and Barron) function spaces, respectively, which aims at approximating an $\alpha$-smooth Q-function in a $d$-dimensional feature space. We prove that, with $T$ episodes, scaling the width $m = \widetilde{\mathcal{O}}(T^{\frac{d}{2\alpha + d}})$ and the depth $L=\mathcal{O}(\log T)$ of the neural network for deep RL is sufficient for learning with sublinear regret in Besov spaces. Moreover, for a two layer neural network endowed by the Barron space, scaling the width $\Omega(\sqrt{T})$ is sufficient. To achieve this, the key issue in our analysis is how to estimate the temporal difference error under deep neural function approximation as the $\epsilon$-greedy exploration is not enough to ensure ``optimism''. Our analysis reformulates the temporal difference error in an $L^2(\mathrm{d}\mu)$-integrable space over a certain averaged measure $\mu$, and transforms it to a generalization problem under the non-iid setting. This might have its own interest in RL theory for better understanding $\epsilon$-greedy exploration in deep RL.
    X-GOAL: Multiplex Heterogeneous Graph Prototypical Contrastive Learning. (arXiv:2109.03560v4 [cs.LG] UPDATED)
    Graphs are powerful representations for relations among objects, which have attracted plenty of attention. A fundamental challenge for graph learning is how to train an effective Graph Neural Network (GNN) encoder without labels, which are expensive and time consuming to obtain. Contrastive Learning (CL) is one of the most popular paradigms to address this challenge, which trains GNNs by discriminating positive and negative node pairs. Despite the success of recent CL methods, there are still two under-explored problems. First, how to reduce the semantic error introduced by random topology based data augmentations. Traditional CL defines positive and negative node pairs via the node-level topological proximity, which is solely based on the graph topology regardless of the semantic information of node attributes, and thus some semantically similar nodes could be wrongly treated as negative pairs. Second, how to effectively model the multiplexity of the real-world graphs, where nodes are connected by various relations and each relation could form a homogeneous graph layer. To solve these problems, we propose a novel multiplex heterogeneous graph prototypical contrastive leaning (X-GOAL) framework to extract node embeddings. X-GOAL is comprised of two components: the GOAL framework, which learns node embeddings for each homogeneous graph layer, and an alignment regularization, which jointly models different layers by aligning layer-specific node embeddings. Specifically, the GOAL framework captures the node-level information by a succinct graph transformation technique, and captures the cluster-level information by pulling nodes within the same semantic cluster closer in the embedding space. The alignment regularization aligns embeddings across layers at both node and cluster levels. We evaluate X-GOAL on various real-world datasets and downstream tasks to demonstrate its effectiveness.
    A general framework for multi-step ahead adaptive conformal heteroscedastic time series forecasting. (arXiv:2207.14219v2 [stat.ML] UPDATED)
    The exponential growth of machine learning (ML) has prompted a great deal of interest in quantifying the uncertainty of each prediction for a user-defined level of confidence since nowadays ML is increasingly being used in high-stakes settings. Reliable ML via prediction intervals(PIs) that take into account jointly the epistemic and aleatory uncertainty is therefore imperative and is a step towards increased trust in model forecasts. Conformal prediction (CP) is a lightweight distribution-free uncertainty quantification framework that works for any black-box model, yielding PIs that are valid under the mild assumption of exchangeability. CP-type methods are gaining popularity due to being easy to implement and computationally cheap; however, the exchangeability assumption immediately excludes time series forecasting from the stage. Although recent papers tackle distribution shift and asymptotic versions of CP, this is not enough for the general time series forecasting problem of producing H-step ahead valid PIs. To attain such a goal, we propose a new method called AEnbMIMOCQR (Adaptive ensemble batch multi-input multi-output conformalized quantile regression), which produces valid PIs asymptotically and is appropriate for heteroscedastic time series. We compare the proposed method against state-of-the-art competitive methods in the NN5 forecasting competition dataset. All the code and data to reproduce the experiments are made available.
    URANUS: Radio Frequency Tracking, Classification and Identification of Unmanned Aircraft Vehicles. (arXiv:2207.06025v2 [cs.LG] UPDATED)
    Safety and security issues for Critical Infrastructures (CI) are growing as attackers increasingly adopt drones as an attack vector flying in sensitive airspace, such as airports, military bases, city centres, and crowded places. The rapid proliferation of drones for merchandise, shipping recreations activities, and other commercial applications poses severe concerns on the CI operators due to the violations and the invasions of the restricted airspaces. A cost-effective framework is needed to detect, classify and identify the presence of drones in such cases. In this paper, we demonstrate that CI operators can detect, classify and identify timely and efficiently drones (multi-copter and fixed-wings) invading no-drone zones, with an inexpensive RF-based detection framework named URANUS. Our experiments show that by using Random Forest classifier, we achieved a classification accuracy of 93.4% in the classification of one or multiple specific drones. The tracking performance achieves an accuracy with an average of MAE=0.3650, MSE=0.9254 and R2 = 0.7502. Our framework has been released as open-source, to enable the community to verify our findings and use URANUS as a ready-to-use basis for further analysis.
    BOND: Benchmarking Unsupervised Outlier Node Detection on Static Attributed Graphs. (arXiv:2206.10071v2 [cs.LG] UPDATED)
    Detecting which nodes in graphs are outliers is a relatively new machine learning task with numerous applications. Despite the proliferation of algorithms developed in recent years for this task, there has been no standard comprehensive setting for performance evaluation. Consequently, it has been difficult to understand which methods work well and when under a broad range of settings. To bridge this gap, we present--to the best of our knowledge--the first comprehensive benchmark for unsupervised outlier node detection on static attributed graphs called BOND, with the following highlights. (1) We benchmark the outlier detection performance of 14 methods ranging from classical matrix factorization to the latest graph neural networks. (2) Using nine real datasets, our benchmark assesses how the different detection methods respond to two major types of synthetic outliers and separately to "organic" (real non-synthetic) outliers. (3) Using an existing random graph generation technique, we produce a family of synthetically generated datasets of different graph sizes that enable us to compare the running time and memory usage of the different outlier detection algorithms. Based on our experimental results, we discuss the pros and cons of existing graph outlier detection algorithms, and we highlight opportunities for future research. Importantly, our code is freely available and meant to be easily extendable: https://github.com/pygod-team/pygod/tree/main/benchmark
    Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v5 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.
    Is your noise correction noisy? PLS: Robustness to label noise with two stage detection. (arXiv:2210.04578v2 [cs.CV] UPDATED)
    Designing robust algorithms capable of training accurate neural networks on uncurated datasets from the web has been the subject of much research as it reduces the need for time consuming human labor. The focus of many previous research contributions has been on the detection of different types of label noise; however, this paper proposes to improve the correction accuracy of noisy samples once they have been detected. In many state-of-the-art contributions, a two phase approach is adopted where the noisy samples are detected before guessing a corrected pseudo-label in a semi-supervised fashion. The guessed pseudo-labels are then used in the supervised objective without ensuring that the label guess is likely to be correct. This can lead to confirmation bias, which reduces the noise robustness. Here we propose the pseudo-loss, a simple metric that we find to be strongly correlated with pseudo-label correctness on noisy samples. Using the pseudo-loss, we dynamically down weight under-confident pseudo-labels throughout training to avoid confirmation bias and improve the network accuracy. We additionally propose to use a confidence guided contrastive objective that learns robust representation on an interpolated objective between class bound (supervised) for confidently corrected samples and unsupervised representation for under-confident label corrections. Experiments demonstrate the state-of-the-art performance of our Pseudo-Loss Selection (PLS) algorithm on a variety of benchmark datasets including curated data synthetically corrupted with in-distribution and out-of-distribution noise, and two real world web noise datasets. Our experiments are fully reproducible github.com/PaulAlbert31/SNCF
    Towards Healing the Blindness of Score Matching. (arXiv:2209.07396v2 [stat.ML] UPDATED)
    Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.
    Generative Visual Prompt: Unifying Distributional Control of Pre-Trained Generative Models. (arXiv:2209.06970v2 [cs.CV] UPDATED)
    Generative models (e.g., GANs, diffusion models) learn the underlying data distribution in an unsupervised manner. However, many applications of interest require sampling from a particular region of the output space or sampling evenly over a range of characteristics. For efficient sampling in these scenarios, we propose Generative Visual Prompt (PromptGen), a framework for distributional control over pre-trained generative models by incorporating knowledge of other off-the-shelf models. PromptGen defines control as energy-based models (EBMs) and samples images in a feed-forward manner by approximating the EBM with invertible neural networks, avoiding optimization at inference. Our experiments demonstrate how PromptGen can efficiently sample from several unconditional generative models (e.g., StyleGAN2, StyleNeRF, diffusion autoencoder, NVAE) in a controlled or/and de-biased manner using various off-the-shelf models: (1) with the CLIP model as control, PromptGen can sample images guided by text, (2) with image classifiers as control, PromptGen can de-bias generative models across a set of attributes or attribute combinations, and (3) with inverse graphics models as control, PromptGen can sample images of the same identity in different poses. (4) Finally, PromptGen reveals that the CLIP model shows a "reporting bias" when used as control, and PromptGen can further de-bias this controlled distribution in an iterative manner. The code is available at https://github.com/ChenWu98/Generative-Visual-Prompt.
    FiLM-Ensemble: Probabilistic Deep Learning via Feature-wise Linear Modulation. (arXiv:2206.00050v2 [cs.LG] UPDATED)
    The ability to estimate epistemic uncertainty is often crucial when deploying machine learning in the real world, but modern methods often produce overconfident, uncalibrated uncertainty predictions. A common approach to quantify epistemic uncertainty, usable across a wide class of prediction models, is to train a model ensemble. In a naive implementation, the ensemble approach has high computational cost and high memory demand. This challenges in particular modern deep learning, where even a single deep network is already demanding in terms of compute and memory, and has given rise to a number of attempts to emulate the model ensemble without actually instantiating separate ensemble members. We introduce FiLM-Ensemble, a deep, implicit ensemble method based on the concept of Feature-wise Linear Modulation (FiLM). That technique was originally developed for multi-task learning, with the aim of decoupling different tasks. We show that the idea can be extended to uncertainty quantification: by modulating the network activations of a single deep network with FiLM, one obtains a model ensemble with high diversity, and consequently well-calibrated estimates of epistemic uncertainty, with low computational overhead in comparison. Empirically, FiLM-Ensemble outperforms other implicit ensemble methods, and it and comes very close to the upper bound of an explicit ensemble of networks (sometimes even beating it), at a fraction of the memory cost.
    Friendly Noise against Adversarial Noise: A Powerful Defense against Data Poisoning Attacks. (arXiv:2208.10224v2 [cs.CR] UPDATED)
    A powerful category of (invisible) data poisoning attacks modify a subset of training examples by small adversarial perturbations to change the prediction of certain test-time data. Existing defense mechanisms are not desirable to deploy in practice, as they often either drastically harm the generalization performance, or are attack-specific, and prohibitively slow to apply. Here, we propose a simple but highly effective approach that unlike existing methods breaks various types of invisible poisoning attacks with the slightest drop in the generalization performance. We make the key observation that attacks introduce local sharp regions of high training loss, which when minimized, results in learning the adversarial perturbations and makes the attack successful. To break poisoning attacks, our key idea is to alleviate the sharp loss regions introduced by poisons. To do so, our approach comprises two components: an optimized friendly noise that is generated to maximally perturb examples without degrading the performance, and a randomly varying noise component. The combination of both components builds a very light-weight but extremely effective defense against the most powerful triggerless targeted and hidden-trigger backdoor poisoning attacks, including Gradient Matching, Bulls-eye Polytope, and Sleeper Agent. We show that our friendly noise is transferable to other architectures, and adaptive attacks cannot break our defense due to its random noise component.
    Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency. (arXiv:2206.08496v3 [cs.LG] UPDATED)
    Pre-training on time series poses a unique challenge due to the potential mismatch between pre-training and target domains, such as shifts in temporal dynamics, fast-evolving trends, and long-range and short-cyclic effects, which can lead to poor downstream performance. While domain adaptation methods can mitigate these shifts, most methods need examples directly from the target domain, making them suboptimal for pre-training. To address this challenge, methods need to accommodate target domains with different temporal dynamics and be capable of doing so without seeing any target examples during pre-training. Relative to other modalities, in time series, we expect that time-based and frequency-based representations of the same example are located close together in the time-frequency space. To this end, we posit that time-frequency consistency (TF-C) -- embedding a time-based neighborhood of an example close to its frequency-based neighborhood -- is desirable for pre-training. Motivated by TF-C, we define a decomposable pre-training model, where the self-supervised signal is provided by the distance between time and frequency components, each individually trained by contrastive estimation. We evaluate the new method on eight datasets, including electrodiagnostic testing, human activity recognition, mechanical fault detection, and physical status monitoring. Experiments against eight state-of-the-art methods show that TF-C outperforms baselines by 15.4% (F1 score) on average in one-to-one settings (e.g., fine-tuning an EEG-pretrained model on EMG data) and by 8.4% (precision) in challenging one-to-many settings (e.g., fine-tuning an EEG-pretrained model for either hand-gesture recognition or mechanical fault prediction), reflecting the breadth of scenarios that arise in real-world applications. Code and datasets: https://github.com/mims-harvard/TFC-pretraining.
    First Hitting Diffusion Models for Generating Manifold, Graph and Categorical Data. (arXiv:2209.01170v2 [cs.CV] UPDATED)
    We propose a family of First Hitting Diffusion Models (FHDM), deep generative models that generate data with a diffusion process that terminates at a random first hitting time. This yields an extension of the standard fixed-time diffusion models that terminate at a pre-specified deterministic time. Although standard diffusion models are designed for continuous unconstrained data, FHDM is naturally designed to learn distributions on continuous as well as a range of discrete and structure domains. Moreover, FHDM enables instance-dependent terminate time and accelerates the diffusion process to sample higher quality data with fewer diffusion steps. Technically, we train FHDM by maximum likelihood estimation on diffusion trajectories augmented from observed data with conditional first hitting processes (i.e., bridge) derived based on Doob's $h$-transform, deviating from the commonly used time-reversal mechanism. We apply FHDM to generate data in various domains such as point cloud (general continuous distribution), climate and geographical events on earth (continuous distribution on the sphere), unweighted graphs (distribution of binary matrices), and segmentation maps of 2D images (high-dimensional categorical distribution). We observe considerable improvement compared with the state-of-the-art approaches in both quality and speed.
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v3 [stat.ML] UPDATED)
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal. In addition, we utilize the stochastic nature of the generative model outputs to obtain a finer granularity in model confidence assessment at the instance level for classification tasks.
    Improved conformalized quantile regression. (arXiv:2207.02808v5 [stat.ML] UPDATED)
    Conformalized quantile regression is a procedure that inherits the advantages of conformal prediction and quantile regression. That is, we use quantile regression to estimate the true conditional quantile and then apply a conformal step on a calibration set to ensure marginal coverage. In this way, we get adaptive prediction intervals that account for heteroscedasticity. However, the aforementioned conformal step lacks adaptiveness as described in (Romano et al., 2019). To overcome this limitation, instead of applying a single conformal step after estimating conditional quantiles with quantile regression, we propose to cluster the explanatory variables weighted by their permutation importance with an optimized k-means and apply k conformal steps. To show that this improved version outperforms the classic version of conformalized quantile regression and is more adaptive to heteroscedasticity, we extensively compare the prediction intervals of both in open datasets.
    A Character-Level Length-Control Algorithm for Non-Autoregressive Sentence Summarization. (arXiv:2205.14522v2 [cs.CL] UPDATED)
    Sentence summarization aims at compressing a long sentence into a short one that keeps the main gist, and has extensive real-world applications such as headline generation. In previous work, researchers have developed various approaches to improve the ROUGE score, which is the main evaluation metric for summarization, whereas controlling the summary length has not drawn much attention. In our work, we address a new problem of explicit character-level length control for summarization, and propose a dynamic programming algorithm based on the Connectionist Temporal Classification (CTC) model. Results show that our approach not only achieves higher ROUGE scores but also yields more complete sentences.
    Online Bilevel Optimization: Regret Analysis of Online Alternating Gradient Methods. (arXiv:2207.02829v3 [math.OC] UPDATED)
    Online optimization is a well-established optimization paradigm that aims to make a sequence of correct decisions given knowledge of the correct answer to previous decision tasks. Bilevel programming involves a hierarchical optimization problem where the feasible region of the so-called outer problem is restricted by the graph of the solution set mapping of the inner problem. This paper brings these two ideas together and studies an online bilevel optimization setting in which a sequence of time-varying bilevel problems are revealed one after the other. We extend the known regret bounds for single-level online algorithms to the bilevel setting. Specifically, we introduce new notions of bilevel regret, develop an online alternating time-averaged gradient method that is capable of leveraging smoothness, and provide regret bounds in terms of the path-length of the inner and outer minimizer sequences.
    Differentially Private Learning Needs Hidden State (Or Much Faster Convergence). (arXiv:2203.05363v2 [stat.ML] UPDATED)
    Prior work on differential privacy analysis of randomized SGD algorithms relies on composition theorems, where the implicit (unrealistic) assumption is that the internal state of the iterative algorithm is revealed to the adversary. As a result, the R\'enyi DP bounds derived by such composition-based analyses linearly grow with the number of training epochs. When the internal state of the algorithm is hidden, we prove a converging privacy bound for noisy stochastic gradient descent (on strongly convex smooth loss functions). We show how to take advantage of privacy amplification by sub-sampling and randomized post-processing, and prove the dynamics of privacy bound for "shuffle and partition" and "sample without replacement" stochastic mini-batch gradient descent schemes. We prove that, in these settings, our privacy bound converges exponentially fast and is substantially smaller than the composition bounds, notably after a few number of training epochs. Thus, unless the DP algorithm converges fast, our privacy analysis shows that hidden state analysis can significantly amplify differential privacy.
    Tensor Completion with Provable Consistency and Fairness Guarantees for Recommender Systems. (arXiv:2204.01815v2 [cs.IR] UPDATED)
    In this paper we introduce a new consistency-based approach for defining and solving nonnegative/positive matrix and tensor completion problems. The novelty of the framework is that instead of artificially making the problem well-posed in the form of an application-arbitrary optimization problem, e.g., minimizing a bulk structural measure such as rank or norm, we show that a single property/constraint -- preserving unit-scale consistency -- guarantees both existence of a solution and, under relatively weak support assumptions, uniqueness. The framework and solution algorithms also generalize directly to tensors of arbitrary dimension while maintaining computational complexity that is linear in problem size for fixed dimension d. In the context of recommender system (RS) applications, we prove that two reasonable properties that should be expected to hold for any solution to the RS problem are sufficient to permit uniqueness guarantees to be established within our framework. This is remarkable because it obviates the need for heuristic-based statistical or AI methods despite what appear to be distinctly human/subjective variables at the heart of the problem. Key theoretical contributions include a general unit-consistent tensor-completion framework with proofs of its properties, e.g., consensus-order and fairness, and algorithms with optimal runtime and space complexities, e.g., O(1) term-completion with preprocessing complexity that is linear in the number of known terms of the matrix/tensor. From a practical perspective, the seamless ability of the framework to generalize to exploit high-dimensional structural relationships among key state variables, e.g., user and product attributes, offers a means for extracting significantly more information than is possible for alternative methods that cannot generalize beyond direct user-product relationships.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v2 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.
    Overparameterization from Computational Constraints. (arXiv:2208.12926v2 [cs.LG] UPDATED)
    Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded learners need \emph{significantly more} model parameters than what information-theoretic learners need. Furthermore, we show that even more model parameters could be necessary for robust learning. In particular, for computationally bounded learners, we extend the recent result of Bubeck and Sellke [NeurIPS'2021] which shows that robust models might need more parameters, to the computational regime and show that bounded learners could provably need an even larger number of parameters. Then, we address the following related question: can we hope to remedy the situation for robust computationally bounded learning by restricting \emph{adversaries} to also be computationally bounded for sake of obtaining models with fewer parameters? Here again, we show that this could be possible. Specifically, building on the work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a learning task that can be learned efficiently and robustly against a computationally bounded attacker, while to be robust against an information-theoretic attacker requires the learner to utilize significantly more parameters.
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v2 [cs.LG] UPDATED)
    There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with $n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after $n^{O(k)}$ iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
    Future Object Detection with Spatiotemporal Transformers. (arXiv:2204.10321v2 [cs.CV] UPDATED)
    We propose the task Future Object Detection, in which the goal is to predict the bounding boxes for all visible objects in a future video frame. While this task involves recognizing temporal and kinematic patterns, in addition to the semantic and geometric ones, it only requires annotations in the standard form for individual, single (future) frames, in contrast to expensive full sequence annotations. We propose to tackle this task with an end-to-end method, in which a detection transformer is trained to directly output the future objects. In order to make accurate predictions about the future, it is necessary to capture the dynamics in the scene, both object motion and the movement of the ego-camera. To this end, we extend existing detection transformers in two ways. First, we experiment with three different mechanisms that enable the network to spatiotemporally process multiple frames. Second, we provide ego-motion information to the model in a learnable manner. We show that both of these extensions improve the future object detection performance substantially. Our final approach learns to capture the dynamics and makes predictions on par with an oracle for prediction horizons up to 100 ms, and outperforms all baselines for longer prediction horizons. By visualizing the attention maps, we observe that a form of tracking emerges within the network. Code is available at github.com/atonderski/future-object-detection.
    How Does Pre-trained Wav2Vec 2.0 Perform on Domain Shifted ASR? An Extensive Benchmark on Air Traffic Control Communications. (arXiv:2203.16822v2 [eess.AS] UPDATED)
    Recent work on self-supervised pre-training focus on leveraging large-scale unlabeled speech data to build robust end-to-end (E2E) acoustic models (AM) that can be later fine-tuned on downstream tasks e.g., automatic speech recognition (ASR). Yet, few works investigated the impact on performance when the data properties substantially differ between the pre-training and fine-tuning phases, termed domain shift. We target this scenario by analyzing the robustness of Wav2Vec 2.0 and XLS-R models on downstream ASR for a completely unseen domain, air traffic control (ATC) communications. We benchmark these two models on several open-source and challenging ATC databases with signal-to-noise ratio between 5 and 20 dB. Relative word error rate (WER) reductions between 20% to 40% are obtained in comparison to hybrid-based ASR baselines by only fine-tuning E2E acoustic models with a smaller fraction of labeled data. We analyze WERs on the low-resource scenario and gender bias carried by one ATC dataset.
    Active Bayesian Causal Inference. (arXiv:2206.02063v2 [cs.LG] UPDATED)
    Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference -- other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.
    CoRe: An Automated Pipeline for The Prediction of Liver Resection Complexity from Preoperative CT Scans. (arXiv:2210.08318v1 [cs.CV])
    Surgical resections are the most prevalent curative treatment for primary liver cancer. Tumors located in critical positions are known to complexify liver resections (LR). While experienced surgeons in specialized medical centers may have the necessary expertise to accurately anticipate LR complexity, and prepare accordingly, an objective method able to reproduce this behavior would have the potential to improve the standard routine of care, and avoid intra- and postoperative complications. In this article, we propose CoRe, an automated medical image processing pipeline for the prediction of postoperative LR complexity from preoperative CT scans, using imaging biomarkers. The CoRe pipeline first segments the liver, lesions, and vessels with two deep learning networks. The liver vasculature is then pruned based on a topological criterion to define the hepatic central zone (HCZ), a convex volume circumscribing the major liver vessels, from which a new imaging biomarker, BHCZ is derived. Additional biomarkers are extracted and leveraged to train and evaluate a LR complexity prediction model. An ablation study shows the HCZ-based biomarker as the central feature in predicting LR complexity. The best predictive model reaches an accuracy, F1, and AUC of 77.3, 75.4, and 84.1% respectively.
    Machine learning algorithms for three-dimensional mean-curvature computation in the level-set method. (arXiv:2208.09047v2 [cs.LG] UPDATED)
    We propose a data-driven mean-curvature solver for the level-set method. This work is the natural extension to $\mathbb{R}^3$ of our two-dimensional strategy in [arXiv:2201.12342][1] and the hybrid inference system of [DOI: 10.1016/j.jcp.2022.111291][2]. However, in contrast to [1,2], which built resolution-dependent neural-network dictionaries, here we develop a pair of models in $\mathbb{R}^3$, regardless of the mesh size. Our feedforward networks ingest transformed level-set, gradient, and curvature data to fix numerical mean-curvature approximations selectively for interface nodes. To reduce the problem's complexity, we have used the Gaussian curvature to classify stencils and fit our models separately to non-saddle and saddle patterns. Non-saddle stencils are easier to handle because they exhibit a curvature error distribution characterized by monotonicity and symmetry. While the latter has allowed us to train only on half the mean-curvature spectrum, the former has helped us blend the data-driven and the baseline estimations seamlessly near flat regions. On the other hand, the saddle-pattern error structure is less clear; thus, we have exploited no latent information beyond what is known. In this regard, we have trained our models on not only spherical but also sinusoidal and hyperbolic paraboloidal patches. Our approach to building their data sets is systematic but gleans samples randomly while ensuring well-balancedness. We have also resorted to standardization and dimensionality reduction as a preprocessing step and integrated regularization to minimize outliers. In addition, we leverage curvature rotation/reflection invariance to improve precision at inference time. Several experiments confirm that our proposed system can yield more accurate mean-curvature estimations than modern particle-based interface reconstruction and level-set schemes around under-resolved regions.
    How sensitive are translation systems to extra contexts? Mitigating gender bias in Neural Machine Translation models through relevant contexts. (arXiv:2205.10762v2 [cs.CL] UPDATED)
    Neural Machine Translation systems built on top of Transformer-based architectures are routinely improving the state-of-the-art in translation quality according to word-overlap metrics. However, a growing number of studies also highlight the inherent gender bias that these models incorporate during training, which reflects poorly in their translations. In this work, we investigate whether these models can be instructed to fix their bias during inference using targeted, guided instructions as contexts. By translating relevant contextual sentences during inference along with the input, we observe large improvements in reducing the gender bias in translations, across three popular test suites (WinoMT, BUG, SimpleGen). We further propose a novel metric to assess several large pre-trained models (OPUS-MT, M2M-100) on their sensitivity towards using contexts during translation to correct their biases. Our approach requires no fine-tuning and thus can be used easily in production systems to de-bias translations from stereotypical gender-occupation bias 1. We hope our method, along with our metric, can be used to build better, bias-free translation systems.
    How Powerful are K-hop Message Passing Graph Neural Networks. (arXiv:2205.13328v3 [cs.LG] UPDATED)
    The most popular design paradigm for Graph Neural Networks (GNNs) is 1-hop message passing -- aggregating information from 1-hop neighbors repeatedly. However, the expressive power of 1-hop message passing is bounded by the Weisfeiler-Lehman (1-WL) test. Recently, researchers extended 1-hop message passing to K-hop message passing by aggregating information from K-hop neighbors of nodes simultaneously. However, there is no work on analyzing the expressive power of K-hop message passing. In this work, we theoretically characterize the expressive power of K-hop message passing. Specifically, we first formally differentiate two different kernels of K-hop message passing which are often misused in previous works. We then characterize the expressive power of K-hop message passing by showing that it is more powerful than 1-WL and can distinguish almost all regular graphs. Despite the higher expressive power, we show that K-hop message passing still cannot distinguish some simple regular graphs and its expressive power is bounded by 3-WL. To further enhance its expressive power, we introduce a KP-GNN framework, which improves K-hop message passing by leveraging the peripheral subgraph information in each hop. We show that KP-GNN can distinguish many distance regular graphs which could not be distinguished by previous distance encoding or 3-WL methods. Experimental results verify the expressive power and effectiveness of KP-GNN. KP-GNN achieves competitive results across all benchmark datasets.
    An Empirical Study on Disentanglement of Negative-free Contrastive Learning. (arXiv:2206.04756v2 [cs.LG] UPDATED)
    Negative-free contrastive learning methods have attracted a lot of attention with simplicity and impressive performances for large-scale pretraining. However, its disentanglement property remains unexplored. In this paper, we examine negative-free contrastive learning methods to study the disentanglement property empirically. We find that existing disentanglement metrics fail to make meaningful measurements for high-dimensional representation models, so we propose a new disentanglement metric based on Mutual Information between latent representations and data factors. With this proposed metric, we benchmark the disentanglement property of negative-free contrastive learning on both popular synthetic datasets and a real-world dataset CelebA. Our study shows that the investigated methods can learn a well-disentangled subset of representation. As far as we know, we are the first to extend the study of disentangled representation learning to high-dimensional representation space and introduce negative-free contrastive learning methods into this area. The source code of this paper is available at \url{https://github.com/noahcao/disentanglement_lib_med}.
    Deep Differentiable Logic Gate Networks. (arXiv:2210.08277v1 [cs.LG])
    Recently, research has increasingly focused on developing efficient neural network architectures. In this work, we explore logic gate networks for machine learning tasks by learning combinations of logic gates. These networks comprise logic gates such as "AND" and "XOR", which allow for very fast execution. The difficulty in learning logic gate networks is that they are conventionally non-differentiable and therefore do not allow training with gradient descent. Thus, to allow for effective training, we propose differentiable logic gate networks, an architecture that combines real-valued logics and a continuously parameterized relaxation of the network. The resulting discretized logic gate networks achieve fast inference speeds, e.g., beyond a million images of MNIST per second on a single CPU core.
    Equivariant Graph Hierarchy-Based Neural Networks. (arXiv:2202.10643v2 [cs.LG] UPDATED)
    Equivariant Graph neural Networks (EGNs) are powerful in characterizing the dynamics of multi-body physical systems. Existing EGNs conduct flat message passing, which, yet, is unable to capture the spatial/dynamical hierarchy for complex systems particularly, limiting substructure discovery and global information fusion. In this paper, we propose Equivariant Hierarchy-based Graph Networks (EGHNs) which consist of the three key components: generalized Equivariant Matrix Message Passing (EMMP) , E-Pool and E-UpPool. In particular, EMMP is able to improve the expressivity of conventional equivariant message passing, E-Pool assigns the quantities of the low-level nodes into high-level clusters, while E-UpPool leverages the high-level information to update the dynamics of the low-level nodes. As their names imply, both E-Pool and E-UpPool are guaranteed to be equivariant to meet physic symmetry. Considerable experimental evaluations verify the effectiveness of our EGHN on several applications including multi-object dynamics simulation, motion capture, and protein dynamics modeling.
    Distinguishing Learning Rules with Brain Machine Interfaces. (arXiv:2206.13448v2 [cs.NE] UPDATED)
    Despite extensive theoretical work on biologically plausible learning rules, clear evidence about whether and how such rules are implemented in the brain has been difficult to obtain. We consider biologically plausible supervised- and reinforcement-learning rules and ask whether changes in network activity during learning can be used to determine which learning rule is being used. Supervised learning requires a credit-assignment model estimating the mapping from neural activity to behavior, and, in a biological organism, this model will inevitably be an imperfect approximation of the ideal mapping, leading to a bias in the direction of the weight updates relative to the true gradient. Reinforcement learning, on the other hand, requires no credit-assignment model and tends to make weight updates following the true gradient direction. We derive a metric to distinguish between learning rules by observing changes in the network activity during learning, given that the mapping from brain to behavior is known by the experimenter. Because brain-machine interface (BMI) experiments allow for precise knowledge of this mapping, we model a cursor-control BMI task using recurrent neural networks, showing that learning rules can be distinguished in simulated experiments using only observations that a neuroscience experimenter would plausibly have access to.
    Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement Learning. (arXiv:2206.07376v2 [cs.LG] UPDATED)
    Keeping risk under control is often more crucial than maximizing expected reward in real-world decision-making situations, such as finance, robotics, autonomous driving, etc. The most natural choice of risk measures is variance, while it penalizes the upside volatility as much as the downside part. Instead, the (downside) semivariance, which captures negative deviation of a random variable under its mean, is more suitable for risk-averse proposes. This paper aims at optimizing the mean-semivariance (MSV) criterion in reinforcement learning w.r.t. steady reward distribution. Since semivariance is time-inconsistent and does not satisfy the standard Bellman equation, the traditional dynamic programming methods are inapplicable to MSV problems directly. To tackle this challenge, we resort to Perturbation Analysis (PA) theory and establish the performance difference formula for MSV. We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function. Further, we propose two on-policy algorithms based on the policy gradient theory and the trust region method. Finally, we conduct diverse experiments from simple bandit problems to continuous control tasks in MuJoCo, which demonstrate the effectiveness of our proposed methods.
    FOCUS: Fairness via Agent-Awareness for Federated Learning on Heterogeneous Data. (arXiv:2207.10265v2 [cs.LG] UPDATED)
    Federated learning (FL) provides an effective collaborative training paradigm, allowing local agents to train a global model jointly without sharing their local data to protect privacy. However, due to the heterogeneous nature of local data, it is challenging to optimize or even define the fairness of the trained global model for the agents. For instance, existing work usually considers accuracy equity as fairness for different agents in FL, which is limited, especially under the heterogeneous setting, since it is intuitively "unfair" to enforce agents with high-quality data to achieve similar accuracy to those who contribute low-quality data. In this work, we aim to address such limitations and propose a formal fairness definition in FL, fairness via agent-awareness (FAA), which takes different contributions of heterogeneous agents into account. Under FAA, the performance of agents with high-quality data will not be sacrificed just due to the existence of large amounts of agents with low-quality data. In addition, we propose a fair FL training algorithm based on agent clustering (FOCUS) to achieve fairness in FL measured by FAA. Theoretically, we prove the convergence and optimality of FOCUS under mild conditions for linear and general convex loss functions with bounded smoothness. We also prove that FOCUS always achieves higher fairness in terms of FAA compared with standard FedAvg under both linear and general convex loss functions. Empirically, we evaluate FOCUS on four datasets, including synthetic data, images, and texts under different settings, and we show that FOCUS achieves significantly higher fairness in terms of FAA while maintaining similar or even higher prediction accuracy compared with FedAvg and other existing fair FL algorithms.
    Biologically Plausible Learning using GAIT-prop Scales to ImageNet. (arXiv:2102.11598v2 [cs.LG] UPDATED)
    Many of the recent advances in the field of artificial intelligence have been fueled by the highly successful backpropagation of error (BP) algorithm, which efficiently solves the credit assignment problem in artificial neural networks. However, it is unlikely that BP is implemented in its usual form within biological neural networks, because of its reliance on non-local information in propagating error gradients. Since biological neural networks are capable of highly efficient learning and responses from BP trained models can be related to neural responses, it seems reasonable that a biologically viable approximation of BP underlies synaptic plasticity in the brain. Gradient Adjusted Incremental Target Propagation (GAIT-prop) has recently been derived directly from BP and has been shown to successfully train networks in a more biologically plausible manner. However, so far, GAIT-prop has only been shown to work on relatively low-dimensional problems, such as handwritten-digit recognition. This work addresses some of the scaling issues in GAIT-prop and shows it to achieve performance comparable to BP on the much more challenging CIFAR10 and ImageNet datasets.
    On Enforcing Better Conditioned Meta-Learning for Rapid Few-Shot Adaptation. (arXiv:2206.07260v2 [cs.LG] UPDATED)
    Inspired by the concept of preconditioning, we propose a novel method to increase adaptation speed for gradient-based meta-learning methods without incurring extra parameters. We demonstrate that recasting the optimization problem to a non-linear least-squares formulation provides a principled way to actively enforce a $\textit{well-conditioned}$ parameter space for meta-learning models based on the concepts of the condition number and local curvature. Our comprehensive evaluations show that the proposed method significantly outperforms its unconstrained counterpart especially during initial adaptation steps, while achieving comparable or better overall results on several few-shot classification tasks -- creating the possibility of dynamically choosing the number of adaptation steps at inference time.
    Subgraph Permutation Equivariant Networks. (arXiv:2111.11840v4 [cs.LG] UPDATED)
    In this work we develop a new method, named Sub-graph Permutation Equivariant Networks (SPEN), which provides a framework for building graph neural networks that operate on sub-graphs, while using a base update function that is permutation equivariant, that are equivariant to a novel choice of automorphism group. Message passing neural networks have been shown to be limited in their expressive power and recent approaches to over come this either lack scalability or require structural information to be encoded into the feature space. The general framework presented here overcomes the scalability issues associated with global permutation equivariance by operating more locally on sub-graphs. In addition, through operating on sub-graphs the expressive power of higher-dimensional global permutation equivariant networks is improved; this is due to fact that two non-distinguishable graphs often contain distinguishable sub-graphs. Furthermore, the proposed framework only requires a choice of $k$-hops for creating ego-network sub-graphs and a choice of representation space to be used for each layer, which makes the method easily applicable across a range of graph based domains. We experimentally validate the method on a range of graph benchmark classification tasks, demonstrating statistically indistinguishable results from the state-of-the-art on six out of seven benchmarks. Further, we demonstrate that the use of local update functions offers a significant improvement in GPU memory over global methods.
    E-Branchformer: Branchformer with Enhanced merging for speech recognition. (arXiv:2210.00077v2 [eess.AS] UPDATED)
    Conformer, combining convolution and self-attention sequentially to capture both local and global information, has shown remarkable performance and is currently regarded as the state-of-the-art for automatic speech recognition (ASR). Several other studies have explored integrating convolution and self-attention but they have not managed to match Conformer's performance. The recently introduced Branchformer achieves comparable performance to Conformer by using dedicated branches of convolution and self-attention and merging local and global context from each branch. In this paper, we propose E-Branchformer, which enhances Branchformer by applying an effective merging method and stacking additional point-wise modules. E-Branchformer sets new state-of-the-art word error rates (WERs) 1.81% and 3.65% on LibriSpeech test-clean and test-other sets without using any external training data.
    Constraint Learning to Define Trust Regions in Predictive-Model Embedded Optimization. (arXiv:2201.04429v2 [cs.LG] UPDATED)
    There is a recent proliferation of research on the integration of machine learning and optimization. One expansive area within this research stream is predictive-model embedded optimization, which proposes the use of pre-trained predictive models as surrogates for uncertain or highly complex objective functions. In this setting, features of the predictive models become decision variables in the optimization problem. Despite a recent surge in publications in this area, only a few papers note the importance of incorporating trust region considerations in this decision-making pipeline, i.e., enforcing solutions to be similar to the data used to train the predictive models. Without such constraints, the evaluation of the predictive model at solutions obtained from optimization cannot be trusted and the practicality of the solutions may be unreasonable. In this paper, we provide an overview of the approaches appearing in the literature to construct a trust region, and propose three alternative approaches. Our numerical evaluation highlights that trust-region constraints learned through isolation forests, one of the newly proposed approaches, outperform all previously suggested approaches, both in terms of solution quality and computational time.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v2 [cs.LG] UPDATED)
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP and its implications for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level DP are less suitable as real-world privacy regulations typically concern the in-silo data subjects rather than the silos themselves. In this work, we instead consider an alternative notion of silo-specific sample-level DP, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy requirements, silos are incentivized to federate more with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide an empirical study of competing methods as well as a theoretical characterization of MR-MTL for mean estimation, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    Supervised Learning for Coverage-Directed Test Selection in Simulation-Based Verification. (arXiv:2205.08524v3 [cs.AR] UPDATED)
    Constrained random test generation is one of the most widely adopted methods for generating stimuli for simulation-based verification. Randomness leads to test diversity, but tests tend to repeatedly exercise the same design logic. Constraints are written (typically manually) to bias random tests towards interesting, hard-to-reach, and yet-untested logic. However, as verification progresses, most constrained random tests yield little to no effect on functional coverage. If stimuli generation consumes significantly less resources than simulation, then a better approach involves randomly generating a large number of tests, selecting the most effective subset, and only simulating that subset. In this paper, we introduce a novel method for automatic constraint extraction and test selection. This method, which we call coverage-directed test selection, is based on supervised learning from coverage feedback. Our method biases selection towards tests that have a high probability of increasing functional coverage, and prioritises them for simulation. We show how coverage-directed test selection can reduce manual constraint writing, prioritise effective tests, reduce verification resource consumption, and accelerate coverage closure on a large, real-life industrial hardware design.
    GraB: Finding Provably Better Data Permutations than Random Reshuffling. (arXiv:2205.10733v2 [cs.LG] UPDATED)
    Random reshuffling, which randomly permutes the dataset each epoch, is widely adopted in model training because it yields faster convergence than with-replacement sampling. Recent studies indicate greedily chosen data orderings can further speed up convergence empirically, at the cost of using more computation and memory. However, greedy ordering lacks theoretical justification and has limited utility due to its non-trivial memory and computation overhead. In this paper, we first formulate an example-ordering framework named herding and answer affirmatively that SGD with herding converges at the rate $O(T^{-2/3})$ on smooth, non-convex objectives, faster than the $O(n^{1/3}T^{-2/3})$ obtained by random reshuffling, where $n$ denotes the number of data points and $T$ denotes the total number of iterations. To reduce the memory overhead, we leverage discrepancy minimization theory to propose an online Gradient Balancing algorithm (GraB) that enjoys the same rate as herding, while reducing the memory usage from $O(nd)$ to just $O(d)$ and computation from $O(n^2)$ to $O(n)$, where $d$ denotes the model dimension. We show empirically on applications including MNIST, CIFAR10, WikiText and GLUE that GraB can outperform random reshuffling in terms of both training and validation performance, and even outperform state-of-the-art greedy ordering while reducing memory usage over $100\times$.
    ST-CoNAL: Consistency-Based Acquisition Criterion Using Temporal Self-Ensemble for Active Learning. (arXiv:2207.02182v2 [cs.CV] UPDATED)
    Modern deep learning has achieved great success in various fields. However, it requires the labeling of huge amounts of data, which is expensive and labor-intensive. Active learning (AL), which identifies the most informative samples to be labeled, is becoming increasingly important to maximize the efficiency of the training process. The existing AL methods mostly use only a single final fixed model for acquiring the samples to be labeled. This strategy may not be good enough in that the structural uncertainty of a model for given training data is not considered to acquire the samples. In this study, we propose a novel acquisition criterion based on temporal self-ensemble generated by conventional stochastic gradient descent (SGD) optimization. These self-ensemble models are obtained by capturing the intermediate network weights obtained through SGD iterations. Our acquisition function relies on a consistency measure between the student and teacher models. The student models are given a fixed number of temporal self-ensemble models, and the teacher model is constructed by averaging the weights of the student models. Using the proposed acquisition criterion, we present an AL algorithm, namely student-teacher consistency-based AL (ST-CoNAL). Experiments conducted for image classification tasks on CIFAR-10, CIFAR-100, Caltech-256, and Tiny ImageNet datasets demonstrate that the proposed ST-CoNAL achieves significantly better performance than the existing acquisition methods. Furthermore, extensive experiments show the robustness and effectiveness of our methods.
    ScanMix: Learning from Severe Label Noise via Semantic Clustering and Semi-Supervised Learning. (arXiv:2103.11395v3 [cs.CV] UPDATED)
    We propose a new training algorithm, ScanMix, that explores semantic clustering and semi-supervised learning (SSL) to allow superior robustness to severe label noise and competitive robustness to non-severe label noise problems, in comparison to the state of the art (SOTA) methods. ScanMix is based on the expectation maximisation framework, where the E-step estimates the latent variable to cluster the training images based on their appearance and classification results, and the M-step optimises the SSL classification and learns effective feature representations via semantic clustering. We present a theoretical result that shows the correctness and convergence of ScanMix, and an empirical result that shows that ScanMix has SOTA results on CIFAR-10/-100 (with symmetric, asymmetric and semantic label noise), Red Mini-ImageNet (from the Controlled Noisy Web Labels), Clothing1M and WebVision. In all benchmarks with severe label noise, our results are competitive to the current SOTA.
    The Lepto-Variance of Stock Returns. (arXiv:2207.04867v2 [q-fin.ST] UPDATED)
    The Regression Tree (RT) sorts the samples using a specific feature and finds the split point that produces the maximum variance reduction from a node to its children. Our key observation is that the best factor to use (in terms of MSE drop) is always the target itself, as this most clearly separates the target. Thus using the target as the splitting factor provides an upper bound on MSE drop (or lower bound on the residual children MSE). Based on this observation, we define the k-bit lepto-variance ${\lambda}k^2$ of a target variable (or equivalently the lepto-variance at a specific depth k) as the variance that cannot be removed by any regression tree of a depth equal to k. As the upper bound performance for any feature, we believe ${\lambda}k^2$ to be an interesting statistical concept related to the underlying structure of the sample as it quantifies the resolving power of the RT for the sample. The max variance that may be explained using RTs of depth up to k is called the sample k-bit macro-variance. At any depth, total sample variance is thus decomposed into lepto-variance ${\lambda}^2$ and macro-variance ${\mu}^2$. We demonstrate the concept, by performing 1- and 2-bit RT based lepto-structure analysis for daily IBM stock returns.
    Improved Robust Algorithms for Learning with Discriminative Feature Feedback. (arXiv:2209.03753v2 [cs.LG] UPDATED)
    Discriminative Feature Feedback is a setting proposed by Dastupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models. In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.
    CsFEVER and CTKFacts: Acquiring Czech data for fact verification. (arXiv:2201.11115v3 [cs.CL] UPDATED)
    In this paper, we examine several methods of acquiring Czech data for automated fact-checking, which is a task commonly modeled as a classification of textual claim veracity w.r.t. a corpus of trusted ground truths. We attempt to collect sets of data in form of a factual claim, evidence within the ground truth corpus, and its veracity label (supported, refuted or not enough info). As a first attempt, we generate a Czech version of the large-scale FEVER dataset built on top of Wikipedia corpus. We take a hybrid approach of machine translation and document alignment; the approach and the tools we provide can be easily applied to other languages. We discuss its weaknesses and inaccuracies, propose a future approach for their cleaning and publish the 127k resulting translations, as well as a version of such dataset reliably applicable for the Natural Language Inference task - the CsFEVER-NLI. Furthermore, we collect a novel dataset of 3,097 claims, which is annotated using the corpus of 2.2M articles of Czech News Agency. We present its extended annotation methodology based on the FEVER approach, and, as the underlying corpus is kept a trade secret, we also publish a standalone version of the dataset for the task of Natural Language Inference we call CTKFactsNLI. We analyze both acquired datasets for spurious cues - annotation patterns leading to model overfitting. CTKFacts is further examined for inter-annotator agreement, thoroughly cleaned, and a typology of common annotator errors is extracted. Finally, we provide baseline models for all stages of the fact-checking pipeline and publish the NLI datasets, as well as our annotation platform and other experimental data.
    Everyone's Preference Changes Differently: Weighted Multi-Interest Retrieval Model. (arXiv:2207.06652v3 [cs.IR] UPDATED)
    User embeddings (vectorized representations of a user) are essential in recommendation systems. Numerous approaches have been proposed to construct a representation for the user in order to find similar items for retrieval tasks, and they have been proven effective in industrial recommendation systems as well. Recently people have discovered the power of using multiple embeddings to represent a user, with the hope that each embedding represents the user's interest in a certain topic. With multi-interest representation, it's important to model the user's preference over the different topics and how the preference change with time. However, existing approaches either fail to estimate the user's affinity to each interest or unreasonably assume every interest of every user fades with an equal rate with time, thus hurting the recall of candidate retrieval. In this paper, we propose the Multi-Interest Preference (MIP) model, an approach that not only produces multi-interest for users by using the user's sequential engagement more effectively but also automatically learns a set of weights to represent the preference over each embedding so that the candidates can be retrieved from each interest proportionally. Extensive experiments have been done on various industrial-scale datasets to demonstrate the effectiveness of our approach.
    Deep Fusion Prior for Plenoptic Super-Resolution All-in-Focus Imaging. (arXiv:2110.05706v5 [cs.CV] UPDATED)
    Multi-focus image fusion (MFIF) and super-resolution (SR) are the inverse problem of imaging model, purposes of MFIF and SR are obtaining all-in-focus and high-resolution 2D mapping of targets. Though various MFIF and SR methods have been designed; almost all the them deal with MFIF and SR separately. This paper unifies MFIF and SR problems in the physical perspective as the multi-focus image super resolution fusion (MFISRF), and we propose a novel unified dataset-free unsupervised framework named deep fusion prior (DFP) based-on deep image prior (DIP) to address such MFISRF with single model. Experiments have proved that our proposed DFP approaches or even outperforms those state-of-art MFIF and SR method combinations. To our best knowledge, our proposed work is a dataset-free unsupervised method to simultaneously implement the multi-focus fusion and super-resolution task for the first time. Additionally, DFP is a general framework, thus its networks and focus measurement tactics can be continuously updated to further improve the MFISRF performance. DFP codes are open source available at this http URL
    Accelerating Inhibitor Discovery With A Deep Generative Foundation Model: Validation for SARS-CoV-2 Drug Targets. (arXiv:2204.09042v3 [q-bio.QM] UPDATED)
    The discovery of novel inhibitor molecules for emerging drug-target proteins is widely acknowledged as a challenging inverse design problem: Exhaustive exploration of the vast chemical search space is impractical, especially when the target structure or active molecules are unknown. Here we validate experimentally the broad utility of a deep generative framework trained at-scale on protein sequences, small molecules, and their mutual interactions -- that is unbiased toward any specific target. As demonstrators, we consider two dissimilar and relevant SARS-CoV-2 targets: the main protease and the spike protein (receptor binding domain, RBD). To perform target-aware design of novel inhibitor molecules, a protein sequence-conditioned sampling on the generative foundation model is performed. Despite using only the target sequence information, and without performing any target-specific adaptation of the generative model, micromolar-level inhibition was observed in in vitro experiments for two candidates out of only four synthesized for each target. The most potent spike RBD inhibitor also exhibited activity against several variants in live virus neutralization assays. These results therefore establish that a single, broadly deployable generative foundation model for accelerated hit discovery is effective and efficient, even in the most general case where neither target structure nor binder information is available.
    Additive MIL: Intrinsically Interpretable Multiple Instance Learning for Pathology. (arXiv:2206.01794v2 [cs.CV] UPDATED)
    Multiple Instance Learning (MIL) has been widely applied in pathology towards solving critical problems such as automating cancer diagnosis and grading, predicting patient prognosis, and therapy response. Deploying these models in a clinical setting requires careful inspection of these black boxes during development and deployment to identify failures and maintain physician trust. In this work, we propose a simple formulation of MIL models, which enables interpretability while maintaining similar predictive performance. Our Additive MIL models enable spatial credit assignment such that the contribution of each region in the image can be exactly computed and visualized. We show that our spatial credit assignment coincides with regions used by pathologists during diagnosis and improves upon classical attention heatmaps from attention MIL models. We show that any existing MIL model can be made additive with a simple change in function composition. We also show how these models can debug model failures, identify spurious features, and highlight class-wise regions of interest, enabling their use in high-stakes environments such as clinical decision-making.
    Crossover-SGD: A gossip-based communication in distributed deep learning for alleviating large mini-batch problem and enhancing scalability. (arXiv:2012.15198v2 [cs.LG] UPDATED)
    Distributed deep learning is an effective way to reduce the training time of deep learning for large datasets as well as complex models. However, the limited scalability caused by network overheads makes it difficult to synchronize the parameters of all workers. To resolve this problem, gossip-based methods that demonstrates stable scalability regardless of the number of workers have been proposed. However, to use gossip-based methods in general cases, the validation accuracy for a large mini-batch needs to be verified. To verify this, we first empirically study the characteristics of gossip methods in a large mini-batch problem and observe that the gossip methods preserve higher validation accuracy than AllReduce-SGD(Stochastic Gradient Descent) when the number of batch sizes is increased and the number of workers is fixed. However, the delayed parameter propagation of the gossip-based models decreases validation accuracy in large node scales. To cope with this problem, we propose Crossover-SGD that alleviates the delay propagation of weight parameters via segment-wise communication and load balancing random network topology. We also adapt hierarchical communication to limit the number of workers in gossip-based communication methods. To validate the effectiveness of our proposed method, we conduct empirical experiments and observe that our Crossover-SGD shows higher node scalability than SGP(Stochastic Gradient Push).
    Telehealthcare and Telepathology in Pandemic: A Noninvasive, Low-Cost Micro-Invasive and Multimodal Real-Time Online Application for Early Diagnosis of COVID-19 Infection. (arXiv:2109.07846v2 [cs.LG] UPDATED)
    To contain the spread of the virus and stop the overcrowding of hospitalized patients, the coronavirus pandemic crippled healthcare facilities, mandating lockdowns and promoting remote work. As a result, telehealth has become increasingly popular for offering low-risk care to patients. However, the difficulty of preventing the next potential waves of infection has increased by constant virus mutation into new forms and a general lack of test kits, particularly in developing nations. In this research, a unique cloud-based application for the early identification of individuals who may have COVID-19 infection is proposed. The application provides five modes of diagnosis from possible symptoms (f1), cough sound (f2), specific blood biomarkers (f3), Raman spectral data of blood specimens (f4), and ECG signal paper-based image (f5). When a user selects an option and enters the information, the data is sent to the cloud server. The deployed machine learning (ML) and deep learning (DL) models classify the data in real time and inform the user of the likelihood of COVID-19 infection. Our deployed models can classify with an accuracy of 100%, 99.80%, 99.55%, 95.65%, and 77.59% from f3, f4, f5, f2, and f1 respectively. Moreover, the sensitivity for f2, f3, and f4 is 100%, which indicates the correct identification of COVID positive patients. This is significant in limiting the spread of the virus. Additionally, another ML model, as seen to offer 92% accuracy serves to identify patients who, out of a large group of patients admitted to the hospital cohort, need immediate critical care support by estimating the mortality risk of patients from blood parameters. The instantaneous multimodal nature of our technique offers multiplex and accurate diagnostic methods, highlighting the effectiveness of telehealth as a simple, widely available, and low-cost diagnostic solution, even for future pandemics.
    The alignment property of SGD noise and how it helps select flat minima: A stability analysis. (arXiv:2207.02628v3 [stat.ML] UPDATED)
    The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness -- as measured by the Frobenius norm of the Hessian -- is bounded \emph{independently} of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.
    First-Order Optimization Inspired from Finite-Time Convergent Flows. (arXiv:2010.02990v4 [cs.LG] UPDATED)
    In this paper, we investigate the performance of two first-order optimization algorithms, obtained from forward Euler discretization of finite-time optimization flows. These flows are the rescaled-gradient flow (RGF) and the signed-gradient flow (SGF), and consist of non-Lipscthiz or discontinuous dynamical systems that converge locally in finite time to the minima of gradient-dominated functions. We propose an Euler discretization for these first-order finite-time flows, and provide convergence guarantees, in the deterministic and the stochastic setting. We then apply the proposed algorithms to academic examples, as well as deep neural networks training, where we empirically test their performances on the SVHN dataset. Our results show that our schemes demonstrate faster convergences against standard optimization alternatives.
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models. (arXiv:2110.08500v3 [stat.ML] UPDATED)
    We theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso) for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso without post-thresholding is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in $\textit{Meng, Obuchi, and Kabashima 2021}$ using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.
    Multi-Game Decision Transformers. (arXiv:2205.15241v2 [cs.AI] UPDATED)
    A longstanding goal of the field of AI is a method for learning a highly capable, generalist agent from diverse experience. In the subfields of vision and language, this was largely achieved by scaling up transformer-based models and training them on large, diverse datasets. Motivated by this progress, we investigate whether the same strategy can be used to produce generalist reinforcement learning agents. Specifically, we show that a single transformer-based model - with a single set of weights - trained purely offline can play a suite of up to 46 Atari games simultaneously at close-to-human performance. When trained and evaluated appropriately, we find that the same trends observed in language and vision hold, including scaling of performance with model size and rapid adaptation to new games via fine-tuning. We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning, and find that our Multi-Game Decision Transformer models offer the best scalability and performance. We release the pre-trained models and code to encourage further research in this direction.
    Rethinking and Scaling Up Graph Contrastive Learning: An Extremely Efficient Approach with Group Discrimination. (arXiv:2206.01535v2 [cs.LG] UPDATED)
    Graph contrastive learning (GCL) alleviates the heavy reliance on label information for graph representation learning (GRL) via self-supervised learning schemes. The core idea is to learn by maximising mutual information for similar instances, which requires similarity computation between two node instances. However, GCL is inefficient in both time and memory consumption. In addition, GCL normally requires a large number of training epochs to be well-trained on large-scale datasets. Inspired by an observation of a technical defect (i.e., inappropriate usage of Sigmoid function) commonly used in two representative GCL works, DGI and MVGRL, we revisit GCL and introduce a new learning paradigm for self-supervised graph representation learning, namely, Group Discrimination (GD), and propose a novel GD-based method called Graph Group Discrimination (GGD). Instead of similarity computation, GGD directly discriminates two groups of node samples with a very simple binary cross-entropy loss. In addition, GGD requires much fewer training epochs to obtain competitive performance compared with GCL methods on large-scale datasets. These two advantages endow GGD with very efficient property. Extensive experiments show that GGD outperforms state-of-the-art self-supervised methods on eight datasets. In particular, GGD can be trained in 0.18 seconds (6.44 seconds including data preprocessing) on ogbn-arxiv, which is orders of magnitude (10,000+) faster than GCL baselines while consuming much less memory. Trained with 9 hours on ogbn-papers100M with billion edges, GGD outperforms its GCL counterparts in both accuracy and efficiency.
    POTATO: exPlainable infOrmation exTrAcTion framewOrk. (arXiv:2201.13230v2 [cs.CL] UPDATED)
    We present POTATO, a task- and languageindependent framework for human-in-the-loop (HITL) learning of rule-based text classifiers using graph-based features. POTATO handles any type of directed graph and supports parsing text into Abstract Meaning Representations (AMR), Universal Dependencies (UD), and 4lang semantic graphs. A streamlit-based user interface allows users to build rule systems from graph patterns, provides real-time evaluation based on ground truth data, and suggests rules by ranking graph features using interpretable machine learning models. Users can also provide patterns over graphs using regular expressions, and POTATO can recommend refinements of such rules. POTATO is applied in projects across domains and languages, including classification tasks on German legal text and English social media data. All components of our system are written in Python, can be installed via pip, and are released under an MIT License on GitHub.
    TrafficFlowGAN: Physics-informed Flow based Generative Adversarial Network for Uncertainty Quantification. (arXiv:2206.09319v2 [cs.LG] UPDATED)
    This paper proposes the TrafficFlowGAN, a physics-informed flow based generative adversarial network (GAN), for uncertainty quantification (UQ) of dynamical systems. TrafficFlowGAN adopts a normalizing flow model as the generator to explicitly estimate the data likelihood. This flow model is trained to maximize the data likelihood and to generate synthetic data that can fool a convolutional discriminator. We further regularize this training process using prior physics information, so-called physics-informed deep learning (PIDL). To the best of our knowledge, we are the first to propose an integration of flow, GAN and PIDL for the UQ problems. We take the traffic state estimation (TSE), which aims to estimate the traffic variables (e.g. traffic density and velocity) using partially observed data, as an example to demonstrate the performance of our proposed model. We conduct numerical experiments where the proposed model is applied to learn the solutions of stochastic differential equations. The results demonstrate the robustness and accuracy of the proposed model, together with the ability to learn a machine learning surrogate model. We also test it on a real-world dataset, the Next Generation SIMulation (NGSIM), to show that the proposed TrafficFlowGAN can outperform the baselines, including the pure flow model, the physics-informed flow model, and the flow based GAN model.
    Gradient Descent: The Ultimate Optimizer. (arXiv:1909.13371v2 [cs.LG] UPDATED)
    Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time. We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer).
    Sampling with Riemannian Hamiltonian Monte Carlo in a Constrained Space. (arXiv:2202.01908v2 [cs.LG] UPDATED)
    We demonstrate for the first time that ill-conditioned, non-smooth, constrained distributions in very high dimension, upwards of 100,000, can be sampled efficiently $\textit{in practice}$. Our algorithm incorporates constraints into the Riemannian version of Hamiltonian Monte Carlo and maintains sparsity. This allows us to achieve a mixing rate independent of smoothness and condition numbers. On benchmark data sets in systems biology and linear programming, our algorithm outperforms existing packages by orders of magnitude. In particular, we achieve a 1,000-fold speed-up for sampling from the largest published human metabolic network (RECON3D). Our package has been incorporated into the COBRA toolbox.
    DI-NIDS: Domain Invariant Network Intrusion Detection System. (arXiv:2210.08252v1 [cs.CR])
    The performance of machine learning based network intrusion detection systems (NIDSs) severely degrades when deployed on a network with significantly different feature distributions from the ones of the training dataset. In various applications, such as computer vision, domain adaptation techniques have been successful in mitigating the gap between the distributions of the training and test data. In the case of network intrusion detection however, the state-of-the-art domain adaptation approaches have had limited success. According to recent studies, as well as our own results, the performance of an NIDS considerably deteriorates when the `unseen' test dataset does not follow the training dataset distribution. In some cases, swapping the train and test datasets makes this even more severe. In order to enhance the generalisibility of machine learning based network intrusion detection systems, we propose to extract domain invariant features using adversarial domain adaptation from multiple network domains, and then apply an unsupervised technique for recognising abnormalities, i.e., intrusions. More specifically, we train a domain adversarial neural network on labelled source domains, extract the domain invariant features, and train a One-Class SVM (OSVM) model to detect anomalies. At test time, we feedforward the unlabeled test data to the feature extractor network to project it into a domain invariant space, and then apply OSVM on the extracted features to achieve our final goal of detecting intrusions. Our extensive experiments on the NIDS benchmark datasets of NFv2-CIC-2018 and NFv2-UNSW-NB15 show that our proposed setup demonstrates superior cross-domain performance in comparison to the previous approaches.
    Deep Policy Iteration with Integer Programming for Inventory Management. (arXiv:2112.02215v2 [cs.LG] UPDATED)
    We present a Reinforcement Learning (RL) based framework for optimizing long-term discounted reward problems with large combinatorial action space and state dependent constraints. These characteristics are common to many operations management problems, e.g., network inventory replenishment, where managers have to deal with uncertain demand, lost sales, and capacity constraints that results in more complex feasible action spaces. Our proposed Programmable Actor Reinforcement Learning (PARL) uses a deep-policy iteration method that leverages neural networks (NNs) to approximate the value function and combines it with mathematical programming (MP) and sample average approximation (SAA) to solve the per-step-action optimally while accounting for combinatorial action spaces and state-dependent constraint sets. We show how the proposed methodology can be applied to complex inventory replenishment problems where analytical solutions are intractable. We also benchmark the proposed algorithm against state-of-the-art RL algorithms and commonly used replenishment heuristics and find that the proposed algorithm considerably outperforms existing methods by as much as 14.7\% on average in various supply chain settings. This improvement in performance of PARL over benchmark algorithms can be attributed to better inventory cost management, especially in inventory constrained settings. Furthermore, in a simpler back order setting where the optimal solution is tractable, we find that the RL based policy also converges to the optimal policy. Finally, to make RL algorithms more accessible for inventory management researchers, we also discuss a modular Python library developed that can be used to test the performance of RL algorithms with various supply chain structures. This library can spur future research in developing practical and near-optimal algorithms for inventory management problems.
    LEMON: LanguagE ModeL for Negative Sampling of Knowledge Graph Embeddings. (arXiv:2203.04703v3 [cs.AI] UPDATED)
    Knowledge Graph Embedding models have become an important area of machine learning.Those models provide a latent representation of entities and relations in a knowledge graph which can then be used in downstream machine learning tasks such as link prediction. The learning process of such models can be performed by contrasting positive and negative triples. While all triples of a KG are considered positive, negative triples are usually not readily available. Therefore, the choice of the sampling method to obtain the negative triples play a crucial role in the performance and effectiveness of Knowledge Graph Embedding models. Most of the current methods fetch negative samples from a random distribution of entities in the underlying Knowledge Graph which also often includes meaningless triples. Other known methods use adversarial techniques or generative neural networks which consequently reduce the efficiency of the process. In this paper, we propose an approach for generating informative negative samples considering available complementary knowledge about entities. Particularly, Pre-trained Language Models are used to form neighborhood clusters by utilizing the distances between entities to obtain representations of symbolic entities via their textual information. Our comprehensive evaluations demonstrate the effectiveness of the proposed approach on benchmark Knowledge Graphs with textual information for the link prediction task.
    Towards Green Automated Machine Learning: Status Quo and Future Directions. (arXiv:2111.05850v3 [cs.LG] UPDATED)
    Automated machine learning (AutoML) strives for the automatic configuration of machine learning algorithms and their composition into an overall (software) solution - a machine learning pipeline - tailored to the learning task (dataset) at hand. Over the last decade, AutoML has developed into an independent research field with hundreds of contributions. At the same time, AutoML is being criticised for its high resource consumption as many approaches rely on the (costly) evaluation of many machine learning pipelines, as well as the expensive large scale experiments across many datasets and approaches. In the spirit of recent work on Green AI, this paper proposes Green AutoML, a paradigm to make the whole AutoML process more environmentally friendly. Therefore, we first elaborate on how to quantify the environmental footprint of an AutoML tool. Afterward, different strategies on how to design and benchmark an AutoML tool wrt. their "greenness", i.e. sustainability, are summarized. Finally, we elaborate on how to be transparent about the environmental footprint and what kind of research incentives could direct the community into a more sustainable AutoML research direction. Additionally, we propose a sustainability checklist to be attached to every AutoML paper featuring all core aspects of Green AutoML.
    Robust Binary Models by Pruning Randomly-initialized Networks. (arXiv:2202.01341v2 [cs.LG] UPDATED)
    Robustness to adversarial attacks was shown to require a larger model capacity, and thus a larger memory footprint. In this paper, we introduce an approach to obtain robust yet compact models by pruning randomly-initialized binary networks. Unlike adversarial training, which learns the model parameters, we initialize the model parameters as either +1 or -1, keep them fixed, and find a subnetwork structure that is robust to attacks. Our method confirms the Strong Lottery Ticket Hypothesis in the presence of adversarial attacks, and extends this to binary networks. Furthermore, it yields more compact networks with competitive performance than existing works by 1) adaptively pruning different network layers; 2) exploiting an effective binary initialization scheme; 3) incorporating a last batch normalization layer to improve training stability. Our experiments demonstrate that our approach not only always outperforms the state-of-the-art robust binary networks, but also can achieve accuracy better than full-precision ones on some datasets. Finally, we show the structured patterns of our pruned binary networks.
    Invariant Representation Driven Neural Classifier for Anti-QCD Jet Tagging. (arXiv:2201.07199v5 [hep-ph] UPDATED)
    We leverage representation learning and the inductive bias in neural-net-based Standard Model jet classification tasks, to detect non-QCD signal jets. In establishing the framework for classification-based anomaly detection in jet physics, we demonstrate that, with a \emph{well-calibrated} and \emph{powerful enough feature extractor}, a well-trained \emph{mass-decorrelated} supervised Standard Model neural jet classifier can serve as a strong generic anti-QCD jet tagger for effectively reducing the QCD background. Imposing \emph{data-augmented} mass-invariance (and thus decoupling the dominant factor) not only facilitates background estimation, but also induces more substructure-aware representation learning. We are able to reach excellent tagging efficiencies for all the test signals considered. In the best case, we reach a background rejection rate of 51 and a significance improvement factor of 3.6 at 50 \% signal acceptance, with the jet mass decorrelated. This study indicates that supervised Standard Model jet classifiers have great potential in general new physics searches.
    Robust Low-tubal-rank Tensor Completion based on Tensor Factorization and Maximum Correntopy Criterion. (arXiv:2010.11740v2 [cs.LG] UPDATED)
    The goal of tensor completion is to recover a tensor from a subset of its entries, often by exploiting its low-rank property. Among several useful definitions of tensor rank, the low-tubal-rank was shown to give a valuable characterization of the inherent low-rank structure of a tensor. While some low-tubal-rank tensor completion algorithms with favorable performance have been recently proposed, these algorithms utilize second-order statistics to measure the error residual, which may not work well when the observed entries contain large outliers. In this paper, we propose a new objective function for low-tubal-rank tensor completion, which uses correntropy as the error measure to mitigate the effect of the outliers. To efficiently optimize the proposed objective, we leverage a half-quadratic minimization technique whereby the optimization is transformed to a weighted low-tubal-rank tensor factorization problem. Subsequently, we propose two simple and efficient algorithms to obtain the solution and provide their convergence and complexity analysis. Numerical results using both synthetic and real data demonstrate the robust and superior performance of the proposed algorithms.
    QuAnt: Quantum Annealing with Learnt Couplings. (arXiv:2210.08114v1 [quant-ph])
    Modern quantum annealers can find high-quality solutions to combinatorial optimisation objectives given as quadratic unconstrained binary optimisation (QUBO) problems. Unfortunately, obtaining suitable QUBO forms in computer vision remains challenging and currently requires problem-specific analytical derivations. Moreover, such explicit formulations impose tangible constraints on solution encodings. In stark contrast to prior work, this paper proposes to learn QUBO forms from data through gradient backpropagation instead of deriving them. As a result, the solution encodings can be chosen flexibly and compactly. Furthermore, our methodology is general and virtually independent of the specifics of the target problem type. We demonstrate the advantages of learnt QUBOs on the diverse problem types of graph matching, 2D point cloud alignment and 3D rotation estimation. Our results are competitive with the previous quantum state of the art while requiring much fewer logical and physical qubits, enabling our method to scale to larger problems. The code and the new dataset will be open-sourced.
    MoRSE: Deep Learning-based Arm Gesture Recognition for Search and Rescue Operations. (arXiv:2210.08307v1 [cs.LG])
    Efficient and quick remote communication in search and rescue operations can be life-saving for the first responders. However, while operating on the field means of communication based on text, image and audio are not suitable for several disaster scenarios. In this paper, we present a smartwatch-based application, which utilizes a Deep Learning (DL) model, to recognize a set of predefined arm gestures, maps them into Morse code via vibrations enabling remote communication amongst first responders. The model performance was evaluated by training it using 4,200 gestures performed by 7 subjects (cross-validation) wearing a smartwatch on their dominant arm. Our DL model relies on convolutional pooling and surpasses the performance of existing DL approaches and common machine learning classifiers, obtaining gesture recognition accuracy above 95%. We conclude by discussing the results and providing future directions.
    Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning. (arXiv:2110.04645v2 [cs.LG] UPDATED)
    Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of $S^5A^3$ -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.
    Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity. (arXiv:2206.07659v2 [cs.LG] UPDATED)
    We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v2 [cs.LG] UPDATED)
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    PDEBENCH: An Extensive Benchmark for Scientific Machine Learning. (arXiv:2210.07182v2 [cs.LG] UPDATED)
    Machine learning-based modeling of physical systems has experienced increased interest in recent years. Despite some impressive progress, there is still a lack of benchmarks for Scientific ML that are easy to use but still challenging and representative of a wide range of problems. We introduce PDEBench, a benchmark suite of time-dependent simulation tasks based on Partial Differential Equations (PDEs). PDEBench comprises both code and data to benchmark the performance of novel machine learning models against both classical numerical simulations and machine learning baselines. Our proposed set of benchmark problems contribute the following unique features: (1) A much wider range of PDEs compared to existing benchmarks, ranging from relatively common examples to more realistic and difficult problems; (2) much larger ready-to-use datasets compared to prior work, comprising multiple simulation runs across a larger number of initial and boundary conditions and PDE parameters; (3) more extensible source codes with user-friendly APIs for data generation and baseline results with popular machine learning models (FNO, U-Net, PINN, Gradient-Based Inverse Method). PDEBench allows researchers to extend the benchmark freely for their own purposes using a standardized API and to compare the performance of new models to existing baseline methods. We also propose new evaluation metrics with the aim to provide a more holistic understanding of learning methods in the context of Scientific ML. With those metrics we identify tasks which are challenging for recent ML methods and propose these tasks as future challenges for the community. The code is available at https://github.com/pdebench/PDEBench.
    Learning Dynamical Systems via Koopman Operator Regression in Reproducing Kernel Hilbert Spaces. (arXiv:2205.14027v2 [cs.LG] UPDATED)
    We study a class of dynamical systems modelled as Markov chains that admit an invariant distribution via the corresponding transfer, or Koopman, operator. While data-driven algorithms to reconstruct such operators are well known, their relationship with statistical learning is largely unexplored. We formalize a framework to learn the Koopman operator from finite data trajectories of the dynamical system. We consider the restriction of this operator to a reproducing kernel Hilbert space and introduce a notion of risk, from which different estimators naturally arise. We link the risk with the estimation of the spectral decomposition of the Koopman operator. These observations motivate a reduced-rank operator regression (RRR) estimator. We derive learning bounds for the proposed estimator, holding both in i.i.d. and non i.i.d. settings, the latter in terms of mixing coefficients. Our results suggest RRR might be beneficial over other widely used estimators as confirmed in numerical experiments both for forecasting and mode decomposition.
    A Policy-Guided Imitation Approach for Offline Reinforcement Learning. (arXiv:2210.08323v1 [cs.LG])
    Offline reinforcement learning (RL) methods can generally be categorized into two types: RL-based and Imitation-based. RL-based methods could in principle enjoy out-of-distribution generalization but suffer from erroneous off-policy evaluation. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. In this study, we propose an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical out-of-distribution generalization. We decompose the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. During training, the guide-poicy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized, serving as the \textit{Prophet}. By doing so, our algorithm allows \textit{state-compositionality} from the dataset, rather than \textit{action-compositionality} conducted in prior imitation-style methods. We dumb this new approach Policy-guided Offline RL (\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL. We also highlight the benefits of \texttt{POR} in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-poicy.
    Towards Axiomatic, Hierarchical, and Symbolic Explanation for Deep Models. (arXiv:2111.06206v4 [cs.LG] UPDATED)
    This paper aims to show that the inference logic of a deep model can be faithfully approximated as a sparse, symbolic causal graph. Such a causal graph potentially bridges the gap between connectionism and symbolism. To this end, the faithfulness of the causal graph is theoretically guaranteed, because we show that the causal graph can well mimic the model's output on an exponential number of different masked samples. Besides, such a causal graph can be further simplified and re-written as an And-Or graph (AOG), which explains the logical relationship between interactive concepts encoded by the deep model, without losing much explanation accuracy.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v3 [cs.LG] UPDATED)
    Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works. Code is available at https://github.com/saurabhgarg1996/ATC_code/.
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v1 [stat.ME])
    We consider the estimation of average treatment effects in observational studies without the standard assumption of unconfoundedness. We propose a new framework of robust causal inference under the general observational study setting with the possible existence of unobserved confounders. Our approach is based on the method of distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of obsered outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case and can be extended to the difference-in-difference and regression discontinuity designs as well as instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology to real-world settings.
    MASTAF: A Model-Agnostic Spatio-Temporal Attention Fusion Network for Few-shot Video Classification. (arXiv:2112.04585v3 [cs.CV] UPDATED)
    We propose MASTAF, a Model-Agnostic Spatio-Temporal Attention Fusion network for few-shot video classification. MASTAF takes input from a general video spatial and temporal representation,e.g., using 2D CNN, 3D CNN, and Video Transformer. Then, to make the most of such representations, we use self- and cross-attention models to highlight the critical spatio-temporal region to increase the inter-class variations and decrease the intra-class variations. Last, MASTAF applies a lightweight fusion network and a nearest neighbor classifier to classify each query video. We demonstrate that MASTAF improves the state-of-the-art performance on three few-shot video classification benchmarks(UCF101, HMDB51, and Something-Something-V2), e.g., by up to 91.6%, 69.5%, and 60.7% for five-way one-shot video classification, respectively.
    Exact Solutions of a Deep Linear Network. (arXiv:2202.04777v4 [stat.ML] UPDATED)
    This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that zero is a special point in deep neural network architecture. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.
    Providing Error Detection for Deep Learning Image Classifiers Using Self-Explainability. (arXiv:2210.08210v1 [cs.CV])
    This paper proposes a self-explainable Deep Learning (SE-DL) system for an image classification problem that performs self-error detection. The self-error detection is key to improving the DL system's safe operation, especially in safety-critical applications such as automotive systems. A SE-DL system outputs both the class prediction and an explanation for that prediction, which provides insight into how the system makes its predictions for humans. Additionally, we leverage the explanation of the proposed SE-DL system to detect potential class prediction errors of the system. The proposed SE-DL system uses a set of concepts to generate the explanation. The concepts are human-understandable lower-level image features in each input image relevant to the higher-level class of that image. We present a concept selection methodology for scoring all concepts and selecting a subset of them based on their contribution to the error detection performance of the proposed SE-DL system. Finally, we present different error detection schemes using the proposed SE-DL system to compare them against an error detection scheme without any SE-DL system.
    AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation. (arXiv:2201.12570v4 [q-bio.BM] UPDATED)
    Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently using computational approaches. Here, we present \texttt{AntBO}: a combinatorial Bayesian optimisation framework enabling efficient \textit{in silico} design of the CDRH3 region. Ideally, antibodies are expected to have high target specificity and developability. We introduce a CDRH3 trust region that restricts the search to sequences with favourable developability scores to achieve this goal. For benchmarking, \texttt{AntBO} uses the \texttt{Absolut!} software suite as a black-box oracle to score the target specificity and affinity of designed antibodies \textit{in silico} in an unconstrained fashion~\citep{robert2021one}. The experiments performed for $159$ discretised antigens used in \texttt{Absolut!} demonstrate the benefit of \texttt{AntBO} in designing CDRH3 regions with diverse biophysical properties. In under $200$ calls to black-box oracle, \texttt{AntBO} can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, \texttt{AntBO} finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude \texttt{AntBO} brings automated antibody design methods closer to what is practically viable for in vitro experimentation.
    Efficient Scheduling of Data Augmentation for Deep Reinforcement Learning. (arXiv:2206.00518v3 [cs.LG] UPDATED)
    In deep reinforcement learning (RL), data augmentation is widely considered as a tool to induce a set of useful priors about semantic consistency and improve sample efficiency and generalization performance. However, even when the prior is useful for generalization, distilling it to RL agent often interferes with RL training and degenerates sample efficiency. Meanwhile, the agent is forgetful of the prior due to the non-stationary nature of RL. These observations suggest two extreme schedules of distillation: (i) over the entire training; or (ii) only at the end. Hence, we devise a stand-alone network distillation method to inject the consistency prior at any time (even after RL), and a simple yet efficient framework to automatically schedule the distillation. Specifically, the proposed framework first focuses on mastering train environments regardless of generalization by adaptively deciding which {\it or no} augmentation to be used for the training. After this, we add the distillation to extract the remaining benefits for generalization from all the augmentations, which requires no additional new samples. In our experiments, we demonstrate the utility of the proposed framework, in particular, that considers postponing the augmentation to the end of RL training.
    Aplicaci\'on de redes neuronales convolucionales profundas al diagn\'ostico asistido de la enfermedad de Alzheimer. (arXiv:2210.08330v1 [eess.IV])
    Currently, the diagnosis of Alzheimer's disease is a complex and error-prone process. Improving this diagnosis could allow earlier detection of the disease and improve the quality of life of patients and their families. For this work, we will use 249 brain images from two modalities: PET and MRI, taken from the ADNI database, and labelled into three classes according to the degree of development of Alzheimer's disease. We propose the development of a convolutional neural network to perform the classification of these images, during which, we will study the appropriate depth of the networks for this problem, the importance of pre-processing medical images, the use of transfer learning and data augmentation techniques as tools to reduce the effects of the problem of having too little data, and the simultaneous use of multiple medical imaging modalities. We also propose the application of an evaluation method that guarantees a good degree of repeatability of the results even when using a small dataset. Following this evaluation method, our best final model, which makes use of transfer learning with COVID-19 data, achieves an accuracy d 68\%. In addition, in an independent test set, this same model achieves 70\% accuracy, a promising result given the small size of our dataset. We further conclude that augmenting the depth of the networks helps with this problem, that image pre-processing is a fundamental process to address this type of medical problem, and that the use of data augmentation and the use of pre-trained networks with images of other diseases can provide significant improvements.
    Multi-Armed Bandits with Censored Consumption of Resources. (arXiv:2011.00813v2 [cs.LG] UPDATED)
    We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We propose a UCB-inspired online learning algorithm, which we analyze theoretically in terms of its regret upper bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.
    A Fast Post-Training Pruning Framework for Transformers. (arXiv:2204.09656v2 [cs.CL] UPDATED)
    Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-base and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2.0x reduction in FLOPs and 1.56x speedup in inference latency, while maintaining < 1% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain the models.
    Exploration of Interpretability Techniques for Deep COVID-19 Classification using Chest X-ray Images. (arXiv:2006.02570v3 [eess.IV] UPDATED)
    The outbreak of COVID-19 has shocked the entire world with its fairly rapid spread and has challenged different sectors. One of the most effective ways to limit its spread is the early and accurate diagnosis of infected patients. Medical imaging such as X-ray and Computed Tomography (CT) combined with the potential of Artificial Intelligence (AI) plays an essential role in supporting the medical staff in the diagnosis process. Thereby, five different deep learning models (ResNet18, ResNet34, InceptionV3, InceptionResNetV2, and DenseNet161) and their Ensemble have been used in this paper to classify COVID-19, pneumoni{\ae} and healthy subjects using Chest X-Ray images. Multi-label classification was performed to predict multiple pathologies for each patient, if present. Foremost, the interpretability of each of the networks was thoroughly studied using local interpretability methods - occlusion, saliency, input X gradient, guided backpropagation, integrated gradients, and DeepLIFT, and using a global technique - neuron activation profiles. The mean Micro-F1 score of the models for COVID-19 classifications ranges from 0.66 to 0.875, and is 0.89 for the Ensemble of the network models. The qualitative results depicted the ResNets to be the most interpretable models. This research demonstrates the importance of using interpretability methods to compare different models before making the decision regarding the best-performing model.
    Effects of Image Size on Deep Learning. (arXiv:2101.11508v6 [cs.CV] UPDATED)
    In this paper the main objective is to determine the best size of late gadolinium enhancement (LGE)-magnetic resonance imaging (MRI) images for the training dataset to achieve optimal deep learning training outcomes. To determine the new size of LGE-MRI images from the reference training dataset, non-extra pixel and extra pixel interpolation algorithms are used. A novel strategy based on thresholding, median filtering, and subtraction operations is introduced and applied to remove extra class labels in interpolated ground truth (GT) segmentation masks. Fully automated quantification is achieved using the expectation maximization, weighted intensity, a priori information (EWA) algorithm, and the outcome of automatic semantic segmentation of LGE-MRI images with the convolutional neural network (CNN). In the experiments, common class metrics are used to evaluate the quality of semantic segmentation with a CNN architecture of interest (U-net) against the GT segmentation. Arbitrary threshold, comparison of the sums, and sums of differences are criteria or options used to estimate the relationship between semi-automatic and fully automated quantification of MI results. A close relationship between semi-automatic or manual and fully automated quantification of MI results was more identified in the case involving the dataset of bigger LGE MRI images than in that of the dataset of smaller LGE-MRI images where the best quantification results based on the dataset of bigger LGE MRI images were 55.5% closer the manual or semiautomatic results while the best quantification results based on the dataset of smaller LGE MRI images were 22.2% closer the manual results.
    A Prompt Array Keeps the Bias Away: Debiasing Vision-Language Models with Adversarial Learning. (arXiv:2203.11933v3 [cs.LG] UPDATED)
    Vision-language models can encode societal biases and stereotypes, but there are challenges to measuring and mitigating these multimodal harms due to lacking measurement robustness and feature degradation. To address these challenges, we investigate bias measures and apply ranking metrics for image-text representations. We then investigate debiasing methods and show that prepending learned embeddings to text queries that are jointly trained with adversarial debiasing and a contrastive loss reduces various bias measures with minimal degradation to the image-text representation.
    DyFEn: Agent-Based Fee Setting in Payment Channel Networks. (arXiv:2210.08197v1 [cs.LG])
    In recent years, with the development of easy to use learning environments, implementing and reproducible benchmarking of reinforcement learning algorithms has been largely accelerated by utilizing these frameworks. In this article, we introduce the Dynamic Fee learning Environment (DyFEn), an open-source real-world financial network model. It can provide a testbed for evaluating different reinforcement learning techniques. To illustrate the promise of DyFEn, we present a challenging problem which is a simultaneous multi-channel dynamic fee setting for off-chain payment channels. This problem is well-known in the Bitcoin Lightning Network and has no effective solutions. Specifically, we report the empirical results of several commonly used deep reinforcement learning methods on this dynamic fee setting task as a baseline for further experiments. To the best of our knowledge, this work proposes the first virtual learning environment based on a simulation of blockchain and distributed ledger technologies, unlike many others which are based on physics simulations or game platforms.
    Movement Penalized Bayesian Optimization with Application to Wind Energy Systems. (arXiv:2210.08087v1 [stat.ML])
    Contextual Bayesian optimization (CBO) is a powerful framework for sequential decision-making given side information, with important applications, e.g., in wind energy systems. In this setting, the learner receives context (e.g., weather conditions) at each round, and has to choose an action (e.g., turbine parameters). Standard algorithms assume no cost for switching their decisions at every round. However, in many practical applications, there is a cost associated with such changes, which should be minimized. We introduce the episodic CBO with movement costs problem and, based on the online learning approach for metrical task systems of Coester and Lee (2019), propose a novel randomized mirror descent algorithm that makes use of Gaussian Process confidence bounds. We compare its performance with the offline optimal sequence for each episode and provide rigorous regret guarantees. We further demonstrate our approach on the important real-world application of altitude optimization for Airborne Wind Energy Systems. In the presence of substantial movement costs, our algorithm consistently outperforms standard CBO algorithms.
    BERTraffic: BERT-based Joint Speaker Role and Speaker Change Detection for Air Traffic Control Communications. (arXiv:2110.05781v3 [eess.AS] UPDATED)
    Automatic speech recognition (ASR) allows transcribing the communications between air traffic controllers (ATCOs) and aircraft pilots. The transcriptions are used later to extract ATC named entities, e.g., aircraft callsigns. One common challenge is speech activity detection (SAD) and speaker diarization (SD). In the failure condition, two or more segments remain in the same recording, jeopardizing the overall performance. We propose a system that combines SAD and a BERT model to perform speaker change detection and speaker role detection (SRD) by chunking ASR transcripts, i.e., SD with a defined number of speakers together with SRD. The proposed model is evaluated on real-life public ATC databases. Our BERT SD model baseline reaches up to 10% and 20% token-based Jaccard error rate (JER) in public and private ATC databases. We also achieved relative improvements of 32% and 7.7% in JERs and SD error rate (DER), respectively, compared to VBx, a well-known SD system.
    A graph representation based on fluid diffusion model for data analysis: theoretical aspects and enhanced community detection. (arXiv:2112.04388v2 [cs.SI] UPDATED)
    Representing data by means of graph structures identifies one of the most valid approach to extract information in several data analysis applications. This is especially true when multimodal datasets are investigated, as records collected by means of diverse sensing strategies are taken into account and explored. Nevertheless, classic graph signal processing is based on a model for information propagation that is configured according to heat diffusion mechanism. This system provides several constraints and assumptions on the data properties that might be not valid for multimodal data analysis, especially when large scale datasets collected from heterogeneous sources are considered, so that the accuracy and robustness of the outcomes might be severely jeopardized. In this paper, we introduce a novel model for graph definition based on fluid diffusion. The proposed approach improves the ability of graph-based data analysis to take into account several issues of modern data analysis in operational scenarios, so to provide a platform for precise, versatile, and efficient understanding of the phenomena underlying the records under exam, and to fully exploit the potential provided by the diversity of the records in obtaining a thorough characterization of the data and their significance. In this work, we focus our attention to using this fluid diffusion model to drive a community detection scheme, i.e., to divide multimodal datasets into many groups according to similarity among nodes in an unsupervised fashion. Experimental results achieved by testing real multimodal datasets in diverse application scenarios show that our method is able to strongly outperform state-of-the-art schemes for community detection in multimodal data analysis.
    Federated Learning of Large Models at the Edge via Principal Sub-Model Training. (arXiv:2208.13141v2 [cs.LG] UPDATED)
    Limited compute, memory, and communication capabilities of edge users create a significant bottleneck for federated learning (FL) of large models. Current literature typically tackles the challenge with a heterogeneous client setting or allows training to be offloaded to the server. However, the former requires a fraction of clients to train near-full models, which may not be achievable at the edge; while the latter can compromise privacy with sharing of intermediate representations or labels. In this work, we consider a realistic, but much less explored, cross-device FL setting in which no client has the capacity to train a full large model nor is willing to share any intermediate representations with the server. To this end, we present Principal Sub-Model (PriSM) training methodology, which leverages models low-rank structure and kernel orthogonality to train sub-models in the orthogonal kernel space. More specifically, by applying singular value decomposition to original kernels in the server model, PriSM first obtains a set of principal orthogonal kernels with importance weighed by their singular values. Thereafter, PriSM utilizes a novel sampling strategy that selects different subsets of the principal kernels independently to create sub-models for clients with reduced computation and communication requirements. Importantly, a kernel with a large singular value is assigned with a high sampling probability. Thus, each sub-model is a low-rank approximation of the full large model, and all clients together achieve nearly full coverage of the principal kernels. To further improve memory efficiency, PriSM exploits low-rank structure in intermediate representations and allows each sub-model to learn only a subset of them while still preserving training performance.
    Distributionally Robust Multiclass Classification and Applications in Deep Image Classifiers. (arXiv:2210.08198v1 [cs.CV])
    We develop a Distributionally Robust Optimization (DRO) formulation for Multiclass Logistic Regression (MLR), which could tolerate data contaminated by outliers. The DRO framework uses a probabilistic ambiguity set defined as a ball of distributions that are close to the empirical distribution of the training set in the sense of the Wasserstein metric. We relax the DRO formulation into a regularized learning problem whose regularizer is a norm of the coefficient matrix. We establish out-of-sample performance guarantees for the solutions to our model, offering insights on the role of the regularizer in controlling the prediction error. We apply the proposed method in rendering deep Vision Transformer (ViT)-based image classifiers robust to random and adversarial attacks. Specifically, using the MNIST and CIFAR-10 datasets, we demonstrate reductions in test error rate by up to 83.5% and loss by up to 91.3% compared with baseline methods, by adopting a novel random training method.
    On Mixup Regularization. (arXiv:2006.06049v3 [cs.LG] UPDATED)
    Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions.
    The Stability-Efficiency Dilemma: Investigating Sequence Length Warmup for Training GPT Models. (arXiv:2108.06084v4 [cs.LG] UPDATED)
    Recent works have demonstrated great success in pre-training large-scale autoregressive language models on massive GPUs. To reduce the wall-clock training time, a common practice is to increase the batch size and learning rate. However, such practice is often brittle and leads to a so-called stability-efficiency dilemma: increasing the batch sizes and learning rates leads to better training efficiency but can also result in training instability, leading to poor generalization accuracy or failed runs. To better understand this phenomenon, we conduct an in-depth analysis on large-scale pre-training experiments replicating the GPT-2 model. We find that there is a strong correlation between training instability and extreme values of gradient variance, and that samples with long sequence lengths contribute to these extreme gradient variance values, especially at the beginning of the training, indicating that long sequence length can be a main source of training instability. Based on the analysis, we present a Sequence Length Warmup method that aims to solve the training stability-efficiency dilemma. Experiments replicating GPT-2 models show that our approach enables stable training with 8x larger batch size and 4x larger learning rate, whereas the baseline approach struggles with training instability. To achieve the same or better zero-shot evaluation results, our method reduces the required number of training tokens and wall clock time by up to 2.2x and 3.7x, respectively. Experiments replicating GPT-3 model (125M) show that our approach enables stable training with 8x larger batch size and 40x larger learning rate, and retains 99% of the zero-shot accuracy on 11 tasks using 10x less data and 17x less time compared to the original GPT-3 training recipe, while the baseline diverges under the same settings and only retain 95% of accuracy under lower learning rate.
    Temporal Abstraction in Reinforcement Learning with the Successor Representation. (arXiv:2110.05740v2 [cs.LG] UPDATED)
    Reasoning at multiple levels of temporal abstraction is one of the key attributes of intelligence. In reinforcement learning, this is often modeled through temporally extended courses of actions called options. Options allow agents to make predictions and to operate at different levels of abstraction within an environment. Nevertheless, approaches based on the options framework often start with the assumption that a reasonable set of options is known beforehand. When this is not the case, there are no definitive answers for which options one should consider. In this paper, we argue that the successor representation (SR), which encodes states based on the pattern of state visitation that follows them, can be seen as a natural substrate for the discovery and use of temporal abstractions. To support our claim, we take a big picture view of recent results, showing how the SR can be used to discover options that facilitate either temporally-extended exploration or planning. We cast these results as instantiations of a general framework for option discovery in which the agent's representation is used to identify useful options, which are then used to further improve its representation. This results in a virtuous, never-ending, cycle in which both the representation and the options are constantly refined based on each other. Beyond option discovery itself, we also discuss how the SR allows us to augment a set of options into a combinatorially large counterpart without additional learning. This is achieved through the combination of previously learned options. Our empirical evaluation focuses on options discovered for exploration and on the use of the SR to combine them. The results of our experiments shed light on important design decisions involved in the definition of options and demonstrate the synergy of different methods based on the SR, such as eigenoptions and the option keyboard.
    DC-BENCH: Dataset Condensation Benchmark. (arXiv:2207.09639v2 [cs.LG] UPDATED)
    Dataset Condensation is a newly emerging technique aiming at learning a tiny dataset that captures the rich information encoded in the original dataset. As the size of datasets contemporary machine learning models rely on becomes increasingly large, condensation methods become a prominent direction for accelerating network training and reducing data storage. Despite numerous methods have been proposed in this rapidly growing field, evaluating and comparing different condensation methods is non-trivial and still remains an open issue. The quality of condensed dataset are often shadowed by many critical contributing factors to the end performance, such as data augmentation and model architectures. The lack of a systematic way to evaluate and compare condensation methods not only hinders our understanding of existing techniques, but also discourages practical usage of the synthesized datasets. This work provides the first large-scale standardized benchmark on Dataset Condensation. It consists of a suite of evaluations to comprehensively reflect the generability and effectiveness of condensation methods through the lens of their generated dataset. Leveraging this benchmark, we conduct a large-scale study of current condensation methods, and report many insightful findings that open up new possibilities for future development. The benchmark library, including evaluators, baseline methods, and generated datasets, is open-sourced to facilitate future research and application.
    Know Thyself: Transferable Visual Control Policies Through Robot-Awareness. (arXiv:2107.09047v3 [cs.LG] UPDATED)
    Training visual control policies from scratch on a new robot typically requires generating large amounts of robot-specific data. How might we leverage data previously collected on another robot to reduce or even completely remove this need for robot-specific data? We propose a "robot-aware control" paradigm that achieves this by exploiting readily available knowledge about the robot. We then instantiate this in a robot-aware model-based RL policy by training modular dynamics models that couple a transferable, robot-aware world dynamics module with a robot-specific, potentially analytical, robot dynamics module. This also enables us to set up visual planning costs that separately consider the robot agent and the world. Our experiments on tabletop manipulation tasks with simulated and real robots demonstrate that these plug-in improvements dramatically boost the transferability of visual model-based RL policies, even permitting zero-shot transfer of visual manipulation skills onto new robots. Project website: https://www.seas.upenn.edu/~hued/rac
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v6 [stat.ML] UPDATED)
    We study generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD) in under-/over-parameterized regime. In this work, we derive precise non-asymptotic error bounds of RF regression under both constant and polynomial-decay step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimum-norm interpolator, as a theoretical justification of using SGD in practice.
    A Closer Look at the Calibration of Differentially Private Learners. (arXiv:2210.08248v1 [cs.LG])
    We systematically study the calibration of classifiers trained with differentially private stochastic gradient descent (DP-SGD) and observe miscalibration across a wide range of vision and language tasks. Our analysis identifies per-example gradient clipping in DP-SGD as a major cause of miscalibration, and we show that existing approaches for improving calibration with differential privacy only provide marginal improvements in calibration error while occasionally causing large degradations in accuracy. As a solution, we show that differentially private variants of post-processing calibration methods such as temperature scaling and Platt scaling are surprisingly effective and have negligible utility cost to the overall model. Across 7 tasks, temperature scaling and Platt scaling with DP-SGD result in an average 3.1-fold reduction in the in-domain expected calibration error and only incur at most a minor percent drop in accuracy.
    Classification of Web Phishing Kits for early detection by platform providers. (arXiv:2210.08273v1 [cs.CR])
    Phishing kits are tools that dark side experts provide to the community of criminal phishers to facilitate the construction of malicious Web sites. As these kits evolve in sophistication, providers of Web-based services need to keep pace with continuous complexity. We present an original classification of a corpus of over 2000 recent phishing kits according to their adopted evasion and obfuscation functions. We carry out an initial deterministic analysis of the source code of the kits to extract the most discriminant features and information about their principal authors. We then integrate this initial classification through supervised machine learning models. Thanks to the ground-truth achieved in the first step, we can demonstrate whether and which machine learning models are able to suitably classify even the kits adopting novel evasion and obfuscation techniques that were unseen during the training phase. We compare different algorithms and evaluate their robustness in the realistic case in which only a small number of phishing kits are available for training. This paper represents an initial but important step to support Web service providers and analysts in improving early detection mechanisms and intelligence operations for the phishing kits that might be installed on their platforms.
    Exploring Transformer Backbones for Heterogeneous Treatment Effect Estimation. (arXiv:2202.01336v5 [cs.LG] UPDATED)
    Previous works on Treatment Effect Estimation (TEE) are not in widespread use because they are predominantly theoretical, where strong parametric assumptions are made but untractable for practical application. Recent work uses multilayer perceptron (MLP) for modeling casual relationships, however, MLPs lag far behind recent advances in ML methodology, which limits their applicability and generalizability. To extend beyond the single domain formulation and towards more realistic learning scenarios, we explore model design spaces beyond MLPs, i.e., transformer backbones, which provide flexibility where attention layers govern interactions among treatments and covariates to exploit structural similarities of potential outcomes for confounding control. Through careful model design, Transformers as Treatment Effect Estimators (TransTEE) is proposed. We show empirically that TransTEE can: (1) serve as a general purpose treatment effect estimator that significantly outperforms competitive baselines in a variety of challenging TEE problems (e.g., discrete, continuous, structured, or dosage-associated treatments) and is applicable to both when covariates are tabular and when they consist of structural data (e.g., texts, graphs); (2) yield multiple advantages: compatibility with propensity score modeling, parameter efficiency, robustness to continuous treatment value distribution shifts, explainable in covariate adjustment, and real-world utility in auditing pre-trained language models
    Zero-shot meta-learning for small-scale data from human subjects. (arXiv:2203.16309v3 [cs.LG] UPDATED)
    While developments in machine learning led to impressive performance gains on big data, many human subjects data are, in actuality, small and sparsely labeled. Existing methods applied to such data often do not easily generalize to out-of-sample subjects. Instead, models must make predictions on test data that may be drawn from a different distribution, a problem known as \textit{zero-shot learning}. To address this challenge, we develop an end-to-end framework using a meta-learning approach, which enables the model to rapidly adapt to a new prediction task with limited training data for out-of-sample test data. We use three real-world small-scale human subjects datasets (two randomized control studies and one observational study), for which we predict treatment outcomes for held-out treatment groups. Our model learns the latent treatment effects of each intervention and, by design, can naturally handle multi-task predictions. We show that our model performs the best holistically for each held-out group and especially when the test group is distinctly different from the training group. Our model has implications for improved generalization of small-size human studies to the wider population.
    Practical Benefits of Feature Feedback Under Distribution Shift. (arXiv:2110.07566v2 [cs.CL] UPDATED)
    In attempts to develop sample-efficient and interpretable algorithms, researcher have explored myriad mechanisms for collecting and exploiting feature feedback (or rationales) auxiliary annotations provided for training (but not test) instances that highlight salient evidence. Examples include bounding boxes around objects and salient spans in text. Despite its intuitive appeal, feature feedback has not delivered significant gains in practical problems as assessed on iid holdout sets. However, recent works on counterfactually augmented data suggest an alternative benefit of supplemental annotations, beyond interpretability: lessening sensitivity to spurious patterns and consequently delivering gains in out-of-domain evaluations. We speculate that while existing methods for incorporating feature feedback have delivered negligible in-sample performance gains, they may nevertheless provide out-of-domain benefits. Our experiments addressing sentiment analysis, show that feature feedback methods perform significantly better on various natural out-of-domain datasets despite comparable in-domain evaluations. By contrast, performance on natural language inference remains comparable. Finally, we compare those tasks where feature feedback does (and does not) help.
    PirouNet: Creating Dance through Artist-Centric Deep Learning. (arXiv:2207.12126v2 [cs.LG] UPDATED)
    Using Artificial Intelligence (AI) to create dance choreography with intention is still at an early stage. Methods that conditionally generate dance sequences remain limited in their ability to follow choreographer-specific creative direction, often relying on external prompts or supervised learning. In the same vein, fully annotated dance datasets are rare and labor intensive. To fill this gap and help leverage deep learning as a meaningful tool for choreographers, we propose "PirouNet", a semi-supervised conditional recurrent variational autoencoder together with a dance labeling web application. PirouNet allows dance professionals to annotate data with their own subjective creative labels and subsequently generate new bouts of choreography based on their aesthetic criteria. Thanks to the proposed semi-supervised approach, PirouNet only requires a small portion of the dataset to be labeled, typically on the order of 1%. We demonstrate PirouNet's capabilities as it generates original choreography based on the "Laban Time Effort", an established dance notion describing intention for a movement's time dynamics. We extensively evaluate PirouNet's dance creations through a series of qualitative and quantitative metrics, validating its applicability as a tool for choreographers.
    PI-QT-Opt: Predictive Information Improves Multi-Task Robotic Reinforcement Learning at Scale. (arXiv:2210.08217v1 [cs.RO])
    The predictive information, the mutual information between the past and future, has been shown to be a useful representation learning auxiliary loss for training reinforcement learning agents, as the ability to model what will happen next is critical to success on many control tasks. While existing studies are largely restricted to training specialist agents on single-task settings in simulation, in this work, we study modeling the predictive information for robotic agents and its importance for general-purpose agents that are trained to master a large repertoire of diverse skills from large amounts of data. Specifically, we introduce Predictive Information QT-Opt (PI-QT-Opt), a QT-Opt agent augmented with an auxiliary loss that learns representations of the predictive information to solve up to 297 vision-based robot manipulation tasks in simulation and the real world with a single set of parameters. We demonstrate that modeling the predictive information significantly improves success rates on the training tasks and leads to better zero-shot transfer to unseen novel tasks. Finally, we evaluate PI-QT-Opt on real robots, achieving substantial and consistent improvement over QT-Opt in multiple experimental settings of varying environments, skills, and multi-task configurations.
    ToupleGDD: A Fine-Designed Solution of Influence Maximization by Deep Reinforcement Learning. (arXiv:2210.07500v1 [cs.SI] CROSS LISTED)
    Online social platforms have become more and more popular, and the dissemination of information on social networks has attracted wide attention of the industries and academia. Aiming at selecting a small subset of nodes with maximum influence on networks, the Influence Maximization (IM) problem has been extensively studied. Since it is #P-hard to compute the influence spread given a seed set, the state-of-art methods, including heuristic and approximation algorithms, faced with great difficulties such as theoretical guarantee, time efficiency, generalization, etc. This makes it unable to adapt to large-scale networks and more complex applications. With the latest achievements of Deep Reinforcement Learning (DRL) in artificial intelligence and other fields, a lot of works has focused on exploiting DRL to solve the combinatorial optimization problems. Inspired by this, we propose a novel end-to-end DRL framework, ToupleGDD, to address the IM problem in this paper, which incorporates three coupled graph neural networks for network embedding and double deep Q-networks for parameters learning. Previous efforts to solve the IM problem with DRL trained their models on the subgraph of the whole network, and then tested their performance on the whole graph, which makes the performance of their models unstable among different networks. However, our model is trained on several small randomly generated graphs and tested on completely different networks, and can obtain results that are very close to the state-of-the-art methods. In addition, our model is trained with a small budget, and it can perform well under various large budgets in the test, showing strong generalization ability. Finally, we conduct entensive experiments on synthetic and realistic datasets, and the experimental results prove the effectiveness and superiority of our model.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v3 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(n^{-1}+ m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $n$ is the local sample size on each worker, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(n^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+ \lambda^{1+\alpha} + \phi_\mathcal{S})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v3 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.
    Multi-Domain Active Learning: Literature Review and Comparative Study. (arXiv:2106.13516v6 [cs.LG] UPDATED)
    Multi-domain learning (MDL) refers to learning a set of models simultaneously, where each model is specialized to perform a task in a particular domain. Generally, a high labeling effort is required in MDL, as data needs to be labeled by human experts for every domain. Active learning (AL) can be utilized in MDL to reduce the labeling effort by only using the most informative data. The resultant paradigm is termed multi-domain active learning (MDAL). In this work, we provide an exhaustive literature review for MDAL on the relevant fields, including AL, cross-domain information sharing schemes, and cross-domain instance evaluation approaches. It is found that the few studies which have been directly conducted on MDAL cannot serve as off-the-shelf solutions on more general MDAL tasks. To fill this gap, we construct a pipeline of MDAL and present a comprehensive comparative study of thirty different algorithms, which are established by combining six representative MDL models and five commonly used AL strategies. We evaluate the algorithms on six datasets involving textual and visual classification tasks. In most cases, AL brings notable improvements to MDL, and the naive BvSB (best vs. second best) Uncertainty strategy can perform competitively with the state-of-the-art AL strategies. Besides, BvSB with the MAN (multinomial adversarial networks) model can consistently achieve top or above-average performance on all the datasets. Furthermore, we qualitatively analyze the behaviors of the well-performed strategies and models, shedding light on their superior performance in the comparison. Finally, we recommend using BvSB with the MAN model in the application of MDAL due to their good performance in the experiments.
    End-to-end Algorithm Synthesis with Recurrent Networks: Logical Extrapolation Without Overthinking. (arXiv:2202.05826v3 [cs.LG] UPDATED)
    Machine learning systems perform well on pattern matching tasks, but their ability to perform algorithmic or logical reasoning is not well understood. One important reasoning capability is algorithmic extrapolation, in which models trained only on small/simple reasoning problems can synthesize complex strategies for large/complex problems at test time. Algorithmic extrapolation can be achieved through recurrent systems, which can be iterated many times to solve difficult reasoning problems. We observe that this approach fails to scale to highly complex problems because behavior degenerates when many iterations are applied -- an issue we refer to as "overthinking." We propose a recall architecture that keeps an explicit copy of the problem instance in memory so that it cannot be forgotten. We also employ a progressive training routine that prevents the model from learning behaviors that are specific to iteration number and instead pushes it to learn behaviors that can be repeated indefinitely. These innovations prevent the overthinking problem, and enable recurrent systems to solve extremely hard extrapolation tasks.
    Extreme-Long-short Term Memory for Time-series Prediction. (arXiv:2210.08244v1 [cs.LG])
    The emergence of Long Short-Term Memory (LSTM) solves the problems of vanishing gradient and exploding gradient in traditional Recurrent Neural Networks (RNN). LSTM, as a new type of RNN, has been widely used in various fields, such as text prediction, Wind Speed Forecast, depression prediction by EEG signals, etc. The results show that improving the efficiency of LSTM can help to improve the efficiency in other application areas. In this paper, we proposed an advanced LSTM algorithm, the Extreme Long Short-Term Memory (E-LSTM), which adds the inverse matrix part of Extreme Learning Machine (ELM) as a new "gate" into the structure of LSTM. This "gate" preprocess a portion of the data and involves the processed data in the cell update of the LSTM to obtain more accurate data with fewer training rounds, thus reducing the overall training time. In this research, the E-LSTM model is used for the text prediction task. Experimental results showed that the E-LSTM sometimes takes longer to perform a single training round, but when tested on a small data set, the new E-LSTM requires only 2 epochs to obtain the results of the 7th epoch traditional LSTM. Therefore, the E-LSTM retains the high accuracy of the traditional LSTM, whilst also improving the training speed and the overall efficiency of the LSTM.
    A Primal-Dual Algorithm for Hybrid Federated Learning. (arXiv:2210.08106v1 [cs.LG])
    Very few methods for hybrid federated learning, where clients only hold subsets of both features and samples, exist. Yet, this scenario is very important in practical settings. We provide a fast, robust algorithm for hybrid federated learning that hinges on Fenchel Duality. We prove the convergence of the algorithm to the same solution as if the model was trained centrally in a variety of practical regimes. Furthermore, we provide experimental results that demonstrate the performance improvements of the algorithm over a commonly used method in federated learning, FedAvg. We also provide privacy considerations and necessary steps to protect client data.
    Linear Scalarization for Byzantine-robust learning on non-IID data. (arXiv:2210.08287v1 [cs.LG])
    In this work we study the problem of Byzantine-robust learning when data among clients is heterogeneous. We focus on poisoning attacks targeting the convergence of SGD. Although this problem has received great attention; the main Byzantine defenses rely on the IID assumption causing them to fail when data distribution is non-IID even with no attack. We propose the use of Linear Scalarization (LS) as an enhancing method to enable current defenses to circumvent Byzantine attacks in the non-IID setting. The LS method is based on the incorporation of a trade-off vector that penalizes the suspected malicious clients. Empirical analysis corroborates that the proposed LS variants are viable in the IID setting. For mild to strong non-IID data splits, LS is either comparable or outperforming current approaches under state-of-the-art Byzantine attack scenarios.
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v2 [stat.ML] UPDATED)
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.
    A Multistep Frank-Wolfe Method. (arXiv:2210.08110v1 [math.OC])
    The Frank-Wolfe algorithm has regained much interest in its use in structurally constrained machine learning applications. However, one major limitation of the Frank-Wolfe algorithm is the slow local convergence property due to the zig-zagging behavior. We observe the zig-zagging phenomenon in the Frank-Wolfe method as an artifact of discretization, and propose multistep Frank-Wolfe variants where the truncation errors decay as $O(\Delta^p)$, where $p$ is the method's order. This strategy "stabilizes" the method, and allows tools like line search and momentum to have more benefits. However, our results suggest that the worst case convergence rate of Runge-Kutta-type discretization schemes cannot improve upon that of the vanilla Frank-Wolfe method for a rate depending on $k$. Still, we believe that this analysis adds to the growing knowledge of flow analysis for optimization methods, and is a cautionary tale on the ultimate usefulness of multistep methods.
    Motion Inspired Unsupervised Perception and Prediction in Autonomous Driving. (arXiv:2210.08061v1 [cs.CV])
    Learning-based perception and prediction modules in modern autonomous driving systems typically rely on expensive human annotation and are designed to perceive only a handful of predefined object categories. This closed-set paradigm is insufficient for the safety-critical autonomous driving task, where the autonomous vehicle needs to process arbitrarily many types of traffic participants and their motion behaviors in a highly dynamic world. To address this difficulty, this paper pioneers a novel and challenging direction, i.e., training perception and prediction models to understand open-set moving objects, with no human supervision. Our proposed framework uses self-learned flow to trigger an automated meta labeling pipeline to achieve automatic supervision. 3D detection experiments on the Waymo Open Dataset show that our method significantly outperforms classical unsupervised approaches and is even competitive to the counterpart with supervised scene flow. We further show that our approach generates highly promising results in open-set 3D detection and trajectory prediction, confirming its potential in closing the safety gap of fully supervised systems.
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v2 [stat.ML] UPDATED)
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.
    Autoinverse: Uncertainty Aware Inversion of Neural Networks. (arXiv:2208.13780v2 [cs.LG] UPDATED)
    Neural networks are powerful surrogates for numerous forward processes. The inversion of such surrogates is extremely valuable in science and engineering. The most important property of a successful neural inverse method is the performance of its solutions when deployed in the real world, i.e., on the native forward process (and not only the learned surrogate). We propose Autoinverse, a highly automated approach for inverting neural network surrogates. Our main insight is to seek inverse solutions in the vicinity of reliable data which have been sampled form the forward process and used for training the surrogate model. Autoinverse finds such solutions by taking into account the predictive uncertainty of the surrogate and minimizing it during the inversion. Apart from high accuracy, Autoinverse enforces the feasibility of solutions, comes with embedded regularization, and is initialization free. We verify our proposed method through addressing a set of real-world problems in control, fabrication, and design.
    Improving Your Graph Neural Networks: A High-Frequency Booster. (arXiv:2210.08251v1 [cs.LG])
    Graph neural networks (GNNs) hold the promise of learning efficient representations of graph-structured data, and one of its most important applications is semi-supervised node classification. However, in this application, GNN frameworks tend to fail due to the following issues: over-smoothing and heterophily. The most popular GNNs are known to be focused on the message-passing framework, and recent research shows that these GNNs are often bounded by low-pass filters from a signal processing perspective. We thus incorporate high-frequency information into GNNs to alleviate this genetic problem. In this paper, we argue that the complement of the original graph incorporates a high-pass filter and propose Complement Laplacian Regularization (CLAR) for an efficient enhancement of high-frequency components. The experimental results demonstrate that CLAR helps GNNs tackle over-smoothing, improving the expressiveness of heterophilic graphs, which adds up to 3.6% improvement over popular baselines and ensures topological robustness.
    Old can be Gold: Better Gradient Flow can Make Vanilla-GCNs Great Again. (arXiv:2210.08122v1 [cs.LG])
    Despite the enormous success of Graph Convolutional Networks (GCNs) in modeling graph-structured data, most of the current GCNs are shallow due to the notoriously challenging problems of over-smoothening and information squashing along with conventional difficulty caused by vanishing gradients and over-fitting. Previous works have been primarily focused on the study of over-smoothening and over-squashing phenomena in training deep GCNs. Surprisingly, in comparison with CNNs/RNNs, very limited attention has been given to understanding how healthy gradient flow can benefit the trainability of deep GCNs. In this paper, firstly, we provide a new perspective of gradient flow to understand the substandard performance of deep GCNs and hypothesize that by facilitating healthy gradient flow, we can significantly improve their trainability, as well as achieve state-of-the-art (SOTA) level performance from vanilla-GCNs. Next, we argue that blindly adopting the Glorot initialization for GCNs is not optimal, and derive a topology-aware isometric initialization scheme for vanilla-GCNs based on the principles of isometry. Additionally, contrary to ad-hoc addition of skip-connections, we propose to use gradient-guided dynamic rewiring of vanilla-GCNs} with skip connections. Our dynamic rewiring method uses the gradient flow within each layer during training to introduce on-demand skip-connections adaptively. We provide extensive empirical evidence across multiple datasets that our methods improve gradient flow in deep vanilla-GCNs and significantly boost their performance to comfortably compete and outperform many fancy state-of-the-art methods. Codes are available at: https://github.com/VITA-Group/GradientGCN.
    A Transformer-based Network for Deformable Medical Image Registration. (arXiv:2202.12104v3 [eess.IV] UPDATED)
    Deformable medical image registration plays an important role in clinical diagnosis and treatment. Recently, the deep learning (DL) based image registration methods have been widely investigated and showed excellent performance in computational speed. However, these methods cannot provide enough registration accuracy because of insufficient ability in representing both the global and local features of the moving and fixed images. To address this issue, this paper has proposed the transformer based image registration method. This method uses the distinctive transformer to extract the global and local image features for generating the deformation fields, based on which the registered image is produced in an unsupervised way. Our method can improve the registration accuracy effectively by means of self-attention mechanism and bi-level information flow. Experimental results on such brain MR image datasets as LPBA40 and OASIS-1 demonstrate that compared with several traditional and DL based registration methods, our method provides higher registration accuracy in terms of dice values.
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v2 [stat.ML] UPDATED)
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized inference. Based on this observation, we propose a new training objective that improves the generalization of amortized inference. We demonstrate how our method can improve performance in the context of image modeling and lossless compression.
    Multi-trainer Interactive Reinforcement Learning System. (arXiv:2210.08050v1 [cs.LG])
    Interactive reinforcement learning can effectively facilitate the agent training via human feedback. However, such methods often require the human teacher to know what is the correct action that the agent should take. In other words, if the human teacher is not always reliable, then it will not be consistently able to guide the agent through its training. In this paper, we propose a more effective interactive reinforcement learning system by introducing multiple trainers, namely Multi-Trainer Interactive Reinforcement Learning (MTIRL), which could aggregate the binary feedback from multiple non-perfect trainers into a more reliable reward for an agent training in a reward-sparse environment. In particular, our trainer feedback aggregation experiments show that our aggregation method has the best accuracy when compared with the majority voting, the weighted voting, and the Bayesian method. Finally, we conduct a grid-world experiment to show that the policy trained by the MTIRL with the review model is closer to the optimal policy than that without a review model.
    Model-Free Characterizations of the Hamilton-Jacobi-Bellman Equation and Convex Q-Learning in Continuous Time. (arXiv:2210.08131v1 [math.OC])
    Convex Q-learning is a recent approach to reinforcement learning, motivated by the possibility of a firmer theory for convergence, and the possibility of making use of greater a priori knowledge regarding policy or value function structure. This paper explores algorithm design in the continuous time domain, with finite-horizon optimal control objective. The main contributions are (i) Algorithm design is based on a new Q-ODE, which defines the model-free characterization of the Hamilton-Jacobi-Bellman equation. (ii) The Q-ODE motivates a new formulation of Convex Q-learning that avoids the approximations appearing in prior work. The Bellman error used in the algorithm is defined by filtered measurements, which is beneficial in the presence of measurement noise. (iii) A characterization of boundedness of the constraint region is obtained through a non-trivial extension of recent results from the discrete time setting. (iv) The theory is illustrated in application to resource allocation for distributed energy resources, for which the theory is ideally suited.
    A Simple Convergence Proof of Adam and Adagrad. (arXiv:2003.02395v3 [stat.ML] UPDATED)
    We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$.
    Autoencoder based Anomaly Detection and Explained Fault Localization in Industrial Cooling Systems. (arXiv:2210.08011v1 [cs.LG])
    Anomaly detection in large industrial cooling systems is very challenging due to the high data dimensionality, inconsistent sensor recordings, and lack of labels. The state of the art for automated anomaly detection in these systems typically relies on expert knowledge and thresholds. However, data is viewed isolated and complex, multivariate relationships are neglected. In this work, we present an autoencoder based end-to-end workflow for anomaly detection suitable for multivariate time series data in large industrial cooling systems, including explained fault localization and root cause analysis based on expert knowledge. We identify system failures using a threshold on the total reconstruction error (autoencoder reconstruction error including all sensor signals). For fault localization, we compute the individual reconstruction error (autoencoder reconstruction error for each sensor signal) allowing us to identify the signals that contribute most to the total reconstruction error. Expert knowledge is provided via look-up table enabling root-cause analysis and assignment to the affected subsystem. We demonstrated our findings in a cooling system unit including 34 sensors over a 8-months time period using 4-fold cross validation approaches and automatically created labels based on thresholds provided by domain experts. Using 4-fold cross validation, we reached a F1-score of 0.56, whereas the autoencoder results showed a higher consistency score (CS of 0.92) compared to the automatically created labels (CS of 0.62) -- indicating that the anomaly is recognized in a very stable manner. The main anomaly was found by the autoencoder and automatically created labels and was also recorded in the log files. Further, the explained fault localization highlighted the most affected component for the main anomaly in a very consistent manner.
    No imputation without representation. (arXiv:2206.14254v2 [cs.LG] UPDATED)
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
    Multi-Modal Representation Learning with Self-Adaptive Threshold for Commodity Verification. (arXiv:2208.11064v4 [cs.LG] UPDATED)
    In this paper, we propose a method to identify identical commodities. In e-commerce scenarios, commodities are usually described by both images and text. By definition, identical commodities are those that have identical key attributes and are cognitively identical to consumers. There are two main challenges: 1) The extraction and fusion of multi-modal representation. 2) The ability to verify identical commodities by comparing the similarity between representations and a threshold. To address the above problems, we propose an end-to-end multi-modal representation learning method with self-adaptive threshold. We use a dual-stream network to extract multi-modal commodity embeddings and threshold embeddings separately and then concatenate them to obtain commodity representation. Our method is able to adaptively adjust the threshold according to different commodities while maintaining the indexability of the commodity representation space. We experimentally validate the advantages of self-adaptive threshold and the effectiveness of multimodal representation fusion. Besides, our method achieves third place with an F1 score of 0.8936 on the second task of the CCKS-2022 Knowledge Graph Evaluation for Digital Commerce Competition. Code and pretrained models are available at https://github.com/hanchenchen/CCKS2022-track2-solution.
    Training Recurrent Neural Networks by Sequential Least Squares and the Alternating Direction Method of Multipliers. (arXiv:2112.15348v3 [cs.LG] UPDATED)
    This paper proposes a novel algorithm for training recurrent neural network models of nonlinear dynamical systems from an input/output training dataset. Arbitrary convex and twice-differentiable loss functions and regularization terms are handled by sequential least squares and either a line-search (LS) or a trust-region method of Levenberg-Marquardt (LM) type for ensuring convergence. In addition, to handle non-smooth regularization terms such as $\ell_1$, $\ell_0$, and group-Lasso regularizers, as well as to impose possibly non-convex constraints such as integer and mixed-integer constraints, we combine sequential least squares with the alternating direction method of multipliers (ADMM). We call the resulting algorithm NAILS (nonconvex ADMM iterations and least squares) in the case line search (LS) is used, or NAILM if a trust-region method (LM) is employed instead. The training method, which is also applicable to feedforward neural networks as a special case, is tested in three nonlinear system identification problems.
    IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach. (arXiv:2209.03895v2 [cs.CL] UPDATED)
    In this paper, we describe our participation in the subtask 1 of CASE-2022, Event Causality Identification with Casual News Corpus. We address the Causal Relation Identification (CRI) task by exploiting a set of simple yet complementary techniques for fine-tuning language models (LMs) on a small number of annotated examples (i.e., a few-shot configuration). We follow a prompt-based prediction approach for fine-tuning LMs in which the CRI task is treated as a masked language modeling problem (MLM). This approach allows LMs natively pre-trained on MLM problems to directly generate textual responses to CRI-specific prompts. We compare the performance of this method against ensemble techniques trained on the entire dataset. Our best-performing submission was fine-tuned with only 256 instances per class, 15.7% of the all available data, and yet obtained the second-best precision (0.82), third-best accuracy (0.82), and an F1-score (0.85) very close to what was reported by the winner team (0.86).
    CAREER: Transfer Learning for Economic Prediction of Labor Sequence Data. (arXiv:2202.08370v3 [cs.LG] UPDATED)
    Labor economists regularly analyze employment data by fitting predictive models to small, carefully constructed longitudinal survey datasets. Although modern machine learning methods offer promise for such problems, these survey datasets are too small to take advantage of them. In recent years large datasets of online resumes have also become available, providing data about the career trajectories of millions of individuals. However, standard econometric models cannot take advantage of their scale or incorporate them into the analysis of survey data. To this end we develop CAREER, a transformer-based model that uses transfer learning to learn representations of job sequences. CAREER is first fit to large, passively-collected resume data and then fine-tuned to smaller, better-curated datasets for economic inferences. We fit CAREER to a dataset of 24 million job sequences from resumes, and fine-tune its representations on longitudinal survey datasets. We find that CAREER forms accurate predictions of job sequences on three widely-used economics datasets. We further find that CAREER can be used to form good predictions of other downstream variables; incorporating CAREER into a wage model provides better predictions than the econometric models currently in use.
    Product Ranking for Revenue Maximization with Multiple Purchases. (arXiv:2210.08268v1 [cs.LG])
    Product ranking is the core problem for revenue-maximizing online retailers. To design proper product ranking algorithms, various consumer choice models are proposed to characterize the consumers' behaviors when they are provided with a list of products. However, existing works assume that each consumer purchases at most one product or will keep viewing the product list after purchasing a product, which does not agree with the common practice in real scenarios. In this paper, we assume that each consumer can purchase multiple products at will. To model consumers' willingness to view and purchase, we set a random attention span and purchase budget, which determines the maximal amount of products that he/she views and purchases, respectively. Under this setting, we first design an optimal ranking policy when the online retailer can precisely model consumers' behaviors. Based on the policy, we further develop the Multiple-Purchase-with-Budget UCB (MPB-UCB) algorithms with $\~O(\sqrt{T})$ regret that estimate consumers' behaviors and maximize revenue simultaneously in online settings. Experiments on both synthetic and semi-synthetic datasets prove the effectiveness of the proposed algorithms.
    Differentiable Model Compression via Pseudo Quantization Noise. (arXiv:2104.09987v3 [stat.ML] UPDATED)
    We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits used. Given a single hyper-parameter balancing between the quantized model size and accuracy, DiffQ optimizes the number of bits used per individual weight or groups of weights, in end-to-end training. We experimentally verify that our method is competitive with STE based quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the ImageNet dataset, DiffQ compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3% in model accuracy. Code is available at github.com/facebookresearch/diffq.
    Symmetry Teleportation for Accelerated Optimization. (arXiv:2205.10637v2 [cs.LG] UPDATED)
    Existing gradient-based optimization methods update parameters locally, in a direction that minimizes the loss function. We study a different approach, symmetry teleportation, that allows parameters to travel a large distance on the loss level set, in order to improve the convergence speed in subsequent steps. Teleportation exploits symmetries in the loss landscape of optimization problems. We derive loss-invariant group actions for test functions in optimization and multi-layer neural networks, and prove a necessary condition for teleportation to improve convergence rate. We also show that our algorithm is closely related to second order methods. Experimentally, we show that teleportation improves the convergence speed of gradient descent and AdaGrad for several optimization problems including test functions, multi-layer regressions, and MNIST classification.
    Model Preserving Compression for Neural Networks. (arXiv:2108.00065v2 [cs.LG] UPDATED)
    After training complex deep learning models, a common task is to compress the model to reduce compute and storage demands. When compressing, it is desirable to preserve the original model's per-example decisions (e.g., to go beyond top-1 accuracy or preserve robustness), maintain the network's structure, automatically determine per-layer compression levels, and eliminate the need for fine tuning. No existing compression methods simultaneously satisfy these criteria $\unicode{x2014}$ we introduce a principled approach that does by leveraging interpolative decompositions. Our approach simultaneously selects and eliminates channels (analogously, neurons), then constructs an interpolation matrix that propagates a correction into the next layer, preserving the network's structure. Consequently, our method achieves good performance even without fine tuning and admits theoretical analysis. Our theoretical generalization bound for a one layer network lends itself naturally to a heuristic that allows our method to automatically choose per-layer sizes for deep networks. We demonstrate the efficacy of our approach with strong empirical performance on a variety of tasks, models, and datasets $\unicode{x2014}$ from simple one-hidden-layer networks to deep networks on ImageNet.
    Efficient Active Learning with Abstention. (arXiv:2204.00043v2 [stat.ML] UPDATED)
    The goal of active learning is to achieve the same accuracy achievable by passive learning, while using much fewer labels. Exponential savings in terms of label complexity have been proved in very special cases, but fundamental lower bounds show that such improvements are impossible in general. This suggests a need to explore alternative goals for active learning. Learning with abstention is one such alternative. In this setting, the active learning algorithm may abstain from prediction and incur an error that is marginally smaller than random guessing. We develop the first computationally efficient active learning algorithm with abstention. Our algorithm provably achieves $\mathsf{polylog}(\frac{1}{\varepsilon})$ label complexity, without any low noise conditions. Such performance guarantee reduces the label complexity by an exponential factor, relative to passive learning and active learning that is not allowed to abstain. Furthermore, our algorithm is guaranteed to only abstain on hard examples (where the true label distribution is close to a fair coin), a novel property we term \emph{proper abstention} that also leads to a host of other desirable characteristics (e.g., recovering minimax guarantees in the standard setting, and avoiding the undesirable ``noise-seeking'' behavior often seen in active learning). We also provide novel extensions of our algorithm that achieve \emph{constant} label complexity and deal with model misspecification.
    Differentiable Hybrid Traffic Simulation. (arXiv:2210.08046v1 [cs.GR])
    We introduce a novel differentiable hybrid traffic simulator, which simulates traffic using a hybrid model of both macroscopic and microscopic models and can be directly integrated into a neural network for traffic control and flow optimization. This is the first differentiable traffic simulator for macroscopic and hybrid models that can compute gradients for traffic states across time steps and inhomogeneous lanes. To compute the gradient flow between two types of traffic models in a hybrid framework, we present a novel intermediate conversion component that bridges the lanes in a differentiable manner as well. We also show that we can use analytical gradients to accelerate the overall process and enhance scalability. Thanks to these gradients, our simulator can provide more efficient and scalable solutions for complex learning and control problems posed in traffic engineering than other existing algorithms. Refer to https://sites.google.com/umd.edu/diff-hybrid-traffic-sim for our project.
    ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions. (arXiv:2106.12231v2 [stat.ML] UPDATED)
    We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets.
    A Survey on Knowledge Graph-based Methods for Automated Driving. (arXiv:2210.08119v1 [cs.RO])
    Automated driving is one of the most active research areas in computer science. Deep learning methods have made remarkable breakthroughs in machine learning in general and in automated driving (AD)in particular. However, there are still unsolved problems to guarantee reliability and safety of automated systems, especially to effectively incorporate all available information and knowledge in the driving task. Knowledge graphs (KG) have recently gained significant attention from both industry and academia for applications that benefit by exploiting structured, dynamic, and relational data. The complexity of graph-structured data with complex relationships and inter-dependencies between objects has posed significant challenges to existing machine learning algorithms. However, recent progress in knowledge graph embeddings and graph neural networks allows to applying machine learning to graph-structured data. Therefore, we motivate and discuss the potential benefit of KGs applied to the main tasks of AD including 1) ontologies 2) perception, 3) scene understanding, 4) motion planning, and 5) validation. Then, we survey, analyze and categorize ontologies and KG-based approaches for AD. We discuss current research challenges and propose promising future research directions for KG-based solutions for AD.
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v2 [cs.LG] UPDATED)
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms can take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality and (2) apply techniques from online learning to learn predictors, tune robustness-consistency trade-offs, and bound the sample complexity. We demonstrate the effectiveness of our approach by applying it to bipartite matching, ski-rental, page migration, and job scheduling. In several settings we improve upon multiple existing results while utilizing a much simpler analysis, while in the others we provide the first learning-theoretic guarantees.
    Code Recommendation for Open Source Software Developers. (arXiv:2210.08332v1 [cs.SE])
    Open Source Software (OSS) is forming the spines of technology infrastructures, attracting millions of talents to contribute. Notably, it is challenging and critical to consider both the developers' interests and the semantic features of the project code to recommend appropriate development tasks to OSS developers. In this paper, we formulate the novel problem of code recommendation, whose purpose is to predict the future contribution behaviors of developers given their interaction history, the semantic features of source code, and the hierarchical file structures of projects. Considering the complex interactions among multiple parties within the system, we propose CODER, a novel graph-based code recommendation framework for open source software developers. CODER jointly models microscopic user-code interactions and macroscopic user-project interactions via a heterogeneous graph and further bridges the two levels of information through aggregation on file-structure graphs that reflect the project hierarchy. Moreover, due to the lack of reliable benchmarks, we construct three large-scale datasets to facilitate future research in this direction. Extensive experiments show that our CODER framework achieves superior performance under various experimental settings, including intra-project, cross-project, and cold-start recommendation. We will release all the datasets, code, and utilities for data retrieval upon the acceptance of this work.
    Almost Optimal Batch-Regret Tradeoff for Batch Linear Contextual Bandits. (arXiv:2110.08057v3 [cs.LG] UPDATED)
    We study the optimal batch-regret tradeoff for batch linear contextual bandits. For any batch number $M$, number of actions $K$, time horizon $T$, and dimension $d$, we provide an algorithm and prove its regret guarantee, which, due to technical reasons, features a two-phase expression as the time horizon $T$ grows. We also prove a lower bound theorem that surprisingly shows the optimality of our two-phase regret upper bound (up to logarithmic factors) in the \emph{full range} of the problem parameters, therefore establishing the exact batch-regret tradeoff. Compared to the recent work \citep{ruan2020linear} which showed that $M = O(\log \log T)$ batches suffice to achieve the asymptotically minimax-optimal regret without the batch constraints, our algorithm is simpler and easier for practical implementation. Furthermore, our algorithm achieves the optimal regret for all $T \geq d$, while \citep{ruan2020linear} requires that $T$ greater than an unrealistically large polynomial of $d$. Along our analysis, we also prove a new matrix concentration inequality with dependence on their dynamic upper bounds, which, to the best of our knowledge, is the first of its kind in literature and maybe of independent interest.
    Domain Adaptation under Open Set Label Shift. (arXiv:2207.13048v2 [cs.LG] UPDATED)
    We introduce the problem of domain adaptation under Open Set Label Shift (OSLS) where the label distribution can change arbitrarily and a new class may arrive during deployment, but the class-conditional distributions p(x|y) are domain-invariant. OSLS subsumes domain adaptation under label shift and Positive-Unlabeled (PU) learning. The learner's goals here are two-fold: (a) estimate the target label distribution, including the novel class; and (b) learn a target classifier. First, we establish necessary and sufficient conditions for identifying these quantities. Second, motivated by advances in label shift and PU learning, we propose practical methods for both tasks that leverage black-box predictors. Unlike typical Open Set Domain Adaptation (OSDA) problems, which tend to be ill-posed and amenable only to heuristics, OSLS offers a well-posed problem amenable to more principled machinery. Experiments across numerous semi-synthetic benchmarks on vision, language, and medical datasets demonstrate that our methods consistently outperform OSDA baselines, achieving 10--25% improvements in target domain accuracy. Finally, we analyze the proposed methods, establishing finite-sample convergence to the true label marginal and convergence to optimal classifier for linear models in a Gaussian setup. Code is available at https://github.com/acmi-lab/Open-Set-Label-Shift.
    Domain Adaptation meets Individual Fairness. And they get along. (arXiv:2205.00504v2 [stat.ML] UPDATED)
    Many instances of algorithmic bias are caused by distributional shifts. For example, machine learning (ML) models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we leverage this connection between algorithmic fairness and distribution shifts to show that algorithmic fairness interventions can help ML models overcome distribution shifts, and that domain adaptation methods (for overcoming distribution shifts) can mitigate algorithmic biases. In particular, we show that (i) enforcing suitable notions of individual fairness (IF) can improve the out-of-distribution accuracy of ML models under the covariate shift assumption and that (ii) it is possible to adapt representation alignment methods for domain adaptation to enforce individual fairness. The former is unexpected because IF interventions were not developed with distribution shifts in mind. The latter is also unexpected because representation alignment is not a common approach in the individual fairness literature.
    Learned Video Compression for YUV 4:2:0 Content Using Flow-based Conditional Inter-frame Coding. (arXiv:2210.08225v1 [eess.IV])
    This paper proposes a learning-based video compression framework for variable-rate coding on YUV 4:2:0 content. Most existing learning-based video compression models adopt the traditional hybrid-based coding architecture, which involves temporal prediction followed by residual coding. However, recent studies have shown that residual coding is sub-optimal from the information-theoretic perspective. In addition, most existing models are optimized with respect to RGB content. Furthermore, they require separate models for variable-rate coding. To address these issues, this work presents an attempt to incorporate the conditional inter-frame coding for YUV 4:2:0 content. We introduce a conditional flow-based inter-frame coder to improve the inter-frame coding efficiency. To adapt our codec to YUV 4:2:0 content, we adopt a simple strategy of using space-to-depth and depth-to-space conversions. Lastly, we employ a rate-adaption net to achieve variable-rate coding without training multiple models. Experimental results show that our model performs better than x265 on UVG and MCL-JCV datasets in terms of PSNR-YUV. However, on the more challenging datasets from ISCAS'22 GC, there is still ample room for improvement. This insufficient performance is due to the lack of inter-frame coding capability at a large GOP size and can be mitigated by increasing the model capacity and applying an error propagation-aware training strategy.
    SS-BERT: Mitigating Identity Terms Bias in Toxic Comment Classification by Utilising the Notion of "Subjectivity" and "Identity Terms". (arXiv:2109.02691v1 [cs.CL] CROSS LISTED)
    Toxic comment classification models are often found biased toward identity terms which are terms characterizing a specific group of people such as "Muslim" and "black". Such bias is commonly reflected in false-positive predictions, i.e. non-toxic comments with identity terms. In this work, we propose a novel approach to tackle such bias in toxic comment classification, leveraging the notion of subjectivity level of a comment and the presence of identity terms. We hypothesize that when a comment is made about a group of people that is characterized by an identity term, the likelihood of that comment being toxic is associated with the subjectivity level of the comment, i.e. the extent to which the comment conveys personal feelings and opinions. Building upon the BERT model, we propose a new structure that is able to leverage these features, and thoroughly evaluate our model on 4 datasets of varying sizes and representing different social media platforms. The results show that our model can consistently outperform BERT and a SOTA model devised to address identity term bias in a different way, with a maximum improvement in F1 of 2.43% and 1.91% respectively.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v3 [stat.ML] UPDATED)
    The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by stochastic gradient descent (SGD), we study here how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces another form of implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima.
    D.MCA: Outlier Detection with Explicit Micro-Cluster Assignments. (arXiv:2210.08212v1 [cs.LG])
    How can we detect outliers, both scattered and clustered, and also explicitly assign them to respective micro-clusters, without knowing apriori how many micro-clusters exist? How can we perform both tasks in-house, i.e., without any post-hoc processing, so that both detection and assignment can benefit simultaneously from each other? Presenting outliers in separate micro-clusters is informative to analysts in many real-world applications. However, a na\"ive solution based on post-hoc clustering of the outliers detected by any existing method suffers from two main drawbacks: (a) appropriate hyperparameter values are commonly unknown for clustering, and most algorithms struggle with clusters of varying shapes and densities; (b) detection and assignment cannot benefit from one another. In this paper, we propose D.MCA to $\underline{D}$etect outliers with explicit $\underline{M}$icro-$\underline{C}$luster $\underline{A}$ssignment. Our method performs both detection and assignment iteratively, and in-house, by using a novel strategy that prunes entire micro-clusters out of the training set to improve the performance of the detection. It also benefits from a novel strategy that avoids clustered outliers to mask each other, which is a well-known problem in the literature. Also, D.MCA is designed to be robust to a critical hyperparameter by employing a hyperensemble "warm up" phase. Experiments performed on 16 real-world and synthetic datasets demonstrate that D.MCA outperforms 8 state-of-the-art competitors, especially on the explicit outlier micro-cluster assignment task.
    Unveiling the Sampling Density in Non-Uniform Geometric Graphs. (arXiv:2210.08219v1 [cs.LG])
    A powerful framework for studying graphs is to consider them as geometric graphs: nodes are randomly sampled from an underlying metric space, and any pair of nodes is connected if their distance is less than a specified neighborhood radius. Currently, the literature mostly focuses on uniform sampling and constant neighborhood radius. However, real-world graphs are likely to be better represented by a model in which the sampling density and the neighborhood radius can both vary over the latent space. For instance, in a social network communities can be modeled as densely sampled areas, and hubs as nodes with larger neighborhood radius. In this work, we first perform a rigorous mathematical analysis of this (more general) class of models, including derivations of the resulting graph shift operators. The key insight is that graph shift operators should be corrected in order to avoid potential distortions introduced by the non-uniform sampling. Then, we develop methods to estimate the unknown sampling density in a self-supervised fashion. Finally, we present exemplary applications in which the learnt density is used to 1) correct the graph shift operator and improve performance on a variety of tasks, 2) improve pooling, and 3) extract knowledge from networks. Our experimental findings support our theory and provide strong evidence for our model.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v2 [cs.LG] UPDATED)
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    PhysGNN: A Physics-Driven Graph Neural Network Based Model for Predicting Soft Tissue Deformation in Image-Guided Neurosurgery. (arXiv:2109.04352v3 [eess.IV] UPDATED)
    Correctly capturing intraoperative brain shift in image-guided neurosurgical procedures is a critical task for aligning preoperative data with intraoperative geometry for ensuring accurate surgical navigation. While the finite element method (FEM) is a proven technique to effectively approximate soft tissue deformation through biomechanical formulations, their degree of success boils down to a trade-off between accuracy and speed. To circumvent this problem, the most recent works in this domain have proposed leveraging data-driven models obtained by training various machine learning algorithms -- e.g., random forests, artificial neural networks (ANNs) -- with the results of finite element analysis (FEA) to speed up tissue deformation approximations by prediction. These methods, however, do not account for the structure of the finite element (FE) mesh during training that provides information on node connectivities as well as the distance between them, which can aid with approximating tissue deformation based on the proximity of force load points with the rest of the mesh nodes. Therefore, this work proposes a novel framework, PhysGNN, a data-driven model that approximates the solution of the FEM by leveraging graph neural networks (GNNs), which are capable of accounting for the mesh structural information and inductive learning over unstructured grids and complex topological structures. Empirically, we demonstrate that the proposed architecture, PhysGNN, promises accurate and fast soft tissue deformation approximations, and is competitive with the state-of-the-art (SOTA) algorithms while promising enhanced computational feasibility, therefore suitable for neurosurgical settings.
    BagFlip: A Certified Defense against Data Poisoning. (arXiv:2205.13634v2 [cs.LG] UPDATED)
    Machine learning models are vulnerable to data-poisoning attacks, in which an attacker maliciously modifies the training set to change the prediction of a learned model. In a trigger-less attack, the attacker can modify the training set but not the test inputs, while in a backdoor attack the attacker can also modify test inputs. Existing model-agnostic defense approaches either cannot handle backdoor attacks or do not provide effective certificates (i.e., a proof of a defense). We present BagFlip, a model-agnostic certified approach that can effectively defend against both trigger-less and backdoor attacks. We evaluate BagFlip on image classification and malware detection datasets. BagFlip is equal to or more effective than the state-of-the-art approaches for trigger-less attacks and more effective than the state-of-the-art approaches for backdoor attacks.
    Receding Horizon Inverse Reinforcement Learning. (arXiv:2206.04477v2 [cs.LG] UPDATED)
    Inverse reinforcement learning (IRL) seeks to infer a cost function that explains the underlying goals and preferences of expert demonstrations. This paper presents receding horizon inverse reinforcement learning (RHIRL), a new IRL algorithm for high-dimensional, noisy, continuous systems with black-box dynamic models. RHIRL addresses two key challenges of IRL: scalability and robustness. To handle high-dimensional continuous systems, RHIRL matches the induced optimal trajectories with expert demonstrations locally in a receding horizon manner and 'stitches' together the local solutions to learn the cost; it thereby avoids the 'curse of dimensionality'. This contrasts sharply with earlier algorithms that match with expert demonstrations globally over the entire high-dimensional state space. To be robust against imperfect expert demonstrations and control noise, RHIRL learns a state-dependent cost function 'disentangled' from system dynamics under mild conditions. Experiments on benchmark tasks show that RHIRL outperforms several leading IRL algorithms in most instances. We also prove that the cumulative error of RHIRL grows linearly with the task duration.
    How Does Pseudo-Labeling Affect the Generalization Error of the Semi-Supervised Gibbs Algorithm?. (arXiv:2210.08188v1 [cs.IT])
    This paper provides an exact characterization of the expected generalization error (gen-error) for semi-supervised learning (SSL) with pseudo-labeling via the Gibbs algorithm. This characterization is expressed in terms of the symmetrized KL information between the output hypothesis, the pseudo-labeled dataset, and the labeled dataset. It can be applied to obtain distribution-free upper and lower bounds on the gen-error. Our findings offer new insights that the generalization performance of SSL with pseudo-labeling is affected not only by the information between the output hypothesis and input training data but also by the information {\em shared} between the {\em labeled} and {\em pseudo-labeled} data samples. To deepen our understanding, we further explore two examples -- mean estimation and logistic regression. In particular, we analyze how the ratio of the number of unlabeled to labeled data $\lambda$ affects the gen-error under both scenarios. As $\lambda$ increases, the gen-error for mean estimation decreases and then saturates at a value larger than when all the samples are labeled, and the gap can be quantified {\em exactly} with our analysis, and is dependent on the \emph{cross-covariance} between the labeled and pseudo-labeled data sample. In logistic regression, the gen-error and the variance component of the excess risk also decrease as $\lambda$ increases.
    A Scalable Reinforcement Learning Approach for Attack Allocation in Swarm to Swarm Engagement Problems. (arXiv:2210.08319v1 [cs.RO])
    In this work we propose a reinforcement learning (RL) framework that controls the density of a large-scale swarm for engaging with adversarial swarm attacks. Although there is a significant amount of existing work in applying artificial intelligence methods to swarm control, analysis of interactions between two adversarial swarms is a rather understudied area. Most of the existing work in this subject develop strategies by making hard assumptions regarding the strategy and dynamics of the adversarial swarm. Our main contribution is the formulation of the swarm to swarm engagement problem as a Markov Decision Process and development of RL algorithms that can compute engagement strategies without the knowledge of strategy/dynamics of the adversarial swarm. Simulation results show that the developed framework can handle a wide array of large-scale engagement scenarios in an efficient manner.
    When to Trust Your Simulator: Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning. (arXiv:2206.13464v2 [cs.LG] UPDATED)
    Learning effective reinforcement learning (RL) policies to solve real-world complex tasks can be quite challenging without a high-fidelity simulation environment. In most cases, we are only given imperfect simulators with simplified dynamics, which inevitably lead to severe sim-to-real gaps in RL policy learning. The recently emerged field of offline RL provides another possibility to learn policies directly from pre-collected historical data. However, to achieve reasonable performance, existing offline RL algorithms need impractically large offline data with sufficient state-action space coverage for training. This brings up a new question: is it possible to combine learning from limited real data in offline RL and unrestricted exploration through imperfect simulators in online RL to address the drawbacks of both approaches? In this study, we propose the Dynamics-Aware Hybrid Offline-and-Online Reinforcement Learning (H2O) framework to provide an affirmative answer to this question. H2O introduces a dynamics-aware policy evaluation scheme, which adaptively penalizes the Q function learning on simulated state-action pairs with large dynamics gaps, while also simultaneously allowing learning from a fixed real-world dataset. Through extensive simulation and real-world tasks, as well as theoretical analysis, we demonstrate the superior performance of H2O against other cross-domain online and offline RL algorithms. H2O provides a brand new hybrid offline-and-online RL paradigm, which can potentially shed light on future RL algorithm design for solving practical real-world tasks.
    Handling missing values in healthcare data: A systematic review of deep learning-based imputation techniques. (arXiv:2210.08258v1 [cs.LG])
    Objective: The proper handling of missing values is critical to delivering reliable estimates and decisions, especially in high-stakes fields such as clinical research. The increasing diversity and complexity of data have led many researchers to develop deep learning (DL)-based imputation techniques. We conducted a systematic review to evaluate the use of these techniques, with a particular focus on data types, aiming to assist healthcare researchers from various disciplines in dealing with missing values. Methods: We searched five databases (MEDLINE, Web of Science, Embase, CINAHL, and Scopus) for articles published prior to August 2021 that applied DL-based models to imputation. We assessed selected publications from four perspectives: health data types, model backbone (i.e., main architecture), imputation strategies, and comparison with non-DL-based methods. Based on data types, we created an evidence map to illustrate the adoption of DL models. Results: We included 64 articles, of which tabular static (26.6%, 17/64) and temporal data (37.5%, 24/64) were the most frequently investigated. We found that model backbone(s) differed among data types as well as the imputation strategy. The "integrated" strategy, that is, the imputation task being solved concurrently with downstream tasks, was popular for tabular temporal (50%, 12/24) and multi-modal data (71.4%, 5/7), but limited for other data types. Moreover, DL-based imputation methods yielded better imputation accuracy in most studies, compared with non-DL-based methods. Conclusion: DL-based imputation models can be customized based on data type, addressing the corresponding missing patterns, and its associated "integrated" strategy can enhance the efficacy of imputation, especially in scenarios where data is complex. Future research may focus on the portability and fairness of DL-based models for healthcare data imputation.
    Diffusion Models: A Comprehensive Survey of Methods and Applications. (arXiv:2209.00796v8 [cs.LG] UPDATED)
    Diffusion models have emerged as a powerful new family of deep generative models with record-breaking performance in many applications, including image synthesis, video generation, and molecule design. In this survey, we provide an overview of the rapidly expanding body of work on diffusion models, categorizing the research into three key areas: efficient sampling, improved likelihood estimation, and handling data with special structures. We also discuss the potential for combining diffusion models with other generative models for enhanced results. We further review the wide-ranging applications of diffusion models in fields spanning from computer vision, natural language processing, temporal data modeling, to interdisciplinary applications in other scientific disciplines. This survey aims to provide a contextualized, in-depth look at the state of diffusion models, identifying the key areas of focus and pointing to potential areas for further exploration. Github: https://github.com/YangLing0818/Diffusion-Models-Papers-Survey-Taxonomy.
    ProtoVAE: A Trustworthy Self-Explainable Prototypical Variational Model. (arXiv:2210.08151v1 [cs.LG])
    The need for interpretable models has fostered the development of self-explainable classifiers. Prior approaches are either based on multi-stage optimization schemes, impacting the predictive performance of the model, or produce explanations that are not transparent, trustworthy or do not capture the diversity of the data. To address these shortcomings, we propose ProtoVAE, a variational autoencoder-based framework that learns class-specific prototypes in an end-to-end manner and enforces trustworthiness and diversity by regularizing the representation space and introducing an orthonormality constraint. Finally, the model is designed to be transparent by directly incorporating the prototypes into the decision process. Extensive comparisons with previous self-explainable approaches demonstrate the superiority of ProtoVAE, highlighting its ability to generate trustworthy and diverse explanations, while not degrading predictive performance.
    Injecting Domain Knowledge from Empirical Interatomic Potentials to Neural Networks for Predicting Material Properties. (arXiv:2210.08047v1 [cs.LG])
    For decades, atomistic modeling has played a crucial role in predicting the behavior of materials in numerous fields ranging from nanotechnology to drug discovery. The most accurate methods in this domain are rooted in first-principles quantum mechanical calculations such as density functional theory (DFT). Because these methods have remained computationally prohibitive, practitioners have traditionally focused on defining physically motivated closed-form expressions known as empirical interatomic potentials (EIPs) that approximately model the interactions between atoms in materials. In recent years, neural network (NN)-based potentials trained on quantum mechanical (DFT-labeled) data have emerged as a more accurate alternative to conventional EIPs. However, the generalizability of these models relies heavily on the amount of labeled training data, which is often still insufficient to generate models suitable for general-purpose applications. In this paper, we propose two generic strategies that take advantage of unlabeled training instances to inject domain knowledge from conventional EIPs to NNs in order to increase their generalizability. The first strategy, based on weakly supervised learning, trains an auxiliary classifier on EIPs and selects the best-performing EIP to generate energies to supplement the ground-truth DFT energies in training the NN. The second strategy, based on transfer learning, first pretrains the NN on a large set of easily obtainable EIP energies, and then fine-tunes it on ground-truth DFT energies. Experimental results on three benchmark datasets demonstrate that the first strategy improves baseline NN performance by 5% to 51% while the second improves baseline performance by up to 55%. Combining them further boosts performance.
    Neural Attentive Circuits. (arXiv:2210.08031v1 [cs.LG])
    Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.
    Towards a Fully Autonomous UAV Controller for Moving Platform Detection and Landing. (arXiv:2210.08120v1 [cs.RO])
    While Unmanned Aerial Vehicles (UAVs) are increasingly deployed in several missions, their inability of reliable and consistent autonomous landing poses a major setback for deploying such systems truly autonomously. In this paper we present an autonomous UAV landing system for landing on a moving platform. In contrast to existing attempts, the proposed system relies only on the camera sensor, and has been designed as lightweight as possible. The proposed system can be deployed on a low power platform as part of the drone payload, whilst being indifferent to any external communication or any other sensors. The system relies on a Neural Network (NN) based controller, for which a target and environment agnostic simulator was created, used in training and testing of the proposed system, via Reinforcement Learning (RL) and Proximal Policy optimization (PPO) to optimally control and steer the drone towards landing on the target. Through real-world testing, the system was evaluated with an average deviation of 15cm from the center of the target, for 40 landing attempts.
    Feature Distribution Matching for Federated Domain Generalization. (arXiv:2203.11635v3 [cs.LG] UPDATED)
    Multi-source domain adaptation has been intensively studied. The distribution shift in features inherent to specific domains causes the negative transfer problem, degrading a model's generality to unseen tasks. In Federated Learning (FL), learned model parameters are shared to train a global model that leverages the underlying knowledge across client models trained on separate data domains. Nonetheless, the data confidentiality of FL hinders the effectiveness of traditional domain adaptation methods that require prior knowledge of different domain data. We propose a new federated domain generalization method called Federated Knowledge Alignment (FedKA). FedKA leverages feature distribution matching in a global workspace such that the global model can learn domain-invariant client features under the constraint of unknown client data. FedKA employs a federated voting mechanism that generates target domain pseudo-labels based on the consensus from clients to facilitate global model fine-tuning. We performed extensive experiments, including an ablation study, to evaluate the effectiveness of the proposed method in both image and text classification tasks using different model architectures. The empirical results show that FedKA achieves performance gains of 8.8% and 3.5% in Digit-Five and Office-Caltech10, respectively, and a gain of 0.7% in Amazon Review with extremely limited training data. Moreover, we studied the effectiveness of FedKA in alleviating the negative transfer of FL based on a new criterion called Group Effect. The results show that FedKA can reduce negative transfer, improving the performance gain via model aggregation by 4 times.
    Inductive Logical Query Answering in Knowledge Graphs. (arXiv:2210.08008v1 [cs.AI])
    Formulating and answering logical queries is a standard communication interface for knowledge graphs (KGs). Alleviating the notorious incompleteness of real-world KGs, neural methods achieved impressive results in link prediction and complex query answering tasks by learning representations of entities, relations, and queries. Still, most existing query answering methods rely on transductive entity embeddings and cannot generalize to KGs containing new entities without retraining the entity embeddings. In this work, we study the inductive query answering task where inference is performed on a graph containing new entities with queries over both seen and unseen entities. To this end, we devise two mechanisms leveraging inductive node and relational structure representations powered by graph neural networks (GNNs). Experimentally, we show that inductive models are able to perform logical reasoning at inference time over unseen nodes generalizing to graphs up to 500% larger than training ones. Exploring the efficiency--effectiveness trade-off, we find the inductive relational structure representation method generally achieves higher performance, while the inductive node representation method is able to answer complex queries in the inference-only regime without any training on queries and scales to graphs of millions of nodes. Code is available at https://github.com/DeepGraphLearning/InductiveQE.
    Multi-class Classification from Multiple Unlabeled Datasets with Partial Risk Regularization. (arXiv:2207.01555v2 [cs.LG] UPDATED)
    Recent years have witnessed a great success of supervised deep learning, where predictive models were trained from a large amount of fully labeled data. However, in practice, labeling such big data can be very costly and may not even be possible for privacy reasons. Therefore, in this paper, we aim to learn an accurate classifier without any class labels. More specifically, we consider the case where multiple sets of unlabeled data and only their class priors, i.e., the proportions of each class, are available. Under this problem setup, we first derive an unbiased estimator of the classification risk that can be estimated from the given unlabeled sets and theoretically analyze the generalization error of the learned classifier. We then find that the classifier obtained as such tends to cause overfitting as its empirical risks go negative during training. To prevent overfitting, we further propose a partial risk regularization that maintains the partial risks with respect to unlabeled datasets and classes to certain levels. Experiments demonstrate that our method effectively mitigates overfitting and outperforms state-of-the-art methods for learning from multiple unlabeled sets.
    Near-Optimal Regret Bounds for Multi-batch Reinforcement Learning. (arXiv:2210.08238v1 [cs.LG])
    In this paper, we study the episodic reinforcement learning (RL) problem modeled by finite-horizon Markov Decision Processes (MDPs) with constraint on the number of batches. The multi-batch reinforcement learning framework, where the agent is required to provide a time schedule to update policy before everything, which is particularly suitable for the scenarios where the agent suffers extensively from changing the policy adaptively. Given a finite-horizon MDP with $S$ states, $A$ actions and planning horizon $H$, we design a computational efficient algorithm to achieve near-optimal regret of $\tilde{O}(\sqrt{SAH^3K\ln(1/\delta)})$\footnote{$\tilde{O}(\cdot)$ hides logarithmic terms of $(S,A,H,K)$} in $K$ episodes using $O\left(H+\log_2\log_2(K) \right)$ batches with confidence parameter $\delta$. To our best of knowledge, it is the first $\tilde{O}(\sqrt{SAH^3K})$ regret bound with $O(H+\log_2\log_2(K))$ batch complexity. Meanwhile, we show that to achieve $\tilde{O}(\mathrm{poly}(S,A,H)\sqrt{K})$ regret, the number of batches is at least $\Omega\left(H/\log_A(K)+ \log_2\log_2(K) \right)$, which matches our upper bound up to logarithmic terms. Our technical contribution are two-fold: 1) a near-optimal design scheme to explore over the unlearned states; 2) an computational efficient algorithm to explore certain directions with an approximated transition model.
    Classification of animal sounds in a hyperdiverse rainforest using Convolutional Neural Networks. (arXiv:2111.14971v2 [cs.LG] UPDATED)
    To protect tropical forest biodiversity, we need to be able to detect it reliably, cheaply, and at scale. Automated species detection from passively recorded soundscapes via machine-learning approaches is a promising technique towards this goal, but it is constrained by the necessity of large training data sets. Using soundscapes from a tropical forest in Borneo and a Convolutional Neural Network model (CNN) created with transfer learning, we investigate i) the minimum viable training data set size for accurate prediction of call types ('sonotypes'), and ii) the extent to which data augmentation can overcome the issue of small training data sets. We found that even relatively high sample sizes (> 80 per call type) lead to mediocre accuracy, which however improves significantly with data augmentation, including at extremely small sample sizes, regardless of taxonomic group or call characteristics. Our results suggest that transfer learning and data augmentation can make the use of CNNs to classify species' vocalizations feasible even for small soundscape-based projects with many rare species. Retraining our open-source model requires only basic programming skills which makes it possible for individual conservation initiatives to match their local context, in order to enable more evidence-informed management of biodiversity.
    Deep Regression Unlearning. (arXiv:2210.08196v1 [cs.LG])
    With the introduction of data protection and privacy regulations, it has become crucial to remove the lineage of data on demand in a machine learning system. In past few years, there has been notable development in machine unlearning to remove the information of certain training data points efficiently and effectively from the model. In this work, we explore unlearning in a regression problem, particularly in deep learning models. Unlearning in classification and simple linear regression has been investigated considerably. However, unlearning in deep regression models largely remain an untouched problem till now. In this work, we introduce deep regression unlearning methods that are well generalized and robust to privacy attacks. We propose the Blindspot unlearning method which uses a novel weight optimization process. A randomly initialized model, partially exposed to the retain samples and a copy of original model are used together to selectively imprint knowledge about the data that we wish to keep and scrub the information of the data we wish to forget. We also propose a Gaussian distribution based fine tuning method for regression unlearning. The existing evaluation metrics for unlearning in a classification task are not directly applicable for regression unlearning. Therefore, we adapt these metrics for regression task. We devise a membership inference attack to check the privacy leaks in the unlearned regression model. We conduct the experiments on regression tasks for computer vision, natural language processing and forecasting applications. Our deep regression unlearning methods show excellent performance across all of these datasets and metrics.
    Robust Flow-based Conformal Inference (FCI) with Statistical Guarantee. (arXiv:2205.10732v2 [stat.ML] UPDATED)
    Conformal prediction aims to determine precise levels of confidence in predictions for new objects using past experience. However, the commonly used exchangeable assumptions between the training data and testing data limit its usage in dealing with contaminated testing sets. In this paper, we develop a novel flow-based conformal inference (FCI) method to build predictive sets and infer outliers for complex and high-dimensional data. We leverage ideas from adversarial flow to transfer the input data to a random vector with known distributions. Our roundtrip transformation can map the input data to a low-dimensional space, meanwhile reserving the conditional distribution of input data given each class label, which enables us to construct a non-conformity score for uncertainty quantification. Our approach is applicable and robust when the testing data is contaminated. We evaluate our method, robust flow-based conformal inference, on benchmark datasets. We find that it produces effective predictive sets and accurate outlier detection and is more powerful relative to competing approaches.
    Pseudo AI Bias. (arXiv:2210.08141v1 [cs.AI])
    Pseudo Artificial Intelligence bias (PAIB) is broadly disseminated in the literature, which can result in unnecessary AI fear in society, exacerbate the enduring inequities and disparities in access to and sharing the benefits of AI applications, and waste social capital invested in AI research. This study systematically reviews publications in the literature to present three types of PAIBs identified due to: a) misunderstandings, b) pseudo mechanical bias, and c) over-expectations. We discussed the consequences of and solutions to PAIBs, including certifying users for AI applications to mitigate AI fears, providing customized user guidance for AI applications, and developing systematic approaches to monitor bias. We concluded that PAIB due to misunderstandings, pseudo mechanical bias, and over-expectations of algorithmic predictions is socially harmful.
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v2 [stat.ML] UPDATED)
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, algorithmic guarantees for VI are still relatively less well-understood. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures--Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.
    Near-Optimal No-Regret Learning Dynamics for General Convex Games. (arXiv:2206.08742v3 [cs.GT] UPDATED)
    A recent line of work has established uncoupled learning dynamics such that, when employed by all players in a game, each player's \emph{regret} after $T$ repetitions grows polylogarithmically in $T$, an exponential improvement over the traditional guarantees within the no-regret framework. However, so far these results have only been limited to certain classes of games with structured strategy spaces -- such as normal-form and extensive-form games. The question as to whether $O(\text{polylog} T)$ regret bounds can be obtained for general convex and compact strategy sets -- which occur in many fundamental models in economics and multiagent systems -- while retaining efficient strategy updates is an important question. In this paper, we answer this in the positive by establishing the first uncoupled learning algorithm with $O(\log T)$ per-player regret in general \emph{convex games}, that is, games with concave utility functions supported on arbitrary convex and compact strategy sets. Our learning dynamics are based on an instantiation of optimistic follow-the-regularized-leader over an appropriately \emph{lifted} space using a \emph{self-concordant regularizer} that is, peculiarly, not a barrier for the feasible region. Further, our learning dynamics are efficiently implementable given access to a proximal oracle for the convex strategy set, leading to $O(\log\log T)$ per-iteration complexity; we also give extensions when access to only a \emph{linear} optimization oracle is assumed. Finally, we adapt our dynamics to guarantee $O(\sqrt{T})$ regret in the adversarial regime. Even in those special cases where prior results apply, our algorithm improves over the state-of-the-art regret bounds either in terms of the dependence on the number of iterations or on the dimension of the strategy sets.
    Parameter-free Dynamic Graph Embedding for Link Prediction. (arXiv:2210.08189v1 [cs.LG])
    Dynamic interaction graphs have been widely adopted to model the evolution of user-item interactions over time. There are two crucial factors when modelling user preferences for link prediction in dynamic interaction graphs: 1) collaborative relationship among users and 2) user personalized interaction patterns. Existing methods often implicitly consider these two factors together, which may lead to noisy user modelling when the two factors diverge. In addition, they usually require time-consuming parameter learning with back-propagation, which is prohibitive for real-time user preference modelling. To this end, this paper proposes FreeGEM, a parameter-free dynamic graph embedding method for link prediction. Firstly, to take advantage of the collaborative relationships, we propose an incremental graph embedding engine to obtain user/item embeddings, which is an Online-Monitor-Offline architecture consisting of an Online module to approximately embed users/items over time, a Monitor module to estimate the approximation error in real time and an Offline module to calibrate the user/item embeddings when the online approximation errors exceed a threshold. Meanwhile, we integrate attribute information into the model, which enables FreeGEM to better model users belonging to some under represented groups. Secondly, we design a personalized dynamic interaction pattern modeller, which combines dynamic time decay with attention mechanism to model user short-term interests. Experimental results on two link prediction tasks show that FreeGEM can outperform the state-of-the-art methods in accuracy while achieving over 36X improvement in efficiency. All code and datasets can be found in https://github.com/FudanCISL/FreeGEM.
    Where to Begin? On the Impact of Pre-Training and Initialization in Federated Learning. (arXiv:2210.08090v1 [cs.LG])
    An oft-cited challenge of federated learning is the presence of heterogeneity. \emph{Data heterogeneity} refers to the fact that data from different clients may follow very different distributions. \emph{System heterogeneity} refers to the fact that client devices have different system capabilities. A considerable number of federated optimization methods address this challenge. In the literature, empirical evaluations usually start federated training from random initialization. However, in many practical applications of federated learning, the server has access to proxy data for the training task that can be used to pre-train a model before starting federated training. We empirically study the impact of starting from a pre-trained model in federated learning using four standard federated learning benchmark datasets. Unsurprisingly, starting from a pre-trained model reduces the training time required to reach a target error rate and enables the training of more accurate models (up to 40\%) than is possible when starting from random initialization. Surprisingly, we also find that starting federated learning from a pre-trained initialization reduces the effect of both data and system heterogeneity. We recommend that future work proposing and evaluating federated optimization methods evaluate the performance when starting from random and pre-trained initializations. We also believe this study raises several questions for further work on understanding the role of heterogeneity in federated optimization.
    Substructure-Atom Cross Attention for Molecular Representation Learning. (arXiv:2210.08243v1 [cs.LG])
    Designing a neural network architecture for molecular representation is crucial for AI-driven drug discovery and molecule design. In this work, we propose a new framework for molecular representation learning. Our contribution is threefold: (a) demonstrating the usefulness of incorporating substructures to node-wise features from molecules, (b) designing two branch networks consisting of a transformer and a graph neural network so that the networks fused with asymmetric attention, and (c) not requiring heuristic features and computationally-expensive information from molecules. Using 1.8 million molecules collected from ChEMBL and PubChem database, we pretrain our network to learn a general representation of molecules with minimal supervision. The experimental results show that our pretrained network achieves competitive performance on 11 downstream tasks for molecular property prediction.
    Rethinking Symmetric Matrix Factorization: A More General and Better Clustering Perspective. (arXiv:2209.02528v2 [cs.LG] UPDATED)
    Nonnegative matrix factorization (NMF) is widely used for clustering with strong interpretability. Among general NMF problems, symmetric NMF is a special one that plays an important role in graph clustering where each element measures the similarity between data points. Most existing symmetric NMF algorithms require factor matrices to be nonnegative, and only focus on minimizing the gap between similarity matrix and its approximation for clustering, without giving a consideration to other potential regularization terms which can yield better clustering. In this paper, we explore factorizing a symmetric matrix that does not have to be nonnegative, presenting an efficient factorization algorithm with a regularization term to boost the clustering performance. Moreover, a more general framework is proposed to solve symmetric matrix factorization problems with different constraints on the factor matrices.
    Neural Lyapunov Control of Unknown Nonlinear Systems with Stability Guarantees. (arXiv:2206.01913v2 [eess.SY] UPDATED)
    Learning for control of dynamical systems with formal guarantees remains a challenging task. This paper proposes a learning framework to simultaneously stabilize an unknown nonlinear system with a neural controller and learn a neural Lyapunov function to certify a region of attraction (ROA) for the closed-loop system. The algorithmic structure consists of two neural networks and a satisfiability modulo theories (SMT) solver. The first neural network is responsible for learning the unknown dynamics. The second neural network aims to identify a valid Lyapunov function and a provably stabilizing nonlinear controller. The SMT solver then verifies that the candidate Lyapunov function indeed satisfies the Lyapunov conditions. We provide theoretical guarantees of the proposed learning framework in terms of the closed-loop stability for the unknown nonlinear system. We illustrate the effectiveness of the approach with a set of numerical experiments.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v2 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.
    Pishgu: Universal Path Prediction Architecture through Graph Isomorphism and Attentive Convolution. (arXiv:2210.08057v1 [cs.CV])
    Path prediction is an essential task for several real-world real-time applications, from autonomous driving and video surveillance to environmental monitoring. Most existing approaches are computation-intensive and only target a narrow domain (e.g., a specific point of view for a particular subject). However, many real-time applications demand a universal path predictor that can work across different subjects (vehicles, pedestrians), perspectives (bird's-eye, high-angle), and scenes (sidewalk, highway). This article proposes Pishgu, a universal graph isomorphism approach for attentive path prediction that accounts for environmental challenges. Pishgu captures the inter-dependencies within the subjects in each frame by taking advantage of Graph Isomorphism Networks. In addition, an attention module is adopted to represent the intrinsic relations of the subjects of interest with their surroundings. We evaluate the adaptability of our approach to multiple publicly available vehicle (bird's-eye view) and pedestrian (bird's-eye and high-angle view) path prediction datasets. Pishgu's universal solution outperforms existing domain-focused methods by producing state-of-the-art results for vehicle bird's-eye view by 42% and 61% and pedestrian high-angle views by 23% and 22% in terms of ADE and FDE, respectively. Moreover, we analyze the domain-specific details for various datasets to understand their effect on path prediction and model interpretation. Although our model is a single solution for path prediction problems and defines a new standard in multiple domains, it still has a comparable complexity to state-of-the-art models, which makes it suitable for real-world application. We also report the latency and throughput for all three domains on multiple embedded processors.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v2 [cs.CR] UPDATED)
    Recent works show that adversarial examples exist for random neural networks [Daniely and Schacham, 2020] and that these examples can be found using a single step of gradient ascent [Bubeck et al., 2021]. In this work, we extend this line of work to "lazy training" of neural networks -- a dominant model in deep learning theory in which neural networks are provably efficiently learnable. We show that over-parametrized neural networks that are guaranteed to generalize well and enjoy strong computational guarantees remain vulnerable to attacks generated using a single step of gradient ascent.
    Disentangled Representation Learning for RF Fingerprint Extraction under Unknown Channel Statistics. (arXiv:2208.02724v2 [eess.SP] UPDATED)
    Deep learning (DL) applied to a device's radio-frequency fingerprint~(RFF) has attracted significant attention in physical-layer authentication due to its extraordinary classification performance. Conventional DL-RFF techniques are trained by adopting maximum likelihood estimation~(MLE). Although their discriminability has recently been extended to unknown devices in open-set scenarios, they still tend to overfit the channel statistics embedded in the training dataset. This restricts their practical applications as it is challenging to collect sufficient training data capturing the characteristics of all possible wireless channel environments. To address this challenge, we propose a DL framework of disentangled representation~(DR) learning that first learns to factor the signals into a device-relevant component and a device-irrelevant component via adversarial learning. Then, it shuffles these two parts within a dataset for implicit data augmentation, which imposes a strong regularization on RFF extractor learning to avoid the possible overfitting of device-irrelevant channel statistics, without collecting additional data from unknown channels. Experiments validate that the proposed approach, referred to as DR-based RFF, outperforms conventional methods in terms of generalizability to unknown devices even under unknown complicated propagation environments, e.g., dispersive multipath fading channels, even though all the training data are collected in a simple environment with dominated direct line-of-sight~(LoS) propagation paths.
    Digital Audio Forensics: Blind Human Voice Mimicry Detection. (arXiv:2209.12573v2 [cs.SD] UPDATED)
    Audio is one of the most used way of human communication, but at the same time it can be easily misused by to trick people. With the revolution of AI, the related technologies are now accessible to almost everyone thus making it simple for the criminals to commit crimes and forgeries. In this work, we introduce a deep learning method to develop a classifier that will blindly classify an input audio as real or mimicked. The proposed model was trained on a set of important features extracted from a large dataset of audios to get a classifier that was tested on the same set of features from different audios. Two datasets were created for this work; an all English data set and a mixed data set (Arabic and English). These datasets have been made available through GitHub for the use of the research community at https://github.com/SaSs7/Dataset. For the purpose of comparison, the audios were also classified through human inspection with the subjects being the native speakers. The ensued results were interesting and exhibited formidable accuracy.
    A Low-cost Humanoid Prototype Intended to assist people with disability using Raspberry Pi. (arXiv:2210.08116v1 [cs.RO])
    This paper will try to delineate the making of a Humanoid prototype intended to assist people with disability (PWD). The assistance that this prototype will offer is rather rudimentary. However, our key focus is to make the prototype cost-friendly while pertaining to its humanoid-like functionalities. Considering growing needs of Robots, facilities for further installment of features have been made available in this project. The prototype will be of humanoid shape harnessing the power of Artificial Neural Network (ANN) to converse with the users. The prototype uses a raspberry pi and as the computational capability of a raspberry pi is minimal, we cut corners to squeeze the last drop of performance and make it as efficient as possible.
    Invertible Monotone Operators for Normalizing Flows. (arXiv:2210.08176v1 [cs.LG])
    Normalizing flows model probability distributions by learning invertible transformations that transfer a simple distribution into complex distributions. Since the architecture of ResNet-based normalizing flows is more flexible than that of coupling-based models, ResNet-based normalizing flows have been widely studied in recent years. Despite their architectural flexibility, it is well-known that the current ResNet-based models suffer from constrained Lipschitz constants. In this paper, we propose the monotone formulation to overcome the issue of the Lipschitz constants using monotone operators and provide an in-depth theoretical analysis. Furthermore, we construct an activation function called Concatenated Pila (CPila) to improve gradient flow. The resulting model, Monotone Flows, exhibits an excellent performance on multiple density estimation benchmarks (MNIST, CIFAR-10, ImageNet32, ImageNet64). Code is available at https://github.com/mlvlab/MonotoneFlows.
    Attention Regularized Laplace Graph for Domain Adaptation. (arXiv:2210.08170v1 [cs.CV])
    In leveraging manifold learning in domain adaptation (DA), graph embedding-based DA methods have shown their effectiveness in preserving data manifold through the Laplace graph. However, current graph embedding DA methods suffer from two issues: 1). they are only concerned with preservation of the underlying data structures in the embedding and ignore sub-domain adaptation, which requires taking into account intra-class similarity and inter-class dissimilarity, thereby leading to negative transfer; 2). manifold learning is proposed across different feature/label spaces separately, thereby hindering unified comprehensive manifold learning. In this paper, starting from our previous DGA-DA, we propose a novel DA method, namely Attention Regularized Laplace Graph-based Domain Adaptation (ARG-DA), to remedy the aforementioned issues. Specifically, by weighting the importance across different sub-domain adaptation tasks, we propose the Attention Regularized Laplace Graph for class-aware DA, thereby generating the attention regularized DA. Furthermore, using a specifically designed FEEL strategy, our approach dynamically unifies alignment of the manifold structures across different feature/label spaces, thus leading to comprehensive manifold learning. Comprehensive experiments are carried out to verify the effectiveness of the proposed DA method, which consistently outperforms the state-of-the-art DA methods on 7 standard DA benchmarks, i.e., 37 cross-domain image classification tasks including object, face, and digit images. An in-depth analysis of the proposed DA method is also discussed, including sensitivity, convergence, and robustness.
    GFlowCausal: Generative Flow Networks for Causal Discovery. (arXiv:2210.08185v1 [cs.LG])
    Causal discovery aims to uncover causal structure among a set of variables. Score-based approaches mainly focus on searching for the best Directed Acyclic Graph (DAG) based on a predefined score function. However, most of them are not applicable on a large scale due to the limited searchability. Inspired by the active learning in generative flow networks, we propose a novel approach to learning a DAG from observational data called GFlowCausal. It converts the graph search problem to a generation problem, in which direct edges are added gradually. GFlowCausal aims to learn the best policy to generate high-reward DAGs by sequential actions with probabilities proportional to predefined rewards. We propose a plug-and-play module based on transitive closure to ensure efficient sampling. Theoretical analysis shows that this module could guarantee acyclicity properties effectively and the consistency between final states and fully-connected graphs. We conduct extensive experiments on both synthetic and real datasets, and results show the proposed approach to be superior and also performs well in a large-scale setting.
    EXoN: EXplainable encoder Network. (arXiv:2105.10867v3 [stat.ML] UPDATED)
    We propose a new semi-supervised learning method of Variational AutoEncoder (VAE) which yields a customized and explainable latent space by EXplainable encoder Network (EXoN). Customization means a manual design of latent space layout for specific labeled data. To improve the performance of our VAE in a classification task without the loss of performance as a generative model, we employ a new semi-supervised classification method called SCI (Soft-label Consistency Interpolation). The classification loss and the Kullback-Leibler divergence play a crucial role in constructing explainable latent space. The variability of generated samples from our proposed model depends on a specific subspace, called activated latent subspace. Our numerical results with MNIST and CIFAR-10 datasets show that EXoN produces an explainable latent space and reduces the cost of investigating representation patterns on the latent space.
    Robust Feature-Level Adversaries are Interpretability Tools. (arXiv:2110.03605v5 [cs.LG] UPDATED)
    The literature on adversarial attacks in computer vision typically focuses on pixel-level perturbations. These tend to be very difficult to interpret. Recent work that manipulates the latent representations of image generators to create "feature-level" adversarial perturbations gives us an opportunity to explore perceptible, interpretable adversarial attacks. We make three contributions. First, we observe that feature-level attacks provide useful classes of inputs for studying representations in models. Second, we show that these adversaries are versatile and highly robust. We demonstrate that they can be used to produce targeted, universal, disguised, physically-realizable, and black-box attacks at the ImageNet scale. Third, we show how these adversarial images can be used as a practical interpretability tool for identifying bugs in networks. We use these adversaries to make predictions about spurious associations between features and classes which we then test by designing "copy/paste" attacks in which one natural image is pasted into another to cause a targeted misclassification. Our results suggest that feature-level attacks are a promising approach for rigorous interpretability research. They support the design of tools to better understand what a model has learned and diagnose brittle feature associations. Code is available at https://github.com/thestephencasper/feature_level_adv
    AMD-DBSCAN: An Adaptive Multi-density DBSCAN for datasets of extremely variable density. (arXiv:2210.08162v1 [cs.DB])
    DBSCAN has been widely used in density-based clustering algorithms. However, with the increasing demand for Multi-density clustering, previous traditional DSBCAN can not have good clustering results on Multi-density datasets. In order to address this problem, an adaptive Multi-density DBSCAN algorithm (AMD-DBSCAN) is proposed in this paper. An improved parameter adaptation method is proposed in AMD-DBSCAN to search for multiple parameter pairs (i.e., Eps and MinPts), which are the key parameters to determine the clustering results and performance, therefore allowing the model to be applied to Multi-density datasets. Moreover, only one hyperparameter is required for AMD-DBSCAN to avoid the complicated repetitive initialization operations. Furthermore, the variance of the number of neighbors (VNN) is proposed to measure the difference in density between each cluster. The experimental results show that our AMD-DBSCAN reduces execution time by an average of 75% due to lower algorithm complexity compared with the traditional adaptive algorithm. In addition, AMD-DBSCAN improves accuracy by 24.7% on average over the state-of-the-art design on Multi-density datasets of extremely variable density, while having no performance loss in Single-density scenarios.
    Gradient-Free Methods for Deterministic and Stochastic Nonsmooth Nonconvex Optimization. (arXiv:2209.05045v3 [math.OC] UPDATED)
    Nonsmooth nonconvex optimization problems broadly emerge in machine learning and business decision making, whereas two core challenges impede the development of efficient solution methods with finite-time convergence guarantee: the lack of computationally tractable optimality criterion and the lack of computationally powerful oracles. The contributions of this paper are two-fold. First, we establish the relationship between the celebrated Goldstein subdifferential~\citep{Goldstein-1977-Optimization} and uniform smoothing, thereby providing the basis and intuition for the design of gradient-free methods that guarantee the finite-time convergence to a set of Goldstein stationary points. Second, we propose the gradient-free method (GFM) and stochastic GFM for solving a class of nonsmooth nonconvex optimization problems and prove that both of them can return a $(\delta,\epsilon)$-Goldstein stationary point of a Lipschitz function $f$ at an expected convergence rate at $O(d^{3/2}\delta^{-1}\epsilon^{-4})$ where $d$ is the problem dimension. Two-phase versions of GFM and SGFM are also proposed and proven to achieve improved large-deviation results. Finally, we demonstrate the effectiveness of 2-SGFM on training ReLU neural networks with the \textsc{Minst} dataset.
    Robot Navigation Anticipative Strategies in Deep Reinforcement Motion Planning. (arXiv:2210.08280v1 [cs.RO])
    The navigation of robots in dynamic urban environments, requires elaborated anticipative strategies for the robot to avoid collisions with dynamic objects, like bicycles or pedestrians, and to be human aware. We have developed and analyzed three anticipative strategies in motion planning taking into account the future motion of the mobile objects that can move up to 18 km/h. First, we have used our hybrid policy resulting from a Deep Deterministic Policy Gradient (DDPG) training and the Social Force Model (SFM), and we have tested it in simulation in four complex map scenarios with many pedestrians. Second, we have used these anticipative strategies in real-life experiments using the hybrid motion planning method and the ROS Navigation Stack with Dynamic Windows Approach (NS-DWA). The results in simulations and real-life experiments show very good results in open environments and also in mixed scenarios with narrow spaces.
    Certified Robustness Against Natural Language Attacks by Causal Intervention. (arXiv:2205.12331v3 [cs.LG] UPDATED)
    Deep learning models have achieved great success in many fields, yet they are vulnerable to adversarial examples. This paper follows a causal perspective to look into the adversarial vulnerability and proposes Causal Intervention by Semantic Smoothing (CISS), a novel framework towards robustness against natural language attacks. Instead of merely fitting observational data, CISS learns causal effects p(y|do(x)) by smoothing in the latent semantic space to make robust predictions, which scales to deep architectures and avoids tedious construction of noise customized for specific attacks. CISS is provably robust against word substitution attacks, as well as empirically robust even when perturbations are strengthened by unknown attack algorithms. For example, on YELP, CISS surpasses the runner-up by 6.7% in terms of certified robustness against word substitutions, and achieves 79.4% empirical robustness when syntactic attacks are integrated.
    HP-GMN: Graph Memory Networks for Heterophilous Graphs. (arXiv:2210.08195v1 [cs.LG])
    Graph neural networks (GNNs) have achieved great success in various graph problems. However, most GNNs are Message Passing Neural Networks (MPNNs) based on the homophily assumption, where nodes with the same label are connected in graphs. Real-world problems bring us heterophily problems, where nodes with different labels are connected in graphs. MPNNs fail to address the heterophily problem because they mix information from different distributions and are not good at capturing global patterns. Therefore, we investigate a novel Graph Memory Networks model on Heterophilous Graphs (HP-GMN) to the heterophily problem in this paper. In HP-GMN, local information and global patterns are learned by local statistics and the memory to facilitate the prediction. We further propose regularization terms to help the memory learn global information. We conduct extensive experiments to show that our method achieves state-of-the-art performance on both homophilous and heterophilous graphs.
    Active Learning from the Web. (arXiv:2210.08205v1 [cs.LG])
    Labeling data is one of the most costly processes in machine learning pipelines. Active learning is a standard approach to alleviating this problem. Pool-based active learning first builds a pool of unlabelled data and iteratively selects data to be labeled so that the total number of required labels is minimized, keeping the model performance high. Many effective criteria for choosing data from the pool have been proposed in the literature. However, how to build the pool is less explored. Specifically, most of the methods assume that a task-specific pool is given for free. In this paper, we advocate that such a task-specific pool is not always available and propose the use of a myriad of unlabelled data on the Web for the pool for which active learning is applied. As the pool is extremely large, it is likely that relevant data exist in the pool for many tasks, and we do not need to explicitly design and build the pool for each task. The challenge is that we cannot compute the acquisition scores of all data exhaustively due to the size of the pool. We propose an efficient method, Seafaring, to retrieve informative data in terms of active learning from the Web using a user-side information retrieval algorithm. In the experiments, we use the online Flickr environment as the pool for active learning. This pool contains more than ten billion images and is several orders of magnitude larger than the existing pools in the literature for active learning. We confirm that our method performs better than existing approaches of using a small unlabelled pool.
    Reliability and Robustness analysis of Machine Learning based Phishing URL Detectors. (arXiv:2005.08454v2 [cs.CR] UPDATED)
    ML-based Phishing URL (MLPU) detectors serve as the first level of defence to protect users and organisations from being victims of phishing attacks. Lately, few studies have launched successful adversarial attacks against specific MLPU detectors raising questions about their practical reliability and usage. Nevertheless, the robustness of these systems has not been extensively investigated. Therefore, the security vulnerabilities of these systems, in general, remain primarily unknown which calls for testing the robustness of these systems. In this article, we have proposed a methodology to investigate the reliability and robustness of 50 representative state-of-the-art MLPU models. Firstly, we have proposed a cost-effective Adversarial URL generator URLBUG that created an Adversarial URL dataset. Subsequently, we reproduced 50 MLPU (traditional ML and Deep learning) systems and recorded their baseline performance. Lastly, we tested the considered MLPU systems on Adversarial Dataset and analyzed their robustness and reliability using box plots and heat maps. Our results showed that the generated adversarial URLs have valid syntax and can be registered at a median annual price of \$11.99. Out of 13\% of the already registered adversarial URLs, 63.94\% were used for malicious purposes. Moreover, the considered MLPU models Matthew Correlation Coefficient (MCC) dropped from a median 0.92 to 0.02 when tested against $Adv_\mathrm{data}$, indicating that the baseline MLPU models are unreliable in their current form. Further, our findings identified several security vulnerabilities of these systems and provided future directions for researchers to design dependable and secure MLPU systems.
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v2 [cs.LG] UPDATED)
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code is available at \url{https://github.com/salesforce/fsnet}.
    Machine Learning Approach for Predicting Students Academic Performance and Study Strategies based on their Motivation. (arXiv:2210.08186v1 [cs.LG])
    This research aims to develop machine learning models for students academic performance and study strategies prediction which could be generalized to all courses in higher education. Key learning attributes (intrinsic, extrinsic, autonomy, relatedness, competence, and self-esteem) essential for students learning process were used in building the models. Determining the broad effect of these attributes on students' academic performance and study strategy is the center of our interest. To investigate this, we used Scikit-learn in python to build five machine learning models (Decision Tree, K-Nearest Neighbour, Random Forest, Linear/Logistic Regression, and Support Vector Machine) for both regression and classification tasks to perform our analysis. The models were trained, evaluated, and tested for accuracy using 924 university dentistry students' data collected by Chilean authors through quantitative research design. A comparative analysis of the models revealed that the tree-based models such as the random forest (with prediction accuracy of 94.9%) and decision tree show the best results compared to the linear, support vector, and k-nearest neighbours. The models built in this research can be used in predicting student performance and study strategy so that appropriate interventions could be implemented to improve student learning progress. Thus, incorporating strategies that could improve diverse student learning attributes in the design of online educational systems may increase the likelihood of students continuing with their learning tasks as required. Moreover, the results show that the attributes could be modelled together and used to adapt/personalize the learning process.
    Whole-body tumor segmentation of 18F -FDG PET/CT using a cascaded and ensembled convolutional neural networks. (arXiv:2210.08068v1 [cs.CV])
    Background: A crucial initial processing step for quantitative PET/CT analysis is the segmentation of tumor lesions enabling accurate feature ex-traction, tumor characterization, oncologic staging, and image-based therapy response assessment. Manual lesion segmentation is however associated with enormous effort and cost and is thus infeasible in clinical routine. Goal: The goal of this study was to report the performance of a deep neural network designed to automatically segment regions suspected of cancer in whole-body 18F-FDG PET/CT images in the context of the AutoPET challenge. Method: A cascaded approach was developed where a stacked ensemble of 3D UNET CNN processed the PET/CT images at a fixed 6mm resolution. A refiner network composed of residual layers enhanced the 6mm segmentation mask to the original resolution. Results: 930 cases were used to train the model. 50% were histologically proven cancer patients and 50% were healthy controls. We obtained a dice=0.68 on 84 stratified test cases. Manual and automatic Metabolic Tumor Volume (MTV) were highly correlated (R2 = 0.969,Slope = 0.947). Inference time was 89.7 seconds on average. Conclusion: The proposed algorithm accurately segmented regions suspicious for cancer in whole-body 18F -FDG PET/CT images.
    Knowledge Distillation approach towards Melanoma Detection. (arXiv:2210.08086v1 [cs.CV])
    Melanoma is regarded as the most threatening among all skin cancers. There is a pressing need to build systems which can aid in the early detection of melanoma and enable timely treatment to patients. Recent methods are geared towards machine learning based systems where the task is posed as image recognition, tag dermoscopic images of skin lesions as melanoma or non-melanoma. Even though these methods show promising results in terms of accuracy, they are computationally quite expensive to train, that questions the ability of these models to be deployable in a clinical setting or memory constraint devices. To address this issue, we focus on building simple and performant models having few layers, less than ten compared to hundreds. As well as with fewer learnable parameters, 0.26 million (M) compared to 42.5M using knowledge distillation with the goal to detect melanoma from dermoscopic images. First, we train a teacher model using a ResNet-50 to detect melanoma. Using the teacher model, we train the student model known as Distilled Student Network (DSNet) which has around 0.26M parameters using knowledge distillation achieving an accuracy of 91.7%. We compare against ImageNet pre-trained models such MobileNet, VGG-16, Inception-V3, EfficientNet-B0, ResNet-50 and ResNet-101. We find that our approach works well in terms of inference runtime compared to other pre-trained models, 2.57 seconds compared to 14.55 seconds. We find that DSNet (0.26M parameters), which is 15 times smaller, consistently performs better than EfficientNet-B0 (4M parameters) in both melanoma and non-melanoma detection across Precision, Recall and F1 scores
    Prediction of drug effectiveness in rheumatoid arthritis patients based on machine learning algorithms. (arXiv:2210.08016v1 [q-bio.QM])
    Rheumatoid arthritis (RA) is an autoimmune condition caused when patients' immune system mistakenly targets their own tissue. Machine learning (ML) has the potential to identify patterns in patient electronic health records (EHR) to forecast the best clinical treatment to improve patient outcomes. This study introduced a \textbf{D}rug \textbf{R}esponse \textbf{P}rediction (DRP) framework with two main goals: 1) design a data processing pipeline to extract information from tabular clinical data, and then preprocess it for functional use, and 2) predict RA patient's responses to drugs and evaluate classification models' performance. We propose a novel two-stage ML framework based on European Alliance of Associations for Rheumatology (EULAR) criteria cutoffs to model drug effectiveness. Our model Stacked-Ensemble DRP was developed and cross-validated using data from 425 RA patients. The evaluation used a subset of 124 patients (30\%) from the same data source. In the evaluation of the test set, two-stage DRP leads to improved classification accuracy over other end-to-end classification models for binary classification. Our proposed method provides a complete pipeline to predict disease activity scores and identify the group that does not respond well to anti-TNF treatments, thus showing promise in supporting clinical decisions based on EHR information. Codes and sample fictional datasets to test our model are given at \url{ https://github.com/Gaskell-1206/Ensemble_DRP}.  ( 3 min )
    Reference Based Color Transfer for Medical Volume Rendering. (arXiv:2210.08083v1 [cs.CV])
    The benefits of medical imaging are enormous. Medical images provide considerable amounts of anatomical information and this facilitates medical practitioners in performing effective disease diagnosis and deciding upon the best course of medical treatment. A transition from traditional monochromatic medical images like CT scans, X-Rays or MRI images to a colored 3D representation of the anatomical structure further enhances the capabilities of medical professionals in extracting valuable medical information. The proposed framework in our research starts with performing color transfer by finding deep semantic correspondence between two medical images: a colored reference image, and a monochromatic CT scan or an MRI image. We extend this idea of reference-based colorization technique to perform colored volume rendering from a stack of grayscale medical images. Furthermore, we also propose to use an effective reference image recommendation system to aid in the selection of good reference images. With our approach, we successfully perform colored medical volume visualization and essentially eliminate the painstaking process of user interaction with a transfer function to obtain color and opacity parameters for volume rendering.  ( 2 min )
    Just Round: Quantized Observation Spaces Enable Memory Efficient Learning of Dynamic Locomotion. (arXiv:2210.08065v1 [cs.RO])
    Deep reinforcement learning (DRL) is one of the most powerful tools for synthesizing complex robotic behaviors. But training DRL models is incredibly compute and memory intensive, requiring large training datasets and replay buffers to achieve performant results. This poses a challenge for the next generation of field robots that will need to learn on the edge to adapt to their environment. In this paper, we begin to address this issue through observation space quantization. We evaluate our approach using four simulated robot locomotion tasks and two state-of-the-art DRL algorithms, the on-policy Proximal Policy Optimization (PPO) and off-policy Soft Actor-Critic (SAC) and find that observation space quantization reduces overall memory costs by as much as 4.2x without impacting learning performance.  ( 2 min )
    Inferring Versatile Behavior from Demonstrations by Matching Geometric Descriptors. (arXiv:2210.08121v1 [cs.RO])
    Humans intuitively solve tasks in versatile ways, varying their behavior in terms of trajectory-based planning and for individual steps. Thus, they can easily generalize and adapt to new and changing environments. Current Imitation Learning algorithms often only consider unimodal expert demonstrations and act in a state-action-based setting, making it difficult for them to imitate human behavior in case of versatile demonstrations. Instead, we combine a mixture of movement primitives with a distribution matching objective to learn versatile behaviors that match the expert's behavior and versatility. To facilitate generalization to novel task configurations, we do not directly match the agent's and expert's trajectory distributions but rather work with concise geometric descriptors which generalize well to unseen task configurations. We empirically validate our method on various robot tasks using versatile human demonstrations and compare to imitation learning algorithms in a state-action setting as well as a trajectory-based setting. We find that the geometric descriptors greatly help in generalizing to new task configurations and that combining them with our distribution-matching objective is crucial for representing and reproducing versatile behavior.  ( 2 min )
    TestAug: A Framework for Augmenting Capability-based NLP Tests. (arXiv:2210.08097v1 [cs.SE])
    The recently proposed capability-based NLP testing allows model developers to test the functional capabilities of NLP models, revealing functional failures that cannot be detected by the traditional heldout mechanism. However, existing work on capability-based testing requires extensive manual efforts and domain expertise in creating the test cases. In this paper, we investigate a low-cost approach for the test case generation by leveraging the GPT-3 engine. We further propose to use a classifier to remove the invalid outputs from GPT-3 and expand the outputs into templates to generate more test cases. Our experiments show that TestAug has three advantages over the existing work on behavioral testing: (1) TestAug can find more bugs than existing work; (2) The test cases in TestAug are more diverse; and (3) TestAug largely saves the manual efforts in creating the test suites. The code and data for TestAug can be found at our project website (https://guanqun-yang.github.io/testaug/) and GitHub (https://github.com/guanqun-yang/testaug).  ( 2 min )
    Self-supervised Graph Learning for Long-tailed Cognitive Diagnosis. (arXiv:2210.08169v1 [cs.CY])
    Cognitive diagnosis is a fundamental yet critical research task in the field of intelligent education, which aims to discover the proficiency level of different students on specific knowledge concepts. Despite the effectiveness of existing efforts, previous methods always considered the mastery level on the whole students, so they still suffer from the Long Tail Effect. A large number of students who have sparse data are performed poorly in the model. To relieve the situation, we proposed a Self-supervised Cognitive Diagnosis (SCD) framework which leverages the self-supervised manner to assist the graph-based cognitive diagnosis, then the performance on those students with sparse data can be improved. Specifically, we came up with a graph confusion method that drops edges under some special rules to generate different sparse views of the graph. By maximizing the consistency of the representation on the same node under different views, the model could be more focused on long-tailed students. Additionally, we proposed an importance-based view generation rule to improve the influence of long-tailed students. Extensive experiments on real-world datasets show the effectiveness of our approach, especially on the students with sparse data.  ( 2 min )
    TweetNERD -- End to End Entity Linking Benchmark for Tweets. (arXiv:2210.08129v1 [cs.CL])
    Named Entity Recognition and Disambiguation (NERD) systems are foundational for information retrieval, question answering, event detection, and other natural language processing (NLP) applications. We introduce TweetNERD, a dataset of 340K+ Tweets across 2010-2021, for benchmarking NERD systems on Tweets. This is the largest and most temporally diverse open sourced dataset benchmark for NERD on Tweets and can be used to facilitate research in this area. We describe evaluation setup with TweetNERD for three NERD tasks: Named Entity Recognition (NER), Entity Linking with True Spans (EL), and End to End Entity Linking (End2End); and provide performance of existing publicly available methods on specific TweetNERD splits. TweetNERD is available at: https://doi.org/10.5281/zenodo.6617192 under Creative Commons Attribution 4.0 International (CC BY 4.0) license. Check out more details at https://github.com/twitter-research/TweetNERD.  ( 2 min )
    Bayesian Spline Learning for Equation Discovery of Nonlinear Dynamics with Quantified Uncertainty. (arXiv:2210.08095v1 [cs.LG])
    Nonlinear dynamics are ubiquitous in science and engineering applications, but the physics of most complex systems is far from being fully understood. Discovering interpretable governing equations from measurement data can help us understand and predict the behavior of complex dynamic systems. Although extensive work has recently been done in this field, robustly distilling explicit model forms from very sparse data with considerable noise remains intractable. Moreover, quantifying and propagating the uncertainty of the identified system from noisy data is challenging, and relevant literature is still limited. To bridge this gap, we develop a novel Bayesian spline learning framework to identify parsimonious governing equations of nonlinear (spatio)temporal dynamics from sparse, noisy data with quantified uncertainty. The proposed method utilizes spline basis to handle the data scarcity and measurement noise, upon which a group of derivatives can be accurately computed to form a library of candidate model terms. The equation residuals are used to inform the spline learning in a Bayesian manner, where approximate Bayesian uncertainty calibration techniques are employed to approximate posterior distributions of the trainable parameters. To promote the sparsity, an iterative sequential-threshold Bayesian learning approach is developed, using the alternative direction optimization strategy to systematically approximate L0 sparsity constraints. The proposed algorithm is evaluated on multiple nonlinear dynamical systems governed by canonical ordinary and partial differential equations, and the merit/superiority of the proposed method is demonstrated by comparison with state-of-the-art methods.  ( 3 min )
    Hierarchical Decentralized Deep Reinforcement Learning Architecture for a Simulated Four-Legged Agent. (arXiv:2210.08003v1 [cs.AI])
    Legged locomotion is widespread in nature and has inspired the design of current robots. The controller of these legged robots is often realized as one centralized instance. However, in nature, control of movement happens in a hierarchical and decentralized fashion. Introducing these biological design principles into robotic control systems has motivated this work. We tackle the question whether decentralized and hierarchical control is beneficial for legged robots and present a novel decentral, hierarchical architecture to control a simulated legged agent. Three different tasks varying in complexity are designed to benchmark five architectures (centralized, decentralized, hierarchical and two different combinations of hierarchical decentralized architectures). The results demonstrate that decentralizing the different levels of the hierarchical architectures facilitates learning of the agent, ensures more energy efficient movements as well as robustness towards new unseen environments. Furthermore, this comparison sheds light on the importance of modularity in hierarchical architectures to solve complex goal-directed tasks. We provide an open-source code implementation of our architecture (https://github.com/wzaielamri/hddrl).  ( 2 min )
    Is Face Recognition Safe from Realizable Attacks?. (arXiv:2210.08178v1 [cs.CV])
    Face recognition is a popular form of biometric authentication and due to its widespread use, attacks have become more common as well. Recent studies show that Face Recognition Systems are vulnerable to attacks and can lead to erroneous identification of faces. Interestingly, most of these attacks are white-box, or they are manipulating facial images in ways that are not physically realizable. In this paper, we propose an attack scheme where the attacker can generate realistic synthesized face images with subtle perturbations and physically realize that onto his face to attack black-box face recognition systems. Comprehensive experiments and analyses show that subtle perturbations realized on attackers face can create successful attacks on state-of-the-art face recognition systems in black-box settings. Our study exposes the underlying vulnerability posed by the Face Recognition Systems against realizable black-box attacks.  ( 2 min )
    Zonotope Domains for Lagrangian Neural Network Verification. (arXiv:2210.08069v1 [cs.LG])
    Neural network verification aims to provide provable bounds for the output of a neural network for a given input range. Notable prior works in this domain have either generated bounds using abstract domains, which preserve some dependency between intermediate neurons in the network; or framed verification as an optimization problem and solved a relaxation using Lagrangian methods. A key drawback of the latter technique is that each neuron is treated independently, thereby ignoring important neuron interactions. We provide an approach that merges these two threads and uses zonotopes within a Lagrangian decomposition. Crucially, we can decompose the problem of verifying a deep neural network into the verification of many 2-layer neural networks. While each of these problems is provably hard, we provide efficient relaxation methods that are amenable to efficient dual ascent procedures. Our technique yields bounds that improve upon both linear programming and Lagrangian-based verification techniques in both time and bound tightness.  ( 2 min )
    A Kernel Approach for PDE Discovery and Operator Learning. (arXiv:2210.08140v1 [stat.ML])
    This article presents a three-step framework for learning and solving partial differential equations (PDEs) using kernel methods. Given a training set consisting of pairs of noisy PDE solutions and source/boundary terms on a mesh, kernel smoothing is utilized to denoise the data and approximate derivatives of the solution. This information is then used in a kernel regression model to learn the algebraic form of the PDE. The learned PDE is then used within a kernel based solver to approximate the solution of the PDE with a new source/boundary term, thereby constituting an operator learning framework. The proposed method is mathematically interpretable and amenable to analysis, and convenient to implement. Numerical experiments compare the method to state-of-the-art algorithms and demonstrate its superior performance on small amounts of training data and for PDEs with spatially variable coefficients.  ( 2 min )
    On the Relationship Between Variational Inference and Auto-Associative Memory. (arXiv:2210.08013v1 [cs.LG])
    In this article, we propose a variational inference formulation of auto-associative memories, allowing us to combine perceptual inference and memory retrieval into the same mathematical framework. In this formulation, the prior probability distribution onto latent representations is made memory dependent, thus pulling the inference process towards previously stored representations. We then study how different neural network approaches to variational inference can be applied in this framework. We compare methods relying on amortized inference such as Variational Auto Encoders and methods relying on iterative inference such as Predictive Coding and suggest combining both approaches to design new auto-associative memory models. We evaluate the obtained algorithms on the CIFAR10 and CLEVR image datasets and compare them with other associative memory models such as Hopfield Networks, End-to-End Memory Networks and Neural Turing Machines.  ( 2 min )
    Partial Identification of Treatment Effects with Implicit Generative Models. (arXiv:2210.08139v1 [cs.LG])
    We consider the problem of partial identification, the estimation of bounds on the treatment effects from observational data. Although studied using discrete treatment variables or in specific causal graphs (e.g., instrumental variables), partial identification has been recently explored using tools from deep generative modeling. We propose a new method for partial identification of average treatment effects(ATEs) in general causal graphs using implicit generative models comprising continuous and discrete random variables. Since ATE with continuous treatment is generally non-regular, we leverage the partial derivatives of response functions to define a regular approximation of ATE, a quantity we call uniform average treatment derivative (UATD). We prove that our algorithm converges to tight bounds on ATE in linear structural causal models (SCMs). For nonlinear SCMs, we empirically show that using UATD leads to tighter and more stable bounds than methods that directly optimize the ATE.  ( 2 min )
    Deep Learning based Super-Resolution for Medical Volume Visualization with Direct Volume Rendering. (arXiv:2210.08080v1 [cs.GR])
    Modern-day display systems demand high-quality rendering. However, rendering at higher resolution requires a large number of data samples and is computationally expensive. Recent advances in deep learning-based image and video super-resolution techniques motivate us to investigate such networks for high-fidelity upscaling of frames rendered at a lower resolution to a higher resolution. While our work focuses on super-resolution of medical volume visualization performed with direct volume rendering, it is also applicable for volume visualization with other rendering techniques. We propose a learning-based technique where our proposed system uses color information along with other supplementary features gathered from our volume renderer to learn efficient upscaling of a low-resolution rendering to a higher-resolution space. Furthermore, to improve temporal stability, we also implement the temporal reprojection technique for accumulating history samples in volumetric rendering.  ( 2 min )
  • Open

    Efficient Active Learning with Abstention. (arXiv:2204.00043v2 [stat.ML] UPDATED)
    The goal of active learning is to achieve the same accuracy achievable by passive learning, while using much fewer labels. Exponential savings in terms of label complexity have been proved in very special cases, but fundamental lower bounds show that such improvements are impossible in general. This suggests a need to explore alternative goals for active learning. Learning with abstention is one such alternative. In this setting, the active learning algorithm may abstain from prediction and incur an error that is marginally smaller than random guessing. We develop the first computationally efficient active learning algorithm with abstention. Our algorithm provably achieves $\mathsf{polylog}(\frac{1}{\varepsilon})$ label complexity, without any low noise conditions. Such performance guarantee reduces the label complexity by an exponential factor, relative to passive learning and active learning that is not allowed to abstain. Furthermore, our algorithm is guaranteed to only abstain on hard examples (where the true label distribution is close to a fair coin), a novel property we term \emph{proper abstention} that also leads to a host of other desirable characteristics (e.g., recovering minimax guarantees in the standard setting, and avoiding the undesirable ``noise-seeking'' behavior often seen in active learning). We also provide novel extensions of our algorithm that achieve \emph{constant} label complexity and deal with model misspecification.
    Towards Healing the Blindness of Score Matching. (arXiv:2209.07396v2 [stat.ML] UPDATED)
    Score-based divergences have been widely used in machine learning and statistics applications. Despite their empirical success, a blindness problem has been observed when using these for multi-modal distributions. In this work, we discuss the blindness problem and propose a new family of divergences that can mitigate the blindness problem. We illustrate our proposed divergence in the context of density estimation and report improved performance compared to traditional approaches.
    AntBO: Towards Real-World Automated Antibody Design with Combinatorial Bayesian Optimisation. (arXiv:2201.12570v4 [q-bio.BM] UPDATED)
    Antibodies are canonically Y-shaped multimeric proteins capable of highly specific molecular recognition. The CDRH3 region located at the tip of variable chains of an antibody dominates antigen-binding specificity. Therefore, it is a priority to design optimal antigen-specific CDRH3 regions to develop therapeutic antibodies. However, the combinatorial nature of CDRH3 sequence space makes it impossible to search for an optimal binding sequence exhaustively and efficiently using computational approaches. Here, we present \texttt{AntBO}: a combinatorial Bayesian optimisation framework enabling efficient \textit{in silico} design of the CDRH3 region. Ideally, antibodies are expected to have high target specificity and developability. We introduce a CDRH3 trust region that restricts the search to sequences with favourable developability scores to achieve this goal. For benchmarking, \texttt{AntBO} uses the \texttt{Absolut!} software suite as a black-box oracle to score the target specificity and affinity of designed antibodies \textit{in silico} in an unconstrained fashion~\citep{robert2021one}. The experiments performed for $159$ discretised antigens used in \texttt{Absolut!} demonstrate the benefit of \texttt{AntBO} in designing CDRH3 regions with diverse biophysical properties. In under $200$ calls to black-box oracle, \texttt{AntBO} can suggest antibody sequences that outperform the best binding sequence drawn from 6.9 million experimentally obtained CDRH3s and a commonly used genetic algorithm baseline. Additionally, \texttt{AntBO} finds very-high affinity CDRH3 sequences in only 38 protein designs whilst requiring no domain knowledge. We conclude \texttt{AntBO} brings automated antibody design methods closer to what is practically viable for in vitro experimentation.
    Ordered Subgraph Aggregation Networks. (arXiv:2206.11168v3 [cs.LG] UPDATED)
    Numerous subgraph-enhanced graph neural networks (GNNs) have emerged recently, provably boosting the expressive power of standard (message-passing) GNNs. However, there is a limited understanding of how these approaches relate to each other and to the Weisfeiler-Leman hierarchy. Moreover, current approaches either use all subgraphs of a given size, sample them uniformly at random, or use hand-crafted heuristics instead of learning to select subgraphs in a data-driven manner. Here, we offer a unified way to study such architectures by introducing a theoretical framework and extending the known expressivity results of subgraph-enhanced GNNs. Concretely, we show that increasing subgraph size always increases the expressive power and develop a better understanding of their limitations by relating them to the established $k\text{-}\mathsf{WL}$ hierarchy. In addition, we explore different approaches for learning to sample subgraphs using recent methods for backpropagating through complex discrete probability distributions. Empirically, we study the predictive performance of different subgraph-enhanced GNNs, showing that our data-driven architectures increase prediction accuracy on standard benchmark datasets compared to non-data-driven subgraph-enhanced graph neural networks while reducing computation time.
    Markov Observation Models. (arXiv:2208.06368v2 [stat.ML] UPDATED)
    Herein, the Hidden Markov Model is expanded to allow for Markov chain observations. In particular, the observations are assumed to be a Markov chain whose one step transition probabilities depend upon the hidden Markov chain. An Expectation-Maximization analog to the Baum-Welch algorithm is developed for this more general model to estimate the transition probabilities for both the hidden state and for the observations as well as to estimate the probabilities for the initial joint hidden-state-observation distribution. A believe state or filter recursion to track the hidden state then arises from the calculations of this Expectation-Maximization algorithm. A dynamic programming analog to the Viterbi algorithm is also developed to estimate the most likely sequence of hidden states given the sequence of observations.
    A general framework for multi-step ahead adaptive conformal heteroscedastic time series forecasting. (arXiv:2207.14219v2 [stat.ML] UPDATED)
    The exponential growth of machine learning (ML) has prompted a great deal of interest in quantifying the uncertainty of each prediction for a user-defined level of confidence since nowadays ML is increasingly being used in high-stakes settings. Reliable ML via prediction intervals(PIs) that take into account jointly the epistemic and aleatory uncertainty is therefore imperative and is a step towards increased trust in model forecasts. Conformal prediction (CP) is a lightweight distribution-free uncertainty quantification framework that works for any black-box model, yielding PIs that are valid under the mild assumption of exchangeability. CP-type methods are gaining popularity due to being easy to implement and computationally cheap; however, the exchangeability assumption immediately excludes time series forecasting from the stage. Although recent papers tackle distribution shift and asymptotic versions of CP, this is not enough for the general time series forecasting problem of producing H-step ahead valid PIs. To attain such a goal, we propose a new method called AEnbMIMOCQR (Adaptive ensemble batch multi-input multi-output conformalized quantile regression), which produces valid PIs asymptotically and is appropriate for heteroscedastic time series. We compare the proposed method against state-of-the-art competitive methods in the NN5 forecasting competition dataset. All the code and data to reproduce the experiments are made available.
    Domain Adaptation meets Individual Fairness. And they get along. (arXiv:2205.00504v2 [stat.ML] UPDATED)
    Many instances of algorithmic bias are caused by distributional shifts. For example, machine learning (ML) models often perform worse on demographic groups that are underrepresented in the training data. In this paper, we leverage this connection between algorithmic fairness and distribution shifts to show that algorithmic fairness interventions can help ML models overcome distribution shifts, and that domain adaptation methods (for overcoming distribution shifts) can mitigate algorithmic biases. In particular, we show that (i) enforcing suitable notions of individual fairness (IF) can improve the out-of-distribution accuracy of ML models under the covariate shift assumption and that (ii) it is possible to adapt representation alignment methods for domain adaptation to enforce individual fairness. The former is unexpected because IF interventions were not developed with distribution shifts in mind. The latter is also unexpected because representation alignment is not a common approach in the individual fairness literature.
    The alignment property of SGD noise and how it helps select flat minima: A stability analysis. (arXiv:2207.02628v3 [stat.ML] UPDATED)
    The phenomenon that stochastic gradient descent (SGD) favors flat minima has played a critical role in understanding the implicit regularization of SGD. In this paper, we provide an explanation of this striking phenomenon by relating the particular noise structure of SGD to its \emph{linear stability} (Wu et al., 2018). Specifically, we consider training over-parameterized models with square loss. We prove that if a global minimum $\theta^*$ is linearly stable for SGD, then it must satisfy $\|H(\theta^*)\|_F\leq O(\sqrt{B}/\eta)$, where $\|H(\theta^*)\|_F, B,\eta$ denote the Frobenius norm of Hessian at $\theta^*$, batch size, and learning rate, respectively. Otherwise, SGD will escape from that minimum \emph{exponentially} fast. Hence, for minima accessible to SGD, the sharpness -- as measured by the Frobenius norm of the Hessian -- is bounded \emph{independently} of the model size and sample size. The key to obtaining these results is exploiting the particular structure of SGD noise: The noise concentrates in sharp directions of local landscape and the magnitude is proportional to loss value. This alignment property of SGD noise provably holds for linear networks and random feature models (RFMs), and is empirically verified for nonlinear networks. Moreover, the validity and practical relevance of our theoretical findings are also justified by extensive experiments on CIFAR-10 dataset.
    Bregman dynamics, contact transformations and convex optimization. (arXiv:1912.02928v4 [math.OC] UPDATED)
    Recent research on accelerated gradient methods of use in optimization has demonstrated that these methods can be derived as discretizations of dynamical systems. This, in turn, has provided a basis for more systematic investigations, especially into the geometric structure of those dynamical systems and their structure--preserving discretizations. In this work, we introduce dynamical systems defined through a contact geometry which are not only naturally suited to the optimization goal but also subsume all previous methods based on geometric dynamical systems. As a consequence, all the deterministic flows used in optimization share an extremely interesting geometric property: they are invariant under contact transformations. In our main result, we exploit this observation to show that the celebrated Bregman Hamiltonian system can always be transformed into an equivalent but separable Hamiltonian by means of a contact transformation. This in turn enables the development of fast and robust discretizations through geometric contact splitting integrators. As an illustration, we propose the Relativistic Bregman algorithm, and show in some paradigmatic examples that it compares favorably with respect to standard optimization algorithms such as classical momentum and Nesterov's accelerated gradient.
    TabNAS: Rejection Sampling for Neural Architecture Search on Tabular Datasets. (arXiv:2204.07615v3 [cs.LG] UPDATED)
    The best neural architecture for a given machine learning problem depends on many factors: not only the complexity and structure of the dataset, but also on resource constraints including latency, compute, energy consumption, etc. Neural architecture search (NAS) for tabular datasets is an important but under-explored problem. Previous NAS algorithms designed for image search spaces incorporate resource constraints directly into the reinforcement learning (RL) rewards. However, for NAS on tabular datasets, this protocol often discovers suboptimal architectures. This paper develops TabNAS, a new and more effective approach to handle resource constraints in tabular NAS using an RL controller motivated by the idea of rejection sampling. TabNAS immediately discards any architecture that violates the resource constraints without training or learning from that architecture. TabNAS uses a Monte-Carlo-based correction to the RL policy gradient update to account for this extra filtering step. Results on several tabular datasets demonstrate the superiority of TabNAS over previous reward-shaping methods: it finds better models that obey the constraints.
    Posterior Refinement Improves Sample Efficiency in Bayesian Neural Networks. (arXiv:2205.10041v2 [cs.LG] UPDATED)
    Monte Carlo (MC) integration is the de facto method for approximating the predictive distribution of Bayesian neural networks (BNNs). But, even with many MC samples, Gaussian-based BNNs could still yield bad predictive performance due to the posterior approximation's error. Meanwhile, alternatives to MC integration tend to be more expensive or biased. In this work, we experimentally show that the key to good MC-approximated predictive distributions is the quality of the approximate posterior itself. However, previous methods for obtaining accurate posterior approximations are expensive and non-trivial to implement. We, therefore, propose to refine Gaussian approximate posteriors with normalizing flows. When applied to last-layer BNNs, it yields a simple \emph{post hoc} method for improving pre-existing parametric approximations. We show that the resulting posterior approximation is competitive with even the gold-standard full-batch Hamiltonian Monte Carlo.
    Adapting to Online Label Shift with Provable Guarantees. (arXiv:2207.02121v2 [cs.LG] UPDATED)
    The standard supervised learning paradigm works effectively when training data shares the same distribution as the upcoming testing samples. However, this stationary assumption is often violated in real-world applications, especially when testing data appear in an online fashion. In this paper, we formulate and investigate the problem of \emph{online label shift} (OLaS): the learner trains an initial model from the labeled offline data and then deploys it to an unlabeled online environment where the underlying label distribution changes over time but the label-conditional density does not. The non-stationarity nature and the lack of supervision make the problem challenging to be tackled. To address the difficulty, we construct a new unbiased risk estimator that utilizes the unlabeled data, which exhibits many benign properties albeit with potential non-convexity. Building upon that, we propose novel online ensemble algorithms to deal with the non-stationarity of the environments. Our approach enjoys optimal \emph{dynamic regret}, indicating that the performance is competitive with a clairvoyant who knows the online environments in hindsight and then chooses the best decision for each round. The obtained dynamic regret bound scales with the intensity and pattern of label distribution shift, hence exhibiting the adaptivity in the OLaS problem. Extensive experiments are conducted to validate the effectiveness and support our theoretical findings.
    Variational inference via Wasserstein gradient flows. (arXiv:2205.15902v2 [stat.ML] UPDATED)
    Along with Markov chain Monte Carlo (MCMC) methods, variational inference (VI) has emerged as a central computational approach to large-scale Bayesian inference. Rather than sampling from the true posterior $\pi$, VI aims at producing a simple but effective approximation $\hat \pi$ to $\pi$ for which summary statistics are easy to compute. However, unlike the well-studied MCMC methodology, algorithmic guarantees for VI are still relatively less well-understood. In this work, we propose principled methods for VI, in which $\hat \pi$ is taken to be a Gaussian or a mixture of Gaussians, which rest upon the theory of gradient flows on the Bures--Wasserstein space of Gaussian measures. Akin to MCMC, it comes with strong theoretical guarantees when $\pi$ is log-concave.
    Topology-aware Generalization of Decentralized SGD. (arXiv:2206.12680v3 [cs.LG] UPDATED)
    This paper studies the algorithmic stability and generalizability of decentralized stochastic gradient descent (D-SGD). We prove that the consensus model learned by D-SGD is $\mathcal{O}{(n^{-1}+ m^{-1} +\lambda^2)}$-stable in expectation in the non-convex non-smooth setting, where $n$ is the local sample size on each worker, $m$ is the worker number, and $1-\lambda$ is the spectral gap that measures the connectivity of the communication topology. These results then deliver an $\mathcal{O}{(n^{-(1+\alpha)/2}+ m^{-(1+\alpha)/2}+ \lambda^{1+\alpha} + \phi_\mathcal{S})}$ in-average generalization bound, which is non-vacuous even when $\lambda$ is closed to $1$, in contrast to vacuous as suggested by existing literature on the projected version of D-SGD. Our theory indicates that the generalizability of D-SGD has a positive correlation with the spectral gap, and can explain why consensus control in initial training phase can ensure better generalization. Experiments of VGG-11 and ResNet-18 on CIFAR-10, CIFAR-100 and Tiny-ImageNet justify our theory. To our best knowledge, this is the first work on the topology-aware generalization of vanilla D-SGD. Code is available at \url{https://github.com/Raiden-Zhu/Generalization-of-DSGD}.
    Deep Learning Aided Laplace Based Bayesian Inference for Epidemiological Systems. (arXiv:2210.08865v1 [stat.CO])
    Parameter estimation and associated uncertainty quantification is an important problem in dynamical systems characterized by ordinary differential equation (ODE) models that are often nonlinear. Typically, such models have analytically intractable trajectories which result in likelihoods and posterior distributions that are similarly intractable. Bayesian inference for ODE systems via simulation methods require numerical approximations to produce inference with high accuracy at a cost of heavy computational power and slow convergence. At the same time, Artificial Neural Networks (ANN) offer tractability that can be utilized to construct an approximate but tractable likelihood and posterior distribution. In this paper we propose a hybrid approach, where Laplace-based Bayesian inference is combined with an ANN architecture for obtaining approximations to the ODE trajectories as a function of the unknown initial values and system parameters. Suitable choices of a collocation grid and customized loss functions are proposed to fine tune the ODE trajectories and Laplace approximation. The effectiveness of our proposed methods is demonstrated using an epidemiological system with non-analytical solutions, the Susceptible-Infectious-Removed (SIR) model for infectious diseases, based on simulated and real-life influenza datasets. The novelty and attractiveness of our proposed approach include (i) a new development of Bayesian inference using ANN architectures for ODE based dynamical systems, and (ii) a computationally fast posterior inference by avoiding convergence issues of benchmark Markov Chain Monte Carlo methods. These two features establish the developed approach as an accurate alternative to traditional Bayesian computational methods, with improved computational cost.
    Hidden Progress in Deep Learning: SGD Learns Parities Near the Computational Limit. (arXiv:2207.08799v2 [cs.LG] UPDATED)
    There is mounting empirical evidence of emergent phenomena in the capabilities of deep learning methods as we scale up datasets, model sizes, and training times. While there are some accounts of how these resources modulate statistical capacity, far less is known about their effect on the computational problem of model training. This work conducts such an exploration through the lens of learning $k$-sparse parities of $n$ bits, a canonical family of problems which pose theoretical computational barriers. In this setting, we find that neural networks exhibit surprising phase transitions when scaling up dataset size and running time. In particular, we demonstrate empirically that with standard training, a variety of architectures learn sparse parities with $n^{O(k)}$ examples, with loss (and error) curves abruptly dropping after $n^{O(k)}$ iterations. These positive results nearly match known SQ lower bounds, even without an explicit sparsity-promoting prior. We elucidate the mechanisms of these phenomena with a theoretical analysis: we find that the phase transition in performance is not due to SGD "stumbling in the dark" until it finds the hidden set of features (a natural algorithm which also runs in $n^{O(k)}$ time); instead, we show that SGD gradually amplifies a Fourier gap in the population gradient.
    Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity. (arXiv:2206.07659v2 [cs.LG] UPDATED)
    We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.
    On Privacy and Personalization in Cross-Silo Federated Learning. (arXiv:2206.07902v2 [cs.LG] UPDATED)
    While the application of differential privacy (DP) has been well-studied in cross-device federated learning (FL), there is a lack of work considering DP and its implications for cross-silo FL, a setting characterized by a limited number of clients each containing many data subjects. In cross-silo FL, usual notions of client-level DP are less suitable as real-world privacy regulations typically concern the in-silo data subjects rather than the silos themselves. In this work, we instead consider an alternative notion of silo-specific sample-level DP, where silos set their own privacy targets for their local examples. Under this setting, we reconsider the roles of personalization in federated learning. In particular, we show that mean-regularized multi-task learning (MR-MTL), a simple personalization framework, is a strong baseline for cross-silo FL: under stronger privacy requirements, silos are incentivized to federate more with each other to mitigate DP noise, resulting in consistent improvements relative to standard baseline methods. We provide an empirical study of competing methods as well as a theoretical characterization of MR-MTL for mean estimation, highlighting the interplay between privacy and cross-silo data heterogeneity. Our work serves to establish baselines for private cross-silo FL as well as identify key directions of future work in this area.
    Adversarial Robustness is at Odds with Lazy Training. (arXiv:2207.00411v2 [cs.CR] UPDATED)
    Recent works show that adversarial examples exist for random neural networks [Daniely and Schacham, 2020] and that these examples can be found using a single step of gradient ascent [Bubeck et al., 2021]. In this work, we extend this line of work to "lazy training" of neural networks -- a dominant model in deep learning theory in which neural networks are provably efficiently learnable. We show that over-parametrized neural networks that are guaranteed to generalize well and enjoy strong computational guarantees remain vulnerable to attacks generated using a single step of gradient ascent.
    Overparameterization from Computational Constraints. (arXiv:2208.12926v2 [cs.LG] UPDATED)
    Overparameterized models with millions of parameters have been hugely successful. In this work, we ask: can the need for large models be, at least in part, due to the \emph{computational} limitations of the learner? Additionally, we ask, is this situation exacerbated for \emph{robust} learning? We show that this indeed could be the case. We show learning tasks for which computationally bounded learners need \emph{significantly more} model parameters than what information-theoretic learners need. Furthermore, we show that even more model parameters could be necessary for robust learning. In particular, for computationally bounded learners, we extend the recent result of Bubeck and Sellke [NeurIPS'2021] which shows that robust models might need more parameters, to the computational regime and show that bounded learners could provably need an even larger number of parameters. Then, we address the following related question: can we hope to remedy the situation for robust computationally bounded learning by restricting \emph{adversaries} to also be computationally bounded for sake of obtaining models with fewer parameters? Here again, we show that this could be possible. Specifically, building on the work of Garg, Jha, Mahloujifar, and Mahmoody [ALT'2020], we demonstrate a learning task that can be learned efficiently and robustly against a computationally bounded attacker, while to be robust against an information-theoretic attacker requires the learner to utilize significantly more parameters.
    Packed-Ensembles for Efficient Uncertainty Estimation. (arXiv:2210.09184v1 [cs.LG])
    Deep Ensembles (DE) are a prominent approach achieving excellent performance on key metrics such as accuracy, calibration, uncertainty estimation, and out-of-distribution detection. However, hardware limitations of real-world systems constrain to smaller ensembles and lower capacity networks, significantly deteriorating their performance and properties. We introduce Packed-Ensembles (PE), a strategy to design and train lightweight structured ensembles by carefully modulating the dimension of their encoding space. We leverage grouped convolutions to parallelize the ensemble into a single common backbone and forward pass to improve training and inference speeds. PE is designed to work under the memory budget of a single standard neural network. Through extensive studies we show that PE faithfully preserve the properties of DE, e.g., diversity, and match their performance in terms of accuracy, calibration, out-of-distribution detection and robustness to distribution shift.
    Temporal-Spatial dependencies ENhanced deep learning model (TSEN) for household leverage series forecasting. (arXiv:2210.08668v1 [cs.LG])
    Analyzing both temporal and spatial patterns for an accurate forecasting model for financial time series forecasting is a challenge due to the complex nature of temporal-spatial dynamics: time series from different locations often have distinct patterns; and for the same time series, patterns may vary as time goes by. Inspired by the successful applications of deep learning, we propose a new model to resolve the issues of forecasting household leverage in China. Our solution consists of multiple RNN-based layers and an attention layer: each RNN-based layer automatically learns the temporal pattern of a specific series with multivariate exogenous series, and then the attention layer learns the spatial correlative weight and obtains the global representations simultaneously. The results show that the new approach can capture the temporal-spatial dynamics of household leverage well and get more accurate and solid predictive results. More, the simulation also studies show that clustering and choosing correlative series are necessary to obtain accurate forecasting results.
    Forward-Backward Latent State Inference for Hidden Continuous-Time semi-Markov Chains. (arXiv:2210.09058v1 [stat.ML])
    Hidden semi-Markov Models (HSMM's) - while broadly in use - are restricted to a discrete and uniform time grid. They are thus not well suited to explain often irregularly spaced discrete event data from continuous-time phenomena. We show that non-sampling-based latent state inference used in HSMM's can be generalized to latent Continuous-Time semi-Markov Chains (CTSMC's). We formulate integro-differential forward and backward equations adjusted to the observation likelihood and introduce an exact integral equation for the Bayesian posterior marginals and a scalable Viterbi-type algorithm for posterior path estimates. The presented equations can be efficiently solved using well-known numerical methods. As a practical tool, variable-step HSMM's are introduced. We evaluate our approaches in latent state inference scenarios in comparison to classical HSMM's.
    Accelerating Inhibitor Discovery With A Deep Generative Foundation Model: Validation for SARS-CoV-2 Drug Targets. (arXiv:2204.09042v3 [q-bio.QM] UPDATED)
    The discovery of novel inhibitor molecules for emerging drug-target proteins is widely acknowledged as a challenging inverse design problem: Exhaustive exploration of the vast chemical search space is impractical, especially when the target structure or active molecules are unknown. Here we validate experimentally the broad utility of a deep generative framework trained at-scale on protein sequences, small molecules, and their mutual interactions -- that is unbiased toward any specific target. As demonstrators, we consider two dissimilar and relevant SARS-CoV-2 targets: the main protease and the spike protein (receptor binding domain, RBD). To perform target-aware design of novel inhibitor molecules, a protein sequence-conditioned sampling on the generative foundation model is performed. Despite using only the target sequence information, and without performing any target-specific adaptation of the generative model, micromolar-level inhibition was observed in in vitro experiments for two candidates out of only four synthesized for each target. The most potent spike RBD inhibitor also exhibited activity against several variants in live virus neutralization assays. These results therefore establish that a single, broadly deployable generative foundation model for accelerated hit discovery is effective and efficient, even in the most general case where neither target structure nor binder information is available.
    Improved Robust Algorithms for Learning with Discriminative Feature Feedback. (arXiv:2209.03753v2 [cs.LG] UPDATED)
    Discriminative Feature Feedback is a setting proposed by Dastupta et al. (2018), which provides a protocol for interactive learning based on feature explanations that are provided by a human teacher. The features distinguish between the labels of pairs of possibly similar instances. That work has shown that learning in this model can have considerable statistical and computational advantages over learning in standard label-based interactive learning models. In this work, we provide new robust interactive learning algorithms for the Discriminative Feature Feedback model, with mistake bounds that are significantly lower than those of previous robust algorithms for this setting. In the adversarial setting, we reduce the dependence on the number of protocol exceptions from quadratic to linear. In addition, we provide an algorithm for a slightly more restricted model, which obtains an even smaller mistake bound for large models with many exceptions. In the stochastic setting, we provide the first algorithm that converges to the exception rate with a polynomial sample complexity. Our algorithm and analysis for the stochastic setting involve a new construction that we call Feature Influence, which may be of wider applicability.
    On generalization bounds for deep networks based on loss surface implicit regularization. (arXiv:2201.04545v3 [stat.ML] UPDATED)
    The classical statistical learning theory implies that fitting too many parameters leads to overfitting and poor performance. That modern deep neural networks generalize well despite a large number of parameters contradicts this finding and constitutes a major unsolved problem towards explaining the success of deep learning. While previous work focuses on the implicit regularization induced by stochastic gradient descent (SGD), we study here how the local geometry of the energy landscape around local minima affects the statistical properties of SGD with Gaussian gradient noise. We argue that under reasonable assumptions, the local geometry forces SGD to stay close to a low dimensional subspace and that this induces another form of implicit regularization and results in tighter bounds on the generalization error for deep neural networks. To derive generalization error bounds for neural networks, we first introduce a notion of stagnation sets around the local minima and impose a local essential convexity property of the population risk. Under these conditions, lower bounds for SGD to remain in these stagnation sets are derived. If stagnation occurs, we derive a bound on the generalization error of deep neural networks involving the spectral norms of the weight matrices but not the number of network parameters. Technically, our proofs are based on controlling the change of parameter values in the SGD iterates and local uniform convergence of the empirical loss functions based on the entropy of suitable neighborhoods around local minima.
    Statistical, Robustness, and Computational Guarantees for Sliced Wasserstein Distances. (arXiv:2210.09160v1 [stat.ML])
    Sliced Wasserstein distances preserve properties of classic Wasserstein distances while being more scalable for computation and estimation in high dimensions. The goal of this work is to quantify this scalability from three key aspects: (i) empirical convergence rates; (ii) robustness to data contamination; and (iii) efficient computational methods. For empirical convergence, we derive fast rates with explicit dependence of constants on dimension, subject to log-concavity of the population distributions. For robustness, we characterize minimax optimal, dimension-free robust estimation risks, and show an equivalence between robust sliced 1-Wasserstein estimation and robust mean estimation. This enables lifting statistical and algorithmic guarantees available for the latter to the sliced 1-Wasserstein setting. Moving on to computational aspects, we analyze the Monte Carlo estimator for the average-sliced distance, demonstrating that larger dimension can result in faster convergence of the numerical integration error. For the max-sliced distance, we focus on a subgradient-based local optimization algorithm that is frequently used in practice, albeit without formal guarantees, and establish an $O(\epsilon^{-4})$ computational complexity bound for it. Our theory is validated by numerical experiments, which altogether provide a comprehensive quantitative account of the scalability question.
    CARD: Classification and Regression Diffusion Models. (arXiv:2206.07275v3 [stat.ML] UPDATED)
    Learning the distribution of a continuous or categorical response variable $\boldsymbol y$ given its covariates $\boldsymbol x$ is a fundamental problem in statistics and machine learning. Deep neural network-based supervised learning algorithms have made great progress in predicting the mean of $\boldsymbol y$ given $\boldsymbol x$, but they are often criticized for their ability to accurately capture the uncertainty of their predictions. In this paper, we introduce classification and regression diffusion (CARD) models, which combine a denoising diffusion-based conditional generative model and a pre-trained conditional mean estimator, to accurately predict the distribution of $\boldsymbol y$ given $\boldsymbol x$. We demonstrate the outstanding ability of CARD in conditional distribution prediction with both toy examples and real-world datasets, the experimental results on which show that CARD in general outperforms state-of-the-art methods, including Bayesian neural network-based ones that are designed for uncertainty estimation, especially when the conditional distribution of $\boldsymbol y$ given $\boldsymbol x$ is multi-modal. In addition, we utilize the stochastic nature of the generative model outputs to obtain a finer granularity in model confidence assessment at the instance level for classification tasks.
    Robust Low-tubal-rank Tensor Completion based on Tensor Factorization and Maximum Correntopy Criterion. (arXiv:2010.11740v2 [cs.LG] UPDATED)
    The goal of tensor completion is to recover a tensor from a subset of its entries, often by exploiting its low-rank property. Among several useful definitions of tensor rank, the low-tubal-rank was shown to give a valuable characterization of the inherent low-rank structure of a tensor. While some low-tubal-rank tensor completion algorithms with favorable performance have been recently proposed, these algorithms utilize second-order statistics to measure the error residual, which may not work well when the observed entries contain large outliers. In this paper, we propose a new objective function for low-tubal-rank tensor completion, which uses correntropy as the error measure to mitigate the effect of the outliers. To efficiently optimize the proposed objective, we leverage a half-quadratic minimization technique whereby the optimization is transformed to a weighted low-tubal-rank tensor factorization problem. Subsequently, we propose two simple and efficient algorithms to obtain the solution and provide their convergence and complexity analysis. Numerical results using both synthetic and real data demonstrate the robust and superior performance of the proposed algorithms.
    Leveraging Unlabeled Data to Predict Out-of-Distribution Performance. (arXiv:2201.04234v3 [cs.LG] UPDATED)
    Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions that may cause performance drops. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples for which model confidence exceeds that threshold. ATC outperforms previous methods across several model architectures, types of distribution shifts (e.g., due to synthetic corruptions, dataset reproduction, or novel subpopulations), and datasets (Wilds, ImageNet, Breeds, CIFAR, and MNIST). In our experiments, ATC estimates target performance $2$-$4\times$ more accurately than prior methods. We also explore the theoretical foundations of the problem, proving that, in general, identifying the accuracy is just as hard as identifying the optimal predictor and thus, the efficacy of any method rests upon (perhaps unstated) assumptions on the nature of the shift. Finally, analyzing our method on some toy distributions, we provide insights concerning when it works. Code is available at https://github.com/saurabhgarg1996/ATC_code/.
    Adversarial Meta-Learning of Gamma-Minimax Estimators That Leverage Prior Knowledge. (arXiv:2012.05465v3 [stat.ME] UPDATED)
    Bayes estimators are well known to provide a means to incorporate prior knowledge that can be expressed in terms of a single prior distribution. However, when this knowledge is too vague to express with a single prior, an alternative approach is needed. Gamma-minimax estimators provide such an approach. These estimators minimize the worst-case Bayes risk over a set $\Gamma$ of prior distributions that are compatible with the available knowledge. Traditionally, Gamma-minimaxity is defined for parametric models. In this work, we define Gamma-minimax estimators for general models and propose adversarial meta-learning algorithms to compute them when the set of prior distributions is constrained by generalized moments. Accompanying convergence guarantees are also provided. We also introduce a neural network class that provides a rich, but finite-dimensional, class of estimators from which a Gamma-minimax estimator can be selected. We illustrate our method in two settings, namely entropy estimation and a prediction problem that arises in biodiversity studies.
    Learning Predictions for Algorithms with Predictions. (arXiv:2202.09312v2 [cs.LG] UPDATED)
    A burgeoning paradigm in algorithm design is the field of algorithms with predictions, in which algorithms can take advantage of a possibly-imperfect prediction of some aspect of the problem. While much work has focused on using predictions to improve competitive ratios, running times, or other performance measures, less effort has been devoted to the question of how to obtain the predictions themselves, especially in the critical online setting. We introduce a general design approach for algorithms that learn predictors: (1) identify a functional dependence of the performance measure on the prediction quality and (2) apply techniques from online learning to learn predictors, tune robustness-consistency trade-offs, and bound the sample complexity. We demonstrate the effectiveness of our approach by applying it to bipartite matching, ski-rental, page migration, and job scheduling. In several settings we improve upon multiple existing results while utilizing a much simpler analysis, while in the others we provide the first learning-theoretic guarantees.
    ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions. (arXiv:2106.12231v2 [stat.ML] UPDATED)
    We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets.
    On the Identifiability of Nonlinear ICA: Sparsity and Beyond. (arXiv:2206.07751v2 [cs.LG] UPDATED)
    Nonlinear independent component analysis (ICA) aims to recover the underlying independent latent sources from their observable nonlinear mixtures. How to make the nonlinear ICA model identifiable up to certain trivial indeterminacies is a long-standing problem in unsupervised learning. Recent breakthroughs reformulate the standard independence assumption of sources as conditional independence given some auxiliary variables (e.g., class labels and/or domain/time indexes) as weak supervision or inductive bias. However, nonlinear ICA with unconditional priors cannot benefit from such developments. We explore an alternative path and consider only assumptions on the mixing process, such as Structural Sparsity. We show that under specific instantiations of such constraints, the independent latent sources can be identified from their nonlinear mixtures up to a permutation and a component-wise transformation, thus achieving nontrivial identifiability of nonlinear ICA without auxiliary variables. We provide estimation methods and validate the theoretical results experimentally. The results on image data suggest that our conditions may hold in a number of practical data generating processes.
    Multi-Armed Bandits with Censored Consumption of Resources. (arXiv:2011.00813v2 [cs.LG] UPDATED)
    We consider a resource-aware variant of the classical multi-armed bandit problem: In each round, the learner selects an arm and determines a resource limit. It then observes a corresponding (random) reward, provided the (random) amount of consumed resources remains below the limit. Otherwise, the observation is censored, i.e., no reward is obtained. For this problem setting, we introduce a measure of regret, which incorporates the actual amount of allocated resources of each learning round as well as the optimality of realizable rewards. Thus, to minimize regret, the learner needs to set a resource limit and choose an arm in such a way that the chance to realize a high reward within the predefined resource limit is high, while the resource limit itself should be kept as low as possible. We propose a UCB-inspired online learning algorithm, which we analyze theoretically in terms of its regret upper bound. In a simulation study, we show that our learning algorithm outperforms straightforward extensions of standard multi-armed bandit algorithms.  ( 2 min )
    Conditional Neural Processes for Molecules. (arXiv:2210.09211v1 [stat.ML])
    Neural processes (NPs) are models for transfer learning with properties reminiscent of Gaussian Processes (GPs). They are adept at modelling data consisting of few observations of many related functions on the same input space and are trained by minimizing a variational objective, which is computationally much less expensive than the Bayesian updating required by GPs. So far, most studies of NPs have focused on low-dimensional datasets which are not representative of realistic transfer learning tasks. Drug discovery is one application area that is characterized by datasets consisting of many chemical properties or functions which are sparsely observed, yet depend on shared features or representations of the molecular inputs. This paper applies the conditional neural process (CNP) to DOCKSTRING, a dataset of docking scores for benchmarking ML models. CNPs show competitive performance in few-shot learning tasks relative to supervised learning baselines common in QSAR modelling, as well as an alternative model for transfer learning based on pre-training and refining neural network regressors. We present a Bayesian optimization experiment which showcases the probabilistic nature of CNPs and discuss shortcomings of the model in uncertainty quantification.
    Sparse Kronecker Product Decomposition: A General Framework of Signal Region Detection in Image Regression. (arXiv:2210.09128v1 [cs.CV])
    This paper aims to present the first Frequentist framework on signal region detection in high-resolution and high-order image regression problems. Image data and scalar-on-image regression are intensively studied in recent years. However, most existing studies on such topics focused on outcome prediction, while the research on image region detection is rather limited, even though the latter is often more important. In this paper, we develop a general framework named Sparse Kronecker Product Decomposition (SKPD) to tackle this issue. The SKPD framework is general in the sense that it works for both matrices (e.g., 2D grayscale images) and (high-order) tensors (e.g., 2D colored images, brain MRI/fMRI data) represented image data. Moreover, unlike many Bayesian approaches, our framework is computationally scalable for high-resolution image problems. Specifically, our framework includes: 1) the one-term SKPD; 2) the multi-term SKPD; and 3) the nonlinear SKPD. We propose nonconvex optimization problems to estimate the one-term and multi-term SKPDs and develop path-following algorithms for the nonconvex optimization. The computed solutions of the path-following algorithm are guaranteed to converge to the truth with a particularly chosen initialization even though the optimization is nonconvex. Moreover, the region detection consistency could also be guaranteed by the one-term and multi-term SKPD. The nonlinear SKPD is highly connected to shallow convolutional neural networks (CNN), particular to CNN with one convolutional layer and one fully connected layer. Effectiveness of SKPDs is validated by real brain imaging data in the UK Biobank database.  ( 3 min )
    Movement Penalized Bayesian Optimization with Application to Wind Energy Systems. (arXiv:2210.08087v1 [stat.ML])
    Contextual Bayesian optimization (CBO) is a powerful framework for sequential decision-making given side information, with important applications, e.g., in wind energy systems. In this setting, the learner receives context (e.g., weather conditions) at each round, and has to choose an action (e.g., turbine parameters). Standard algorithms assume no cost for switching their decisions at every round. However, in many practical applications, there is a cost associated with such changes, which should be minimized. We introduce the episodic CBO with movement costs problem and, based on the online learning approach for metrical task systems of Coester and Lee (2019), propose a novel randomized mirror descent algorithm that makes use of Gaussian Process confidence bounds. We compare its performance with the offline optimal sequence for each episode and provide rigorous regret guarantees. We further demonstrate our approach on the important real-world application of altitude optimization for Airborne Wind Energy Systems. In the presence of substantial movement costs, our algorithm consistently outperforms standard CBO algorithms.
    Operator Shifting for Model-based Policy Evaluation. (arXiv:2110.12658v2 [cs.LG] UPDATED)
    In model-based reinforcement learning, the transition matrix and reward vector are often estimated from random samples subject to noise. Even if the estimated model is an unbiased estimate of the true underlying model, the value function computed from the estimated model is biased. We introduce an operator shifting method for reducing the error introduced by the estimated model. When the error is in the residual norm, we prove that the shifting factor is always positive and upper bounded by $1+O\left(1/n\right)$, where $n$ is the number of samples used in learning each row of the transition matrix. We also propose a practical numerical algorithm for implementing the operator shifting.
    Score-Based Generative Models Detect Manifolds. (arXiv:2206.01018v3 [stat.ML] UPDATED)
    Score-based generative models (SGMs) need to approximate the scores $\nabla \log p_t$ of the intermediate distributions as well as the final distribution $p_T$ of the forward process. The theoretical underpinnings of the effects of these approximations are still lacking. We find precise conditions under which SGMs are able to produce samples from an underlying (low-dimensional) data manifold $\mathcal{M}$. This assures us that SGMs are able to generate the "right kind of samples". For example, taking $\mathcal{M}$ to be the subset of images of faces, we find conditions under which the SGM robustly produces an image of a face, even though the relative frequencies of these images might not accurately represent the true data generating distribution. Moreover, this analysis is a first step towards understanding the generalization properties of SGMs: Taking $\mathcal{M}$ to be the set of all training samples, our results provide a precise description of when the SGM memorizes its training data.  ( 2 min )
    Generalized Variational Inference in Function Spaces: Gaussian Measures meet Bayesian Deep Learning. (arXiv:2205.06342v2 [stat.ML] UPDATED)
    We develop a framework for generalized variational inference in infinite-dimensional function spaces and use it to construct a method termed Gaussian Wasserstein inference (GWI). GWI leverages the Wasserstein distance between Gaussian measures on the Hilbert space of square-integrable functions in order to determine a variational posterior using a tractable optimisation criterion and avoids pathologies arising in standard variational function space inference. An exciting application of GWI is the ability to use deep neural networks in the variational parametrisation of GWI, combining their superior predictive performance with the principled uncertainty quantification analogous to that of Gaussian processes. The proposed method obtains state-of-the-art performance on several benchmark datasets.  ( 2 min )
    Concentration bounds for the empirical angular measure with statistical learning applications. (arXiv:2104.03966v3 [math.ST] UPDATED)
    The angular measure on the unit sphere characterizes the first-order dependence structure of the components of a random vector in extreme regions and is defined in terms of standardized margins. Its statistical recovery is an important step in learning problems involving observations far away from the center. In the common situation that the components of the vector have different distributions, the rank transformation offers a convenient and robust way of standardizing data in order to build an empirical version of the angular measure based on the most extreme observations. However, the study of the sampling distribution of the resulting empirical angular measure is challenging. It is the purpose of the paper to establish finite-sample bounds for the maximal deviations between the empirical and true angular measures, uniformly over classes of Borel sets of controlled combinatorial complexity. The bounds are valid with high probability and, up to logarithmic factors, scale as the square root of the effective sample size. The bounds are applied to provide performance guarantees for two statistical learning procedures tailored to extreme regions of the input space and built upon the empirical angular measure: binary classification in extreme regions through empirical risk minimization and unsupervised anomaly detection through minimum-volume sets of the sphere.  ( 3 min )
    Free Probability for predicting the performance of feed-forward fully connected neural networks. (arXiv:2111.00841v3 [stat.ML] UPDATED)
    Gradient descent during the learning process of a neural network can be subject to many instabilities. The spectral density of the Jacobian is a key component for analyzing stability. Following the works of Pennington et al., such Jacobians are modeled using free multiplicative convolutions from Free Probability Theory (FPT). We present a reliable and very fast method for computing the associated spectral densities, for given architecture and initialization. This method has a controlled and proven convergence. Our technique is based on an homotopy method: it is an adaptative Newton-Raphson scheme which chains basins of attraction. In order to demonstrate the relevance of our method we show that the relevant FPT metrics computed before training are highly correlated to final test accuracies - up to 85\%. We also nuance the idea that learning happens at the edge of chaos by giving evidence that a very desirable feature for neural networks is the hyperbolicity of their Jacobian at initialization.
    Pay attention to your loss: understanding misconceptions about 1-Lipschitz neural networks. (arXiv:2104.05097v6 [cs.LG] UPDATED)
    Lipschitz constrained networks have gathered considerable attention in the deep learning community, with usages ranging from Wasserstein distance estimation to the training of certifiably robust classifiers. However they remain commonly considered as less accurate, and their properties in learning are still not fully understood. In this paper we clarify the matter: when it comes to classification 1-Lipschitz neural networks enjoy several advantages over their unconstrained counterpart. First, we show that these networks are as accurate as classical ones, and can fit arbitrarily difficult boundaries. Then, relying on a robustness metric that reflects operational needs we characterize the most robust classifier: the WGAN discriminator. Next, we show that 1-Lipschitz neural networks generalize well under milder assumptions. Finally, we show that hyper-parameters of the loss are crucial for controlling the accuracy-robustness trade-off. We conclude that they exhibit appealing properties to pave the way toward provably accurate, and provably robust neural networks.
    Data Subsampling for Bayesian Neural Networks. (arXiv:2210.09141v1 [stat.ML])
    Markov Chain Monte Carlo (MCMC) algorithms do not scale well for large datasets leading to difficulties in Neural Network posterior sampling. In this paper, we apply a generalization of the Metropolis Hastings algorithm that allows us to restrict the evaluation of the likelihood to small mini-batches in a Bayesian inference context. Since it requires the computation of a so-called "noise penalty" determined by the variance of the training loss function over the mini-batches, we refer to this data subsampling strategy as Penalty Bayesian Neural Networks - PBNNs. Its implementation on top of MCMC is straightforward, as the variance of the loss function merely reduces the acceptance probability. Comparing to other samplers, we empirically show that PBNN achieves good predictive performance for a given mini-batch size. Varying the size of the mini-batches enables a natural calibration of the predictive distribution and provides an inbuilt protection against overfitting. We expect PBNN to be particularly suited for cases when data sets are distributed across multiple decentralized devices as typical in federated learning.
    No imputation without representation. (arXiv:2206.14254v2 [cs.LG] UPDATED)
    By filling in missing values in datasets, imputation allows these datasets to be used with algorithms that cannot handle missing values by themselves. However, missing values may in principle contribute useful information that is lost through imputation. The missing-indicator approach can be used in combination with imputation to instead represent this information as a part of the dataset. There are several theoretical considerations why missing-indicators may or may not be beneficial, but there has not been any large-scale practical experiment on real-life datasets to test this question for machine learning predictions. We perform this experiment for three imputation strategies and a range of different classification algorithms, on the basis of twenty real-life datasets. We find that on these datasets, missing-indicators generally increase classification performance. In addition, we find no evidence for most algorithms that nearest neighbour and iterative imputation lead to better performance than simple mean/mode imputation. Therefore, we recommend the use of missing-indicators with mean/mode imputation as a safe default, with the caveat that for decision trees, pruning is necessary to prevent overfitting. In a follow-up experiment, we determine attribute-specific missingness thresholds for each classifier above which missing-indicators are more likely than not to increase classification performance, and observe that these thresholds are much lower for categorical than for numerical attributes. Finally, we argue that mean imputation of numerical attributes may preserve some of the information from missing values, and we show that in the absence of missing-indicators, it can similarly be useful to apply mean imputation to one-hot encoded categorical attributes instead of mode imputation.
    Distributed Computation for Marginal Likelihood based Model Choice. (arXiv:1910.04672v4 [stat.CO] UPDATED)
    We propose a general method for distributed Bayesian model choice, using the marginal likelihood, where a data set is split in non-overlapping subsets. These subsets are only accessed locally by individual workers and no data is shared between the workers. We approximate the model evidence for the full data set through Monte Carlo sampling from the posterior on every subset generating a model evidence per subset. The results are combined using a novel approach which corrects for the splitting using summary statistics of the generated samples. Our divide-and-conquer approach enables Bayesian model choice in the large data setting, exploiting all available information but limiting communication between workers. We derive theoretical error bounds that quantify the resulting trade-off between computational gain and loss in precision. The embarrassingly parallel nature yields important speed-ups when used on massive data sets as illustrated by our real world experiments. In addition, we show how the suggested approach can be extended to model choice within a reversible jump setting that explores multiple feature combinations within one run.
    Exact Solutions of a Deep Linear Network. (arXiv:2202.04777v4 [stat.ML] UPDATED)
    This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that zero is a special point in deep neural network architecture. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than $1$ hidden layer, qualitatively different from a network with only $1$ hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.
    Sample-Efficient Reinforcement Learning of Partially Observable Markov Games. (arXiv:2206.01315v2 [cs.LG] UPDATED)
    This paper considers the challenging tasks of Multi-Agent Reinforcement Learning (MARL) under partial observability, where each agent only sees her own individual observations and actions that reveal incomplete information about the underlying state of system. This paper studies these tasks under the general model of multiplayer general-sum Partially Observable Markov Games (POMGs), which is significantly larger than the standard model of Imperfect Information Extensive-Form Games (IIEFGs). We identify a rich subclass of POMGs -- weakly revealing POMGs -- in which sample-efficient learning is tractable. In the self-play setting, we prove that a simple algorithm combining optimism and Maximum Likelihood Estimation (MLE) is sufficient to find approximate Nash equilibria, correlated equilibria, as well as coarse correlated equilibria of weakly revealing POMGs, in a polynomial number of samples when the number of agents is small. In the setting of playing against adversarial opponents, we show that a variant of our optimistic MLE algorithm is capable of achieving sublinear regret when being compared against the optimal maximin policies. To our best knowledge, this work provides the first line of sample-efficient results for learning POMGs.
    Fast Gaussian Process Predictions on Large Geospatial Fields with Prediction-Point Dependent Basis Functions. (arXiv:2210.09168v1 [cs.LG])
    In order to perform GP predictions fast in large geospatial fields with small-scale variations, a computational complexity that is independent of the number of measurements $N$ and the size of the field is crucial. In this setting, GP approximations using $m$ basis functions requires $\mathcal{O}(Nm^2+m^3)$ computations. Using finite-support basis functions reduces the required number of computations to perform a single prediction to $\mathcal{O}(m^3)$, after a one-time training cost of $O(N)$. The prediction cost increases with increasing field size, as the number of required basis functions $m$ grows with the size of the field relative to the size of the spatial variations. To prevent the prediction speed from depending on field size, we propose leveraging the property that a subset of the trained system is a trained subset of the system to use only a local subset of $m'\ll m$ finite-support basis functions centered around each prediction point to perform predictions. Our proposed approximation requires $\mathcal{O}(m'^3)$ operations to perform each prediction after a one-time training cost of $\mathcal{O}(N)$. We show on real-life spatial data that our approach matches the prediction error of state-of-the-art methods and that it performs faster predictions, also compared to state-of-the-art approximations that lower the prediction cost of $\mathcal{O}(m^3)$ to $\mathcal{O}(m\log(m))$ using a conjugate gradient solver. Finally, we demonstrate that our approach can perform fast predictions on a global bathymetry dataset using millions of basis functions and tens of millions of measurements on a laptop computer.
    Causal Discovery in Heterogeneous Environments Under the Sparse Mechanism Shift Hypothesis. (arXiv:2206.02013v2 [cs.LG] UPDATED)
    Machine learning approaches commonly rely on the assumption of independent and identically distributed (i.i.d.) data. In reality, however, this assumption is almost always violated due to distribution shifts between environments. Although valuable learning signals can be provided by heterogeneous data from changing distributions, it is also known that learning under arbitrary (adversarial) changes is impossible. Causality provides a useful framework for modeling distribution shifts, since causal models encode both observational and interventional distributions. In this work, we explore the sparse mechanism shift hypothesis, which posits that distribution shifts occur due to a small number of changing causal conditionals. Motivated by this idea, we apply it to learning causal structure from heterogeneous environments, where i.i.d. data only allows for learning an equivalence class of graphs without restrictive assumptions. We propose the Mechanism Shift Score (MSS), a score-based approach amenable to various empirical estimators, which provably identifies the entire causal structure with high probability if the sparse mechanism shift hypothesis holds. Empirically, we verify behavior predicted by the theory and compare multiple estimators and score functions to identify the best approaches in practice. Compared to other methods, we show how MSS bridges a gap by both being nonparametric as well as explicitly leveraging sparse changes.
    Dimension free ridge regression. (arXiv:2210.08571v1 [math.ST])
    Random matrix theory has become a widely useful tool in high-dimensional statistics and theoretical machine learning. However, random matrix theory is largely focused on the proportional asymptotics in which the number of columns grows proportionally to the number of rows of the data matrix. This is not always the most natural setting in statistics where columns correspond to covariates and rows to samples. With the objective to move beyond the proportional asymptotics, we revisit ridge regression ($\ell_2$-penalized least squares) on i.i.d. data $(x_i, y_i)$, $i\le n$, where $x_i$ is a feature vector and $y_i = \beta^\top x_i +\epsilon_i \in\mathbb{R}$ is a response. We allow the feature vector to be high-dimensional, or even infinite-dimensional, in which case it belongs to a separable Hilbert space, and assume either $z_i := \Sigma^{-1/2}x_i$ to have i.i.d. entries, or to satisfy a certain convex concentration property. Within this setting, we establish non-asymptotic bounds that approximate the bias and variance of ridge regression in terms of the bias and variance of an `equivalent' sequence model (a regression model with diagonal design matrix). The approximation is up to multiplicative factors bounded by $(1\pm \Delta)$ for some explicitly small $\Delta$. Previously, such an approximation result was known only in the proportional regime and only up to additive errors: in particular, it did not allow to characterize the behavior of the excess risk when this converges to $0$. Our general theory recovers earlier results in the proportional regime (with better error rates). As a new application, we obtain a completely explicit and sharp characterization of ridge regression for Hilbert covariates with regularly varying spectrum. Finally, we analyze the overparametrized near-interpolation setting and obtain sharp `benign overfitting' guarantees.
    Zonotope Domains for Lagrangian Neural Network Verification. (arXiv:2210.08069v1 [cs.LG])
    Neural network verification aims to provide provable bounds for the output of a neural network for a given input range. Notable prior works in this domain have either generated bounds using abstract domains, which preserve some dependency between intermediate neurons in the network; or framed verification as an optimization problem and solved a relaxation using Lagrangian methods. A key drawback of the latter technique is that each neuron is treated independently, thereby ignoring important neuron interactions. We provide an approach that merges these two threads and uses zonotopes within a Lagrangian decomposition. Crucially, we can decompose the problem of verifying a deep neural network into the verification of many 2-layer neural networks. While each of these problems is provably hard, we provide efficient relaxation methods that are amenable to efficient dual ascent procedures. Our technique yields bounds that improve upon both linear programming and Lagrangian-based verification techniques in both time and bound tightness.
    The Terminating-Random Experiments Selector: Fast High-Dimensional Variable Selection with False Discovery Rate Control. (arXiv:2110.06048v5 [stat.ME] UPDATED)
    We propose the Terminating-Random Experiments (T-Rex) selector, a fast variable selection method for high-dimensional data. The T-Rex selector controls a user-defined target false discovery rate (FDR) while maximizing the number of selected variables. This is achieved by fusing the solutions of multiple early terminated random experiments. The experiments are conducted on a combination of the original predictors and multiple sets of randomly generated dummy predictors. A finite sample proof based on martingale theory for the FDR control property is provided. Numerical simulations confirm that the FDR is controlled at the target level while allowing for a high power. We prove under mild conditions that the dummies can be sampled from any univariate probability distribution with existing finite expectation and variance. The computational complexity of the proposed method is linear in the number of variables. The T-Rex selector outperforms state-of-the-art methods for FDR control on a simulated genome-wide association study (GWAS), while its sequential computation time is more than two orders of magnitude lower than that of the strongest benchmark methods. The open source R package TRexSelector containing the implementation of the T-Rex selector is available on CRAN.
    A Unified Algorithm for Stochastic Path Problems. (arXiv:2210.09255v1 [cs.LG])
    We study reinforcement learning in stochastic path (SP) problems. The goal in these problems is to maximize the expected sum of rewards until the agent reaches a terminal state. We provide the first regret guarantees in this general problem by analyzing a simple optimistic algorithm. Our regret bound matches the best known results for the well-studied special case of stochastic shortest path (SSP) with all non-positive rewards. For SSP, we present an adaptation procedure for the case when the scale of rewards $B_\star$ is unknown. We show that there is no price for adaptation, and our regret bound matches that with a known $B_\star$. We also provide a scale adaptation procedure for the special case of stochastic longest paths (SLP) where all rewards are non-negative. However, unlike in SSP, we show through a lower bound that there is an unavoidable price for adaptation.
    Model Criticism for Long-Form Text Generation. (arXiv:2210.08444v1 [cs.CL])
    Language models have demonstrated the ability to generate highly fluent text; however, it remains unclear whether their output retains coherent high-level structure (e.g., story progression). Here, we propose to apply a statistical tool, model criticism in latent space, to evaluate the high-level structure of the generated text. Model criticism compares the distributions between real and generated data in a latent space obtained according to an assumptive generative process. Different generative processes identify specific failure modes of the underlying model. We perform experiments on three representative aspects of high-level discourse -- coherence, coreference, and topicality -- and find that transformer-based language models are able to capture topical structures but have a harder time maintaining structural coherence or modeling coreference.
    Generalization Gap in Amortized Inference. (arXiv:2205.11640v2 [stat.ML] UPDATED)
    The ability of likelihood-based probabilistic models to generalize to unseen data is central to many machine learning applications such as lossless compression. In this work, we study the generalization of a popular class of probabilistic model - the Variational Auto-Encoder (VAE). We discuss the two generalization gaps that affect VAEs and show that overfitting is usually dominated by amortized inference. Based on this observation, we propose a new training objective that improves the generalization of amortized inference. We demonstrate how our method can improve performance in the context of image modeling and lossless compression.  ( 2 min )
    Spectral Bias Outside the Training Set for Deep Networks in the Kernel Regime. (arXiv:2206.02927v2 [stat.ML] UPDATED)
    We provide quantitative bounds measuring the $L^2$ difference in function space between the trajectory of a finite-width network trained on finitely many samples from the idealized kernel dynamics of infinite width and infinite data. An implication of the bounds is that the network is biased to learn the top eigenfunctions of the Neural Tangent Kernel not just on the training set but over the entire input space. This bias depends on the model architecture and input distribution alone and thus does not depend on the target function which does not need to be in the RKHS of the kernel. The result is valid for deep architectures with fully connected, convolutional, and residual layers. Furthermore the width does not need to grow polynomially with the number of samples in order to obtain high probability bounds up to a stopping time. The proof exploits the low-effective-rank property of the Fisher Information Matrix at initialization, which implies a low effective dimension of the model (far smaller than the number of parameters). We conclude that local capacity control from the low effective rank of the Fisher Information Matrix is still underexplored theoretically.  ( 3 min )
    k-Sliced Mutual Information: A Quantitative Study of Scalability with Dimension. (arXiv:2206.08526v2 [cs.IT] UPDATED)
    Sliced mutual information (SMI) is defined as an average of mutual information (MI) terms between one-dimensional random projections of the random variables. It serves as a surrogate measure of dependence to classic MI that preserves many of its properties but is more scalable to high dimensions. However, a quantitative characterization of how SMI itself and estimation rates thereof depend on the ambient dimension, which is crucial to the understanding of scalability, remain obscure. This work provides a multifaceted account of the dependence of SMI on dimension, under a broader framework termed $k$-SMI, which considers projections to $k$-dimensional subspaces. Using a new result on the continuity of differential entropy in the 2-Wasserstein metric, we derive sharp bounds on the error of Monte Carlo (MC)-based estimates of $k$-SMI, with explicit dependence on $k$ and the ambient dimension, revealing their interplay with the number of samples. We then combine the MC integrator with the neural estimation framework to provide an end-to-end $k$-SMI estimator, for which optimal convergence rates are established. We also explore asymptotics of the population $k$-SMI as dimension grows, providing Gaussian approximation results with a residual that decays under appropriate moment bounds. All our results trivially apply to SMI by setting $k=1$. Our theory is validated with numerical experiments and is applied to sliced InfoGAN, which altogether provide a comprehensive quantitative account of the scalability question of $k$-SMI, including SMI as a special case when $k=1$.  ( 3 min )
    A new nonparametric interpoint distance-based measure for assessment of clustering. (arXiv:2210.08972v1 [cs.LG])
    A new interpoint distance-based measure is proposed to identify the optimal number of clusters present in a data set. Designed in nonparametric approach, it is independent of the distribution of given data. Interpoint distances between the data members make our cluster validity index applicable to univariate and multivariate data measured on arbitrary scales, or having observations in any dimensional space where the number of study variables can be even larger than the sample size. Our proposed criterion is compatible with any clustering algorithm, and can be used to determine the unknown number of clusters or to assess the quality of the resulting clusters for a data set. Demonstration through synthetic and real-life data establishes its superiority over the well-known clustering accuracy measures of the literature.  ( 2 min )
    On Mixup Regularization. (arXiv:2006.06049v3 [cs.LG] UPDATED)
    Mixup is a data augmentation technique that creates new examples as convex combinations of training points and labels. This simple technique has empirically shown to improve the accuracy of many state-of-the-art models in different settings and applications, but the reasons behind this empirical success remain poorly understood. In this paper we take a substantial step in explaining the theoretical foundations of Mixup, by clarifying its regularization effects. We show that Mixup can be interpreted as standard empirical risk minimization estimator subject to a combination of data transformation and random perturbation of the transformed data. We gain two core insights from this new interpretation. First, the data transformation suggests that, at test time, a model trained with Mixup should also be applied to transformed data, a one-line change in code that we show empirically to improve both accuracy and calibration of the prediction. Second, we show how the random perturbation of the new interpretation of Mixup induces multiple known regularization schemes, including label smoothing and reduction of the Lipschitz constant of the estimator. These schemes interact synergistically with each other, resulting in a self calibrated and effective regularization effect that prevents overfitting and overconfident predictions. We corroborate our theoretical analysis with experiments that support our conclusions.  ( 3 min )
    Breaking the Sample Complexity Barrier to Regret-Optimal Model-Free Reinforcement Learning. (arXiv:2110.04645v2 [cs.LG] UPDATED)
    Achieving sample efficiency in online episodic reinforcement learning (RL) requires optimally balancing exploration and exploitation. When it comes to a finite-horizon episodic Markov decision process with $S$ states, $A$ actions and horizon length $H$, substantial progress has been achieved towards characterizing the minimax-optimal regret, which scales on the order of $\sqrt{H^2SAT}$ (modulo log factors) with $T$ the total number of samples. While several competing solution paradigms have been proposed to minimize regret, they are either memory-inefficient, or fall short of optimality unless the sample size exceeds an enormous threshold (e.g., $S^6A^4 \,\mathrm{poly}(H)$ for existing model-free methods). To overcome such a large sample size barrier to efficient RL, we design a novel model-free algorithm, with space complexity $O(SAH)$, that achieves near-optimal regret as soon as the sample size exceeds the order of $SA\,\mathrm{poly}(H)$. In terms of this sample size requirement (also referred to the initial burn-in cost), our method improves -- by at least a factor of $S^5A^3$ -- upon any prior memory-efficient algorithm that is asymptotically regret-optimal. Leveraging the recently introduced variance reduction strategy (also called {\em reference-advantage decomposition}), the proposed algorithm employs an {\em early-settled} reference update rule, with the aid of two Q-learning sequences with upper and lower confidence bounds. The design principle of our early-settled variance reduction method might be of independent interest to other RL settings that involve intricate exploration-exploitation trade-offs.  ( 3 min )
    On the Double Descent of Random Features Models Trained with SGD. (arXiv:2110.06910v6 [stat.ML] UPDATED)
    We study generalization properties of random features (RF) regression in high dimensions optimized by stochastic gradient descent (SGD) in under-/over-parameterized regime. In this work, we derive precise non-asymptotic error bounds of RF regression under both constant and polynomial-decay step-size SGD setting, and observe the double descent phenomenon both theoretically and empirically. Our analysis shows how to cope with multiple randomness sources of initialization, label noise, and data sampling (as well as stochastic gradients) with no closed-form solution, and also goes beyond the commonly-used Gaussian/spherical data assumption. Our theoretical results demonstrate that, with SGD training, RF regression still generalizes well for interpolation learning, and is able to characterize the double descent behavior by the unimodality of variance and monotonic decrease of bias. Besides, we also prove that the constant step-size SGD setting incurs no loss in convergence rate when compared to the exact minimum-norm interpolator, as a theoretical justification of using SGD in practice.  ( 3 min )
    Differentiable Model Compression via Pseudo Quantization Noise. (arXiv:2104.09987v3 [stat.ML] UPDATED)
    We propose DiffQ a differentiable method for model compression for quantizing model parameters without gradient approximations (e.g., Straight Through Estimator). We suggest adding independent pseudo quantization noise to model parameters during training to approximate the effect of a quantization operator. DiffQ is differentiable both with respect to the unquantized weights and the number of bits used. Given a single hyper-parameter balancing between the quantized model size and accuracy, DiffQ optimizes the number of bits used per individual weight or groups of weights, in end-to-end training. We experimentally verify that our method is competitive with STE based quantization techniques on several benchmarks and architectures for image classification, language modeling, and audio source separation. For instance, on the ImageNet dataset, DiffQ compresses a 12 layers transformer-based model by more than a factor of 8, (lower than 4 bits precision per weight on average), with a loss of 0.3% in model accuracy. Code is available at github.com/facebookresearch/diffq.  ( 2 min )
    Data-Efficient Pipeline for Offline Reinforcement Learning with Limited Data. (arXiv:2210.08642v1 [cs.LG])
    Offline reinforcement learning (RL) can be used to improve future performance by leveraging historical data. There exist many different algorithms for offline RL, and it is well recognized that these algorithms, and their hyperparameter settings, can lead to decision policies with substantially differing performance. This prompts the need for pipelines that allow practitioners to systematically perform algorithm-hyperparameter selection for their setting. Critically, in most real-world settings, this pipeline must only involve the use of historical data. Inspired by statistical model selection methods for supervised learning, we introduce a task- and method-agnostic pipeline for automatically training, comparing, selecting, and deploying the best policy when the provided dataset is limited in size. In particular, our work highlights the importance of performing multiple data splits to produce more reliable algorithm-hyperparameter selection. While this is a common approach in supervised learning, to our knowledge, this has not been discussed in detail in the offline RL setting. We show it can have substantial impacts when the dataset is small. Compared to alternate approaches, our proposed pipeline outputs higher-performing deployed policies from a broad range of offline policy learning algorithms and across various simulation domains in healthcare, education, and robotics. This work contributes toward the development of a general-purpose meta-algorithm for automatic algorithm-hyperparameter selection for offline RL.  ( 3 min )
    A Kernel Approach for PDE Discovery and Operator Learning. (arXiv:2210.08140v1 [stat.ML])
    This article presents a three-step framework for learning and solving partial differential equations (PDEs) using kernel methods. Given a training set consisting of pairs of noisy PDE solutions and source/boundary terms on a mesh, kernel smoothing is utilized to denoise the data and approximate derivatives of the solution. This information is then used in a kernel regression model to learn the algebraic form of the PDE. The learned PDE is then used within a kernel based solver to approximate the solution of the PDE with a new source/boundary term, thereby constituting an operator learning framework. The proposed method is mathematically interpretable and amenable to analysis, and convenient to implement. Numerical experiments compare the method to state-of-the-art algorithms and demonstrate its superior performance on small amounts of training data and for PDEs with spatially variable coefficients.  ( 2 min )
    Robust Flow-based Conformal Inference (FCI) with Statistical Guarantee. (arXiv:2205.10732v2 [stat.ML] UPDATED)
    Conformal prediction aims to determine precise levels of confidence in predictions for new objects using past experience. However, the commonly used exchangeable assumptions between the training data and testing data limit its usage in dealing with contaminated testing sets. In this paper, we develop a novel flow-based conformal inference (FCI) method to build predictive sets and infer outliers for complex and high-dimensional data. We leverage ideas from adversarial flow to transfer the input data to a random vector with known distributions. Our roundtrip transformation can map the input data to a low-dimensional space, meanwhile reserving the conditional distribution of input data given each class label, which enables us to construct a non-conformity score for uncertainty quantification. Our approach is applicable and robust when the testing data is contaminated. We evaluate our method, robust flow-based conformal inference, on benchmark datasets. We find that it produces effective predictive sets and accurate outlier detection and is more powerful relative to competing approaches.  ( 2 min )
    Neural Attentive Circuits. (arXiv:2210.08031v1 [cs.LG])
    Recent work has seen the development of general purpose neural architectures that can be trained to perform tasks across diverse data modalities. General purpose models typically make few assumptions about the underlying data-structure and are known to perform well in the large-data regime. At the same time, there has been growing interest in modular neural architectures that represent the data using sparsely interacting modules. These models can be more robust out-of-distribution, computationally efficient, and capable of sample-efficient adaptation to new data. However, they tend to make domain-specific assumptions about the data, and present challenges in how module behavior (i.e., parameterization) and connectivity (i.e., their layout) can be jointly learned. In this work, we introduce a general purpose, yet modular neural architecture called Neural Attentive Circuits (NACs) that jointly learns the parameterization and a sparse connectivity of neural modules without using domain knowledge. NACs are best understood as the combination of two systems that are jointly trained end-to-end: one that determines the module configuration and the other that executes it on an input. We demonstrate qualitatively that NACs learn diverse and meaningful module configurations on the NLVR2 dataset without additional supervision. Quantitatively, we show that by incorporating modularity in this way, NACs improve upon a strong non-modular baseline in terms of low-shot adaptation on CIFAR and CUBs dataset by about 10%, and OOD robustness on Tiny ImageNet-R by about 2.5%. Further, we find that NACs can achieve an 8x speedup at inference time while losing less than 3% performance. Finally, we find NACs to yield competitive results on diverse data modalities spanning point-cloud classification, symbolic processing and text-classification from ASCII bytes, thereby confirming its general purpose nature.  ( 3 min )
    Forget Unlearning: Towards True Data-Deletion in Machine Learning. (arXiv:2210.08911v1 [stat.ML])
    Unlearning has emerged as a technique to efficiently erase information of deleted records from learned models. We show, however, that the influence created by the original presence of a data point in the training set can still be detected after running certified unlearning algorithms (which can result in its reconstruction by an adversary). Thus, under realistic assumptions about the dynamics of model releases over time and in the presence of adaptive adversaries, we show that unlearning is not equivalent to data deletion and does not guarantee the "right to be forgotten." We then propose a more robust data-deletion guarantee and show that it is necessary to satisfy differential privacy to ensure true data deletion. Under our notion, we propose an accurate, computationally efficient, and secure data-deletion machine learning algorithm in the online setting based on noisy gradient descent algorithm.  ( 2 min )
    A Simple Convergence Proof of Adam and Adagrad. (arXiv:2003.02395v3 [stat.ML] UPDATED)
    We provide a simple proof of convergence covering both the Adam and Adagrad adaptive optimization algorithms when applied to smooth (possibly non-convex) objective functions with bounded gradients. We show that in expectation, the squared norm of the objective gradient averaged over the trajectory has an upper-bound which is explicit in the constants of the problem, parameters of the optimizer, the dimension $d$, and the total number of iterations $N$. This bound can be made arbitrarily small, and with the right hyper-parameters, Adam can be shown to converge with the same rate of convergence $O(d\ln(N)/\sqrt{N})$. When used with the default parameters, Adam doesn't converge, however, and just like constant step-size SGD, it moves away from the initialization point faster than Adagrad, which might explain its practical success. Finally, we obtain the tightest dependency on the heavy ball momentum decay rate $\beta_1$ among all previous convergence bounds for non-convex Adam and Adagrad, improving from $O((1-\beta_1)^{-3})$ to $O((1-\beta_1)^{-1})$.  ( 2 min )
    Active Bayesian Causal Inference. (arXiv:2206.02063v2 [cs.LG] UPDATED)
    Causal discovery and causal reasoning are classically treated as separate and consecutive tasks: one first infers the causal graph, and then uses it to estimate causal effects of interventions. However, such a two-stage approach is uneconomical, especially in terms of actively collected interventional data, since the causal query of interest may not require a fully-specified causal model. From a Bayesian perspective, it is also unnatural, since a causal query (e.g., the causal graph or some causal effect) can be viewed as a latent quantity subject to posterior inference -- other unobserved quantities that are not of direct interest (e.g., the full causal model) ought to be marginalized out in this process and contribute to our epistemic uncertainty. In this work, we propose Active Bayesian Causal Inference (ABCI), a fully-Bayesian active learning framework for integrated causal discovery and reasoning, which jointly infers a posterior over causal models and queries of interest. In our approach to ABCI, we focus on the class of causally-sufficient, nonlinear additive noise models, which we model using Gaussian processes. We sequentially design experiments that are maximally informative about our target causal query, collect the corresponding interventional data, and update our beliefs to choose the next experiment. Through simulations, we demonstrate that our approach is more data-efficient than several baselines that only focus on learning the full causal graph. This allows us to accurately learn downstream causal queries from fewer samples while providing well-calibrated uncertainty estimates for the quantities of interest.  ( 3 min )
    Pareto Set Learning for Expensive Multi-Objective Optimization. (arXiv:2210.08495v1 [cs.NE])
    Expensive multi-objective optimization problems can be found in many real-world applications, where their objective function evaluations involve expensive computations or physical experiments. It is desirable to obtain an approximate Pareto front with a limited evaluation budget. Multi-objective Bayesian optimization (MOBO) has been widely used for finding a finite set of Pareto optimal solutions. However, it is well-known that the whole Pareto set is on a continuous manifold and can contain infinite solutions. The structural properties of the Pareto set are not well exploited in existing MOBO methods, and the finite-set approximation may not contain the most preferred solution(s) for decision-makers. This paper develops a novel learning-based method to approximate the whole Pareto set for MOBO, which generalizes the decomposition-based multi-objective optimization algorithm (MOEA/D) from finite populations to models. We design a simple and powerful acquisition search method based on the learned Pareto set, which naturally supports batch evaluation. In addition, with our proposed model, decision-makers can readily explore any trade-off area in the approximate Pareto set for flexible decision-making. This work represents the first attempt to model the Pareto set for expensive multi-objective optimization. Experimental results on different synthetic and real-world problems demonstrate the effectiveness of our proposed method.  ( 2 min )
    Learning Fast and Slow for Online Time Series Forecasting. (arXiv:2202.11672v2 [cs.LG] UPDATED)
    The fast adaptation capability of deep neural networks in non-stationary environments is critical for online time series forecasting. Successful solutions require handling changes to new and recurring patterns. However, training deep neural forecaster on the fly is notoriously challenging because of their limited ability to adapt to non-stationary environments and the catastrophic forgetting of old knowledge. In this work, inspired by the Complementary Learning Systems (CLS) theory, we propose Fast and Slow learning Networks (FSNet), a holistic framework for online time-series forecasting to simultaneously deal with abrupt changing and repeating patterns. Particularly, FSNet improves the slowly-learned backbone by dynamically balancing fast adaptation to recent changes and retrieving similar old knowledge. FSNet achieves this mechanism via an interaction between two complementary components of an adapter to monitor each layer's contribution to the lost, and an associative memory to support remembering, updating, and recalling repeating events. Extensive experiments on real and synthetic datasets validate FSNet's efficacy and robustness to both new and recurring patterns. Our code is available at \url{https://github.com/salesforce/fsnet}.  ( 2 min )
    Theory for Equivariant Quantum Neural Networks. (arXiv:2210.08566v1 [quant-ph])
    Most currently used quantum neural network architectures have little-to-no inductive biases, leading to trainability and generalization issues. Inspired by a similar problem, recent breakthroughs in classical machine learning address this crux by creating models encoding the symmetries of the learning task. This is materialized through the usage of equivariant neural networks whose action commutes with that of the symmetry. In this work, we import these ideas to the quantum realm by presenting a general theoretical framework to understand, classify, design and implement equivariant quantum neural networks. As a special implementation, we show how standard quantum convolutional neural networks (QCNN) can be generalized to group-equivariant QCNNs where both the convolutional and pooling layers are equivariant under the relevant symmetry group. Our framework can be readily applied to virtually all areas of quantum machine learning, and provides hope to alleviate central challenges such as barren plateaus, poor local minima, and sample complexity.  ( 2 min )
    RbX: Region-based explanations of prediction models. (arXiv:2210.08721v1 [stat.ML])
    We introduce region-based explanations (RbX), a novel, model-agnostic method to generate local explanations of scalar outputs from a black-box prediction model using only query access. RbX is based on a greedy algorithm for building a convex polytope that approximates a region of feature space where model predictions are close to the prediction at some target point. This region is fully specified by the user on the scale of the predictions, rather than on the scale of the features. The geometry of this polytope - specifically the change in each coordinate necessary to escape the polytope - quantifies the local sensitivity of the predictions to each of the features. These "escape distances" can then be standardized to rank the features by local importance. RbX is guaranteed to satisfy a "sparsity axiom," which requires that features which do not enter into the prediction model are assigned zero importance. At the same time, real data examples and synthetic experiments show how RbX can more readily detect all locally relevant features than existing methods.  ( 2 min )
    Differentially Private Learning Needs Hidden State (Or Much Faster Convergence). (arXiv:2203.05363v2 [stat.ML] UPDATED)
    Prior work on differential privacy analysis of randomized SGD algorithms relies on composition theorems, where the implicit (unrealistic) assumption is that the internal state of the iterative algorithm is revealed to the adversary. As a result, the R\'enyi DP bounds derived by such composition-based analyses linearly grow with the number of training epochs. When the internal state of the algorithm is hidden, we prove a converging privacy bound for noisy stochastic gradient descent (on strongly convex smooth loss functions). We show how to take advantage of privacy amplification by sub-sampling and randomized post-processing, and prove the dynamics of privacy bound for "shuffle and partition" and "sample without replacement" stochastic mini-batch gradient descent schemes. We prove that, in these settings, our privacy bound converges exponentially fast and is substantially smaller than the composition bounds, notably after a few number of training epochs. Thus, unless the DP algorithm converges fast, our privacy analysis shows that hidden state analysis can significantly amplify differential privacy.  ( 2 min )
    EXoN: EXplainable encoder Network. (arXiv:2105.10867v3 [stat.ML] UPDATED)
    We propose a new semi-supervised learning method of Variational AutoEncoder (VAE) which yields a customized and explainable latent space by EXplainable encoder Network (EXoN). Customization means a manual design of latent space layout for specific labeled data. To improve the performance of our VAE in a classification task without the loss of performance as a generative model, we employ a new semi-supervised classification method called SCI (Soft-label Consistency Interpolation). The classification loss and the Kullback-Leibler divergence play a crucial role in constructing explainable latent space. The variability of generated samples from our proposed model depends on a specific subspace, called activated latent subspace. Our numerical results with MNIST and CIFAR-10 datasets show that EXoN produces an explainable latent space and reduces the cost of investigating representation patterns on the latent space.  ( 2 min )
    Gradient Descent: The Ultimate Optimizer. (arXiv:1909.13371v2 [cs.LG] UPDATED)
    Working with any gradient-based machine learning algorithm involves the tedious task of tuning the optimizer's hyperparameters, such as its step size. Recent work has shown how the step size can itself be optimized alongside the model parameters by manually deriving expressions for "hypergradients" ahead of time. We show how to automatically compute hypergradients with a simple and elegant modification to backpropagation. This allows us to easily apply the method to other optimizers and hyperparameters (e.g. momentum coefficients). We can even recursively apply the method to its own hyper-hyperparameters, and so on ad infinitum. As these towers of optimizers grow taller, they become less sensitive to the initial choice of hyperparameters. We present experiments validating this for MLPs, CNNs, and RNNs. Finally, we provide a simple PyTorch implementation of this algorithm (see people.csail.mit.edu/kach/gradient-descent-the-ultimate-optimizer).  ( 2 min )
    Learning from Few Samples: Transformation-Invariant SVMs with Composition and Locality at Multiple Scales. (arXiv:2109.12784v5 [cs.LG] UPDATED)
    Motivated by the problem of learning with small sample sizes, this paper shows how to incorporate into support-vector machines (SVMs) those properties that have made convolutional neural networks (CNNs) successful. Particularly important is the ability to incorporate domain knowledge of invariances, e.g., translational invariance of images. Kernels based on the \textit{maximum} similarity over a group of transformations are not generally positive definite. Perhaps it is for this reason that they have not been studied theoretically. We address this lacuna and show that positive definiteness indeed holds \textit{with high probability} for kernels based on the maximum similarity in the small training sample set regime of interest, and that they do yield the best results in that regime. We also show how additional properties such as their ability to incorporate local features at multiple spatial scales, e.g., as done in CNNs through max pooling, and to provide the benefits of composition through the architecture of multiple layers, can also be embedded into SVMs. We verify through experiments on widely available image sets that the resulting SVMs do provide superior accuracy in comparison to well-established deep neural network benchmarks for small sample sizes.  ( 3 min )
    What Makes Convolutional Models Great on Long Sequence Modeling?. (arXiv:2210.09298v1 [cs.LG])
    Convolutional models have been widely used in multiple domains. However, most existing models only use local convolution, making the model unable to handle long-range dependency efficiently. Attention overcomes this problem by aggregating global information but also makes the computational complexity quadratic to the sequence length. Recently, Gu et al. [2021] proposed a model called S4 inspired by the state space model. S4 can be efficiently implemented as a global convolutional model whose kernel size equals the input sequence length. S4 can model much longer sequences than Transformers and achieve significant gains over SoTA on several long-range tasks. Despite its empirical success, S4 is involved. It requires sophisticated parameterization and initialization schemes. As a result, S4 is less intuitive and hard to use. Here we aim to demystify S4 and extract basic principles that contribute to the success of S4 as a global convolutional model. We focus on the structure of the convolution kernel and identify two critical but intuitive principles enjoyed by S4 that are sufficient to make up an effective global convolutional model: 1) The parameterization of the convolutional kernel needs to be efficient in the sense that the number of parameters should scale sub-linearly with sequence length. 2) The kernel needs to satisfy a decaying structure that the weights for convolving with closer neighbors are larger than the more distant ones. Based on the two principles, we propose a simple yet effective convolutional model called Structured Global Convolution (SGConv). SGConv exhibits strong empirical performance over several tasks: 1) With faster speed, SGConv surpasses S4 on Long Range Arena and Speech Command datasets. 2) When plugging SGConv into standard language and vision models, it shows the potential to improve both efficiency and performance.  ( 3 min )
    Streaming PAC-Bayes Gaussian process regression with a performance guarantee for online decision making. (arXiv:2210.08486v1 [cs.LG])
    As a powerful Bayesian non-parameterized algorithm, the Gaussian process (GP) has performed a significant role in Bayesian optimization and signal processing. GPs have also advanced online decision-making systems because their posterior distribution has a closed-form solution. However, its training and inference process requires all historic data to be stored and the GP model to be trained from scratch. For those reasons, several online GP algorithms, such as O-SGPR and O-SVGP, have been specifically designed for streaming settings. In this paper, we present a new theoretical framework for online GPs based on the online probably approximately correct (PAC) Bayes theory. The framework offers both a guarantee of generalized performance and good accuracy. Instead of minimizing the marginal likelihood, our algorithm optimizes both the empirical risk function and a regularization item, which is in proportion to the divergence between the prior distribution and posterior distribution of parameters. In addition to its theoretical appeal, the algorithm performs well empirically on several regression datasets. Compared to other online GP algorithms, ours yields a generalization guarantee and very competitive accuracy.  ( 2 min )
    Industry-Scale Orchestrated Federated Learning for Drug Discovery. (arXiv:2210.08871v1 [cs.LG])
    To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. To the best of our knowledge, The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.  ( 3 min )
    WILD-SCAV: Benchmarking FPS Gaming AI on Unity3D-based Environments. (arXiv:2210.09026v1 [cs.LG])
    Recent advances in deep reinforcement learning (RL) have demonstrated complex decision-making capabilities in simulation environments such as Arcade Learning Environment, MuJoCo, and ViZDoom. However, they are hardly extensible to more complicated problems, mainly due to the lack of complexity and variations in the environments they are trained and tested on. Furthermore, they are not extensible to an open-world environment to facilitate long-term exploration research. To learn realistic task-solving capabilities, we need to develop an environment with greater diversity and complexity. We developed WILD-SCAV, a powerful and extensible environment based on a 3D open-world FPS (First-Person Shooter) game to bridge the gap. It provides realistic 3D environments of variable complexity, various tasks, and multiple modes of interaction, where agents can learn to perceive 3D environments, navigate and plan, compete and cooperate in a human-like manner. WILD-SCAV also supports different complexities, such as configurable maps with different terrains, building structures and distributions, and multi-agent settings with cooperative and competitive tasks. The experimental results on configurable complexity, multi-tasking, and multi-agent scenarios demonstrate the effectiveness of WILD-SCAV in benchmarking various RL algorithms, as well as it is potential to give rise to intelligent agents with generalized task-solving abilities. The link to our open-sourced code can be found here https://github.com/inspirai/wilderness-scavenger.  ( 2 min )
    Implicit models, latent compression, intrinsic biases, and cheap lunches in community detection. (arXiv:2210.09186v1 [cs.SI])
    The task of community detection, which aims to partition a network into clusters of nodes to summarize its large-scale structure, has spawned the development of many competing algorithms with varying objectives. Some community detection methods are inferential, explicitly deriving the clustering objective through a probabilistic generative model, while other methods are descriptive, dividing a network according to an objective motivated by a particular application, making it challenging to compare these methods on the same scale. Here we present a solution to this problem that associates any community detection objective, inferential or descriptive, with its corresponding implicit network generative model. This allows us to compute the description length of a network and its partition under arbitrary objectives, providing a principled measure to compare the performance of different algorithms without the need for "ground truth" labels. Our approach also gives access to instances of the community detection problem that are optimal to any given algorithm, and in this way reveals intrinsic biases in popular descriptive methods, explaining their tendency to overfit. Using our framework, we compare a number of community detection methods on artificial networks, and on a corpus of over 500 structurally diverse empirical networks. We find that more expressive community detection methods exhibit consistently superior compression performance on structured data instances, without having degraded performance on a minority of situations where more specialized algorithms perform optimally. Our results undermine the implications of the "no free lunch" theorem for community detection, both conceptually and in practice, since it is confined to unstructured data instances, unlike relevant community detection problems which are structured by requirement.  ( 3 min )
    Confound-leakage: Confound Removal in Machine Learning Leads to Leakage. (arXiv:2210.09232v1 [cs.LG])
    Machine learning (ML) approaches to data analysis are now widely adopted in many fields including epidemiology and medicine. To apply these approaches, confounds must first be removed as is commonly done by featurewise removal of their variance by linear regression before applying ML. Here, we show this common approach to confound removal biases ML models, leading to misleading results. Specifically, this common deconfounding approach can leak information such that what are null or moderate effects become amplified to near-perfect prediction when nonlinear ML approaches are subsequently applied. We identify and evaluate possible mechanisms for such confound-leakage and provide practical guidance to mitigate its negative impact. We demonstrate the real-world importance of confound-leakage by analyzing a clinical dataset where accuracy is overestimated for predicting attention deficit hyperactivity disorder (ADHD) with depression as a confound. Our results have wide-reaching implications for implementation and deployment of ML workflows and beg caution against na\"ive use of standard confound removal approaches.  ( 2 min )
    Distributed Estimation and Inference for Semi-parametric Binary Response Models. (arXiv:2210.08393v1 [math.ST])
    The development of modern technology has enabled data collection of unprecedented size, which poses new challenges to many statistical estimation and inference problems. This paper studies the maximum score estimator of a semi-parametric binary choice model under a distributed computing environment without pre-specifying the noise distribution. An intuitive divide-and-conquer estimator is computationally expensive and restricted by a non-regular constraint on the number of machines, due to the highly non-smooth nature of the objective function. We propose (1) a one-shot divide-and-conquer estimator after smoothing the objective to relax the constraint, and (2) a multi-round estimator to completely remove the constraint via iterative smoothing. We specify an adaptive choice of kernel smoother with a sequentially shrinking bandwidth to achieve the superlinear improvement of the optimization error over the multiple iterations. The improved statistical accuracy per iteration is derived, and a quadratic convergence up to the optimal statistical error rate is established. We further provide two generalizations to handle the heterogeneity of datasets with covariate shift and high-dimensional problems where the parameter of interest is sparse.  ( 2 min )
    Skeptical inferences in multi-label ranking with sets of probabilities. (arXiv:2210.08576v1 [stat.ML])
    In this paper, we consider the problem of making skeptical inferences for the multi-label ranking problem. We assume that our uncertainty is described by a convex set of probabilities (i.e. a credal set), defined over the set of labels. Instead of learning a singleton prediction (or, a completed ranking over the labels), we thus seek for skeptical inferences in terms of set-valued predictions consisting of completed rankings.  ( 2 min )
    Resolving the Mixing Time of the Langevin Algorithm to its Stationary Distribution for Log-Concave Sampling. (arXiv:2210.08448v1 [math.ST])
    Sampling from a high-dimensional distribution is a fundamental task in statistics, engineering, and the sciences. A particularly canonical approach is the Langevin Algorithm, i.e., the Markov chain for the discretized Langevin Diffusion. This is the sampling analog of Gradient Descent. Despite being studied for several decades in multiple communities, tight mixing bounds for this algorithm remain unresolved even in the seemingly simple setting of log-concave distributions over a bounded domain. This paper completely characterizes the mixing time of the Langevin Algorithm to its stationary distribution in this setting (and others). This mixing result can be combined with any bound on the discretization bias in order to sample from the stationary distribution of the continuous Langevin Diffusion. In this way, we disentangle the study of the mixing and bias of the Langevin Algorithm. Our key insight is to introduce a technique from the differential privacy literature to the sampling literature. This technique, called Privacy Amplification by Iteration, uses as a potential a variant of R\'enyi divergence that is made geometrically aware via Optimal Transport smoothing. This gives a short, simple proof of optimal mixing bounds and has several additional appealing properties. First, our approach removes all unnecessary assumptions required by other sampling analyses. Second, our approach unifies many settings: it extends unchanged if the Langevin Algorithm uses projections, stochastic mini-batch gradients, or strongly convex potentials (whereby our mixing time improves exponentially). Third, our approach exploits convexity only through the contractivity of a gradient step -- reminiscent of how convexity is used in textbook proofs of Gradient Descent. In this way, we offer a new approach towards further unifying the analyses of optimization and sampling algorithms.  ( 3 min )
    On the Identifiability and Estimation of Causal Location-Scale Noise Models. (arXiv:2210.09054v1 [stat.ML])
    We study the class of location-scale or heteroscedastic noise models (LSNMs), in which the effect $Y$ can be written as a function of the cause $X$ and a noise source $N$ independent of $X$, which may be scaled by a positive function $g$ over the cause, i.e., $Y = f(X) + g(X)N$. Despite the generality of the model class, we show the causal direction is identifiable up to some pathological cases. To empirically validate these theoretical findings, we propose two estimators for LSNMs: an estimator based on (non-linear) feature maps, and one based on probabilistic neural networks. Both model the conditional distribution of $Y$ given $X$ as a Gaussian parameterized by its natural parameters. Since the neural network approach can fit functions of arbitrary complexity, it has an edge over the feature map-based approach in terms of empirical performance. When the feature maps are correctly specified, however, we can prove that our estimator is jointly concave, which allows us to derive stronger guarantees for the cause-effect identification task.  ( 2 min )
    Positive-Unlabeled Learning using Random Forests via Recursive Greedy Risk Minimization. (arXiv:2210.08461v1 [cs.LG])
    The need to learn from positive and unlabeled data, or PU learning, arises in many applications and has attracted increasing interest. While random forests are known to perform well on many tasks with positive and negative data, recent PU algorithms are generally based on deep neural networks, and the potential of tree-based PU learning is under-explored. In this paper, we propose new random forest algorithms for PU-learning. Key to our approach is a new interpretation of decision tree algorithms for positive and negative data as \emph{recursive greedy risk minimization algorithms}. We extend this perspective to the PU setting to develop new decision tree learning algorithms that directly minimizes PU-data based estimators for the expected risk. This allows us to develop an efficient PU random forest algorithm, PU extra trees. Our approach features three desirable properties: it is robust to the choice of the loss function in the sense that various loss functions lead to the same decision trees; it requires little hyperparameter tuning as compared to neural network based PU learning; it supports a feature importance that directly measures a feature's contribution to risk minimization. Our algorithms demonstrate strong performance on several datasets. Our code is available at \url{https://github.com/puetpaper/PUExtraTrees}.  ( 2 min )
    On Model Selection Consistency of Lasso for High-Dimensional Ising Models. (arXiv:2110.08500v3 [stat.ML] UPDATED)
    We theoretically analyze the model selection consistency of least absolute shrinkage and selection operator (Lasso) for high-dimensional Ising models. For random regular (RR) graphs of size $p$ with regular node degree $d$ and uniform couplings $\theta_0$, it is rigorously proved that Lasso without post-thresholding is model selection consistent in the whole paramagnetic phase with the same order of sample complexity $n=\Omega{(d^3\log{p})}$ as that of $\ell_1$-regularized logistic regression ($\ell_1$-LogR). This result is consistent with the conjecture in $\textit{Meng, Obuchi, and Kabashima 2021}$ using the non-rigorous replica method from statistical physics and thus complements it with a rigorous proof. For general tree-like graphs, it is demonstrated that the same result as RR graphs can be obtained under mild assumptions of the dependency condition and incoherence condition. Moreover, we provide a rigorous proof of the model selection consistency of Lasso with post-thresholding for general tree-like graphs in the paramagnetic phase without further assumptions on the dependency and incoherence conditions. Experimental results agree well with our theoretical analysis.  ( 2 min )
    Automatic Emergency Dust-Free solution on-board International Space Station with Bi-GRU (AED-ISS). (arXiv:2210.08549v1 [stat.AP])
    With a rising attention for the issue of PM2.5 or PM0.3, particulate matters have become not only a potential threat to both the environment and human, but also a harming existence to instruments onboard International Space Station (ISS). Our team is aiming to relate various concentration of particulate matters to magnetic fields, humidity, acceleration, temperature, pressure and CO2 concentration. Our goal is to establish an early warning system (EWS), which is able to forecast the levels of particulate matters and provides ample reaction time for astronauts to protect their instruments in some experiments or increase the accuracy of the measurements; In addition, the constructed model can be further developed into a prototype of a remote-sensing smoke alarm for applications related to fires. In this article, we will implement the Bi-GRU (Bidirectional Gated Recurrent Unit) algorithms that collect data for past 90 minutes and predict the levels of particulates which over 2.5 micrometer per 0.1 liter for the next 1 minute, which is classified as an early warning  ( 2 min )
    Distributionally Robust Causal Inference with Observational Data. (arXiv:2210.08326v1 [stat.ME])
    We consider the estimation of average treatment effects in observational studies without the standard assumption of unconfoundedness. We propose a new framework of robust causal inference under the general observational study setting with the possible existence of unobserved confounders. Our approach is based on the method of distributionally robust optimization and proceeds in two steps. We first specify the maximal degree to which the distribution of unobserved potential outcomes may deviate from that of obsered outcomes. We then derive sharp bounds on the average treatment effects under this assumption. Our framework encompasses the popular marginal sensitivity model as a special case and can be extended to the difference-in-difference and regression discontinuity designs as well as instrumental variables. Through simulation and empirical studies, we demonstrate the applicability of the proposed methodology to real-world settings.  ( 2 min )
    SGD with Coordinate Sampling: Theory and Practice. (arXiv:2105.11818v2 [stat.ML] UPDATED)
    While classical forms of stochastic gradient descent algorithm treat the different coordinates in the same way, a framework allowing for adaptive (non uniform) coordinate sampling is developed to leverage structure in data. In a non-convex setting and including zeroth order gradient estimate, almost sure convergence as well as non-asymptotic bounds are established. Within the proposed framework, we develop an algorithm, MUSKETEER, based on a reinforcement strategy: after collecting information on the noisy gradients, it samples the most promising coordinate (all for one); then it moves along the one direction yielding an important decrease of the objective (one for all). Numerical experiments on both synthetic and real data examples confirm the effectiveness of MUSKETEER in large scale problems.  ( 2 min )
    Active Learning with Neural Networks: Insights from Nonparametric Statistics. (arXiv:2210.08367v1 [cs.LG])
    Deep neural networks have great representation power, but typically require large numbers of training examples. This motivates deep active learning methods that can significantly reduce the amount of labeled training data. Empirical successes of deep active learning have been recently reported in the literature, however, rigorous label complexity guarantees of deep active learning have remained elusive. This constitutes a significant gap between theory and practice. This paper tackles this gap by providing the first near-optimal label complexity guarantees for deep active learning. The key insight is to study deep active learning from the nonparametric classification perspective. Under standard low noise conditions, we show that active learning with neural networks can provably achieve the minimax label complexity, up to disagreement coefficient and other logarithmic terms. When equipped with an abstention option, we further develop an efficient deep active learning algorithm that achieves $\mathsf{polylog}(\frac{1}{\epsilon})$ label complexity, without any low noise assumptions. We also provide extensions of our results beyond the commonly studied Sobolev/H\"older spaces and develop label complexity guarantees for learning in Radon $\mathsf{BV}^2$ spaces, which have recently been proposed as natural function spaces associated with neural networks.  ( 2 min )

  • Open

    [P] Clustering Pokémon in 15 minutes with Machine Learning
    Hey everyone, thought this would be a fun read where we demonstrate clustering on the Pokémon character statistics database in just 15 minutes. Read here if you're interested! If you have any question, feel free to ask away too. submitted by /u/PIEXCHANGE [link] [comments]  ( 124 min )
    [D] Video Tracking vs Image detection
    Hello All How is the video tracking different from image detection? From my understanding, tracking within a video can be simply doing a per-frame level objection detection, and then using NMS to combine these object (based on the overlapping). However, my friend told me this might not be an efficient method (because per-frame level). What are the current norm of doing video tracking? Do they run at the per-frame level? submitted by /u/Dense-Smf-6032 [link] [comments]  ( 141 min )
    What is the best graph classification Graph Neural Network for large graphs ? [R] [P] [D]
    Hey everyone, I am currently doing a particle physics project in which I have to develop a GNN that classifies 2 different particle interactions. The problem is that depending on the cut, my graphs,which each represent an interaction of type 0 or 1, are on average of size 500 - 1 000 000 nodes with around 8 edges per nodes. I already understand that I will have to dynamically create the graphs and run them through the GNN as there are around 100 000 events that need to be classified and having 100 000 events of size say 50 000 nodes in active memory is not possible. I was therefore wondering if reddit had an idea for the most interesting architecture and general algorithm to use in this scenario. The graphs are directed, the edges have no weights and the nodes have features. Each graph represents one particle interaction and therefore this is a binary classification task. Please link me with any paper I might have missed on this, because so far I mostly find algorithms that do not scale well at all, or the algorithm is not designed for graph classification but node classification. Thanks a lot for your help ! submitted by /u/elii17 [link] [comments]  ( 123 min )
    [D] Question on Copyright Datasets in Europe
    Hi reddit, I recently read a lot of posts about AI art. A common theme is that the dataset used could be copyrighted and no permission is granted yet. The US's current law on this seems ambiguous. So, I looked in the Directive on Copyright in the Digital Single Market. To my understanding, in Europe, the creator of anything copyrighted should be able to opt-out their work for AI training, right? If they do so, is the usage of their work considered illegal for commercial machine learning? Thanks for all answer submitted by /u/PrinzChiyo [link] [comments]  ( 124 min )
    Paper looking at when DNNs learn different features during training? [R]
    For complex tasks (e.g. language generation), it's known that DNNs first learn simple patterns (e.g. word frequency), followed by more complex ones (e.g. grammar), followed by yet more complex ones (e.g. semantics). I'm looking for papers that specifically investigate this phenomenon, of what gets learned when during training. I'm sure there are simple cases that have been examined, for example a neural network trained on MNIST learning to distinguish 0 from 1 before it learns to distinguish 0 from 8. The one example I can think of off the top of my head is the Grokking paper, but that's looking at a slightly different (and less intuitive) phenomenon. Thanks in advance! submitted by /u/oops-no-more-profile [link] [comments]  ( 124 min )
    [D] Clustering after instance segmentation
    I trained a SINGLE class instance segmentation model with Detectron2 and YOLACT. Both perform quite well. What I want to do next: Crop out detected instances. Obtain image embeddings using PCA/ (VAE) Autoencoders ( any suggestions?) Do some sort of clustering based on those embeddings ( K-means, PCA) Anyone thinks this pipeline makes sense? Could you guys provide any suggestions for image embedding techniques? I am expecting this pipeline to group the class object into 2 categories based on shape: straight and bent. This feature is most visible to human eye but not sure if this works. Thanks a lot! submitted by /u/vocdex [link] [comments]  ( 127 min )
    [P] Awesome Image Segmentation Project Based on Deep Learning (5.6k star)
    Hi, I'd like to introduce PaddleSeg, which provides the ability of designing, training and deploying segmentation models. This might be some help to you. Hope you enjoy it. Code and docs: https://github.com/PaddlePaddle/PaddleSeg Features Set: Support several tasks: Semantic Segmentation, Interactive Segmentation, Panoptic Segmentation, Image Matting, etc. Provide 40+ state-of-art semantic segmentation models and 140+ high-quality pre-training models Provide efficient interactive segmentation tool (EISeg) for annotating segmentation images Release a variety of human matting and portrait segmentation models for practical application without training Support 3D medical image segmentation etc https://i.redd.it/pi0pjee2ncu91.gif https://i.redd.it/7cuiqre3ncu91.gif submitted by /u/Effective_Tax_2096 [link] [comments]  ( 124 min )
    [D] Now that Colab has introduced "compute units". Which are the best free/cheap alternatives?
    I liked the flat fees of Colab Pro and Pro+. Maybe this new system is even cheaper for me, but there's something psychological about having to pay for credits instead of a flat fee that I don't like. I've heard Paperspace is good but the storage is expensive. What cheap/free alternatives are there? submitted by /u/zuccoff [link] [comments]  ( 126 min )
  • Open

    Using AI art to turn the Palace of Fine Arts into something “out of this world” 🪄
    submitted by /u/imaginfinity [link] [comments]  ( 109 min )
    Stable Diffusion Weekly AI Art Video and Images 10.16.22
    submitted by /u/prfitofthesngularity [link] [comments]  ( 108 min )
    A pragmatic metric for Artificial General Intelligence
    submitted by /u/lorepieri [link] [comments]  ( 108 min )
    Masters in AI
    Anyone here pursuing a masters in AI? is there a university accepting a relatively lower gpa. really appreciate honest feedback. submitted by /u/Electronic_Fold_4395 [link] [comments]  ( 109 min )
    ARTIFICIAL INTELLIGENCE - The Dark Side
    submitted by /u/IntelligentTip8745 [link] [comments]  ( 110 min )
    Weekly China AI News: VW Teams up with Horizon Robotics; Meet WeChat's New Language Model; AI Art Creates New Jobs
    submitted by /u/trcytony [link] [comments]  ( 110 min )
    What is good marketing or ways to bring AI generated art into out of the virtual world?
    There is a machine that makes normal images looks like they are painted and I want to print ai gerated art with it, what are other cool ways to hype your AI generated art? submitted by /u/xXNOdrugsForMEXx [link] [comments]  ( 111 min )
    David Chalmers, "Are Large Language Models Sentient?"
    submitted by /u/rickonti [link] [comments]  ( 109 min )
    AI Art Competition: Get your work printed, displayed, and sold in a NYC art gallery
    The Human-Assisted Art Project is putting on an AI-focused gallery exhibition to showcase the potential of AI for art and artists. We are running an open competition as part of this show. Submit your AI-generated work for a chance to have a large-scale museum-quality print of your work shown and potentially sold in a gallery exhibition this December! The pieces will also be display in a concurrent online walkthrough gallery. You can take part in the competition from anywhere in the world. For the in-person show, we are also collaborating with select artists to create additional AI-inspired works, and will have some interactive exhibits such as a live human performance of AI-generated music. Much more info is available at HumanAssistedArt.com, and I'm happy to answer any questions here as well. submitted by /u/FletcherHeisler [link] [comments]  ( 109 min )
    Where can we find an organized list of AI tools?
    I was trying to find particular AI tools and I could only find "The Top 10 Whatever AI" articles. It's very cumbersome, and you only find the most popular ones from the only ones the author know about. I do check here often, I get some AI newsletters, some AI channels in youtube… and I still find very cool AI projects that have been around for a while and I never knew about them. I remember those GitHub awesome lists, where the community creates and maintains a 360º view of tools and resources related to a specific topic. Is there something like this out there? If not, would the community be willing to participate in a new one? submitted by /u/kmtrp [link] [comments]  ( 112 min )
    I'm doing a research paper on black boxes and artificial intelligence. Besides the drawbacks, what else should I talk about ?
    submitted by /u/Careless-Yogurt-7871 [link] [comments]  ( 109 min )
    DALL-E 2 on Microsoft Azure: the market for generative models matures
    submitted by /u/bendee983 [link] [comments]  ( 108 min )
    GANs in The Age of Diffusion Models
    The new cool kid in town, Diffusion models, has somehow made GANs look obsolete. When it comes to generating images, tools like DALLE-2, Stable Diffusion, and Midjourney have been outperforming the task and have taken over the field completely. While there are obvious reasons why diffusion models are gaining popularity for image synthesis, general adversarial networks (GANs) saw the same popularity, sparked interest and were revived in 2017, three years after they were proposed by Ian Goodfellow. https://analyticsindiamag.com/gans-in-the-age-of-diffusion-models/ submitted by /u/analyticsindiam [link] [comments]  ( 112 min )
    Using Ai Chatbots in Games.
    Do you guys think game developers will start using Ai Chatbots in Video Games to make the player be able to talk to the NPCs? Like, instead of having a list of things you can say, you can talk about anything with a NPC and they will tell you anything they know. I'm not sure but if you used something like Character.ai than you could even make them have distinguished Personalities and have stuff they know about and stuff they can't tell you, because they don't know. submitted by /u/Fiuron [link] [comments]  ( 110 min )
    What is machine learning, and why do we need it?
    submitted by /u/edvanceredu [link] [comments]  ( 113 min )
    AI related things i can gift to someone ?
    Hi everyone, I want to gift my fiancée for her birthday and she is a PhD student in AI. I work in IT too but more on the coding side, i would love to have some cool ideas, things i can gift her related to AI. I thought about buying her a subscription to Midjourney or OpenAI but please if you have a list of cool AI she can try, it would help me a lot. Thank you so much in advance submitted by /u/ilyabarigou [link] [comments]  ( 111 min )
    Meet Ai-Da, The World’s First Robot Artist
    submitted by /u/sopadebombillas [link] [comments]  ( 107 min )
    Is there an AI powered tool which can help guide blind people through daily life tasks/video games?
    I heard about this project called Mars Vision which had this goal in mind, but I don't think they ever got it off the ground, as I have not heard anything about them in a while. ​ Basically,k it would be some kind of complex computer vision AI that would see a ladder in a video game, and then play an accompanying spatial sound to tell the user where to go in the video game or real life. For example, if I issued the command of "Where's my cat?" it would scan a camera feed and when it located what it thought was a cat, it woud play a sound to guide me there. ​ Does this exist? I can't find anything online, but it would quite literally be life changing for me. There are also tools like The vOICe, but this si synthetic vision and has almost nothing to do with AI, but it is relevant in the sense that it is a sound based solution that can help guide blind users. Thoughts? submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 110 min )
    TrollPhace- The Cube visuals by AI Manifest
    All audio credits to TrollPhace! submitted by /u/Available_Tadpole829 [link] [comments]  ( 110 min )
    If implementing any cognitive function was as easy as making an API call, what would you develop?
    I am curious as to what you can imagine. Assume that by making an API call you could inject into any system any number of cognitive functions akin to, but not limited by, what biological organisms can do. The ability to see, recognize patterns, contextualize, optimize, act, regulate, plan, navigate the world, be social, protect a target, find resources, build…etc. Anything you can imagine except being conscious (assuming consciousness is beyond scope) submitted by /u/ImportanceDecent92 [link] [comments]  ( 109 min )
    I'm blind, and while I love what we're doing with AI for art, is there anything the blind community can play around with like Music?
    Not super sure if this exists, but I would personally love being able to type into a prompt field, and have some sort of music generated whether orchestral or otherwise. I could spend probably hours just showing it to friends, but upon further inspection, I don't think this exists? Would love to hear what you guys think though. ​ Thanks! submitted by /u/ChipsAhoiMcCoy [link] [comments]  ( 115 min )
  • Open

    Ukrainian Success Through the Lens of Agile
    The world is watching the war Russia is waging against the Ukrainians. Most of it applauds how the Ukrainians adapt to a rapidly evolving environment and occasionally prevail.  This article describes how the Ukrainian’s agility serves them quite well in the conflict. The post Ukrainian Success Through the Lens of Agile appeared first on Data Science Central.  ( 24 min )
    Data Management as a Business Discipline – Part 3: Enabling Frameworks
    A Business Discipline consists of systematic research, observation, measurement, and experimentation resulting in the assimilation of learnings into laws, theorems, concepts, principles, practices, frameworks, and formulas to enable the consistent application and ongoing enhancements from the real-world application of that discipline. The post Data Management as a Business Discipline – Part 3: Enabling Frameworks appeared first on Data Science Central.  ( 20 min )
    Python File Input/Output: Read & Write Files in Python
    Python is an object-oriented and high-level language that supports many built-in functions which is easy to use for users.  Python has built-in functions for handling files such as creating a file, writing into the file, reading the file, and updating the file. Scope of the Article → In this article, we will know about file-handling… Read More »Python File Input/Output: Read & Write Files in Python The post Python File Input/Output: Read & Write Files in Python appeared first on Data Science Central.  ( 22 min )
    What is trustable data? Why do you need it?
    Trustable data can be defined as data that comes from specific and trusted sources and is used according to its intended use. It is delivered in the appropriate format and time frames for specific users. The post What is trustable data? Why do you need it? appeared first on Data Science Central.  ( 20 min )
    Top 5 Blockchain Certifications to Enhance your Job-Readiness
    Blockchain technology is new and has been around for only about a decade, it has earned widespread popularity across the globe. Off late, blockchain has been accepted across industry sectors as a disruptive technology that helps better business processes. Initially found as a technology to keep digital monetary transactions safe and tamper-proof, the technology has… Read More »Top 5 Blockchain Certifications to Enhance your Job-Readiness The post Top 5 Blockchain Certifications to Enhance your Job-Readiness appeared first on Data Science Central.  ( 20 min )
  • Open

    "CARP: Robust Preference Learning for Storytelling via Contrastive Reinforcement Learning", Castricato et al 2022 {EleutherAI/CarperAI}
    submitted by /u/gwern [link] [comments]  ( 118 min )
    Partially observable Continuous Control Gym Environment
    Hi, I am doing bit of research relating partially observable environments. Previously i was working with an environment which claimed to have PO observations. But PPO was able to solve it with just using current observation. My question is, which partially observable Continuous Control environment is generally use in research setting? submitted by /u/ZIGGY-Zz [link] [comments]  ( 120 min )
    REINFORCE in frozen lake
    I've started following spinning up's RL roadmap, and started off with REINFORCE with baseline from Sutton and Barto. I decided to implement it for the frozen lake environment on Google colab. But the agent fails to learn anything at all. The environment is a 4x4 grid with 16 states in all and 4 actions. I think the cause is one of the two: My state and feature vector representation: I've considered the weights to be linear in my features. I'm representing the state as a vector of size 4 - binary representation of the state returned by the env. In the book, they calculate action preferences using a feature vector x(s,a). So I decided to represent x as a vector of size 6 - first 4 places to represent the state, last 2 to represent the action in binary. These representations of the states and x allow me to use the weights w and theta as vectors of size 4 and 6. But I feel I'm making a mistake in their representation. The actual code being faulty. This is the colab notebook in case anyone wants to view it. Any idea what is causing my agent to fail? submitted by /u/anuraagshankar [link] [comments]  ( 116 min )
    Deep Reinforcement Learning and Entity Component Systems
    I am happy to open a new blog about bridges between video games and deep reinforcement learning. First, I will build an entity-component system (ECS) able to handle realistic multiplayer distributed 3D games integrated with deep learning systems. ​ https://ludc.github.io/video_games_and_deep_reinforcement_learning/ ​ It is a personal blog that only contains my thoughts. It is associated with a github repository that will evolve week after week https://github.com/ludc/gymecs until we obtain a satisfying multiplayer 3D game engine, pure python, dedicated to research and fun. ​ Of course, I don't expect to reach AAA quality since python is bad at rendering. But we will still have good 3D graphics (this will come in few weeks) and strong AI ! Let's keep in touch. submitted by /u/ludoden [link] [comments]  ( 117 min )
    How do I define a state for this problem
    I'm working on a problem where an agent has to take decisions place or cancel orders based on stock prices. In a case like this, how do I define a state? My concern is that just considering the current price would not lead to a fully markov state. My rationale for this would be that in a falling market, it is more likely that the price continues to fall, so the history of the states plays an important role. In this case, should I take a fixed length vector of historical prices (eg. last 10 prices) as my state? And in such a case, would I have to use an RNN to approximate the value function? submitted by /u/theanswerisnt42 [link] [comments]  ( 120 min )
    RL for internet protocols
    I vaguely remember the website of a company that use RL for internet protocols. Let's speculate about this task. Imagine you are in charge of creating a protocol that must replace standard TCP. Alice and Bob are 2 cooperating agents that must use IP in order to transfer a file between A and B without data loss in the minimum possible time. (beware that there is also congestion control) Possible action for the sender: wait Z milliseconds send from byte X to byte Y other I can't think about now Possible action for the receiver: send an acknowledge that certify you correctly received bytes in the range [X,Y] other I can't think about now First question: what you will read? Are there article in literature about this task? Second question: for you which algorithm or set of algorithms is best suited for the task and why? submitted by /u/Moist_Turnip [link] [comments]  ( 118 min )
  • Open

    The science of strength: How data analytics is transforming college basketball
    Adam Petway, strength and conditioning coach for the University of Louisville, is using his MIT Professional Education training to improve player performance off the court.  ( 7 min )
  • Open

    Host code-server on Amazon SageMaker
    Machine learning (ML) teams need the flexibility to choose their integrated development environment (IDE) when working on a project. It allows you to have a productive developer experience and innovate at speed. You may even use multiple IDEs within a project. Amazon SageMaker lets ML teams choose to work from fully managed, cloud-based environments within […]  ( 7 min )
    Real estate brokerage firm John L. Scott uses Amazon Textract and Amazon Comprehend to strike racially restrictive language from property deeds for homeowners
    Founded more than 91 years ago in Seattle, John L. Scott Real Estate’s core value is Living Life as a Contribution®. The firm helps homebuyers find and buy the home of their dreams, while also helping sellers move into the next chapter of their home ownership journey. John L. Scott currently operates over 100 offices […]  ( 6 min )
    Real estate brokerage firm John L. Scott uses Amazon Textract to strike racially restrictive language from property deeds for homeowners
    Founded more than 91 years ago in Seattle, John L. Scott Real Estate’s core value is Living Life as a Contribution®. The firm helps homebuyers find and buy the home of their dreams, while also helping sellers move into the next chapter of their home ownership journey. John L. Scott currently operates over 100 offices […]  ( 6 min )
  • Open

    How the orbit of one planet appears from another
    This post shows how the orbits some planets appear from other planets. I’ve give a few of my favorite examples and include a Python program you could use to create your own plots. We will assume the planets move in circles around the sun. They don’t exactly—they don’t exactly move in ellipses either—but their orbits […] How the orbit of one planet appears from another first appeared on John D. Cook.  ( 5 min )
  • Open

    SOCRATES: A Stereo Camera Trap for Monitoring of Biodiversity. (arXiv:2209.09070v2 [cs.CV] UPDATED)
    The development and application of modern technology is an essential basis for the efficient monitoring of species in natural habitats and landscapes to trace the development of ecosystems, species communities, and populations, and to analyze reasons of changes. For estimating animal abundance using methods such as camera trap distance sampling, spatial information of natural habitats in terms of 3D (three-dimensional) measurements is crucial. Additionally, 3D information improves the accuracy of animal detection using camera trapping. This study presents a novel approach to 3D camera trapping featuring highly optimized hardware and software. This approach employs stereo vision to infer 3D information of natural habitats and is designated as StereO CameRA Trap for monitoring of biodivErSity (SOCRATES). A comprehensive evaluation of SOCRATES shows not only a $3.23\%$ improvement in animal detection (bounding box $\text{mAP}_{75}$) but also its superior applicability for estimating animal abundance using camera trap distance sampling. The software and documentation of SOCRATES is provided at https://github.com/timmh/socrates  ( 2 min )
    EcoFormer: Energy-Saving Attention with Linear Complexity. (arXiv:2209.09004v2 [cs.CV] UPDATED)
    Transformer is a transformative framework that models sequential data and has achieved remarkable performance on a wide range of tasks, but with high computational and energy cost. To improve its efficiency, a popular choice is to compress the models via binarization which constrains the floating-point values into binary ones to save resource consumption owing to cheap bitwise operations significantly. However, existing binarization methods only aim at minimizing the information loss for the input distribution statistically, while ignoring the pairwise similarity modeling at the core of the attention. To this end, we propose a new binarization paradigm customized to high-dimensional softmax attention via kernelized hashing, called EcoFormer, to map the original queries and keys into low-dimensional binary codes in Hamming space. The kernelized hash functions are learned to match the ground-truth similarity relations extracted from the attention map in a self-supervised way. Based on the equivalence between the inner product of binary codes and the Hamming distance as well as the associative property of matrix multiplication, we can approximate the attention in linear complexity by expressing it as a dot-product of binary codes. Moreover, the compact binary representations of queries and keys enable us to replace most of the expensive multiply-accumulate operations in attention with simple accumulations to save considerable on-chip energy footprint on edge devices. Extensive experiments on both vision and language tasks show that EcoFormer consistently achieves comparable performance with standard attentions while consuming much fewer resources. For example, based on PVTv2-B0 and ImageNet-1K, Ecoformer achieves a 73% on-chip energy footprint reduction with only a 0.33% performance drop compared to the standard attention. Code is available at https://github.com/ziplab/EcoFormer.  ( 3 min )
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v2 [cs.LG] UPDATED)
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.  ( 2 min )
    Uncertainty Quantification with Pre-trained Language Models: A Large-Scale Empirical Analysis. (arXiv:2210.04714v2 [cs.CL] UPDATED)
    Pre-trained language models (PLMs) have gained increasing popularity due to their compelling prediction performance in diverse natural language processing (NLP) tasks. When formulating a PLM-based prediction pipeline for NLP tasks, it is also crucial for the pipeline to minimize the calibration error, especially in safety-critical applications. That is, the pipeline should reliably indicate when we can trust its predictions. In particular, there are various considerations behind the pipeline: (1) the choice and (2) the size of PLM, (3) the choice of uncertainty quantifier, (4) the choice of fine-tuning loss, and many more. Although prior work has looked into some of these considerations, they usually draw conclusions based on a limited scope of empirical studies. There still lacks a holistic analysis on how to compose a well-calibrated PLM-based prediction pipeline. To fill this void, we compare a wide range of popular options for each consideration based on three prevalent NLP classification tasks and the setting of domain shift. In response, we recommend the following: (1) use ELECTRA for PLM encoding, (2) use larger PLMs if possible, (3) use Temp Scaling as the uncertainty quantifier, and (4) use Focal Loss for fine-tuning.  ( 3 min )
    Double Sampling Randomized Smoothing. (arXiv:2206.07912v3 [cs.LG] UPDATED)
    Neural networks (NNs) are known to be vulnerable against adversarial perturbations, and thus there is a line of work aiming to provide robustness certification for NNs, such as randomized smoothing, which samples smoothing noises from a certain distribution to certify the robustness for a smoothed classifier. However, as shown by previous work, the certified robust radius in randomized smoothing suffers from scaling to large datasets ("curse of dimensionality"). To overcome this hurdle, we propose a Double Sampling Randomized Smoothing (DSRS) framework, which exploits the sampled probability from an additional smoothing distribution to tighten the robustness certification of the previous smoothed classifier. Theoretically, under mild assumptions, we prove that DSRS can certify $\Theta(\sqrt d)$ robust radius under $\ell_2$ norm where $d$ is the input dimension, implying that DSRS may be able to break the curse of dimensionality of randomized smoothing. We instantiate DSRS for a generalized family of Gaussian smoothing and propose an efficient and sound computing method based on customized dual optimization considering sampling error. Extensive experiments on MNIST, CIFAR-10, and ImageNet verify our theory and show that DSRS certifies larger robust radii than existing baselines consistently under different settings. Code is available at https://github.com/llylly/DSRS.
    AI-Assisted Discovery of Quantitative and Formal Models in Social Science. (arXiv:2210.00563v2 [cs.SC] UPDATED)
    In social science, formal and quantitative models, such as ones describing economic growth and collective action, are used to formulate mechanistic explanations, provide predictions, and uncover questions about observed phenomena. Here, we demonstrate the use of a machine learning system to aid the discovery of symbolic models that capture nonlinear and dynamical relationships in social science datasets. By extending neuro-symbolic methods to find compact functions and differential equations in noisy and longitudinal data, we show that our system can be used to discover interpretable models from real-world data in economics and sociology. Augmenting existing workflows with symbolic regression can help uncover novel relationships and explore counterfactual models during the scientific process. We propose that this AI-assisted framework can bridge parametric and non-parametric models commonly employed in social science research by systematically exploring the space of nonlinear models and enabling fine-grained control over expressivity and interpretability.  ( 2 min )
    Interpretable (not just posthoc-explainable) medical claims modeling for discharge placement to prevent avoidable all-cause readmissions or death. (arXiv:2208.12814v2 [cs.CY] UPDATED)
    We developed an inherently interpretable multilevel Bayesian framework for representing variation in regression coefficients that mimics the piecewise linearity of ReLU-activated deep neural networks. We used the framework to formulate a survival model for using medical claims to predict hospital readmission and death that focuses on discharge placement, adjusting for confounding in estimating causal local average treatment effects. We trained the model on a 5\% sample of Medicare beneficiaries from 2008 and 2011, based on their 2009--2011 inpatient episodes, and then tested the model on 2012 episodes. The model scored an AUROC of approximately 0.76 on predicting all-cause readmissions (defined using official CMS methodology) or death within 30-days of discharge, being competitive against XGBoost and a Bayesian deep neural network, demonstrating that one need-not sacrifice interpretability for accuracy. Crucially, we provide what blackboxes cannot -- the exact gold-standard global interpretation of the model, identifying relative risk factors and quantifying the effect of discharge placement. We also show that the posthoc explainer SHAP fails to provide accurate explanations.  ( 3 min )
    MAGIC: Microlensing Analysis Guided by Intelligent Computation. (arXiv:2206.08199v2 [astro-ph.IM] UPDATED)
    The modeling of binary microlensing light curves via the standard sampling-based method can be challenging, because of the time-consuming light-curve computation and the pathological likelihood landscape in the high-dimensional parameter space. In this work, we present MAGIC, which is a machine-learning framework to efficiently and accurately infer the microlensing parameters of binary events with realistic data quality. In MAGIC, binary microlensing parameters are divided into two groups and inferred separately with different neural networks. The key feature of MAGIC is the introduction of a neural controlled differential equation, which provides the capability to handle light curves with irregular sampling and large data gaps. Based on simulated light curves, we show that MAGIC can achieve fractional uncertainties of a few percent on the binary mass ratio and separation. We also test MAGIC on a real microlensing event. MAGIC is able to locate degenerate solutions even when large data gaps are introduced. As irregular samplings are common in astronomical surveys, our method also has implications for other studies that involve time series.  ( 3 min )
    Geometric Scattering on Measure Spaces. (arXiv:2208.08561v2 [stat.ML] UPDATED)
    The scattering transform is a multilayered, wavelet-based transform initially introduced as a model of convolutional neural networks (CNNs) that has played a foundational role in our understanding of these networks' stability and invariance properties. Subsequently, there has been widespread interest in extending the success of CNNs to data sets with non-Euclidean structure, such as graphs and manifolds, leading to the emerging field of geometric deep learning. In order to improve our understanding of the architectures used in this new field, several papers have proposed generalizations of the scattering transform for non-Euclidean data structures such as undirected graphs and compact Riemannian manifolds without boundary. In this paper, we introduce a general, unified model for geometric scattering on measure spaces. Our proposed framework includes previous work on geometric scattering as special cases but also applies to more general settings such as directed graphs, signed graphs, and manifolds with boundary. We propose a new criterion that identifies to which groups a useful representation should be invariant and show that this criterion is sufficient to guarantee that the scattering transform has desirable stability and invariance properties. Additionally, we consider finite measure spaces that are obtained from randomly sampling an unknown manifold. We propose two methods for constructing a data-driven graph on which the associated graph scattering transform approximates the scattering transform on the underlying manifold. Moreover, we use a diffusion-maps based approach to prove quantitative estimates on the rate of convergence of one of these approximations as the number of sample points tends to infinity. Lastly, we showcase the utility of our method on spherical images, directed graphs, and on high-dimensional single-cell data.  ( 3 min )
    Quality Not Quantity: On the Interaction between Dataset Design and Robustness of CLIP. (arXiv:2208.05516v2 [cs.LG] UPDATED)
    Web-crawled datasets have enabled remarkable generalization capabilities in recent image-text models such as CLIP (Contrastive Language-Image pre-training) or Flamingo, but little is known about the dataset creation processes. In this work, we introduce a testbed of six publicly available data sources - YFCC, LAION, Conceptual Captions, WIT, RedCaps, Shutterstock - to investigate how pre-training distributions induce robustness in CLIP. We find that the performance of the pre-training data varies substantially across distribution shifts, with no single data source dominating. Moreover, we systematically study the interactions between these data sources and find that combining multiple sources does not necessarily yield better models, but rather dilutes the robustness of the best individual data source. We complement our empirical findings with theoretical insights from a simple setting, where combining the training data also results in diluted robustness. In addition, our theoretical model provides a candidate explanation for the success of the CLIP-based data filtering technique recently employed in the LAION dataset. Overall our results demonstrate that simply gathering a large amount of data from the web is not the most effective way to build a pre-training dataset for robust generalization, necessitating further study into dataset design. Code is available at https://github.com/mlfoundations/clip_quality_not_quantity.  ( 3 min )
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v2 [stat.ML] UPDATED)
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.  ( 2 min )
    Bootstrapping Multilingual Semantic Parsers using Large Language Models. (arXiv:2210.07313v1 [cs.CL])
    Despite cross-lingual generalization demonstrated by pre-trained multilingual models, the translate-train paradigm of transferring English datasets across multiple languages remains to be the key ingredient for training task-specific multilingual models. However, for many low-resource languages, the availability of a reliable translation service entails significant amounts of costly human-annotated translation pairs. Further, the translation services for low-resource languages may continue to be brittle due to domain mismatch between the task-specific input text and the general-purpose text used while training the translation models. We consider the task of multilingual semantic parsing and demonstrate the effectiveness and flexibility offered by large language models (LLMs) for translating English datasets into several languages via few-shot prompting. We provide (i) Extensive comparisons with prior translate-train methods across 50 languages demonstrating that LLMs can serve as highly effective data translators, outperforming prior translation based methods on 40 out of 50 languages; (ii) A comprehensive study of the key design choices that enable effective data translation via prompted LLMs.  ( 2 min )
    BLOX: Macro Neural Architecture Search Benchmark and Algorithms. (arXiv:2210.07271v1 [cs.LG])
    Neural architecture search (NAS) has been successfully used to design numerous high-performance neural networks. However, NAS is typically compute-intensive, so most existing approaches restrict the search to decide the operations and topological structure of a single block only, then the same block is stacked repeatedly to form an end-to-end model. Although such an approach reduces the size of search space, recent studies show that a macro search space, which allows blocks in a model to be different, can lead to better performance. To provide a systematic study of the performance of NAS algorithms on a macro search space, we release Blox - a benchmark that consists of 91k unique models trained on the CIFAR-100 dataset. The dataset also includes runtime measurements of all the models on a diverse set of hardware platforms. We perform extensive experiments to compare existing algorithms that are well studied on cell-based search spaces, with the emerging blockwise approaches that aim to make NAS scalable to much larger macro search spaces. The benchmark and code are available at https://github.com/SamsungLabs/blox.  ( 2 min )
    Walk-and-Relate: A Random-Walk-based Algorithm for Representation Learning on Sparse Knowledge Graphs. (arXiv:2209.08769v2 [cs.LG] UPDATED)
    Knowledge graph (KG) embedding techniques use structured relationships between entities to learn low-dimensional representations of entities and relations. The traditional KG embedding techniques (such as TransE and DistMult) estimate these embeddings via simple models developed over observed KG triplets. These approaches differ in their triplet scoring loss functions. As these models only use the observed triplets to estimate the embeddings, they are prone to suffer through data sparsity that usually occurs in the real-world knowledge graphs, i.e., the lack of enough triplets per entity. To settle this issue, we propose an efficient method to augment the number of triplets to address the problem of data sparsity. We use random walks to create additional triplets, such that the relations carried by these introduced triplets entail the metapath induced by the random walks. We also provide approaches to accurately and efficiently filter out informative metapaths from the possible set of metapaths, induced by the random walks. The proposed approaches are model-agnostic, and the augmented training dataset can be used with any KG embedding approach out of the box. Experimental results obtained on the benchmark datasets show the advantages of the proposed approach.
    Ergodic variational flows. (arXiv:2205.07475v2 [stat.ML] UPDATED)
    This work presents a new class of variational family -- ergodic variational flows -- that not only enables tractable i.i.d. sampling and density evaluation, but also comes with MCMC-like convergence guarantees. Ergodic variational flows consist of a mixture of repeated applications of a measure-preserving and ergodic map to an initial reference distribution. We provide mild conditions under which the variational distribution converges weakly and in total variation to the target as the number of steps in the flow increases; this convergence holds regardless of the value of variational parameters, though different parameter values may result in faster or slower convergence. We develop a practical implementation of the flow family using Hamiltonian dynamics combined with deterministic momentum refreshment, including a tunable step size to optimize the trade-off between simulation fidelity and computational cost. Simulated and real data experiments provide an empirical verification of the convergence theory, and demonstrate that the method provides more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.  ( 2 min )
    Recognition Models to Learn Dynamics from Partial Observations with Neural ODEs. (arXiv:2205.12550v2 [eess.SY] UPDATED)
    Identifying dynamical systems from experimental data is a notably difficult task. Prior knowledge generally helps, but the extent of this knowledge varies with the application, and customized models are often needed. Neural ordinary differential equations can be written as a flexible framework for system identification and can incorporate a broad spectrum of physical insight, giving physical interpretability to the resulting latent space. In the case of partial observations, however, the data points cannot directly be mapped to the latent state of the ODE. Hence, we propose to design recognition models, in particular inspired by nonlinear observer theory, to link the partial observations to the latent state. We demonstrate the performance of the proposed approach on numerical simulations and on an experimental dataset from a robotic exoskeleton.  ( 2 min )
    MaxWeight With Discounted UCB: A Provably Stable Scheduling Policy for Nonstationary Multi-Server Systems With Unknown Statistics. (arXiv:2209.01126v2 [cs.LG] UPDATED)
    Multi-server queueing systems are widely used models for job scheduling in machine learning, wireless networks, and crowdsourcing. This paper considers a multi-server system with multiple servers and multiple types of jobs. The system maintains a separate queue for each type of jobs. For each time slot, each available server picks a job from a queue and then serves the job until it is complete. The arrival rates of the queues and the mean service times are unknown and even nonstationary. We propose the MaxWeight with discounted upper confidence bound (UCB) algorithm, which simultaneously learns the statistics and schedules jobs to servers. We prove that the proposed algorithm can stabilize the queues when the arrival rates are strictly within the service capacity region. Specifically, we prove that the queue lengths are bounded in the mean under the assumption that the mean service times change relatively slowly over time and the arrival rates are bounded away from the capacity region by a constant whose value depends on the discount factor used in the discounted UCB. Simulation results confirm that the proposed algorithm can stabilize the queues and that it outperforms MaxWeight with empirical mean and MaxWeight with discounted empirical mean. The proposed algorithm is also better than MaxWeight with UCB in the nonstationary setting.  ( 3 min )
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v4 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.  ( 3 min )
    Augmenting Neural Networks with Priors on Function Values. (arXiv:2202.04798v4 [cs.LG] UPDATED)
    The need for function estimation in label-limited settings is common in the natural sciences. At the same time, prior knowledge of function values is often available in these domains. For example, data-free biophysics-based models can be informative on protein properties, while quantum-based computations can be informative on small molecule properties. How can we coherently leverage such prior knowledge to help improve a neural network model that is quite accurate in some regions of input space -- typically near the training data -- but wildly wrong in other regions? Bayesian neural networks (BNN) enable the user to specify prior information only on the neural network weights, not directly on the function values. Moreover, there is in general no clear mapping between these. Herein, we tackle this problem by developing an approach to augment BNNs with prior information on the function values themselves. Our probabilistic approach yields predictions that rely more heavily on the prior information when the epistemic uncertainty is large, and more heavily on the neural network when the epistemic uncertainty is small.  ( 2 min )
    Spherical Channels for Modeling Atomic Interactions. (arXiv:2206.14331v2 [physics.chem-ph] UPDATED)
    Modeling the energy and forces of atomic systems is a fundamental problem in computational chemistry with the potential to help address many of the world's most pressing problems, including those related to energy scarcity and climate change. These calculations are traditionally performed using Density Functional Theory, which is computationally very expensive. Machine learning has the potential to dramatically improve the efficiency of these calculations from days or hours to seconds. We propose the Spherical Channel Network (SCN) to model atomic energies and forces. The SCN is a graph neural network where nodes represent atoms and edges their neighboring atoms. The atom embeddings are a set of spherical functions, called spherical channels, represented using spherical harmonics. We demonstrate, that by rotating the embeddings based on the 3D edge orientation, more information may be utilized while maintaining the rotational equivariance of the messages. While equivariance is a desirable property, we find that by relaxing this constraint in both message passing and aggregation, improved accuracy may be achieved. We demonstrate state-of-the-art results on the large-scale Open Catalyst dataset in both energy and force prediction for numerous tasks and metrics.  ( 3 min )
    Indirect Active Learning. (arXiv:2206.01454v2 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.  ( 2 min )
    E2V-SDE: From Asynchronous Events to Fast and Continuous Video Reconstruction via Neural Stochastic Differential Equations. (arXiv:2206.07578v2 [cs.CV] UPDATED)
    Event cameras respond to brightness changes in the scene asynchronously and independently for every pixel. Due to the properties, these cameras have distinct features: high dynamic range (HDR), high temporal resolution, and low power consumption. However, the results of event cameras should be processed into an alternative representation for computer vision tasks. Also, they are usually noisy and cause poor performance in areas with few events. In recent years, numerous researchers have attempted to reconstruct videos from events. However, they do not provide good quality videos due to a lack of temporal information from irregular and discontinuous data. To overcome these difficulties, we introduce an E2V-SDE whose dynamics are governed in a latent space by Stochastic differential equations (SDE). Therefore, E2V-SDE can rapidly reconstruct images at arbitrary time steps and make realistic predictions on unseen data. In addition, we successfully adopted a variety of image composition techniques for improving image clarity and temporal consistency. By conducting extensive experiments on simulated and real-scene datasets, we verify that our model outperforms state-of-the-art approaches under various video reconstruction settings. In terms of image quality, the LPIPS score improves by up to 12% and the reconstruction speed is 87% higher than that of ET-Net.  ( 3 min )
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v3 [cs.LG] UPDATED)
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.  ( 3 min )
    Tunable Complexity Benchmarks for Evaluating Physics-Informed Neural Networks on Coupled Ordinary Differential Equations. (arXiv:2210.07880v1 [stat.ML])
    In this work, we assess the ability of physics-informed neural networks (PINNs) to solve increasingly-complex coupled ordinary differential equations (ODEs). We focus on a pair of benchmarks: discretized partial differential equations and harmonic oscillators, each of which has a tunable parameter that controls its complexity. Even by varying network architecture and applying a state-of-the-art training method that accounts for "difficult" training regions, we show that PINNs eventually fail to produce correct solutions to these benchmarks as their complexity -- the number of equations and the size of time domain -- increases. We identify several reasons why this may be the case, including insufficient network capacity, poor conditioning of the ODEs, and high local curvature, as measured by the Laplacian of the PINN loss.
    Towards Understanding Grokking: An Effective Theory of Representation Learning. (arXiv:2205.10343v2 [cs.LG] UPDATED)
    We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. We find representation learning to occur only in a "Goldilocks zone" (including comprehension and grokking) between memorization and confusion. We find on transformers the grokking phase stays closer to the memorization phase (compared to the comprehension phase), leading to delayed generalization. The Goldilocks phase is reminiscent of "intelligence from starvation" in Darwinian evolution, where resource limitations drive discovery of more efficient solutions. This study not only provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning.
    Mixture-of-Experts with Expert Choice Routing. (arXiv:2202.09368v2 [cs.LG] UPDATED)
    Sparsely-activated Mixture-of-experts (MoE) models allow the number of parameters to greatly increase while keeping the amount of computation for a given token or a given sample unchanged. However, a poor expert routing strategy (e.g. one resulting in load imbalance) can cause certain experts to be under-trained, leading to an expert being under or over-specialized. Prior work allocates a fixed number of experts to each token using a top-k function regardless of the relative importance of different tokens. To address this, we propose a heterogeneous mixture-of-experts employing an expert choice method. Instead of letting tokens select the top-k experts, we have experts selecting the top-k tokens. As a result, each token can be routed to a variable number of experts and each expert can have a fixed bucket size. We systematically study pre-training speedups using the same computational resources of the Switch Transformer top-1 and GShard top-2 gating of prior work and find that our method improves training convergence time by more than 2x. For the same computational cost, our method demonstrates higher performance in fine-tuning 11 selected tasks in the GLUE and SuperGLUE benchmarks. For a smaller activation cost, our method outperforms the T5 dense model in 7 out of the 11 tasks.
    The cluster structure function. (arXiv:2201.01222v3 [cs.LG] UPDATED)
    For each partition of a data set into a given number of parts there is a partition such that every part is as much as possible a good model (an "algorithmic sufficient statistic") for the data in that part. Since this can be done for every number between one and the number of data, the result is a function, the cluster structure function. It maps the number of parts of a partition to values related to the deficiencies of being good models by the parts. Such a function starts with a value at least zero for no partition of the data set and descents to zero for the partition of the data set into singleton parts. The optimal clustering is the one chosen to minimize the cluster structure function. The theory behind the method is expressed in algorithmic information theory (Kolmogorov complexity). In practice the Kolmogorov complexities involved are approximated by a concrete compressor. We give examples using real data sets: the MNIST handwritten digits and the segmentation of real cells as used in stem cell research.
    Reliability Assessment and Safety Arguments for Machine Learning Components in System Assurance. (arXiv:2112.00646v2 [cs.SE] UPDATED)
    The increasing use of Machine Learning (ML) components embedded in autonomous systems -- so-called Learning-Enabled Systems (LESs) -- has resulted in the pressing need to assure their functional safety. As for traditional functional safety, the emerging consensus within both, industry and academia, is to use assurance cases for this purpose. Typically assurance cases support claims of reliability in support of safety, and can be viewed as a structured way of organising arguments and evidence generated from safety analysis and reliability modelling activities. While such assurance activities are traditionally guided by consensus-based standards developed from vast engineering experience, LESs pose new challenges in safety-critical application due to the characteristics and design of ML models. In this article, we first present an overall assurance framework for LESs with an emphasis on quantitative aspects, e.g., breaking down system-level safety targets to component-level requirements and supporting claims stated in reliability metrics. We then introduce a novel model-agnostic Reliability Assessment Model (RAM) for ML classifiers that utilises the operational profile and robustness verification evidence. We discuss the model assumptions and the inherent challenges of assessing ML reliability uncovered by our RAM and propose solutions to practical use. Probabilistic safety argument templates at the lower ML component-level are also developed based on the RAM. Finally, to evaluate and demonstrate our methods, we not only conduct experiments on synthetic/benchmark datasets but also scope our methods with case studies on simulated Autonomous Underwater Vehicles and physical Unmanned Ground Vehicles.
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v3 [stat.ML] UPDATED)
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.
    Autoregressive Perturbations for Data Poisoning. (arXiv:2206.03693v3 [cs.LG] UPDATED)
    The prevalence of data scraping from social media as a means to obtain datasets has led to growing concerns regarding unauthorized use of data. Data poisoning attacks have been proposed as a bulwark against scraping, as they make data "unlearnable" by adding small, imperceptible perturbations. Unfortunately, existing methods require knowledge of both the target architecture and the complete dataset so that a surrogate network can be trained, the parameters of which are used to generate the attack. In this work, we introduce autoregressive (AR) poisoning, a method that can generate poisoned data without access to the broader dataset. The proposed AR perturbations are generic, can be applied across different datasets, and can poison different architectures. Compared to existing unlearnable methods, our AR poisons are more resistant against common defenses such as adversarial training and strong data augmentations. Our analysis further provides insight into what makes an effective data poison.
    A Recommendation Approach based on Similarity-Popularity Models of Complex Networks. (arXiv:2210.07816v1 [cs.IR])
    Recommender systems have become an essential tool for providers and users of online services and goods, especially with the increased use of the Internet to access information and purchase products and services. This work proposes a novel recommendation method based on complex networks generated by a similarity-popularity model to predict ones. We first construct a model of a network having users and items as nodes from observed ratings and then use it to predict unseen ratings. The prospect of producing accurate rating predictions using a similarity-popularity model with hidden metric spaces and dot-product similarity is explored. The proposed approach is implemented and experimentally compared against baseline and state-of-the-art recommendation methods on 21 datasets from various domains. The experimental results demonstrate that the proposed method produces accurate predictions and outperforms existing methods. We also show that the proposed approach produces superior results in low dimensions, proving its effectiveness for data visualization and exploration.
    ODBO: Bayesian Optimization with Search Space Prescreening for Directed Protein Evolution. (arXiv:2205.09548v3 [q-bio.BM] UPDATED)
    Directed evolution is a versatile technique in protein engineering that mimics the process of natural selection by iteratively alternating between mutagenesis and screening in order to search for sequences that optimize a given property of interest, such as catalytic activity and binding affinity to a specified target. However, the space of possible proteins is too large to search exhaustively in the laboratory, and functional proteins are scarce in the vast sequence space. Machine learning (ML) approaches can accelerate directed evolution by learning to map protein sequences to functions without building a detailed model of the underlying physics, chemistry and biological pathways. Despite the great potentials held by these ML methods, they encounter severe challenges in identifying the most suitable sequences for a targeted function. These failures can be attributed to the common practice of adopting a high-dimensional feature representation for protein sequences and inefficient search methods. To address these issues, we propose an efficient, experimental design-oriented closed-loop optimization framework for protein directed evolution, termed ODBO, which employs a combination of novel low-dimensional protein encoding strategy and Bayesian optimization enhanced with search space prescreening via outlier detection. We further design an initial sample selection strategy to minimize the number of experimental samples for training ML models. We conduct and report four protein directed evolution experiments that substantiate the capability of the proposed framework for finding of the variants with properties of interest. We expect the ODBO framework to greatly reduce the experimental cost and time cost of directed evolution, and can be further generalized as a powerful tool for adaptive experimental design in a broader context.
    Green Hierarchical Vision Transformer for Masked Image Modeling. (arXiv:2205.13515v2 [cs.CV] UPDATED)
    We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of three key designs. First, for window attention, we propose a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a uniform partition that visible patches within each local window of arbitrary size can be grouped with equal size, where masked self-attention is then performed within each group. Second, we further improve the grouping strategy via the Dynamic Programming algorithm to minimize the overall computation cost of the attention on the grouped patches. Third, as for the convolution layers, we convert them to the Sparse Convolution that works seamlessly with the sparse data, i.e., the visible patches in MIM. As a result, MIM can now work on most, if not all, hierarchical ViTs in a green and efficient way. For example, we can train the hierarchical ViTs, e.g., Swin Transformer and Twins Transformer, about 2.7$\times$ faster and reduce the GPU memory usage by 70%, while still enjoying competitive performance on ImageNet classification and the superiority on downstream COCO object detection benchmarks. Code and pre-trained models have been made publicly available at https://github.com/LayneH/GreenMIM.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v2 [stat.ML] UPDATED)
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum that requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    Unsupervised Model Selection for Time-series Anomaly Detection. (arXiv:2210.01078v2 [cs.LG] UPDATED)
    Anomaly detection in time-series has a wide range of practical applications. While numerous anomaly detection methods have been proposed in the literature, a recent survey concluded that no single method is the most accurate across various datasets. To make matters worse, anomaly labels are scarce and rarely available in practice. The practical problem of selecting the most accurate model for a given dataset without labels has received little attention in the literature. This paper answers this question i.e. Given an unlabeled dataset and a set of candidate anomaly detectors, how can we select the most accurate model? To this end, we identify three classes of surrogate (unsupervised) metrics, namely, prediction error, model centrality, and performance on injected synthetic anomalies, and show that some metrics are highly correlated with standard supervised anomaly detection performance metrics such as the $F_1$ score, but to varying degrees. We formulate metric combination with multiple imperfect surrogate metrics as a robust rank aggregation problem. We then provide theoretical justification behind the proposed approach. Large-scale experiments on multiple real-world datasets demonstrate that our proposed unsupervised approach is as effective as selecting the most accurate model based on partially labeled data.
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v2 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a generalization bound for EQRM; and (ii) that EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG, and demonstrate that EQRM outperforms state-of-the-art baselines on CMNIST and several datasets from WILDS and DomainBed.
    Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian Gradients. (arXiv:2206.06295v4 [cs.LG] UPDATED)
    Minimizing the inclusive Kullback-Leibler (KL) divergence with stochastic gradient descent (SGD) is challenging since its gradient is defined as an integral over the posterior. Recently, multiple methods have been proposed to run SGD with biased gradient estimates obtained from a Markov chain. This paper provides the first non-asymptotic convergence analysis of these methods by establishing their mixing rate and gradient variance. To do this, we demonstrate that these methods-which we collectively refer to as Markov chain score ascent (MCSA) methods-can be cast as special cases of the Markov chain gradient descent framework. Furthermore, by leveraging this new understanding, we develop a novel MCSA scheme, parallel MCSA (pMCSA), that achieves a tighter bound on the gradient variance. We demonstrate that this improved theoretical result translates to superior empirical performance.
    B\'ezier Gaussian Processes for Tall and Wide Data. (arXiv:2209.00343v2 [stat.ML] UPDATED)
    Modern approximations to Gaussian processes are suitable for "tall data", with a cost that scales well in the number of observations, but under-performs on ``wide data'', scaling poorly in the number of input features. That is, as the number of input features grows, good predictive performance requires the number of summarising variables, and their associated cost, to grow rapidly. We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features, but requires only linear cost in both number of observations and input features. This scaling is achieved through our introduction of the B\'ezier buttress, which allows approximate inference without computing matrix inverses or determinants. We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression, and empirically demonstrate the kernel's ability to scale to both tall and wide datasets.
    Provably Efficient Offline Multi-agent Reinforcement Learning via Strategy-wise Bonus. (arXiv:2206.00159v2 [cs.LG] UPDATED)
    This paper considers offline multi-agent reinforcement learning. We propose the strategy-wise concentration principle which directly builds a confidence interval for the joint strategy, in contrast to the point-wise concentration principle that builds a confidence interval for each point in the joint action space. For two-player zero-sum Markov games, by exploiting the convexity of the strategy-wise bonus, we propose a computationally efficient algorithm whose sample complexity enjoys a better dependency on the number of actions than the prior methods based on the point-wise bonus. Furthermore, for offline multi-agent general-sum Markov games, based on the strategy-wise bonus and a novel surrogate function, we give the first algorithm whose sample complexity only scales $\sum_{i=1}^mA_i$ where $A_i$ is the action size of the $i$-th player and $m$ is the number of players. In sharp contrast, the sample complexity of methods based on the point-wise bonus would scale with the size of the joint action space $\Pi_{i=1}^m A_i$ due to the curse of multiagents. Lastly, all of our algorithms can naturally take a pre-specified strategy class $\Pi$ as input and output a strategy that is close to the best strategy in $\Pi$. In this setting, the sample complexity only scales with $\log |\Pi|$ instead of $\sum_{i=1}^mA_i$.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v2 [math.OC] UPDATED)
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v2 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    PAN: Pulse Ansatz on NISQ Machines. (arXiv:2208.01215v2 [quant-ph] UPDATED)
    Variational quantum algorithms (VQAs) have demonstrated great potentials in the NISQ era. In the workflow of VQA, the parameters of ansatz are iteratively updated to approximate the desired quantum states. We have seen various efforts to draft better ansatz with less gates. In quantum computers, the gate ansatz will eventually be transformed into control signals such as microwave pulses on transmons. And the control pulses need elaborate calibration to minimize the errors such as over-rotation and under-rotation. In the case of VQAs, this procedure will introduce redundancy, but the variational properties of VQAs can naturally handle problems of over-rotation and under-rotation by updating the amplitude and frequency parameters. Therefore, we propose PAN, a native-pulse ansatz generator framework for VQAs. We generate native-pulse ansatz with trainable parameters for amplitudes and frequencies. In our proposed PAN, we are tuning parametric pulses, which are natively supported on NISQ computers. Considering that parameter-shift rules do not hold for native-pulse ansatz, we need to deploy non-gradient optimizers. To constrain the number of parameters sent to the optimizer, we adopt a progressive way to generate our native-pulse ansatz. Experiments are conducted on both simulators and quantum devices to validate our methods. When adopted on NISQ machines, PAN obtained improved the performance with decreased latency by an average of 86%. PAN is able to achieve 96.482% and 99.336% accuracy for VQE tasks on H2 and HeH+ respectively, An average accuracy of 97.27% is achieved for medium-size VQE tasks on CO2, H2O, and NaH. PAN also demonstrates advantages on QAOA tasks even with considerable noises in NISQ machines.
    Scalable Stochastic Parametric Verification with Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v2 [cs.LG] UPDATED)
    Parametric verification of linear temporal properties for stochastic models can be expressed as computing the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) aims at inferring the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. As observations are costly and noisy, smMC is framed as a Bayesian inference problem so that the estimates have an additional quantification of the uncertainty. In smMC the authors use Gaussian Processes (GP), inferred by means of the Expectation Propagation algorithm. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the well-known scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets and enabling its application to models with high dimensional parameter spaces. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). The core ingredient of SVI is a stochastic gradient-based optimization that makes inference easily parallelizable and that enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and the accuracy of the reconstructed satisfaction function.
    CELEST: Federated Learning for Globally Coordinated Threat Detection. (arXiv:2205.11459v2 [cs.CR] UPDATED)
    The cyber-threat landscape has evolved tremendously in recent years, with new threat variants emerging daily, and large-scale coordinated campaigns becoming more prevalent. In this study, we propose CELEST (CollaborativE LEarning for Scalable Threat detection, a federated machine learning framework for global threat detection over HTTP, which is one of the most commonly used protocols for malware dissemination and communication. CELEST leverages federated learning in order to collaboratively train a global model across multiple clients who keep their data locally, thus providing increased privacy and confidentiality assurances. Through a novel active learning component integrated with the federated learning technique, our system continuously discovers and learns the behavior of new, evolving, and globally-coordinated cyber threats. We show that CELEST is able to expose attacks that are largely invisible to individual organizations. For instance, in one challenging attack scenario with data exfiltration malware, the global model achieves a three-fold increase in Precision-Recall AUC compared to the local model. We also design a poisoning detection and mitigation method, DTrust, specifically designed for federated learning in the collaborative threat detection domain. DTrust successfully detects poisoning clients using the feedback from participating clients to investigate and remove them from the training process. We deploy CELEST on two university networks and show that it is able to detect the malicious HTTP communication with high precision and low false positive rates. Furthermore, during its deployment, CELEST detected a set of previously unknown 42 malicious URLs and 20 malicious domains in one day, which were confirmed to be malicious by VirusTotal.
    Generalized Anomaly Detection. (arXiv:2110.15108v2 [cs.LG] UPDATED)
    We study anomaly detection for the case when the normal class consists of more than one object category. This is an obvious generalization of the standard one-class anomaly detection problem. However, we show that jointly using multiple one-class anomaly detectors to solve this problem yields poorer results as compared to training a single one-class anomaly detector on all normal object categories together. We further develop a new anomaly detector called DeepMAD that learns compact distinguishing features by exploiting the multiple normal objects categories. This algorithm achieves higher AUC values for different datasets compared to two top performing one-class algorithms that either are trained on each normal object category or jointly trained on all normal object categories combined. In addition to theoretical results we present empirical results using the CIFAR-10, fMNIST, CIFAR-100, and a new dataset we developed called RECYCLE.
    GRPE: Relative Positional Encoding for Graph Transformer. (arXiv:2201.12787v3 [cs.LG] UPDATED)
    We propose a novel positional encoding for learning graph on Transformer architecture. Existing approaches either linearize a graph to encode absolute position in the sequence of nodes, or encode relative position with another node using bias terms. The former loses preciseness of relative position from linearization, while the latter loses a tight integration of node-edge and node-topology interaction. To overcome the weakness of the previous approaches, our method encodes a graph without linearization and considers both node-topology and node-edge interaction. We name our method Graph Relative Positional Encoding dedicated to graph representation learning. Experiments conducted on various graph datasets show that the proposed method outperforms previous approaches significantly. Our code is publicly available at https://github.com/lenscloth/GRPE.
    Gradient Obfuscation Gives a False Sense of Security in Federated Learning. (arXiv:2206.04055v2 [cs.CR] UPDATED)
    Federated learning has been proposed as a privacy-preserving machine learning framework that enables multiple clients to collaborate without sharing raw data. However, client privacy protection is not guaranteed by design in this framework. Prior work has shown that the gradient sharing strategies in federated learning can be vulnerable to data reconstruction attacks. In practice, though, clients may not transmit raw gradients considering the high communication cost or due to privacy enhancement requirements. Empirical studies have demonstrated that gradient obfuscation, including intentional obfuscation via gradient noise injection and unintentional obfuscation via gradient compression, can provide more privacy protection against reconstruction attacks. In this work, we present a new data reconstruction attack framework targeting the image classification task in federated learning. We show that commonly adopted gradient postprocessing procedures, such as gradient quantization, gradient sparsification, and gradient perturbation, may give a false sense of security in federated learning. Contrary to prior studies, we argue that privacy enhancement should not be treated as a byproduct of gradient compression. Additionally, we design a new method under the proposed framework to reconstruct the image at the semantic level. We quantify the semantic privacy leakage and compare with conventional based on image similarity scores. Our comparisons challenge the image data leakage evaluation schemes in the literature. The results emphasize the importance of revisiting and redesigning the privacy protection mechanisms for client data in existing federated learning algorithms.
    Posterior Collapse of a Linear Latent Variable Model. (arXiv:2205.04009v2 [cs.LG] UPDATED)
    This work identifies the existence and cause of a type of posterior collapse that frequently occurs in the Bayesian deep learning practice. For a general linear latent variable model that includes linear variational autoencoders as a special case, we precisely identify the nature of posterior collapse to be the competition between the likelihood and the regularization of the mean due to the prior. Our result suggests that posterior collapse may be related to neural collapse and dimensional collapse and could be a subclass of a general problem of learning for deeper architectures.
    DouFu: A Double Fusion Joint Learning Method For Driving Trajectory Representation. (arXiv:2205.08356v2 [cs.LG] UPDATED)
    Driving trajectory representation learning is of great significance for various location-based services, such as driving pattern mining and route recommendation. However, previous representation generation approaches tend to rarely address three challenges: 1) how to represent the intricate semantic intentions of mobility inexpensively; 2) complex and weak spatial-temporal dependencies due to the sparsity and heterogeneity of the trajectory data; 3) route selection preferences and their correlation to driving behavior. In this paper, we propose a novel multimodal fusion model, DouFu, for trajectory representation joint learning, which applies multimodal learning and attention fusion module to capture the internal characteristics of trajectories. We first design movement, route, and global features generated from the trajectory data and urban functional zones and then analyze them respectively with the attention encoder or feed forward network. The attention fusion module incorporates route features with movement features to create a better spatial-temporal embedding. With the global semantic feature, DouFu produces a comprehensive embedding for each trajectory. We evaluate representations generated by our method and other baseline models on classification and clustering tasks. Empirical results show that DouFu outperforms other models in most of the learning algorithms like the linear regression and the support vector machine by more than 10%.
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v2 [cs.LG] UPDATED)
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild, such as Google's Vizier database, one of the world's largest HPO datasets. Our extensive experiments demonstrate that the OptFormer can simultaneously imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.
    Improving Subgraph Representation Learning via Multi-View Augmentation. (arXiv:2205.13038v3 [cs.LG] UPDATED)
    Subgraph representation learning based on Graph Neural Network (GNN) has exhibited broad applications in scientific advancements, such as predictions of molecular structure-property relationships and collective cellular function. In particular, graph augmentation techniques have shown promising results in improving graph-based and node-based classification tasks. Still, they have rarely been explored in the existing GNN-based subgraph representation learning studies. In this study, we develop a novel multi-view augmentation mechanism to improve subgraph representation learning models and thus the accuracy of downstream prediction tasks. Our augmentation technique creates multiple variants of subgraphs and embeds these variants into the original graph to achieve highly improved training efficiency, scalability, and accuracy. Benchmark experiments on several real-world biological and physiological datasets demonstrate the superiority of our proposed multi-view augmentation techniques in subgraph representation learning.
    Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). (arXiv:2209.07263v2 [cs.LG] UPDATED)
    We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with the lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness. Moreover, under the non-lazy training regime, we demonstrate how the width of a two-layer ReLU network benefits robustness. Our theoretical developments improve the results by [Huang et al. NeurIPS21; Wu et al. NeurIPS21] and are consistent with [Bubeck and Sellke NeurIPS21; Bubeck et al. COLT21].
    Quantity over Quality: Training an AV Motion Planner with Large Scale Commodity Vision Data. (arXiv:2203.01681v2 [cs.RO] UPDATED)
    With the Autonomous Vehicle (AV) industry shifting towards machine-learned approaches for motion planning, the performance of self-driving systems is starting to rely heavily on large quantities of expert driving demonstrations. However, collecting this demonstration data typically involves expensive HD sensor suites (LiDAR + RADAR + cameras), which quickly becomes financially infeasible at the scales required. This motivates the use of commodity sensors like cameras for data collection, which are an order of magnitude cheaper than HD sensor suites, but offer lower fidelity. Leveraging these sensors for training an AV motion planner opens a financially viable path to observe the `long tail' of driving events. As our main contribution we show it is possible to train a high-performance motion planner using commodity vision data which outperforms planners trained on HD-sensor data for a fraction of the cost. To the best of our knowledge, we are the first to demonstrate this using real-world data. We compare the performance of the autonomy system on these two different sensor configurations, and show that we can compensate for the lower sensor fidelity by means of increased quantity: a planner trained on 100h of commodity vision data outperforms the one with 25h of expensive HD data. We also share the engineering challenges we had to tackle to make this work.
    Beyond IID: data-driven decision-making in heterogeneous environments. (arXiv:2206.09642v2 [cs.LG] UPDATED)
    In this work, we study data-driven decision-making and depart from the classical identically and independently distributed (i.i.d.) assumption. We present a new framework in which historical samples are generated from unknown and different distributions, which we dub heterogeneous environments. These distributions are assumed to lie in a heterogeneity ball with known radius and centered around the (also) unknown future (out-of-sample) distribution on which the performance of a decision will be evaluated. We quantify the asymptotic worst-case regret that is achievable by central data-driven policies such as Sample Average Approximation, but also by rate-optimal ones, as a function of the radius of the heterogeneity ball. Our work shows that the type of achievable performance varies considerably across different combinations of problem classes and notions of heterogeneity. We demonstrate the versatility of our framework by comparing achievable guarantees for the heterogeneous version of widely studied data-driven problems such as pricing, ski-rental, and newsvendor. En route, we establish a new connection between data-driven decision-making and distributionally robust optimization.
    Efficiently Controlling Multiple Risks with Pareto Testing. (arXiv:2210.07913v1 [cs.LG])
    Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiable high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes -- including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered -- to simultaneously control and optimize various accuracy and cost metrics.
    An Interpretive Constrained Linear Model for ResNet and MgNet. (arXiv:2112.07441v2 [cs.CV] UPDATED)
    We propose a constrained linear data-feature-mapping model as an interpretable mathematical model for image classification using a convolutional neural network (CNN). From this viewpoint, we establish detailed connections between the traditional iterative schemes for linear systems and the architectures of the basic blocks of ResNet- and MgNet-type models. Using these connections, we present some modified ResNet models that compared with the original models have fewer parameters and yet can produce more accurate results, thereby demonstrating the validity of this constrained learning data-feature-mapping assumption. Based on this assumption, we further propose a general data-feature iterative scheme to show the rationality of MgNet. We also provide a systematic numerical study on MgNet to show its success and advantages in image classification problems and demonstrate its advantages in comparison with established networks.
    Physics-Driven Deep Learning for Computational Magnetic Resonance Imaging. (arXiv:2203.12215v3 [eess.IV] UPDATED)
    Physics-driven deep learning methods have emerged as a powerful tool for computational magnetic resonance imaging (MRI) problems, pushing reconstruction performance to new limits. This article provides an overview of the recent developments in incorporating physics information into learning-based MRI reconstruction. We consider inverse problems with both linear and non-linear forward models for computational MRI, and review the classical approaches for solving these. We then focus on physics-driven deep learning approaches, covering physics-driven loss functions, plug-and-play methods, generative models, and unrolled networks. We highlight domain-specific challenges such as real- and complex-valued building blocks of neural networks, and translational applications in MRI with linear and non-linear forward models. Finally, we discuss common issues and open challenges, and draw connections to the importance of physics-driven learning when combined with other downstream tasks in the medical imaging pipeline.
    On the non-universality of deep learning: quantifying the cost of symmetry. (arXiv:2208.03113v2 [cs.LG] UPDATED)
    We prove limitations on what neural networks trained by noisy gradient descent (GD) can efficiently learn. Our results apply whenever GD training is equivariant, which holds for many standard architectures and initializations. As applications, (i) we characterize the functions that fully-connected networks can weak-learn on the binary hypercube and unit sphere, demonstrating that depth-2 is as powerful as any other depth for this task; (ii) we extend the merged-staircase necessity result for learning with latent low-dimensional structure [ABM22] to beyond the mean-field regime. Under cryptographic assumptions, we also show hardness results for learning with fully-connected networks trained by stochastic gradient descent (SGD).
    USB: A Unified Semi-supervised Learning Benchmark for Classification. (arXiv:2208.07204v2 [cs.LG] UPDATED)
    Semi-supervised learning (SSL) improves model generalization by leveraging massive unlabeled data to augment limited labeled samples. However, currently, popular SSL evaluation protocols are often constrained to computer vision (CV) tasks. In addition, previous work typically trains deep neural networks from scratch, which is time-consuming and environmentally unfriendly. To address the above issues, we construct a Unified SSL Benchmark (USB) for classification by selecting 15 diverse, challenging, and comprehensive tasks from CV, natural language processing (NLP), and audio processing (Audio), on which we systematically evaluate the dominant SSL methods, and also open-source a modular and extensible codebase for fair evaluation of these SSL methods. We further provide the pre-trained versions of the state-of-the-art neural models for CV tasks to make the cost affordable for further tuning. USB enables the evaluation of a single SSL algorithm on more tasks from multiple domains but with less cost. Specifically, on a single NVIDIA V100, only 39 GPU days are required to evaluate FixMatch on 15 tasks in USB while 335 GPU days (279 GPU days on 4 CV datasets except for ImageNet) are needed on 5 CV tasks with TorchSSL.
    projUNN: efficient method for training deep networks with unitary matrices. (arXiv:2203.05483v3 [cs.LG] UPDATED)
    In learning with recurrent or very deep feed-forward networks, employing unitary matrices in each layer can be very effective at maintaining long-range stability. However, restricting network parameters to be unitary typically comes at the cost of expensive parameterizations or increased training runtime. We propose instead an efficient method based on rank-$k$ updates -- or their rank-$k$ approximation -- that maintains performance at a nearly optimal training runtime. We introduce two variants of this method, named Direct (projUNN-D) and Tangent (projUNN-T) projected Unitary Neural Networks, that can parameterize full $N$-dimensional unitary or orthogonal matrices with a training runtime scaling as $O(kN^2)$. Our method either projects low-rank gradients onto the closest unitary matrix (projUNN-T) or transports unitary matrices in the direction of the low-rank gradient (projUNN-D). Even in the fastest setting ($k=1$), projUNN is able to train a model's unitary parameters to reach comparable performances against baseline implementations. In recurrent neural network settings, projUNN closely matches or exceeds benchmarked results from prior unitary neural networks. Finally, we preliminarily explore projUNN in training orthogonal convolutional neural networks, which are currently unable to outperform state of the art models but can potentially enhance stability and robustness at large depth.
    An Experimental Study on Private Aggregation of Teacher Ensemble Learning for End-to-End Speech Recognition. (arXiv:2210.05614v2 [cs.SD] UPDATED)
    Differential privacy (DP) is one data protection avenue to safeguard user information used for training deep models by imposing noisy distortion on privacy data. Such a noise perturbation often results in a severe performance degradation in automatic speech recognition (ASR) in order to meet a privacy budget $\varepsilon$. Private aggregation of teacher ensemble (PATE) utilizes ensemble probabilities to improve ASR accuracy when dealing with the noise effects controlled by small values of $\varepsilon$. We extend PATE learning to work with dynamic patterns, namely speech utterances, and perform a first experimental demonstration that it prevents acoustic data leakage in ASR training. We evaluate three end-to-end deep models, including LAS, hybrid CTC/attention, and RNN transducer, on the open-source LibriSpeech and TIMIT corpora. PATE learning-enhanced ASR models outperform the benchmark DP-SGD mechanisms, especially under strict DP budgets, giving relative word error rate reductions between 26.2% and 27.5% for an RNN transducer model evaluated with LibriSpeech. We also introduce a DP-preserving ASR solution for pretraining on public speech corpora.
    Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables. (arXiv:2003.12659v3 [stat.ML] UPDATED)
    Identification theory for causal effects in causal models associated with hidden variable directed acyclic graphs (DAGs) is well studied. However, the corresponding algorithms are underused due to the complexity of estimating the identifying functionals they output. In this work, we bridge the gap between identification and estimation of population-level causal effects involving a single treatment and a single outcome. We derive influence function based estimators that exhibit double robustness for the identified effects in a large class of hidden variable DAGs where the treatment satisfies a simple graphical criterion; this class includes models yielding the adjustment and front-door functionals as special cases. We also provide necessary and sufficient conditions under which the statistical model of a hidden variable DAG is nonparametrically saturated and implies no equality constraints on the observed data distribution. Further, we derive an important class of hidden variable DAGs that imply observed data distributions observationally equivalent (up to equality constraints) to fully observed DAGs. In these classes of DAGs, we derive estimators that achieve the semiparametric efficiency bounds for the target of interest where the treatment satisfies our graphical criterion. Finally, we provide a sound and complete identification algorithm that directly yields a weight based estimation strategy for any identifiable effect in hidden variable causal models.
    Cumulo: A Dataset for Learning Cloud Classes. (arXiv:1911.04227v3 [physics.ao-ph] UPDATED)
    One of the greatest sources of uncertainty in future climate projections comes from limitations in modelling clouds and in understanding how different cloud types interact with the climate system. A key first step in reducing this uncertainty is to accurately classify cloud types at high spatial and temporal resolution. In this paper, we introduce Cumulo, a benchmark dataset for training and evaluating global cloud classification models. It consists of one year of 1km resolution MODIS hyperspectral imagery merged with pixel-width 'tracks' of CloudSat cloud labels. Bringing these complementary datasets together is a crucial first step, enabling the Machine-Learning community to develop innovative new techniques which could greatly benefit the Climate community. To showcase Cumulo, we provide baseline performance analysis using an invertible flow generative model (IResNet), which further allows us to discover new sub-classes for a given cloud class by exploring the latent space. To compare methods, we introduce a set of evaluation criteria, to identify models that are not only accurate, but also physically-realistic. CUMULO can be download from https://www.dropbox.com/sh/i3s9q2v2jjyk2it/AACxXnXfMF5wuIqLXqH4NJOra?dl=0 .
    Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization. (arXiv:1910.02008v5 [math.ST] UPDATED)
    In this paper, we are concerned with a non-asymptotic analysis of sampling algorithms used in nonconvex optimization. In particular, we obtain non-asymptotic estimates in Wasserstein-1 and Wasserstein-2 distances for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). In addition, the aforementioned Wasserstein-2 convergence result can be applied to establish a non-asymptotic error bound for the expected excess risk. Crucially, these results are obtained under a local Lipschitz condition and a local dissipativity condition where we remove the uniform dependence in the data stream. We illustrate the importance of this relaxation by presenting examples from variational inference and from index tracking optimization.
    Covariate-informed Representation Learning to Prevent Posterior Collapse of iVAE. (arXiv:2202.04206v3 [stat.ML] UPDATED)
    The recently proposed identifiable variational autoencoder (iVAE) framework provides a promising approach for learning latent independent components (ICs). iVAEs use auxiliary covariates to build an identifiable generation structure from covariates to ICs to observations, and the posterior network approximates ICs given observations and covariates. Though the identifiability is appealing, we show that iVAEs could have local minimum solution where observations and the approximated ICs are independent given covariates.-a phenomenon we referred to as the posterior collapse problem of iVAEs. To overcome this problem, we develop a new approach, covariate-informed iVAE (CI-iVAE) by considering a mixture of encoder and posterior distributions in the objective function. In doing so, the objective function prevents the posterior collapse, resulting latent representations that contain more information of the observations. Furthermore, CI-iVAEs extend the original iVAE objective function to a larger class and finds the optimal one among them, thus having tighter evidence lower bounds than the original iVAE. Experiments on simulation datasets, EMNIST, Fashion-MNIST, and a large-scale brain imaging dataset demonstrate the effectiveness of our new method.
    Learning To Rank Diversely. (arXiv:2210.07774v1 [cs.IR])
    Airbnb is a two-sided marketplace, bringing together hosts who own listings for rent, with prospective guests from around the globe. Applying neural network-based learning to rank techniques has led to significant improvements in matching guests with hosts. These improvements in ranking were driven by a core strategy: order the listings by their estimated booking probabilities, then iterate on techniques to make these booking probability estimates more and more accurate. Embedded implicitly in this strategy was an assumption that the booking probability of a listing could be determined independently of other listings in search results. In this paper we discuss how this assumption, pervasive throughout the commonly-used learning to rank frameworks, is false. We provide a theoretical foundation correcting this assumption, followed by efficient neural network architectures based on the theory. Explicitly accounting for possible similarities between listings, and reducing them to diversify the search results generated strong positive impact. We discuss these metric wins as part of the online A/B tests of the theory. Our method provides a practical way to diversify search results for large-scale production ranking systems.
    Sarcasm Detection using Hybrid Neural Network. (arXiv:1908.07414v2 [cs.LG] UPDATED)
    Sarcasm Detection has enjoyed great interest from the research community, however the task of predicting sarcasm in a text remains an elusive problem for machines. Past studies mostly make use of twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. To overcome these shortcoming, we introduce a new dataset which contains news headlines from a sarcastic news website and a real news website. Next, we propose a hybrid Neural Network architecture with attention mechanism which provides insights about what actually makes sentences sarcastic. Through experiments, we show that the proposed model improves upon the baseline by ~ 5% in terms of classification accuracy.
    Optimal AdaBoost Converges. (arXiv:2210.07808v1 [stat.ML])
    The following work is a preprint collection of formal proofs regarding the convergence properties of the AdaBoost machine learning algorithm's classifier and margins. Various math and computer science papers have been written regarding conjectures and special cases of these convergence properties. Furthermore, the margins of AdaBoost feature prominently in the research surrounding the algorithm. At the zenith of this paper we present how AdaBoost's classifier and margins converge on a value that agrees with decades of research. After this, we show how various quantities associated with the combined classifier converge.
    Interpretable and Effective Reinforcement Learning for Attacking against Graph-based Rumor Detection. (arXiv:2201.05819v2 [cs.LG] UPDATED)
    Social networks are frequently polluted by rumors, which can be detected by advanced models such as graph neural networks. However, the models are vulnerable to attacks and understanding the vulnerabilities is critical to rumor detection in practice. To discover subtle vulnerabilities, we design a powerful attacking algorithm to camouflage rumors in social networks based on reinforcement learning that can interact with and attack any black-box detectors. The environment has exponentially large state spaces, high-order graph dependencies, and delayed noisy rewards, making the state-of-the-art end-to-end approaches difficult to learn features as large learning costs and expressive limitation of graph deep models. Instead, we design domain-specific features to avoid learning features and produce interpretable attack policies. To further speed up policy optimization, we devise: (i) a credit assignment method that decomposes delayed rewards to atomic attacking actions proportional to the their camouflage effects on target rumors; (ii) a time-dependent control variate to reduce reward variance due to large graphs and many attacking steps, supported by the reward variance analysis and a Bayesian analysis of the prediction distribution. On three real world datasets of rumor detection tasks, we demonstrate: (i) the effectiveness of the learned attacking policy compared to rule-based attacks and current end-to-end approaches; (ii) the usefulness of the proposed credit assignment strategy and variance reduction components; (iii) the interpretability of the policy when generating strong attacks via the case study.
    Fine-grained Category Discovery under Coarse-grained supervision with Hierarchical Weighted Self-contrastive Learning. (arXiv:2210.07733v1 [cs.CL])
    Novel category discovery aims at adapting models trained on known categories to novel categories. Previous works only focus on the scenario where known and novel categories are of the same granularity. In this paper, we investigate a new practical scenario called Fine-grained Category Discovery under Coarse-grained supervision (FCDC). FCDC aims at discovering fine-grained categories with only coarse-grained labeled data, which can adapt models to categories of different granularity from known ones and reduce significant labeling cost. It is also a challenging task since supervised training on coarse-grained categories tends to focus on inter-class distance (distance between coarse-grained classes) but ignore intra-class distance (distance between fine-grained sub-classes) which is essential for separating fine-grained categories. Considering most current methods cannot transfer knowledge from coarse-grained level to fine-grained level, we propose a hierarchical weighted self-contrastive network by building a novel weighted self-contrastive module and combining it with supervised learning in a hierarchical manner. Extensive experiments on public datasets show both effectiveness and efficiency of our model over compared methods. Code and data are available at https://github.com/Lackel/Hierarchical_Weighted_SCL.
    Hierarchical Policy Blending as Inference for Reactive Robot Control. (arXiv:2210.07890v1 [cs.RO])
    Motion generation in cluttered, dense, and dynamic environments is a central topic in robotics, rendered as a multi-objective decision-making problem. Current approaches trade-off between safety and performance. On the one hand, reactive policies guarantee fast response to environmental changes at the risk of suboptimal behavior. On the other hand, planning-based motion generation provides feasible trajectories, but the high computational cost may limit the control frequency and thus safety. To combine the benefits of reactive policies and planning, we propose a hierarchical motion generation method. Moreover, we adopt probabilistic inference methods to formalize the hierarchical model and stochastic optimization. We realize this approach as a weighted product of stochastic, reactive expert policies, where planning is used to adaptively compute the optimal weights over the task horizon. This stochastic optimization avoids local optima and proposes feasible reactive plans that find paths in cluttered and dense environments. Our extensive experimental study in planar navigation and 6DoF manipulation shows that our proposed hierarchical motion generation method outperforms both myopic reactive controllers and online re-planning methods.
    Object-Category Aware Reinforcement Learning. (arXiv:2210.07802v1 [cs.LG])
    Object-oriented reinforcement learning (OORL) is a promising way to improve the sample efficiency and generalization ability over standard RL. Recent works that try to solve OORL tasks without additional feature engineering mainly focus on learning the object representations and then solving tasks via reasoning based on these object representations. However, none of these works tries to explicitly model the inherent similarity between different object instances of the same category. Objects of the same category should share similar functionalities; therefore, the category is the most critical property of an object. Following this insight, we propose a novel framework named Object-Category Aware Reinforcement Learning (OCARL), which utilizes the category information of objects to facilitate both perception and reasoning. OCARL consists of three parts: (1) Category-Aware Unsupervised Object Discovery (UOD), which discovers the objects as well as their corresponding categories; (2) Object-Category Aware Perception, which encodes the category information and is also robust to the incompleteness of (1) at the same time; (3) Object-Centric Modular Reasoning, which adopts multiple independent and object-category-specific networks when reasoning based on objects. Our experiments show that OCARL can improve both the sample efficiency and generalization in the OORL domain.
    Generalization Properties of NAS under Activation and Skip Connection Search. (arXiv:2209.07238v2 [cs.LG] UPDATED)
    Neural Architecture Search (NAS) has fostered the automatic discovery of state-of-the-art neural architectures. Despite the progress achieved with NAS, so far there is little attention to theoretical guarantees on NAS. In this work, we study the generalization properties of NAS under a unifying framework enabling (deep) layer skip connection search and activation function search. To this end, we derive the lower (and upper) bounds of the minimum eigenvalue of the Neural Tangent Kernel (NTK) under the (in)finite-width regime using a certain search space including mixed activation functions, fully connected, and residual neural networks. We use the minimum eigenvalue to establish generalization error bounds of NAS in the stochastic gradient descent training. Importantly, we theoretically and experimentally show how the derived results can guide NAS to select the top-performing architectures, even in the case without training, leading to a train-free algorithm based on our theory. Accordingly, our numerical validation shed light on the design of computationally efficient methods for NAS. Our analysis is non-trivial due to the coupling of various architectures and activation functions under the unifying framework and has its own interest in providing the lower bound of the minimum eigenvalue of NTK in deep learning theory.
    FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems. (arXiv:2210.07768v1 [cs.IR])
    Deep learning has been widely deployed for online ads systems to predict Click-Through Rate (CTR). Machine learning researchers and practitioners frequently retrain CTR models to test their new extracted features. However, the CTR model training often relies on a large number of raw input data logs. Hence, the feature extraction can take a significant proportion of the training time for an industrial-level CTR model. In this paper, we propose FeatureBox, a novel end-to-end training framework that pipelines the feature extraction and the training on GPU servers to save the intermediate I/O of the feature extraction. We rewrite computation-intensive feature extraction operators as GPU operators and leave the memory-intensive operator on CPUs. We introduce a layer-wise operator scheduling algorithm to schedule these heterogeneous operators. We present a light-weight GPU memory management algorithm that supports dynamic GPU memory allocation with minimal overhead. We experimentally evaluate FeatureBox and compare it with the previous in-production feature extraction framework on two real-world ads applications. The results confirm the effectiveness of our proposed method.
    A Sequence-Aware Recommendation Method Based on Complex Networks. (arXiv:2210.07814v1 [cs.IR])
    Online stores and service providers rely heavily on recommendation softwares to guide users through the vast amount of available products. Consequently, the field of recommender systems has attracted increased attention from the industry and academia alike, but despite this joint effort, the field still faces several challenges. For instance, most existing work models the recommendation problem as a matrix completion problem to predict the user preference for an item. This abstraction prevents the system from utilizing the rich information from the ordered sequence of user actions logged in online sessions. To address this limitation, researchers have recently developed a promising new breed of algorithms called sequence-aware recommender systems to predict the user's next action by utilizing the time series composed of the sequence of actions in an ongoing user session. This paper proposes a novel sequence-aware recommendation approach based on a complex network generated by the hidden metric space model, which combines node similarity and popularity to generate links. We build a network model from data and then use it to predict the user's subsequent actions. The network model provides an additional source of information that improves the accuracy of the recommendations. The proposed method is implemented and tested experimentally on a large dataset. The results prove that the proposed approach performs better than state-of-the-art recommendation methods.
    Asymmetric Student-Teacher Networks for Industrial Anomaly Detection. (arXiv:2210.07829v1 [cs.LG])
    Industrial defect detection is commonly addressed with anomaly detection (AD) methods where no or only incomplete data of potentially occurring defects is available. This work discovers previously unknown problems of student-teacher approaches for AD and proposes a solution, where two neural networks are trained to produce the same output for the defect-free training examples. The core assumption of student-teacher networks is that the distance between the outputs of both networks is larger for anomalies since they are absent in training. However, previous methods suffer from the similarity of student and teacher architecture, such that the distance is undesirably small for anomalies. For this reason, we propose asymmetric student-teacher networks (AST). We train a normalizing flow for density estimation as a teacher and a conventional feed-forward network as a student to trigger large distances for anomalies: The bijectivity of the normalizing flow enforces a divergence of teacher outputs for anomalies compared to normal data. Outside the training distribution the student cannot imitate this divergence due to its fundamentally different architecture. Our AST network compensates for wrongly estimated likelihoods by a normalizing flow, which was alternatively used for anomaly detection in previous work. We show that our method produces state-of-the-art results on the two currently most relevant defect detection datasets MVTec AD and MVTec 3D-AD regarding image-level anomaly detection on RGB and 3D data.
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v3 [cs.LG] UPDATED)
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. This result holds even if the data is linear separable (which means achieving standard generalization is easy), and more generally for any parameterized function classes as long as their VC dimension is at most polynomial in the number of parameters. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Hybrid Decentralized Optimization: First- and Zeroth-Order Optimizers Can Be Jointly Leveraged For Faster Convergence. (arXiv:2210.07703v1 [cs.LG])
    Distributed optimization has become one of the standard ways of speeding up machine learning training, and most of the research in the area focuses on distributed first-order, gradient-based methods. Yet, there are settings where some computationally-bounded nodes may not be able to implement first-order, gradient-based optimization, while they could still contribute to joint optimization tasks. In this paper, we initiate the study of hybrid decentralized optimization, studying settings where nodes with zeroth-order and first-order optimization capabilities co-exist in a distributed system, and attempt to jointly solve an optimization task over some data distribution. We essentially show that, under reasonable parameter settings, such a system can not only withstand noisier zeroth-order agents but can even benefit from integrating such agents into the optimization process, rather than ignoring their information. At the core of our approach is a new analysis of distributed optimization with noisy and possibly-biased gradient estimators, which may be of independent interest. Experimental results on standard optimization tasks confirm our analysis, showing that hybrid first-zeroth order optimization can be practical.
    Self-Adaptive Training: Bridging Supervised and Self-Supervised Learning. (arXiv:2101.08732v3 [cs.LG] UPDATED)
    We propose self-adaptive training -- a unified training algorithm that dynamically calibrates and enhances training processes by model predictions without incurring an extra computational cost -- to advance both supervised and self-supervised learning of deep neural networks. We analyze the training dynamics of deep networks on training data that are corrupted by, e.g., random noise and adversarial examples. Our analysis shows that model predictions are able to magnify useful underlying information in data and this phenomenon occurs broadly even in the absence of any label information, highlighting that model predictions could substantially benefit the training processes: self-adaptive training improves the generalization of deep networks under noise and enhances the self-supervised representation learning. The analysis also sheds light on understanding deep learning, e.g., a potential explanation of the recently-discovered double-descent phenomenon in empirical risk minimization and the collapsing issue of the state-of-the-art self-supervised learning algorithms. Experiments on the CIFAR, STL, and ImageNet datasets verify the effectiveness of our approach in three applications: classification with label noise, selective classification, and linear evaluation. To facilitate future research, the code has been made publicly available at https://github.com/LayneH/self-adaptive-training.
    Flattened Graph Convolutional Networks For Recommendation. (arXiv:2210.07769v1 [cs.IR])
    Graph Convolutional Networks (GCNs) and their variants have achieved significant performances on various recommendation tasks. However, many existing GCN models tend to perform recursive aggregations among all related nodes, which can arise severe computational burden to hinder their application to large-scale recommendation tasks. To this end, this paper proposes the flattened GCN~(FlatGCN) model, which is able to achieve superior performance with remarkably less complexity compared with existing models. Our main contribution is three-fold. First, we propose a simplified but powerful GCN architecture which aggregates the neighborhood information using one flattened GCN layer, instead of recursively. The aggregation step in FlatGCN is parameter-free such that it can be pre-computed with parallel computation to save memory and computational cost. Second, we propose an informative neighbor-infomax sampling method to select the most valuable neighbors by measuring the correlation among neighboring nodes based on a principled metric. Third, we propose a layer ensemble technique which improves the expressiveness of the learned representations by assembling the layer-wise neighborhood representations at the final layer. Extensive experiments on three datasets verify that our proposed model outperforms existing GCN models considerably and yields up to a few orders of magnitude speedup in training efficiency.
    Revisiting Realistic Test-Time Training: Sequential Inference and Adaptation by Anchored Clustering. (arXiv:2206.02721v2 [cs.CV] UPDATED)
    Deploying models on target domain data subject to distribution shift requires adaptation. Test-time training (TTT) emerges as a solution to this adaptation under a realistic scenario where access to full source domain data is not available and instant inference on target domain is required. Despite many efforts into TTT, there is a confusion over the experimental settings, thus leading to unfair comparisons. In this work, we first revisit TTT assumptions and categorize TTT protocols by two key factors. Among the multiple protocols, we adopt a realistic sequential test-time training (sTTT) protocol, under which we further develop a test-time anchored clustering (TTAC) approach to enable stronger test-time feature learning. TTAC discovers clusters in both source and target domain and match the target clusters to the source ones to improve generalization. Pseudo label filtering and iterative updating are developed to improve the effectiveness and efficiency of anchored clustering. We demonstrate that under all TTT protocols TTAC consistently outperforms the state-of-the-art methods on six TTT datasets. We hope this work will provide a fair benchmarking of TTT methods and future research should be compared within respective protocols. A demo code is available at https://github.com/Gorilla-Lab-SCUT/TTAC.
    Automated dysgraphia detection by deep learning with SensoGrip. (arXiv:2210.07659v1 [cs.LG])
    Dysgraphia, a handwriting learning disability, has a serious negative impact on children's academic results, daily life and overall wellbeing. Early detection of dysgraphia allows for an early start of a targeted intervention. Several studies have investigated dysgraphia detection by machine learning algorithms using a digital tablet. However, these studies deployed classical machine learning algorithms with manual feature extraction and selection as well as binary classification: either dysgraphia or no dysgraphia. In this work, we investigated fine grading of handwriting capabilities by predicting SEMS score (between 0 and 12) with deep learning. Our approach provide accuracy more than 99% and root mean square error lower than one, with automatic instead of manual feature extraction and selection. Furthermore, we used smart pen called SensoGrip, a pen equipped with sensors to capture handwriting dynamics, instead of a tablet, enabling writing evaluation in more realistic scenarios.
    Close the Gate: Detecting Backdoored Models in Federated Learning based on Client-Side Deep Layer Output Analysis. (arXiv:2210.07714v1 [cs.CR])
    Federated Learning (FL) is a scheme for collaboratively training Deep Neural Networks (DNNs) with multiple data sources from different clients. Instead of sharing the data, each client trains the model locally, resulting in improved privacy. However, recently so-called targeted poisoning attacks have been proposed that allow individual clients to inject a backdoor into the trained model. Existing defenses against these backdoor attacks either rely on techniques like Differential Privacy to mitigate the backdoor, or analyze the weights of the individual models and apply outlier detection methods that restricts these defenses to certain data distributions. However, adding noise to the models' parameters or excluding benign outliers might also reduce the accuracy of the collaboratively trained model. Additionally, allowing the server to inspect the clients' models creates a privacy risk due to existing knowledge extraction methods. We propose \textit{CrowdGuard}, a model filtering defense, that mitigates backdoor attacks by leveraging the clients' data to analyze the individual models before the aggregation. To prevent data leaks, the server sends the individual models to secure enclaves, running in client-located Trusted Execution Environments. To effectively distinguish benign and poisoned models, even if the data of different clients are not independently and identically distributed (non-IID), we introduce a novel metric called \textit{HLBIM} to analyze the outputs of the DNN's hidden layers. We show that the applied significance-based detection algorithm combined can effectively detect poisoned models, even in non-IID scenarios.
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v1 [stat.ML])
    In this paper, we interpret disentanglement from the manifold perspective and trace how it naturally leads to a necessary and sufficient condition for disentanglement: the disentangled factors must commute with each other. Along the way, we show how some technical results have consequences for the compression and disentanglement of generative models, and we also discuss the practical and theoretical implications of commutativity. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.
    Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models. (arXiv:2210.07723v1 [stat.ML])
    Various privacy-preserving frameworks that respect the individual's privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the $L_2$-loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.
    Learning image representations for anomaly detection: application to discovery of histological alterations in drug development. (arXiv:2210.07675v1 [cs.CV])
    We present a system for anomaly detection in histopathological images. In histology, normal samples are usually abundant, whereas anomalous (pathological) cases are scarce or not available. Under such settings, one-class classifiers trained on healthy data can detect out-of-distribution anomalous samples. Such approaches combined with pre-trained Convolutional Neural Network (CNN) representations of images were previously employed for anomaly detection (AD). However, pre-trained off-the-shelf CNN representations may not be sensitive to abnormal conditions in tissues, while natural variations of healthy tissue may result in distant representations. To adapt representations to relevant details in healthy tissue we propose training a CNN on an auxiliary task that discriminates healthy tissue of different species, organs, and staining reagents. Almost no additional labeling workload is required, since healthy samples come automatically with aforementioned labels. During training we enforce compact image representations with a center-loss term, which further improves representations for AD. The proposed system outperforms established AD methods on a published dataset of liver anomalies. Moreover, it provided comparable results to conventional methods specifically tailored for quantification of liver anomalies. We show that our approach can be used for toxicity assessment of candidate drugs at early development stages and thereby may reduce expensive late-stage drug attrition.
    Learning Generalizable Models for Vehicle Routing Problems via Knowledge Distillation. (arXiv:2210.07686v1 [cs.LG])
    Recent neural methods for vehicle routing problems always train and test the deep models on the same instance distribution (i.e., uniform). To tackle the consequent cross-distribution generalization concerns, we bring the knowledge distillation to this field and propose an Adaptive Multi-Distribution Knowledge Distillation (AMDKD) scheme for learning more generalizable deep models. Particularly, our AMDKD leverages various knowledge from multiple teachers trained on exemplar distributions to yield a light-weight yet generalist student model. Meanwhile, we equip AMDKD with an adaptive strategy that allows the student to concentrate on difficult distributions, so as to absorb hard-to-master knowledge more effectively. Extensive experimental results show that, compared with the baseline neural methods, our AMDKD is able to achieve competitive results on both unseen in-distribution and out-of-distribution instances, which are either randomly synthesized or adopted from benchmark datasets (i.e., TSPLIB and CVRPLIB). Notably, our AMDKD is generic, and consumes less computational resources for inference.
    PrivMVMF: Privacy-Preserving Multi-View Matrix Factorization for Recommender Systems. (arXiv:2210.07775v1 [cs.IR])
    With an increasing focus on data privacy, there have been pilot studies on recommender systems in a federated learning (FL) framework, where multiple parties collaboratively train a model without sharing their data. Most of these studies assume that the conventional FL framework can fully protect user privacy. However, there are serious privacy risks in matrix factorization in federated recommender systems based on our study. This paper first provides a rigorous theoretical analysis of the server reconstruction attack in four scenarios in federated recommender systems, followed by comprehensive experiments. The empirical results demonstrate that the FL server could infer users' information with accuracy >80% based on the uploaded gradients from FL nodes. The robustness analysis suggests that our reconstruction attack analysis outperforms the random guess by >30% under Laplace noises with b no larger than 0.5 for all scenarios. Then, the paper proposes a new privacy-preserving framework based on homomorphic encryption, Privacy-Preserving Multi-View Matrix Factorization (PrivMVMF), to enhance user data privacy protection in federated recommender systems. The proposed PrivMVMF is successfully implemented and tested thoroughly with the MovieLens dataset.
    Graph Machine Learning for Design of High-Octane Fuels. (arXiv:2206.00619v2 [cs.LG] UPDATED)
    Fuels with high-knock resistance enable modern spark-ignition engines to achieve high efficiency and thus low CO2 emissions. Identification of molecules with desired autoignition properties indicated by a high research octane number and a high octane sensitivity is therefore of great practical relevance and can be supported by computer-aided molecular design (CAMD). Recent developments in the field of graph machine learning (graph-ML) provide novel, promising tools for CAMD. We propose a modular graph-ML CAMD framework that integrates generative graph-ML models with graph neural networks and optimization, enabling the design of molecules with desired ignition properties in a continuous molecular space. In particular, we explore the potential of Bayesian optimization and genetic algorithms in combination with generative graph-ML models. The graph-ML CAMD framework successfully identifies well-established high-octane components. It also suggests new candidates, one of which we experimentally investigate and use to illustrate the need for further auto-ignition training data.
    Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes. (arXiv:2210.07612v1 [stat.ML])
    The quality of many modern machine learning models improves as model complexity increases, an effect that has been quantified, for predictive performance, with the non-monotonic double descent learning curve. Here, we address the overarching question: is there an analogous theory of double descent for models which estimate uncertainty? We provide a partially affirmative and partially negative answer in the setting of Gaussian processes (GP). Under standard assumptions, we prove that higher model quality for optimally-tuned GPs (including uncertainty prediction) under marginal likelihood is realized for larger input dimensions, and therefore exhibits a monotone error curve. After showing that marginal likelihood does not naturally exhibit double descent in the input dimension, we highlight related forms of posterior predictive loss that do exhibit non-monotonicity. Finally, we verify empirically that our results hold for real data, beyond our considered assumptions, and we explore consequences involving synthetic covariates.
    Comparing interpretation methods in mental state decoding analyses with deep learning models. (arXiv:2205.15581v2 [q-bio.NC] UPDATED)
    Deep learning (DL) models find increasing application in mental state decoding, where researchers seek to understand the mapping between mental states (e.g., perceiving fear or joy) and brain activity by identifying those brain regions (and networks) whose activity allows to accurately identify (i.e., decode) these states. Once a DL model has been trained to accurately decode a set of mental states, neuroimaging researchers often make use of interpretation methods from explainable artificial intelligence research to understand the model's learned mappings between mental states and brain activity. Here, we compare the explanation performance of prominent interpretation methods in a mental state decoding analysis of three functional Magnetic Resonance Imaging (fMRI) datasets. Our findings demonstrate a gradient between two key characteristics of an explanation in mental state decoding, namely, its biological plausibility and faithfulness: interpretation methods with high explanation faithfulness, which capture the model's decision process well, generally provide explanations that are biologically less plausible than the explanations of interpretation methods with less explanation faithfulness. Based on this finding, we provide specific recommendations for the application of interpretation methods in mental state decoding.
    Pareto-aware Neural Architecture Generation for Diverse Computational Budgets. (arXiv:2210.07634v1 [cs.LG])
    Designing feasible and effective architectures under diverse computational budgets, incurred by different applications/devices, is essential for deploying deep models in real-world applications. To achieve this goal, existing methods often perform an independent architecture search process for each target budget, which is very inefficient yet unnecessary. More critically, these independent search processes cannot share their learned knowledge (i.e., the distribution of good architectures) with each other and thus often result in limited search results. To address these issues, we propose a Pareto-aware Neural Architecture Generator (PNAG) which only needs to be trained once and dynamically produces the Pareto optimal architecture for any given budget via inference. To train our PNAG, we learn the whole Pareto frontier by jointly finding multiple Pareto optimal architectures under diverse budgets. Such a joint search algorithm not only greatly reduces the overall search cost but also improves the search results. Extensive experiments on three hardware platforms (i.e., mobile device, CPU, and GPU) show the superiority of our method over existing methods.
    Simpson's Paradox in Recommender Fairness: Reconciling differences between per-user and aggregated evaluations. (arXiv:2210.07755v1 [cs.IR])
    There has been a flurry of research in recent years on notions of fairness in ranking and recommender systems, particularly on how to evaluate if a recommender allocates exposure equally across groups of relevant items (also known as provider fairness). While this research has laid an important foundation, it gave rise to different approaches depending on whether relevant items are compared per-user/per-query or aggregated across users. Despite both being established and intuitive, we discover that these two notions can lead to opposite conclusions, a form of Simpson's Paradox. We reconcile these notions and show that the tension is due to differences in distributions of users where items are relevant, and break down the important factors of the user's recommendations. Based on this new understanding, practitioners might be interested in either notions, but might face challenges with the per-user metric due to partial observability of the relevance and user satisfaction, typical in real-world recommenders. We describe a technique based on distribution matching to estimate it in such a scenario. We demonstrate on simulated and real-world recommender data the effectiveness and usefulness of such an approach.
    Federated Best Arm Identification with Heterogeneous Clients. (arXiv:2210.07780v1 [cs.LG])
    We study best arm identification in a federated multi-armed bandit setting with a central server and multiple clients, when each client has access to a {\em subset} of arms and each arm yields independent Gaussian observations. The {\em reward} from an arm at any given time is defined as the average of the observations generated at this time across all the clients that have access to the arm. The end goal is to identify the best arm (the arm with the largest mean reward) of each client with the least expected stopping time, subject to an upper bound on the error probability (i.e., the {\em fixed-confidence regime}). We provide a lower bound on the growth rate of the expected time to find the best arm of each client. Furthermore, we show that for any algorithm whose upper bound on the expected time to find the best arms matches with the lower bound up to a multiplicative constant, the ratio of any two consecutive communication time instants must be bounded, a result that is of independent interest. We then provide the first-known lower bound on the expected number of {\em communication rounds} required to find the best arms. We propose a novel algorithm based on the well-known {\em Track-and-Stop} strategy that communicates only at exponential time instants, and derive asymptotic upper bounds on its expected time to find the best arms and the expected number of communication rounds, where the asymptotics is one of vanishing error probabilities.
    Revisiting Optimal Convergence Rate for Smooth and Non-convex Stochastic Decentralized Optimization. (arXiv:2210.07863v1 [cs.LG])
    Decentralized optimization is effective to save communication in large-scale machine learning. Although numerous algorithms have been proposed with theoretical guarantees and empirical successes, the performance limits in decentralized optimization, especially the influence of network topology and its associated weight matrix on the optimal convergence rate, have not been fully understood. While (Lu and Sa, 2021) have recently provided an optimal rate for non-convex stochastic decentralized optimization with weight matrices defined over linear graphs, the optimal rate with general weight matrices remains unclear. This paper revisits non-convex stochastic decentralized optimization and establishes an optimal convergence rate with general weight matrices. In addition, we also establish the optimal rate when non-convex loss functions further satisfy the Polyak-Lojasiewicz (PL) condition. Following existing lines of analysis in literature cannot achieve these results. Instead, we leverage the Ring-Lattice graph to admit general weight matrices while maintaining the optimal relation between the graph diameter and weight matrix connectivity. Lastly, we develop a new decentralized algorithm to nearly attain the above two optimal rates under additional mild conditions.
    Distributional Reward Estimation for Effective Multi-Agent Deep Reinforcement Learning. (arXiv:2210.07636v1 [cs.LG])
    Multi-agent reinforcement learning has drawn increasing attention in practice, e.g., robotics and automatic driving, as it can explore optimal policies using samples generated by interacting with the environment. However, high reward uncertainty still remains a problem when we want to train a satisfactory model, because obtaining high-quality reward feedback is usually expensive and even infeasible. To handle this issue, previous methods mainly focus on passive reward correction. At the same time, recent active reward estimation methods have proven to be a recipe for reducing the effect of reward uncertainty. In this paper, we propose a novel Distributional Reward Estimation framework for effective Multi-Agent Reinforcement Learning (DRE-MARL). Our main idea is to design the multi-action-branch reward estimation and policy-weighted reward aggregation for stabilized training. Specifically, we design the multi-action-branch reward estimation to model reward distributions on all action branches. Then we utilize reward aggregation to obtain stable updating signals during training. Our intuition is that consideration of all possible consequences of actions could be useful for learning policies. The superiority of the DRE-MARL is demonstrated using benchmark multi-agent scenarios, compared with the SOTA baselines in terms of both effectiveness and robustness.
    Not All Neighbors Are Worth Attending to: Graph Selective Attention Networks for Semi-supervised Learning. (arXiv:2210.07715v1 [cs.LG])
    Graph attention networks (GATs) are powerful tools for analyzing graph data from various real-world scenarios. To learn representations for downstream tasks, GATs generally attend to all neighbors of the central node when aggregating the features. In this paper, we show that a large portion of the neighbors are irrelevant to the central nodes in many real-world graphs, and can be excluded from neighbor aggregation. Taking the cue, we present Selective Attention (SA) and a series of novel attention mechanisms for graph neural networks (GNNs). SA leverages diverse forms of learnable node-node dissimilarity to acquire the scope of attention for each node, from which irrelevant neighbors are excluded. We further propose Graph selective attention networks (SATs) to learn representations from the highly correlated node features identified and investigated by different SA mechanisms. Lastly, theoretical analysis on the expressive power of the proposed SATs and a comprehensive empirical study of the SATs on challenging real-world datasets against state-of-the-art GNNs are presented to demonstrate the effectiveness of SATs.
    Abstract-to-Executable Trajectory Translation for One-Shot Task Generalization. (arXiv:2210.07658v1 [cs.LG])
    Training long-horizon robotic policies in complex physical environments is essential for many applications, such as robotic manipulation. However, learning a policy that can generalize to unseen tasks is challenging. In this work, we propose to achieve one-shot task generalization by decoupling plan generation and plan execution. Specifically, our method solves complex long-horizon tasks in three steps: build a paired abstract environment by simplifying geometry and physics, generate abstract trajectories, and solve the original task by an abstract-to-executable trajectory translator. In the abstract environment, complex dynamics such as physical manipulation are removed, making abstract trajectories easier to generate. However, this introduces a large domain gap between abstract trajectories and the actual executed trajectories as abstract trajectories lack low-level details and are not aligned frame-to-frame with the executed trajectory. In a manner reminiscent of language translation, our approach leverages a seq-to-seq model to overcome the large domain gap between the abstract and executable trajectories, enabling the low-level policy to follow the abstract trajectory. Experimental results on various unseen long-horizon tasks with different robot embodiments demonstrate the practicability of our methods to achieve one-shot task generalization.
    Mix and Reason: Reasoning over Semantic Topology with Data Mixing for Domain Generalization. (arXiv:2210.07571v1 [cs.CV])
    Domain generalization (DG) enables generalizing a learning machine from multiple seen source domains to an unseen target one. The general objective of DG methods is to learn semantic representations that are independent of domain labels, which is theoretically sound but empirically challenged due to the complex mixture of common and domain-specific factors. Although disentangling the representations into two disjoint parts has been gaining momentum in DG, the strong presumption over the data limits its efficacy in many real-world scenarios. In this paper, we propose Mix and Reason (\mire), a new DG framework that learns semantic representations via enforcing the structural invariance of semantic topology. \mire\ consists of two key components, namely, Category-aware Data Mixing (CDM) and Adaptive Semantic Topology Refinement (ASTR). CDM mixes two images from different domains in virtue of activation maps generated by two complementary classification losses, making the classifier focus on the representations of semantic objects. ASTR introduces relation graphs to represent semantic topology, which is progressively refined via the interactions between local feature aggregation and global cross-domain relational reasoning. Experiments on multiple DG benchmarks validate the effectiveness and robustness of the proposed \mire.
    Unconditional Image-Text Pair Generation with Multimodal Cross Quantizer. (arXiv:2204.07537v2 [cs.CV] UPDATED)
    Although deep generative models have gained a lot of attention, most of the existing works are designed for unimodal generation. In this paper, we explore a new method for unconditional image-text pair generation. We design Multimodal Cross-Quantization VAE (MXQ-VAE), a novel vector quantizer for joint image-text representations, with which we discover that a joint image-text representation space is effective for semantically consistent image-text pair generation. To learn a multimodal semantic correlation in a quantized space, we combine VQ-VAE with a Transformer encoder and apply an input masking strategy. Specifically, MXQ-VAE accepts a masked image-text pair as input and learns a quantized joint representation space, so that the input can be converted to a unified code sequence, then we perform unconditional image-text pair generation with the code sequence. Extensive experiments show the correlation between the quantized joint space and the multimodal generation capability on synthetic and real-world datasets. In addition, we demonstrate the superiority of our approach in these two aspects over several baselines. The source code is publicly available at: https://github.com/ttumyche/MXQ-VAE.
    The Invariant Ground Truth of Affect. (arXiv:2210.07630v1 [cs.AI])
    Affective computing strives to unveil the unknown relationship between affect elicitation, manifestation of affect and affect annotations. The ground truth of affect, however, is predominately attributed to the affect labels which inadvertently include biases inherent to the subjective nature of emotion and its labeling. The response to such limitations is usually augmenting the dataset with more annotations per data point; however, this is not possible when we are interested in self-reports via first-person annotation. Moreover, outlier detection methods based on inter-annotator agreement only consider the annotations themselves and ignore the context and the corresponding affect manifestation. This paper reframes the ways one may obtain a reliable ground truth of affect by transferring aspects of causation theory to affective computing. In particular, we assume that the ground truth of affect can be found in the causal relationships between elicitation, manifestation and annotation that remain \emph{invariant} across tasks and participants. To test our assumption we employ causation inspired methods for detecting outliers in affective corpora and building affect models that are robust across participants and tasks. We validate our methodology within the domain of digital games, with experimental results showing that it can successfully detect outliers and boost the accuracy of affect models. To the best of our knowledge, this study presents the first attempt to integrate causation tools in affective computing, making a crucial and decisive step towards general affect modeling.
    Self-Supervised 2D/3D Registration for X-Ray to CT Image Fusion. (arXiv:2210.07611v1 [eess.IV])
    Deep Learning-based 2D/3D registration enables fast, robust, and accurate X-ray to CT image fusion when large annotated paired datasets are available for training. However, the need for paired CT volume and X-ray images with ground truth registration limits the applicability in interventional scenarios. An alternative is to use simulated X-ray projections from CT volumes, thus removing the need for paired annotated datasets. Deep Neural Networks trained exclusively on simulated X-ray projections can perform significantly worse on real X-ray images due to the domain gap. We propose a self-supervised 2D/3D registration framework combining simulated training with unsupervised feature and pixel space domain adaptation to overcome the domain gap and eliminate the need for paired annotated datasets. Our framework achieves a registration accuracy of 1.83$\pm$1.16 mm with a high success ratio of 90.1% on real X-ray images showing a 23.9% increase in success ratio compared to reference annotation-free algorithms.
    Monte Carlo Augmented Actor-Critic for Sparse Reward Deep Reinforcement Learning from Suboptimal Demonstrations. (arXiv:2210.07432v1 [cs.LG])
    Providing densely shaped reward functions for RL algorithms is often exceedingly challenging, motivating the development of RL algorithms that can learn from easier-to-specify sparse reward functions. This sparsity poses new exploration challenges. One common way to address this problem is using demonstrations to provide initial signal about regions of the state space with high rewards. However, prior RL from demonstrations algorithms introduce significant complexity and many hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified $Q$-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high $Q$-values in corresponding regions of the state space. Experiments across $5$ continuous control domains suggest that MCAC can be used to significantly increase learning efficiency across $6$ commonly used RL and RL-from-demonstrations algorithms. See https://sites.google.com/view/mcac-rl for code and supplementary material.
    Mutual Information Regularized Offline Reinforcement Learning. (arXiv:2210.07484v1 [cs.LG])
    Offline reinforcement learning (RL) aims at learning an effective policy from offline datasets without active interactions with the environment. The major challenge of offline RL is the distribution shift that appears when out-of-distribution actions are queried, which makes the policy improvement direction biased by extrapolation errors. Most existing methods address this problem by penalizing the policy for deviating from the behavior policy during policy improvement or making conservative updates for value functions during policy evaluation. In this work, we propose a novel MISA framework to approach offline RL from the perspective of Mutual Information between States and Actions in the dataset by directly constraining the policy improvement direction. Intuitively, mutual information measures the mutual dependence of actions and states, which reflects how a behavior agent reacts to certain environment states during data collection. To effectively utilize this information to facilitate policy learning, MISA constructs lower bounds of mutual information parameterized by the policy and Q-values. We show that optimizing this lower bound is equivalent to maximizing the likelihood of a one-step improved policy on the offline dataset. In this way, we constrain the policy improvement direction to lie in the data manifold. The resulting algorithm simultaneously augments the policy evaluation and improvement by adding a mutual information regularization. MISA is a general offline RL framework that unifies conservative Q-learning (CQL) and behavior regularization methods (e.g., TD3+BC) as special cases. Our experiments show that MISA performs significantly better than existing methods and achieves new state-of-the-art on various tasks of the D4RL benchmark.
    Bandwidth-efficient distributed neural network architectures with application to body sensor networks. (arXiv:2210.07750v1 [cs.LG])
    In this paper, we describe a conceptual design methodology to design distributed neural network architectures that can perform efficient inference within sensor networks with communication bandwidth constraints. The different sensor channels are distributed across multiple sensor devices, which have to exchange data over bandwidth-limited communication channels to solve, e.g., a classification task. Our design methodology starts from a user-defined centralized neural network and transforms it into a distributed architecture in which the channels are distributed over different nodes. The distributed network consists of two parallel branches of which the outputs are fused at the fusion center. The first branch collects classification results from local, node-specific classifiers while the second branch compresses each node's signal and then reconstructs the multi-channel time series for classification at the fusion center. We further improve bandwidth gains by dynamically activating the compression path when the local classifications do not suffice. We validate this method on a motor execution task in an emulated EEG sensor network and analyze the resulting bandwidth-accuracy trade-offs. Our experiments show that the proposed framework enables up to a factor 20 in bandwidth reduction with minimal loss (up to 2%) in classification accuracy compared to the centralized baseline on the demonstrated motor execution task. The proposed method offers a way to smoothly transform a centralized architecture to a distributed, bandwidth-efficient network amenable for low-power sensor networks. While the application focus of this paper is on wearable brain-computer interfaces, the proposed methodology can be applied in other sensor network-like applications as well.
    Training speech emotion classifier without categorical annotations. (arXiv:2210.07642v1 [cs.SD])
    There are two paradigms of emotion representation, categorical labeling and dimensional description in continuous space. Therefore, the emotion recognition task can be treated as a classification or regression. The main aim of this study is to investigate the relation between these two representations and propose a classification pipeline that uses only dimensional annotation. The proposed approach contains a regressor model which is trained to predict a vector of continuous values in dimensional representation for given speech audio. The output of this model can be interpreted as an emotional category using a mapping algorithm. We investigated the performances of a combination of three feature extractors, three neural network architectures, and three mapping algorithms on two different corpora. Our study shows the advantages and limitations of the classification via regression approach.
    A Continuous Time Framework for Discrete Denoising Models. (arXiv:2205.14987v2 [stat.ML] UPDATED)
    We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques developed in chemical physics and exploit our continuous time framework to derive high performance samplers that we show can outperform discrete time methods for discrete data. The continuous time treatment also enables us to derive a novel theoretical result bounding the error between the generated sample distribution and the true data distribution.
    HGARN: Hierarchical Graph Attention Recurrent Network for Human Mobility Prediction. (arXiv:2210.07765v1 [cs.LG])
    Human mobility prediction is a fundamental task essential for various applications, including urban planning, transportation services, and location recommendation. Existing approaches often ignore activity information crucial for reasoning human preferences and routines, or adopt a simplified representation of the dependencies between time, activities and locations. To address these issues, we present Hierarchical Graph Attention Recurrent Network (HGARN) for human mobility prediction. Specifically, we construct a hierarchical graph based on all users' history mobility records and employ a Hierarchical Graph Attention Module to capture complex time-activity-location dependencies. This way, HGARN can learn representations with rich contextual semantics to model user preferences at the global level. We also propose a model-agnostic history-enhanced confidence (MaHec) label to focus our model on each user's individual-level preferences. Finally, we introduce a Recurrent Encoder-Decoder Module, which employs recurrent structures to jointly predict users' next activities (as an auxiliary task) and locations. For model evaluation, we test the performances of our Hgarn against existing SOTAs in recurring and explorative settings. The recurring setting focuses more on assessing models' capabilities to capture users' individual-level preferences. In contrast, the results in the explorative setting tend to reflect the power of different models to learn users' global-level preferences. Overall, our model outperforms other baselines significantly in the main, recurring, and explorative settings based on two real-world human mobility data benchmarks. Source codes of HGARN are available at https://github.com/YihongT/HGARN.
    G2A2: An Automated Graph Generator with Attributes and Anomalies. (arXiv:2210.07449v1 [cs.LG])
    Many data-mining applications use dynamic attributed graphs to represent relational information; but due to security and privacy concerns, there is a dearth of available datasets that can be represented as dynamic attributed graphs. Even when such datasets are available, they do not have ground truth that can be used to train deep-learning models. Thus, we present G2A2, an automated graph generator with attributes and anomalies, which encompasses (1) probabilistic models to generate a dynamic bipartite graph, representing time-evolving connections between two independent sets of entities, (2) realistic injection of anomalies using a novel algorithm that captures the general properties of graph anomalies across domains, and (3) a deep generative model to produce realistic attributes, learned from an existing real-world dataset. Using the maximum mean discrepancy (MMD) metric to evaluate the realism of a G2A2-generated graph against three real-world graphs, G2A2 outperforms Kronecker graph generation by reducing the MMD distance by up to six-fold (6x).
    (1,1)-Cluster Editing is Polynomial-time Solvable. (arXiv:2210.07722v1 [cs.DS])
    A graph $H$ is a clique graph if $H$ is a vertex-disjoin union of cliques. Abu-Khzam (2017) introduced the $(a,d)$-{Cluster Editing} problem, where for fixed natural numbers $a,d$, given a graph $G$ and vertex-weights $a^*:\ V(G)\rightarrow \{0,1,\dots, a\}$ and $d^*{}:\ V(G)\rightarrow \{0,1,\dots, d\}$, we are to decide whether $G$ can be turned into a cluster graph by deleting at most $d^*(v)$ edges incident to every $v\in V(G)$ and adding at most $a^*(v)$ edges incident to every $v\in V(G)$. Results by Komusiewicz and Uhlmann (2012) and Abu-Khzam (2017) provided a dichotomy of complexity (in P or NP-complete) of $(a,d)$-{Cluster Editing} for all pairs $a,d$ apart from $a=d=1.$ Abu-Khzam (2017) conjectured that $(1,1)$-{Cluster Editing} is in P. We resolve Abu-Khzam's conjecture in affirmative by (i) providing a serious of five polynomial-time reductions to $C_3$-free and $C_4$-free graphs of maximum degree at most 3, and (ii) designing a polynomial-time algorithm for solving $(1,1)$-{Cluster Editing} on $C_3$-free and $C_4$-free graphs of maximum degree at most 3.
    Safe Model-Based Reinforcement Learning with an Uncertainty-Aware Reachability Certificate. (arXiv:2210.07553v1 [cs.RO])
    Safe reinforcement learning (RL) that solves constraint-satisfactory policies provides a promising way to the broader safety-critical applications of RL in real-world problems such as robotics. Among all safe RL approaches, model-based methods reduce training time violations further due to their high sample efficiency. However, lacking safety robustness against the model uncertainties remains an issue in safe model-based RL, especially in training time safety. In this paper, we propose a distributional reachability certificate (DRC) and its Bellman equation to address model uncertainties and characterize robust persistently safe states. Furthermore, we build a safe RL framework to resolve constraints required by the DRC and its corresponding shield policy. We also devise a line search method to maintain safety and reach higher returns simultaneously while leveraging the shield policy. Comprehensive experiments on classical benchmarks such as constrained tracking and navigation indicate that the proposed algorithm achieves comparable returns with much fewer constraint violations during training.
    Provable Subspace Identification Under Post-Nonlinear Mixtures. (arXiv:2210.07532v1 [cs.LG])
    Unsupervised mixture learning (UML) aims at identifying linearly or nonlinearly mixed latent components in a blind manner. UML is known to be challenging: Even learning linear mixtures requires highly nontrivial analytical tools, e.g., independent component analysis or nonnegative matrix factorization. In this work, the post-nonlinear (PNL) mixture model -- where unknown element-wise nonlinear functions are imposed onto a linear mixture -- is revisited. The PNL model is widely employed in different fields ranging from brain signal classification, speech separation, remote sensing, to causal discovery. To identify and remove the unknown nonlinear functions, existing works often assume different properties on the latent components (e.g., statistical independence or probability-simplex structures). This work shows that under a carefully designed UML criterion, the existence of a nontrivial null space associated with the underlying mixing system suffices to guarantee identification/removal of the unknown nonlinearity. Compared to prior works, our finding largely relaxes the conditions of attaining PNL identifiability, and thus may benefit applications where no strong structural information on the latent components is known. A finite-sample analysis is offered to characterize the performance of the proposed approach under realistic settings. To implement the proposed learning criterion, a block coordinate descent algorithm is proposed. A series of numerical experiments corroborate our theoretical claims.
    CAB: Comprehensive Attention Benchmarking on Long Sequence Modeling. (arXiv:2210.07661v1 [cs.LG])
    Transformer has achieved remarkable success in language, image, and speech processing. Recently, various efficient attention architectures have been proposed to improve transformer's efficiency while largely preserving its efficacy, especially in modeling long sequences. A widely-used benchmark to test these efficient methods' capability on long-range modeling is Long Range Arena (LRA). However, LRA only focuses on the standard bidirectional (or noncausal) self attention, and completely ignores cross attentions and unidirectional (or causal) attentions, which are equally important to downstream applications. Although designing cross and causal variants of an attention method is straightforward for vanilla attention, it is often challenging for efficient attentions with subquadratic time and memory complexity. In this paper, we propose Comprehensive Attention Benchmark (CAB) under a fine-grained attention taxonomy with four distinguishable attention patterns, namely, noncausal self, causal self, noncausal cross, and causal cross attentions. CAB collects seven real-world tasks from different research areas to evaluate efficient attentions under the four attention patterns. Among these tasks, CAB validates efficient attentions in eight backbone networks to show their generalization across neural architectures. We conduct exhaustive experiments to benchmark the performances of nine widely-used efficient attention architectures designed with different philosophies on CAB. Extensive experimental results also shed light on the fundamental problems of efficient attentions, such as efficiency length against vanilla attention, performance consistency across attention patterns, the benefit of attention mechanisms, and interpolation/extrapolation on long-context language modeling.
    See Blue Sky: Deep Image Dehaze Using Paired and Unpaired Training Images. (arXiv:2210.07594v1 [cs.CV])
    The issue of image haze removal has attracted wide attention in recent years. However, most existing haze removal methods cannot restore the scene with clear blue sky, since the color and texture information of the object in the original haze image is insufficient. To remedy this, we propose a cycle generative adversarial network to construct a novel end-to-end image dehaze model. We adopt outdoor image datasets to train our model, which includes a set of real-world unpaired image dataset and a set of paired image dataset to ensure that the generated images are close to the real scene. Based on the cycle structure, our model adds four different kinds of loss function to constrain the effect including adversarial loss, cycle consistency loss, photorealism loss and paired L1 loss. These four constraints can improve the overall quality of such degraded images for better visual appeal and ensure reconstruction of images to keep from distortion. The proposed model could remove the haze of images and also restore the sky of images to be clean and blue (like captured in a sunny weather).
    Latent Temporal Flows for Multivariate Analysis of Wearables Data. (arXiv:2210.07475v1 [cs.LG])
    Increased use of sensor signals from wearable devices as rich sources of physiological data has sparked growing interest in developing health monitoring systems to identify changes in an individual's health profile. Indeed, machine learning models for sensor signals have enabled a diverse range of healthcare related applications including early detection of abnormalities, fertility tracking, and adverse drug effect prediction. However, these models can fail to account for the dependent high-dimensional nature of the underlying sensor signals. In this paper, we introduce Latent Temporal Flows, a method for multivariate time-series modeling tailored to this setting. We assume that a set of sequences is generated from a multivariate probabilistic model of an unobserved time-varying low-dimensional latent vector. Latent Temporal Flows simultaneously recovers a transformation of the observed sequences into lower-dimensional latent representations via deep autoencoder mappings, and estimates a temporally-conditioned probabilistic model via normalizing flows. Using data from the Apple Heart and Movement Study (AH&MS), we illustrate promising forecasting performance on these challenging signals. Additionally, by analyzing two and three dimensional representations learned by our model, we show that we can identify participants' $\text{VO}_2\text{max}$, a main indicator and summary of cardio-respiratory fitness, using only lower-level signals. Finally, we show that the proposed method consistently outperforms the state-of-the-art in multi-step forecasting benchmarks (achieving at least a $10\%$ performance improvement) on several real-world datasets, while enjoying increased computational efficiency.
    Discrete Optimal Transport with Independent Marginals is #P-Hard. (arXiv:2203.01161v2 [math.OC] UPDATED)
    We study the computational complexity of the optimal transport problem that evaluates the Wasserstein distance between the distributions of two K-dimensional discrete random vectors. The best known algorithms for this problem run in polynomial time in the maximum of the number of atoms of the two distributions. However, if the components of either random vector are independent, then this number can be exponential in K even though the size of the problem description scales linearly with K. We prove that the described optimal transport problem is #P-hard even if all components of the first random vector are independent uniform Bernoulli random variables, while the second random vector has merely two atoms, and even if only approximate solutions are sought. We also develop a dynamic programming-type algorithm that approximates the Wasserstein distance in pseudo-polynomial time when the components of the first random vector follow arbitrary independent discrete distributions, and we identify special problem instances that can be solved exactly in strongly polynomial time.
    Counterfactual Neural Temporal Point Process for Estimating Causal Influence of Misinformation on Social Media. (arXiv:2210.07518v1 [cs.LG])
    Recent years have witnessed the rise of misinformation campaigns that spread specific narratives on social media to manipulate public opinions on different areas, such as politics and healthcare. Consequently, an effective and efficient automatic methodology to estimate the influence of the misinformation on user beliefs and activities is needed. However, existing works on misinformation impact estimation either rely on small-scale psychological experiments or can only discover the correlation between user behaviour and misinformation. To address these issues, in this paper, we build up a causal framework that model the causal effect of misinformation from the perspective of temporal point process. To adapt the large-scale data, we design an efficient yet precise way to estimate the Individual Treatment Effect(ITE) via neural temporal point process and gaussian mixture models. Extensive experiments on synthetic dataset verify the effectiveness and efficiency of our model. We further apply our model on a real-world dataset of social media posts and engagements about COVID-19 vaccines. The experimental results indicate that our model recognized identifiable causal effect of misinformation that hurts people's subjective emotions toward the vaccines.
    Amortized Inference for Heterogeneous Reconstruction in Cryo-EM. (arXiv:2210.07387v1 [cs.CV])
    Cryo-electron microscopy (cryo-EM) is an imaging modality that provides unique insights into the dynamics of proteins and other building blocks of life. The algorithmic challenge of jointly estimating the poses, 3D structure, and conformational heterogeneity of a biomolecule from millions of noisy and randomly oriented 2D projections in a computationally efficient manner, however, remains unsolved. Our method, cryoFIRE, performs ab initio heterogeneous reconstruction with unknown poses in an amortized framework, thereby avoiding the computationally expensive step of pose search while enabling the analysis of conformational heterogeneity. Poses and conformation are jointly estimated by an encoder while a physics-based decoder aggregates the images into an implicit neural representation of the conformational space. We show that our method can provide one order of magnitude speedup on datasets containing millions of images without any loss of accuracy. We validate that the joint estimation of poses and conformations can be amortized over the size of the dataset. For the first time, we prove that an amortized method can extract interpretable dynamic information from experimental datasets.
    FedFM: Anchor-based Feature Matching for Data Heterogeneity in Federated Learning. (arXiv:2210.07615v1 [cs.LG])
    One of the key challenges in federated learning (FL) is local data distribution heterogeneity across clients, which may cause inconsistent feature spaces across clients. To address this issue, we propose a novel method FedFM, which guides each client's features to match shared category-wise anchors (landmarks in feature space). This method attempts to mitigate the negative effects of data heterogeneity in FL by aligning each client's feature space. Besides, we tackle the challenge of varying objective function and provide convergence guarantee for FedFM. In FedFM, to mitigate the phenomenon of overlapping feature spaces across categories and enhance the effectiveness of feature matching, we further propose a more precise and effective feature matching loss called contrastive-guiding (CG), which guides each local feature to match with the corresponding anchor while keeping away from non-corresponding anchors. Additionally, to achieve higher efficiency and flexibility, we propose a FedFM variant, called FedFM-Lite, where clients communicate with server with fewer synchronization times and communication bandwidth costs. Through extensive experiments, we demonstrate that FedFM with CG outperforms several works by quantitative and qualitative comparisons. FedFM-Lite can achieve better performance than state-of-the-art methods with five to ten times less communication costs.
    Characterizing the Influence of Graph Elements. (arXiv:2210.07441v1 [cs.LG])
    Influence function, a method from robust statistics, measures the changes of model parameters or some functions about model parameters concerning the removal or modification of training instances. It is an efficient and useful post-hoc method for studying the interpretability of machine learning models without the need for expensive model re-training. Recently, graph convolution networks (GCNs), which operate on graph data, have attracted a great deal of attention. However, there is no preceding research on the influence functions of GCNs to shed light on the effects of removing training nodes/edges from an input graph. Since the nodes/edges in a graph are interdependent in GCNs, it is challenging to derive influence functions for GCNs. To fill this gap, we started with the simple graph convolution (SGC) model that operates on an attributed graph and formulated an influence function to approximate the changes in model parameters when a node or an edge is removed from an attributed graph. Moreover, we theoretically analyzed the error bound of the estimated influence of removing an edge. We experimentally validated the accuracy and effectiveness of our influence estimation function. In addition, we showed that the influence function of an SGC model could be used to estimate the impact of removing training nodes/edges on the test performance of the SGC without re-training the model. Finally, we demonstrated how to use influence functions to guide the adversarial attacks on GCNs effectively.
    Robust Candidate Generation for Entity Linking on Short Social Media Texts. (arXiv:2210.07472v1 [cs.CL])
    Entity Linking (EL) is the gateway into Knowledge Bases. Recent advances in EL utilize dense retrieval approaches for Candidate Generation, which addresses some of the shortcomings of the Lookup based approach of matching NER mentions against pre-computed dictionaries. In this work, we show that in the domain of Tweets, such methods suffer as users often include informal spelling, limited context, and lack of specificity, among other issues. We investigate these challenges on a large and recent Tweets benchmark for EL, empirically evaluate lookup and dense retrieval approaches, and demonstrate a hybrid solution using long contextual representation from Wikipedia is necessary to achieve considerable gains over previous work, achieving 0.93 recall.
    PCFG-based Natural Language Interface Improves Generalization for Controlled Text Generation. (arXiv:2210.07431v1 [cs.CL])
    Existing work on controlled text generation (CTG) assumes a control interface of categorical attributes. In this work, we propose a natural language (NL) interface, where we craft a PCFG to embed the control attributes into natural language commands, and propose variants of existing CTG models that take commands as input. In our experiments, we design tailored setups to test model's generalization abilities. We find our PCFG-based command generation approach is effective for handling unseen commands compared to fix-set templates; our proposed NL models can effectively generalize to unseen attributes, a new ability enabled by the NL interface, as well as unseen attribute combinations. Interestingly, we discover that the simple conditional generation approach, enhanced with our proposed NL interface, is a strong baseline in those challenging settings.
    Skill-Based Reinforcement Learning with Intrinsic Reward Matching. (arXiv:2210.07426v1 [cs.LG])
    While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM is competitive with previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and enables us to utilize pretrained skills far more effectively on challenging tabletop manipulation tasks.
    Using Graph Algorithms to Pretrain Graph Completion Transformers. (arXiv:2210.07453v1 [cs.LG])
    Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination. We leverage the versatility of our Transformer-based model to explore graph structure generation pretraining tasks, typically inapplicable to most graph embedding methods. We further propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task across three downstream knowledge graph completion datasets. In a multitask setting that combines all pretraining tasks, our method surpasses some of the latest and strong performing knowledge graph embedding methods on all metrics for FB15K-237, on MRR and Hit@1 for WN18RR and on MRR and hit@10 for JF17K (a knowledge hypergraph dataset).
    ExAug: Robot-Conditioned Navigation Policies via Geometric Experience Augmentation. (arXiv:2210.07450v1 [cs.RO])
    Machine learning techniques rely on large and diverse datasets for generalization. Computer vision, natural language processing, and other applications can often reuse public datasets to train many different models. However, due to differences in physical configurations, it is challenging to leverage public datasets for training robotic control policies on new robot platforms or for new tasks. In this work, we propose a novel framework, ExAug to augment the experiences of different robot platforms from multiple datasets in diverse environments. ExAug leverages a simple principle: by extracting 3D information in the form of a point cloud, we can create much more complex and structured augmentations, utilizing both generating synthetic images and geometric-aware penalization that would have been suitable in the same situation for a different robot, with different size, turning radius, and camera placement. The trained policy is evaluated on two new robot platforms with three different cameras in indoor and outdoor environments with obstacles.
    Disentanglement of Correlated Factors via Hausdorff Factorized Support. (arXiv:2210.07347v1 [cs.LG])
    A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a models representations with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we propose a relaxed disentanglement criterion - the Hausdorff Factorized Support (HFS) criterion - that encourages a factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization.
    A Comprehensive Study on Large-Scale Graph Training: Benchmarking and Rethinking. (arXiv:2210.07494v1 [cs.LG])
    Large-scale graph training is a notoriously challenging problem for graph neural networks (GNNs). Due to the nature of evolving graph structures into the training process, vanilla GNNs usually fail to scale up, limited by the GPU memory space. Up to now, though numerous scalable GNN architectures have been proposed, we still lack a comprehensive survey and fair benchmark of this reservoir to find the rationale for designing scalable GNNs. To this end, we first systematically formulate the representative methods of large-scale graph training into several branches and further establish a fair and consistent benchmark for them by a greedy hyperparameter searching. In addition, regarding efficiency, we theoretically evaluate the time and space complexity of various branches and empirically compare them w.r.t GPU memory usage, throughput, and convergence. Furthermore, We analyze the pros and cons for various branches of scalable GNNs and then present a new ensembling training manner, named EnGCN, to address the existing issues. Remarkably, our proposed method has achieved new state-of-the-art (SOTA) performance on large-scale datasets. Our code is available at https://github.com/VITA-Group/Large_Scale_GCN_Benchmarking.
    Vision Transformer Visualization: What Neurons Tell and How Neurons Behave?. (arXiv:2210.07646v1 [cs.CV])
    Recently vision transformers (ViT) have been applied successfully for various tasks in computer vision. However, important questions such as why they work or how they behave still remain largely unknown. In this paper, we propose an effective visualization technique, to assist us in exposing the information carried in neurons and feature embeddings across the ViT's layers. Our approach departs from the computational process of ViTs with a focus on visualizing the local and global information in input images and the latent feature embeddings at multiple levels. Visualizations at the input and embeddings at level 0 reveal interesting findings such as providing support as to why ViTs are rather generally robust to image occlusions and patch shuffling; or unlike CNNs, level 0 embeddings already carry rich semantic details. Next, we develop a rigorous framework to perform effective visualizations across layers, exposing the effects of ViTs filters and grouping/clustering behaviors to object patches. Finally, we provide comprehensive experiments on real datasets to qualitatively and quantitatively demonstrate the merit of our proposed methods as well as our findings. https://github.com/byM1902/ViT_visualization
    Efficiently Computing Local Lipschitz Constants of Neural Networks via Bound Propagation. (arXiv:2210.07394v1 [cs.LG])
    Lipschitz constants are connected to many properties of neural networks, such as robustness, fairness, and generalization. Existing methods for computing Lipschitz constants either produce relatively loose upper bounds or are limited to small networks. In this paper, we develop an efficient framework for computing the $\ell_\infty$ local Lipschitz constant of a neural network by tightly upper bounding the norm of Clarke Jacobian via linear bound propagation. We formulate the computation of local Lipschitz constants with a linear bound propagation process on a high-order backward graph induced by the chain rule of Clarke Jacobian. To enable linear bound propagation, we derive tight linear relaxations for specific nonlinearities in Clarke Jacobian. This formulate unifies existing ad-hoc approaches such as RecurJac, which can be seen as a special case of ours with weaker relaxations. The bound propagation framework also allows us to easily borrow the popular Branch-and-Bound (BaB) approach from neural network verification to further tighten Lipschitz constants. Experiments show that on tiny models, our method produces comparable bounds compared to exact methods that cannot scale to slightly larger models; on larger models, our method efficiently produces tighter results than existing relaxed or naive methods, and our method scales to much larger practical models that previous works could not handle. We also demonstrate an application on provable monotonicity analysis. Code is available at https://github.com/shizhouxing/Local-Lipschitz-Constants.
    CaloDVAE : Discrete Variational Autoencoders for Fast Calorimeter Shower Simulation. (arXiv:2210.07430v1 [physics.ins-det])
    Calorimeter simulation is the most computationally expensive part of Monte Carlo generation of samples necessary for analysis of experimental data at the Large Hadron Collider (LHC). The High-Luminosity upgrade of the LHC would require an even larger amount of such samples. We present a technique based on Discrete Variational Autoencoders (DVAEs) to simulate particle showers in Electromagnetic Calorimeters. We discuss how this work paves the way towards exploration of quantum annealing processors as sampling devices for generation of simulated High Energy Physics datasets.
    Hierarchical Diffusion Models for Singing Voice Neural Vocoder. (arXiv:2210.07508v1 [cs.SD])
    Recent progress in deep generative models has improved the quality of neural vocoders in speech domain. However, it remains challenging to generate high-quality singing voice due to a wider variety of musical expressions in pitch, loudness, and pronunciations. In this work, we propose a hierarchical diffusion model for singing voice neural vocoders. The proposed method consists of multiple diffusion models operating in different sampling rates; the model at the lowest sampling rate focuses on generating accurate low frequency components such as pitch, and other models progressively generate the waveform at the higher sampling rates based on the data at the lower sampling rate and acoustic features. Experimental results show that the proposed method produces high-quality singing voice for multiple singers, outperforming state-of-the-art neural vocoders with a similar range of computational costs.
    GLACIAL: Granger and Learning-based Causality Analysis for Longitudinal Studies. (arXiv:2210.07416v1 [cs.LG])
    The Granger framework is widely used for discovering causal relationships based on time-varying signals. Implementations of Granger causality (GC) are mostly developed for densely sampled timeseries data. A substantially different setting, particularly common in population health applications, is the longitudinal study design, where multiple individuals are followed and sparsely observed for a limited number of times. Longitudinal studies commonly track many variables, which are likely governed by nonlinear dynamics that might have individual-specific idiosyncrasies and exhibit both direct and indirect causes. Furthermore, real-world longitudinal data often suffer from widespread missingness. GC methods are not well-suited to handle these issues. In this paper, we intend to fill this methodological gap. We propose to marry the GC framework with a machine learning based prediction model. We call our approach GLACIAL, which stands for "Granger and LeArning-based CausalIty Analysis for Longitudinal studies." GLACIAL treats individuals as independent samples and uses average prediction accuracy on hold-out individuals to test for effects of causal relationships. GLACIAL employs a multi-task neural network trained with input feature dropout to efficiently learn nonlinear dynamic relationships between a large number of variables, handle missing values, and probe causal links. Extensive experiments on synthetic and real data demonstrate the utility of GLACIAL and how it can outperform competitive baselines.
    Invariance-adapted decomposition and Lasso-type contrastive learning. (arXiv:2210.07413v1 [stat.ML])
    Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of \citet{content_isolate} has shown that contrastive learning is capable of decomposing the data space into the space that is invariant to all augmentations and its complement. In this paper, we introduce the notion of invariance-adapted latent space that decomposes the data space into the intersections of the invariant spaces of each augmentation and their complements. This decomposition generalizes the one introduced in \citet{content_isolate}, and describes a structure that is analogous to the frequencies in the harmonic analysis of a group. We experimentally show that contrastive learning with lasso-type metric can be used to find an invariance-adapted latent space, thereby suggesting a new potential for the contrastive learning. We also investigate when such a latent space can be identified up to mixings within each component.
    Learning to Efficiently Plan Robust Frictional Multi-Object Grasps. (arXiv:2210.07420v1 [cs.RO])
    We consider a decluttering problem where multiple rigid convex polygonal objects rest in randomly placed positions and orientations on a planar surface and must be efficiently transported to a packing box using both single and multi-object grasps. Prior work considered frictionless multi-object grasping. In this paper, we introduce friction to increase picks per hour. We train a neural network using real examples to plan robust multi-object grasps. In physical experiments, we find an 11.7% increase in success rates, a 1.7x increase in picks per hour, and an 8.2x decrease in grasp planning time compared to prior work on multi-object grasping. Videos are available at https://youtu.be/pEZpHX5FZIs.
    Learning entanglement breakdown as a phase transition by confusion. (arXiv:2202.00348v3 [quant-ph] UPDATED)
    Quantum technologies require methods for preparing and manipulating entangled multiparticle states. However, the problem of determining whether a given quantum state is entangled or separable is known to be an NP-hard problem in general, and even the task of detecting entanglement breakdown for a given class of quantum states is difficult. In this work, we develop an approach for revealing entanglement breakdown using a machine learning technique, which is known as 'learning by confusion'. We consider a family of quantum states, which is parameterized such that there is a single critical value dividing states within this family into separate and entangled. We demonstrate the 'learning by confusion' scheme allows us to determine the critical value. Specifically, we study the performance of the method for the two-qubit, two-qutrit, and two-ququart entangled state. In addition, we investigate the properties of the local depolarization and the generalized amplitude damping channel in the framework of the confusion scheme. Within our approach and setting the parameterization of special trajectories, we obtain an entanglement-breakdown 'phase diagram' of a quantum channel, which indicates regions of entangled (separable) states and the entanglement-breakdown region. Then we extend the way of using the 'learning by confusion' scheme for recognizing whether an arbitrary given state is entangled or separable. We show that the developed method provides correct answers for a variety of states, including entangled states with positive partial transpose. We also present a more practical version of the method, which is suitable for studying entanglement breakdown in noisy intermediate-scale quantum devices. We demonstrate its performance using an available cloud-based IBM quantum processor.
    Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget. (arXiv:2202.04487v2 [cs.LG] UPDATED)
    We consider the combinatorial bandits problem with semi-bandit feedback under finite sampling budget constraints, in which the learner can carry out its action only for a limited number of times specified by an overall budget. The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received. Unlike existing works, we study this problem in a non-stochastic setting with subset-dependent feedback, i.e., the semi-bandit feedback received could be generated by an oblivious adversary and also might depend on the chosen set of arms. In addition, we consider a general feedback scenario covering both the numerical-based as well as preference-based case and introduce a sound theoretical framework for this setting guaranteeing sensible notions of optimal arms, which a learner seeks to find. We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies from aggressive to conservative. Theoretical questions about the sufficient and necessary budget of the algorithm to find the best arm are answered and complemented by deriving lower bounds for any learning algorithm for this problem scenario.
    Spatiotemporal Classification with limited labels using Constrained Clustering for large datasets. (arXiv:2210.07522v1 [cs.LG])
    Creating separable representations via representation learning and clustering is critical in analyzing large unstructured datasets with only a few labels. Separable representations can lead to supervised models with better classification capabilities and additionally aid in generating new labeled samples. Most unsupervised and semisupervised methods to analyze large datasets do not leverage the existing small amounts of labels to get better representations. In this paper, we propose a spatiotemporal clustering paradigm that uses spatial and temporal features combined with a constrained loss to produce separable representations. We show the working of this method on the newly published dataset ReaLSAT, a dataset of surface water dynamics for over 680,000 lakes across the world, making it an essential dataset in terms of ecology and sustainability. Using this large unlabelled dataset, we first show how a spatiotemporal representation is better compared to just spatial or temporal representation. We then show how we can learn even better representation using a constrained loss with few labels. We conclude by showing how our method, using few labels, can pick out new labeled samples from the unlabeled data, which can be used to augment supervised methods leading to better classification.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v1 [stat.ML])
    As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.
    Black-box Safety Analysis and Retraining of DNNs based on Feature Extraction and Clustering. (arXiv:2201.05077v4 [cs.SE] UPDATED)
    Deep neural networks (DNNs) have demonstrated superior performance over classical machine learning to support many features in safety-critical systems. Although DNNs are now widely used in such systems (e.g., self driving cars), there is limited progress regarding automated support for functional safety analysis in DNN-based systems. For example, the identification of root causes of errors, to enable both risk analysis and DNN retraining, remains an open problem. In this paper, we propose SAFE, a black-box approach to automatically characterize the root causes of DNN errors. SAFE relies on a transfer learning model pre-trained on ImageNet to extract the features from error-inducing images. It then applies a density-based clustering algorithm to detect arbitrary shaped clusters of images modeling plausible causes of error. Last, clusters are used to effectively retrain and improve the DNN. The black-box nature of SAFE is motivated by our objective not to require changes or even access to the DNN internals to facilitate adoption. Experimental results show the superior ability of SAFE in identifying different root causes of DNN errors based on case studies in the automotive domain. It also yields significant improvements in DNN accuracy after retraining, while saving significant execution time and memory when compared to alternatives.
    A Reinforcement Learning Approach to Estimating Long-term Treatment Effects. (arXiv:2210.07536v1 [cs.LG])
    Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.
    DyLoRA: Parameter Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation. (arXiv:2210.07558v1 [cs.CL])
    With the ever-growing size of pre-trained models (PMs), fine-tuning them has become more expensive and resource-hungry. As a remedy, low-rank adapters (LoRA) keep the main pre-trained weights of the model frozen and just introduce some learnable truncated SVD modules (so-called LoRA blocks) to the model. While LoRA blocks are parameter efficient, they suffer from two major problems: first, the size of these blocks is fixed and cannot be modified after training (for example, if we need to change the rank of LoRA blocks, then we need to re-train them from scratch); second, optimizing their rank requires an exhaustive search and effort. In this work, we introduce a dynamic low-rank adaptation (DyLoRA) technique to address these two problems together. Our DyLoRA method trains LoRA blocks for a range of ranks instead of a single rank by sorting out the representation learned by the adapter module at different ranks during training. We evaluate our solution on different tasks of the GLUE benchmark using the RoBERTa model. Our results show that we can train dynamic search-free models with DyLoRA at least $7\times$ faster than LoRA without significantly compromising performance. Moreover, our models can perform consistently well on a much larger range of ranks compared to LoRA.
    Communication-Efficient Adam-Type Algorithms for Distributed Data Mining. (arXiv:2210.07454v1 [cs.LG])
    Distributed data mining is an emerging research topic to effectively and efficiently address hard data mining tasks using big data, which are partitioned and computed on different worker nodes, instead of one centralized server. Nevertheless, distributed learning methods often suffer from the communication bottleneck when the network bandwidth is limited or the size of model is large. To solve this critical issue, many gradient compression methods have been proposed recently to reduce the communication cost for multiple optimization algorithms. However, the current applications of gradient compression to adaptive gradient method, which is widely adopted because of its excellent performance to train DNNs, do not achieve the same ideal compression rate or convergence rate as Sketched-SGD. To address this limitation, in this paper, we propose a class of novel distributed Adam-type algorithms (\emph{i.e.}, SketchedAMSGrad) utilizing sketching, which is a promising compression technique that reduces the communication cost from $O(d)$ to $O(\log(d))$ where $d$ is the parameter dimension. In our theoretical analysis, we prove that our new algorithm achieves a fast convergence rate of $O(\frac{1}{\sqrt{nT}} + \frac{1}{(k/d)^2 T})$ with the communication cost of $O(k \log(d))$ at each iteration. Compared with single-machine AMSGrad, our algorithm can achieve the linear speedup with respect to the number of workers $n$. The experimental results on training various DNNs in distributed paradigm validate the efficiency of our algorithms.
    Intra-session Context-aware Feed Recommendation in Live Systems. (arXiv:2210.07815v1 [cs.IR])
    Feed recommendation allows users to constantly browse items until feel uninterested and leave the session, which differs from traditional recommendation scenarios. Within a session, user's decision to continue browsing or not substantially affects occurrences of later clicks. However, such type of exposure bias is generally ignored or not explicitly modeled in most feed recommendation studies. In this paper, we model this effect as part of intra-session context, and propose a novel intra-session Context-aware Feed Recommendation (INSCAFER) framework to maximize the total views and total clicks simultaneously. User click and browsing decisions are jointly learned by a multi-task setting, and the intra-session context is encoded by the session-wise exposed item sequence. We deploy our model on Alipay with all key business benchmarks improved. Our method sheds some lights on feed recommendation studies which aim to optimize session-level click and view metrics.
    Similarity and Generalization: From Noise to Corruption. (arXiv:2201.12803v2 [cs.LG] UPDATED)
    Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study how NNs generalize the concept of similarity in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. While DD examines how the network adjusts to the dataset during a long training time or by increasing the number of parameters, online/offline correspondence compares the network performances varying the quality (diversity) of the dataset. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.
    Theory and Approximate Solvers for Branched Optimal Transport with Multiple Sources. (arXiv:2210.07702v1 [cs.LG])
    Branched Optimal Transport (BOT) is a generalization of optimal transport in which transportation costs along an edge are subadditive. This subadditivity models an increase in transport efficiency when shipping mass along the same route, favoring branched transportation networks. We here study the NP-hard optimization of BOT networks connecting a finite number of sources and sinks in $\mathbb{R}^2$. First, we show how to efficiently find the best geometry of a BOT network for many sources and sinks, given a topology. Second, we argue that a topology with more than three edges meeting at a branching point is never optimal. Third, we show that the results obtained for the Euclidean plane generalize directly to optimal transportation networks on two-dimensional Riemannian manifolds. Finally, we present a simple but effective approximate BOT solver combining geometric optimization with a combinatorial optimization of the network topology.
    When Adversarial Training Meets Vision Transformers: Recipes from Training to Architecture. (arXiv:2210.07540v1 [cs.CV])
    Vision Transformers (ViTs) have recently achieved competitive performance in broad vision tasks. Unfortunately, on popular threat models, naturally trained ViTs are shown to provide no more adversarial robustness than convolutional neural networks (CNNs). Adversarial training is still required for ViTs to defend against such adversarial attacks. In this paper, we provide the first and comprehensive study on the adversarial training recipe of ViTs via extensive evaluation of various training techniques across benchmark datasets. We find that pre-training and SGD optimizer are necessary for ViTs' adversarial training. Further considering ViT as a new type of model architecture, we investigate its adversarial robustness from the perspective of its unique architectural components. We find, when randomly masking gradients from some attention blocks or masking perturbations on some patches during adversarial training, the adversarial robustness of ViTs can be remarkably improved, which may potentially open up a line of work to explore the architectural information inside the newly designed models like ViTs. Our code is available at https://github.com/mo666666/When-Adversarial-Training-Meets-Vision-Transformers.
    A Scalable Finite Difference Method for Deep Reinforcement Learning. (arXiv:2210.07487v1 [cs.LG])
    Several low-bandwidth distributable black-box optimization algorithms have recently been shown to perform nearly as well as more refined modern methods in some Deep Reinforcement Learning domains. In this work we investigate a core problem with the use of distributed workers in such systems. Further, we investigate the dramatic differences in performance between the popular Adam gradient descent algorithm and the simplest form of stochastic gradient descent. These investigations produce a stable, low-bandwidth learning algorithm that achieves 100\% usage of all connected CPUs under typical conditions.
    Predicting Fine-Tuning Performance with Probing. (arXiv:2210.07352v1 [cs.CL])
    Large NLP models have recently shown impressive performance in language understanding tasks, typically evaluated by their fine-tuned performance. Alternatively, probing has received increasing attention as being a lightweight method for interpreting the intrinsic mechanisms of large NLP models. In probing, post-hoc classifiers are trained on "out-of-domain" datasets that diagnose specific abilities. While probing the language models has led to insightful findings, they appear disjointed from the development of models. This paper explores the utility of probing deep NLP models to extract a proxy signal widely used in model development -- the fine-tuning performance. We find that it is possible to use the accuracies of only three probing tests to predict the fine-tuning performance with errors $40\%$ - $80\%$ smaller than baselines. We further discuss possible avenues where probing can empower the development of deep NLP models.
    Watermarking Pre-trained Language Models with Backdooring. (arXiv:2210.07543v1 [cs.CL])
    Large pre-trained language models (PLMs) have proven to be a crucial component of modern natural language processing systems. PLMs typically need to be fine-tuned on task-specific downstream datasets, which makes it hard to claim the ownership of PLMs and protect the developer's intellectual property due to the catastrophic forgetting phenomenon. We show that PLMs can be watermarked with a multi-task learning framework by embedding backdoors triggered by specific inputs defined by the owners, and those watermarks are hard to remove even though the watermarked PLMs are fine-tuned on multiple downstream tasks. In addition to using some rare words as triggers, we also show that the combination of common words can be used as backdoor triggers to avoid them being easily detected. Extensive experiments on multiple datasets demonstrate that the embedded watermarks can be robustly extracted with a high success rate and less influenced by the follow-up fine-tuning.
    Improved automated lesion segmentation in whole-body FDG/PET-CT via Test-Time Augmentation. (arXiv:2210.07761v1 [eess.IV])
    Numerous oncology indications have extensively quantified metabolically active tumors using positron emission tomography (PET) and computed tomography (CT). F-fluorodeoxyglucose-positron emission tomography (FDG-PET) is frequently utilized in clinical practice and clinical drug research to detect and measure metabolically active malignancies. The assessment of tumor burden using manual or computer-assisted tumor segmentation in FDG-PET images is widespread. Deep learning algorithms have also produced effective solutions in this area. However, there may be a need to improve the performance of a pre-trained deep learning network without the opportunity to modify this network. We investigate the potential benefits of test-time augmentation for segmenting tumors from PET-CT pairings. We applied a new framework of multilevel and multimodal tumor segmentation techniques that can simultaneously consider PET and CT data. In this study, we improve the network using a learnable composition of test time augmentations. We trained U-Net and Swin U-Netr on the training database to determine how different test time augmentation improved segmentation performance. We also developed an algorithm that finds an optimal test time augmentation contribution coefficient set. Using the newly trained U-Net and Swin U-Netr results, we defined an optimal set of coefficients for test-time augmentation and utilized them in combination with a pre-trained fixed nnU-Net. The ultimate idea is to improve performance at the time of testing when the model is fixed. Averaging the predictions with varying ratios on the augmented data can improve prediction accuracy. Our code will be available at \url{https://github.com/sepidehamiri/pet\_seg\_unet}
    LEATHER: A Framework for Learning to Generate Human-like Text in Dialogue. (arXiv:2210.07777v1 [cs.CL])
    Algorithms for text-generation in dialogue can be misguided. For example, in task-oriented settings, reinforcement learning that optimizes only task-success can lead to abysmal lexical diversity. We hypothesize this is due to poor theoretical understanding of the objectives in text-generation and their relation to the learning process (i.e., model training). To this end, we propose a new theoretical framework for learning to generate text in dialogue. Compared to existing theories of learning, our framework allows for analysis of the multi-faceted goals inherent to text-generation. We use our framework to develop theoretical guarantees for learners that adapt to unseen data. As an example, we apply our theory to study data-shift within a cooperative learning algorithm proposed for the GuessWhat?! visual dialogue game. From this insight, we propose a new algorithm, and empirically, we demonstrate our proposal improves both task-success and human-likeness of the generated text. Finally, we show statistics from our theory are empirically predictive of multiple qualities of the generated dialogue, suggesting our theory is useful for model-selection when human evaluations are not available.
    Demystifying Self-supervised Trojan Attacks. (arXiv:2210.07346v1 [cs.CR])
    As an emerging machine learning paradigm, self-supervised learning (SSL) is able to learn high-quality representations for complex data without data labels. Prior work shows that, besides obviating the reliance on labeling, SSL also benefits adversarial robustness by making it more challenging for the adversary to manipulate model prediction. However, whether this robustness benefit generalizes to other types of attacks remains an open question. We explore this question in the context of trojan attacks by showing that SSL is comparably vulnerable as supervised learning to trojan attacks. Specifically, we design and evaluate CTRL, an extremely simple self-supervised trojan attack. By polluting a tiny fraction of training data (less than 1%) with indistinguishable poisoning samples, CTRL causes any trigger-embedded input to be misclassified to the adversary's desired class with a high probability (over 99%) at inference. More importantly, through the lens of CTRL, we study the mechanisms underlying self-supervised trojan attacks. With both empirical and analytical evidence, we reveal that the representation invariance property of SSL, which benefits adversarial robustness, may also be the very reason making SSL highly vulnerable to trojan attacks. We further discuss the fundamental challenges to defending against self-supervised trojan attacks, pointing to promising directions for future research.
    ScionFL: Secure Quantized Aggregation for Federated Learning. (arXiv:2210.07376v1 [cs.CR])
    Privacy concerns in federated learning (FL) are commonly addressed with secure aggregation schemes that prevent a central party from observing plaintext client updates. However, most such schemes neglect orthogonal FL research that aims at reducing communication between clients and the aggregator and is instrumental in facilitating cross-device FL with thousands and even millions of (mobile) participants. In particular, quantization techniques can typically reduce client-server communication by a factor of 32x. In this paper, we unite both research directions by introducing an efficient secure aggregation framework based on outsourced multi-party computation (MPC) that supports any linear quantization scheme. Specifically, we design a novel approximate version of an MPC-based secure aggregation protocol with support for multiple stochastic quantization schemes, including ones that utilize the randomized Hadamard transform and Kashin's representation. In our empirical performance evaluation, we show that with no additional overhead for clients and moderate inter-server communication, we achieve similar training accuracy as insecure schemes for standard FL benchmarks. Beyond this, we present an efficient extension to our secure quantized aggregation framework that effectively defends against state-of-the-art untargeted poisoning attacks.
    Estimation of the Sample Frechet Mean: A Convolutional Neural Network Approach. (arXiv:2210.07401v1 [cs.LG])
    This work addresses the rising demand for novel tools in statistical and machine learning for "graph-valued random variables" by proposing a fast algorithm to compute the sample Frechet mean, which replaces the concept of sample mean for graphs (or networks). We use convolutional neural networks to learn the morphology of the graphs in a set of graphs. Our experiments on several ensembles of random graphs demonstrate that our method can reliably recover the sample Frechet mean.
    Meta-Query-Net: Resolving Purity-Informativeness Dilemma in Open-set Active Learning. (arXiv:2210.07805v1 [cs.LG])
    Unlabeled data examples awaiting annotations contain open-set noise inevitably. A few active learning studies have attempted to deal with this open-set noise for sample selection by filtering out the noisy examples. However, because focusing on the purity of examples in a query set leads to overlooking the informativeness of the examples, the best balancing of purity and informativeness remains an important question. In this paper, to solve this purity-informativeness dilemma in open-set active learning, we propose a novel Meta-Query-Net,(MQ-Net) that adaptively finds the best balancing between the two factors. Specifically, by leveraging the multi-round property of active learning, we train MQ-Net using a query set without an additional validation set. Furthermore, a clear dominance relationship between unlabeled examples is effectively captured by MQ-Net through a novel skyline regularization. Extensive experiments on multiple open-set active learning scenarios demonstrate that the proposed MQ-Net achieves 20.14% improvement in terms of accuracy, compared with the state-of-the-art methods.
    MTEB: Massive Text Embedding Benchmark. (arXiv:2210.07316v1 [cs.CL])
    Text embeddings are commonly evaluated on a small set of datasets from a single task not covering their possible applications to other tasks. It is unclear whether state-of-the-art embeddings on semantic textual similarity (STS) can be equally well applied to other tasks like clustering or reranking. This makes progress in the field difficult to track, as various models are constantly being proposed without proper evaluation. To solve this problem, we introduce the Massive Text Embedding Benchmark (MTEB). MTEB spans 8 embedding tasks covering a total of 56 datasets and 112 languages. Through the benchmarking of 33 models on MTEB, we establish the most comprehensive benchmark of text embeddings to date. We find that no particular text embedding method dominates across all tasks. This suggests that the field has yet to converge on a universal text embedding method and scale it up sufficiently to provide state-of-the-art results on all embedding tasks. MTEB comes with open-source code and a public leaderboard at https://huggingface.co/spaces/mteb/leaderboard.
    Optimal Auctions through Deep Learning: Advances in Differentiable Economics. (arXiv:1706.03459v6 [cs.GT] UPDATED)
    Designing an incentive compatible auction that maximizes expected revenue is an intricate task. The single-item case was resolved in a seminal piece of work by Myerson in 1981, but more than 40 years later a full analytical understanding of the optimal design still remains elusive for settings with two or more items. In this work, we initiate the exploration of the use of tools from deep learning for the automated design of optimal auctions. We model an auction as a multi-layer neural network, frame optimal auction design as a constrained learning problem, and show how it can be solved using standard machine learning pipelines. In addition to providing generalization bounds, we present extensive experimental results, recovering essentially all known solutions that come from the theoretical analysis of optimal auction design problems and obtaining novel mechanisms for settings in which the optimal mechanism is unknown.
    Secure Multiparty Computation for Synthetic Data Generation from Distributed Data. (arXiv:2210.07332v1 [cs.CR])
    Legal and ethical restrictions on accessing relevant data inhibit data science research in critical domains such as health, finance, and education. Synthetic data generation algorithms with privacy guarantees are emerging as a paradigm to break this data logjam. Existing approaches, however, assume that the data holders supply their raw data to a trusted curator, who uses it as fuel for synthetic data generation. This severely limits the applicability, as much of the valuable data in the world is locked up in silos, controlled by entities who cannot show their data to each other or a central aggregator without raising privacy concerns. To overcome this roadblock, we propose the first solution in which data holders only share encrypted data for differentially private synthetic data generation. Data holders send shares to servers who perform Secure Multiparty Computation (MPC) computations while the original data stays encrypted. We instantiate this idea in an MPC protocol for the Multiplicative Weights with Exponential Mechanism (MWEM) algorithm to generate synthetic data based on real data originating from many data holders without reliance on a single point of failure.
    Machine Generated Text: A Comprehensive Survey of Threat Models and Detection Methods. (arXiv:2210.07321v1 [cs.CL])
    Advances in natural language generation (NLG) have resulted in machine generated text that is increasingly difficult to distinguish from human authored text. Powerful open-source models are freely available, and user-friendly tools democratizing access to generative models are proliferating. The great potential of state-of-the-art NLG systems is tempered by the multitude of avenues for abuse. Detection of machine generated text is a key countermeasure for reducing abuse of NLG models, with significant technical challenges and numerous open problems. We provide a survey that includes both 1) an extensive analysis of threat models posed by contemporary NLG systems, and 2) the most complete review of machine generated text detection methods to date. This survey places machine generated text within its cybersecurity and social context, and provides strong guidance for future work addressing the most critical threat models, and ensuring detection systems themselves demonstrate trustworthiness through fairness, robustness, and accountability.
    Kernel-Whitening: Overcome Dataset Bias with Isotropic Sentence Embedding. (arXiv:2210.07547v1 [cs.CL])
    Dataset bias has attracted increasing attention recently for its detrimental effect on the generalization ability of fine-tuned models. The current mainstream solution is designing an additional shallow model to pre-identify biased instances. However, such two-stage methods scale up the computational complexity of training process and obstruct valid feature information while mitigating bias. To address this issue, we utilize the representation normalization method which aims at disentangling the correlations between features of encoded sentences. We find it also promising in eliminating the bias problem by providing isotropic data distribution. We further propose Kernel-Whitening, a Nystrom kernel approximation method to achieve more thorough debiasing on nonlinear spurious correlations. Our framework is end-to-end with similar time consumption to fine-tuning. Experiments show that Kernel-Whitening significantly improves the performance of BERT on out-of-distribution datasets while maintaining in-distribution accuracy.
    Smart Headset, Computer Vision and Machine Learning for Efficient Prawn Farm Management. (arXiv:2210.07436v1 [cs.CV])
    Understanding the growth and distribution of the prawns is critical for optimising the feed and harvest strategies. An inadequate understanding of prawn growth can lead to reduced financial gain, for example, crops are harvested too early. The key to maintaining a good understanding of prawn growth is frequent sampling. However, the most commonly adopted sampling practice, the cast net approach, is unable to sample the prawns at a high frequency as it is expensive and laborious. An alternative approach is to sample prawns from feed trays that farm workers inspect each day. This will allow growth data collection at a high frequency (each day). But measuring prawns manually each day is a laborious task. In this article, we propose a new approach that utilises smart glasses, depth camera, computer vision and machine learning to detect prawn distribution and growth from feed trays. A smart headset was built to allow farmers to collect prawn data while performing daily feed tray checks. A computer vision + machine learning pipeline was developed and demonstrated to detect the growth trends of prawns in 4 prawn ponds over a growing season.
    COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. (arXiv:2202.11705v3 [cs.CL] UPDATED)
    Many applications of text generation require incorporating different constraints to control the semantics or style of generated text. These constraints can be hard (e.g., ensuring certain keywords are included in the output) and soft (e.g., contextualizing the output with the left- or right-hand context). In this paper, we present Energy-based Constrained Decoding with Langevin Dynamics (COLD), a decoding framework which unifies constrained generation as specifying constraints through an energy function, then performing efficient differentiable reasoning over the constraints through gradient-based sampling. COLD decoding is a flexible framework that can be applied directly to off-the-shelf left-to-right language models without the need for any task-specific fine-tuning, as demonstrated through three challenging text generation applications: lexically-constrained generation, abductive reasoning, and counterfactual reasoning. Our experiments on these constrained generation tasks point to the effectiveness of our approach, both in terms of automatic and human evaluation.
    Communication-Efficient Topologies for Decentralized Learning with $O(1)$ Consensus Rate. (arXiv:2210.07881v1 [math.OC])
    Decentralized optimization is an emerging paradigm in distributed learning in which agents achieve network-wide solutions by peer-to-peer communication without the central server. Since communication tends to be slower than computation, when each agent communicates with only a few neighboring agents per iteration, they can complete iterations faster than with more agents or a central server. However, the total number of iterations to reach a network-wide solution is affected by the speed at which the agents' information is ``mixed'' by communication. We found that popular communication topologies either have large maximum degrees (such as stars and complete graphs) or are ineffective at mixing information (such as rings and grids). To address this problem, we propose a new family of topologies, EquiTopo, which has an (almost) constant degree and a network-size-independent consensus rate that is used to measure the mixing efficiency. In the proposed family, EquiStatic has a degree of $\Theta(\ln(n))$, where $n$ is the network size, and a series of time-dependent one-peer topologies, EquiDyn, has a constant degree of 1. We generate EquiDyn through a certain random sampling procedure. Both of them achieve an $n$-independent consensus rate. We apply them to decentralized SGD and decentralized gradient tracking and obtain faster communication and better convergence, theoretically and empirically. Our code is implemented through BlueFog and available at \url{https://github.com/kexinjinnn/EquiTopo}
    Consistent Sufficient Explanations and Minimal Local Rules for explaining regression and classification models. (arXiv:2111.04658v2 [stat.ML] UPDATED)
    To explain the decision of any model, we extend the notion of probabilistic Sufficient Explanations (P-SE). For each instance, this approach selects the minimal subset of features that is sufficient to yield the same prediction with high probability, while removing other features. The crux of P-SE is to compute the conditional probability of maintaining the same prediction. Therefore, we introduce an accurate and fast estimator of this probability via random Forests for any data $(\boldsymbol{X}, Y)$ and show its efficiency through a theoretical analysis of its consistency. As a consequence, we extend the P-SE to regression problems. In addition, we deal with non-discrete features, without learning the distribution of $\boldsymbol{X}$ nor having the model for making predictions. Finally, we introduce local rule-based explanations for regression/classification based on the P-SE and compare our approaches w.r.t other explainable AI methods. These methods are available as a Python package at \url{www.github.com/salimamoukou/acv00}.
    NTS-NOTEARS: Learning Nonparametric DBNs With Prior Knowledge. (arXiv:2109.04286v2 [cs.LG] UPDATED)
    We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach. We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data show that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20% on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables.
    Reinforcement Learning with Unbiased Policy Evaluation and Linear Function Approximation. (arXiv:2210.07338v1 [cs.LG])
    We provide performance guarantees for a variant of simulation-based policy iteration for controlling Markov decision processes that involves the use of stochastic approximation algorithms along with state-of-the-art techniques that are useful for very large MDPs, including lookahead, function approximation, and gradient descent. Specifically, we analyze two algorithms; the first algorithm involves a least squares approach where a new set of weights associated with feature vectors is obtained via least squares minimization at each iteration and the second algorithm involves a two-time-scale stochastic approximation algorithm taking several steps of gradient descent towards the least squares solution before obtaining the next iterate using a stochastic approximation algorithm.
    Post-Training Quantization for Energy Efficient Realization of Deep Neural Networks. (arXiv:2210.07906v1 [cs.LG])
    The biggest challenge for the deployment of Deep Neural Networks (DNNs) close to the generated data on edge devices is their size, i.e., memory footprint and computational complexity. Both are significantly reduced with quantization. With the resulting lower word-length, the energy efficiency of DNNs increases proportionally. However, lower word-length typically causes accuracy degradation. To counteract this effect, the quantized DNN is retrained. Unfortunately, training costs up to 5000x more energy than the inference of the quantized DNN. To address this issue, we propose a post-training quantization flow without the need for retraining. For this, we investigated different quantization options. Furthermore, our analysis systematically assesses the impact of reduced word-lengths of weights and activations revealing a clear trend for the choice of word-length. Both aspects have not been systematically investigated so far. Our results are independent of the depth of the DNNs and apply to uniform quantization, allowing fast quantization of a given pre-trained DNN. We excel state-of-the-art for 6 bit by 2.2% Top-1 accuracy for ImageNet. Without retraining, our quantization to 8 bit surpasses floating-point accuracy.
    Exponential Convergence of Deep Operator Networks for Elliptic Partial Differential Equations. (arXiv:2112.08125v2 [math.NA] UPDATED)
    We construct and analyze approximation rates of deep operator networks (ONets) between infinite-dimensional spaces that emulate with an exponential rate of convergence the coefficient-to-solution map of elliptic second-order partial differential equations. In particular, we consider problems set in $d$-dimensional periodic domains, $d=1, 2, \dots$, and with analytic right-hand sides and coefficients. Our analysis covers linear, elliptic second order divergence-form PDEs as, e.g., diffusion-reaction problems, parametric diffusion equations, and elliptic systems such as linear isotropic elastostatics in heterogeneous materials. We leverage the exponential convergence of spectral collocation methods for boundary value problems whose solutions are analytic. In the present periodic and analytic setting, this follows from classical elliptic regularity. Within the ONet branch and trunk construction of [Chen and Chen, 1993] and of [Lu et al., 2021], we show the existence of deep ONets which emulate the coefficient-to-solution map to a desired accuracy in the $H^1$ norm, uniformly over the coefficient set. We prove that the neural networks in the ONet have size $\mathcal{O}(\left|\log(\varepsilon)\right|^\kappa)$, where $\varepsilon>0$ is the approximation accuracy, for some $\kappa>0$ depending on the physical space dimension.
    Prompt Conditioned VAE: Enhancing Generative Replay for Lifelong Learning in Task-Oriented Dialogue. (arXiv:2210.07783v1 [cs.CL])
    Lifelong learning (LL) is vital for advanced task-oriented dialogue (ToD) systems. To address the catastrophic forgetting issue of LL, generative replay methods are widely employed to consolidate past knowledge with generated pseudo samples. However, most existing generative replay methods use only a single task-specific token to control their models. This scheme is usually not strong enough to constrain the generative model due to insufficient information involved. In this paper, we propose a novel method, prompt conditioned VAE for lifelong learning (PCLL), to enhance generative replay by incorporating tasks' statistics. PCLL captures task-specific distributions with a conditional variational autoencoder, conditioned on natural language prompts to guide the pseudo-sample generation. Moreover, it leverages a distillation process to further consolidate past knowledge by alleviating the noise in pseudo samples. Experiments on natural language understanding tasks of ToD systems demonstrate that PCLL significantly outperforms competitive baselines in building LL models.
    A Lightweight Moving Target Defense Framework for Multi-purpose Malware Affecting IoT Devices. (arXiv:2210.07719v1 [cs.CR])
    Malware affecting Internet of Things (IoT) devices is rapidly growing due to the relevance of this paradigm in real-world scenarios. Specialized literature has also detected a trend towards multi-purpose malware able to execute different malicious actions such as remote control, data leakage, encryption, or code hiding, among others. Protecting IoT devices against this kind of malware is challenging due to their well-known vulnerabilities and limitation in terms of CPU, memory, and storage. To improve it, the moving target defense (MTD) paradigm was proposed a decade ago and has shown promising results, but there is a lack of IoT MTD solutions dealing with multi-purpose malware. Thus, this work proposes four MTD mechanisms changing IoT devices' network, data, and runtime environment to mitigate multi-purpose malware. Furthermore, it presents a lightweight and IoT-oriented MTD framework to decide what, when, and how the MTD mechanisms are deployed. Finally, the efficiency and effectiveness of the framework and MTD mechanisms are evaluated in a real-world scenario with one IoT spectrum sensor affected by multi-purpose malware.
    Accelerating RNN-based Speech Enhancement on a Multi-Core MCU with Mixed FP16-INT8 Post-Training Quantization. (arXiv:2210.07692v1 [cs.SD])
    This paper presents an optimized methodology to design and deploy Speech Enhancement (SE) algorithms based on Recurrent Neural Networks (RNNs) on a state-of-the-art MicroController Unit (MCU), with 1+8 general-purpose RISC-V cores. To achieve low-latency execution, we propose an optimized software pipeline interleaving parallel computation of LSTM or GRU recurrent blocks, featuring vectorized 8-bit integer (INT8) and 16-bit floating-point (FP16) compute units, with manually-managed memory transfers of model parameters. To ensure minimal accuracy degradation with respect to the full-precision models, we propose a novel FP16-INT8 Mixed-Precision Post-Training Quantization (PTQ) scheme that compresses the recurrent layers to 8-bit while the bit precision of remaining layers is kept to FP16. Experiments are conducted on multiple LSTM and GRU based SE models trained on the Valentini dataset, featuring up to 1.24M parameters. Thanks to the proposed approaches, we speed-up the computation by up to 4x with respect to the lossless FP16 baselines. Differently from a uniform 8-bit quantization that degrades the PESQ score by 0.3 on average, the Mixed-Precision PTQ scheme leads to a low-degradation of only 0.06, while achieving a 1.4-1.7x memory saving. Thanks to this compression, we cut the power cost of the external memory by fitting the large models on the limited on-chip non-volatile memory and we gain a MCU power saving of up to 2.5x by reducing the supply voltage from 0.8V to 0.65V while still matching the real-time constraints. Our design results 10x more energy efficient than state-of-the-art SE solutions deployed on single-core MCUs that make use of smaller models and quantization-aware training.
    Bootstrap Advantage Estimation for Policy Optimization in Reinforcement Learning. (arXiv:2210.07312v1 [cs.LG])
    This paper proposes an advantage estimation approach based on data augmentation for policy optimization. Unlike using data augmentation on the input to learn value and policy function as existing methods use, our method uses data augmentation to compute a bootstrap advantage estimation. This Bootstrap Advantage Estimation (BAE) is then used for learning and updating the gradient of policy and value function. To demonstrate the effectiveness of our approach, we conducted experiments on several environments. These environments are from three benchmarks: Procgen, Deepmind Control, and Pybullet, which include both image and vector-based observations; discrete and continuous action spaces. We observe that our method reduces the policy and the value loss better than the Generalized advantage estimation (GAE) method and eventually improves cumulative return. Furthermore, our method performs better than two recently proposed data augmentation techniques (RAD and DRAC). Overall, our method performs better empirically than baselines in sample efficiency and generalization, where the agent is tested in unseen environments.
    E2R: a Hierarchical-Learning inspired Novelty-Search method to generate diverse repertoires of grasping trajectories. (arXiv:2210.07887v1 [cs.RO])
    Robotics grasping refers to the task of making a robotic system pick an object by applying forces and torques on its surface. Despite the recent advances in data-driven approaches, grasping remains an unsolved problem. Most of the works on this task are relying on priors and heavy constraints to avoid the exploration problem. Novelty Search (NS) refers to evolutionary algorithms that replace selection of best performing individuals with selection of the most novel ones. Such methods have already shown promising results on hard exploration problems. In this work, we introduce a new NS-based method that can generate large datasets of grasping trajectories in a platform-agnostic manner. Inspired by the hierarchical learning paradigm, our method decouples approach and prehension to make the behavioral space smoother. Experiments conducted on 3 different robot-gripper setups and on several standard objects shows that our method outperforms state-of-the-art for generating diverse repertoire of grasping trajectories, getting a higher successful run ratio, as well as a better diversity for both approach and prehension. Some of the generated solutions have been successfully deployed on a real robot, showing the exploitability of the obtained repertoires.
    A Concentration Bound for LSPE($\lambda$). (arXiv:2111.02644v4 [cs.LG] UPDATED)
    The popular LSPE($\lambda$) algorithm for policy evaluation is revisited to derive a concentration bound that gives high probability performance guarantees from some time on.
    Neural Differential Equations for Learning to Program Neural Nets Through Continuous Learning Rules. (arXiv:2206.01649v2 [cs.LG] UPDATED)
    Neural ordinary differential equations (ODEs) have attracted much attention as continuous-time counterparts of deep residual neural networks (NNs), and numerous extensions for recurrent NNs have been proposed. Since the 1980s, ODEs have also been used to derive theoretical results for NN learning rules, e.g., the famous connection between Oja's rule and principal component analysis. Such rules are typically expressed as additive iterative update processes which have straightforward ODE counterparts. Here we introduce a novel combination of learning rules and Neural ODEs to build continuous-time sequence processing nets that learn to manipulate short-term memory in rapidly changing synaptic connections of other nets. This yields continuous-time counterparts of Fast Weight Programmers and linear Transformers. Our novel models outperform the best existing Neural Controlled Differential Equation based models on various time series classification tasks, while also addressing their fundamental scalability limitations. Our code is public.
    Distributed Distributionally Robust Optimization with Non-Convex Objectives. (arXiv:2210.07588v1 [cs.LG])
    Distributionally Robust Optimization (DRO), which aims to find an optimal decision that minimizes the worst case cost over the ambiguity set of probability distribution, has been widely applied in diverse applications, e.g., network behavior analysis, risk management, etc. However, existing DRO techniques face three key challenges: 1) how to deal with the asynchronous updating in a distributed environment; 2) how to leverage the prior distribution effectively; 3) how to properly adjust the degree of robustness according to different scenarios. To this end, we propose an asynchronous distributed algorithm, named Asynchronous Single-looP alternatIve gRadient projEction (ASPIRE) algorithm with the itErative Active SEt method (EASE) to tackle the distributed distributionally robust optimization (DDRO) problem. Furthermore, a new uncertainty set, i.e., constrained D-norm uncertainty set, is developed to effectively leverage the prior distribution and flexibly control the degree of robustness. Finally, our theoretical analysis elucidates that the proposed algorithm is guaranteed to converge and the iteration complexity is also analyzed. Extensive empirical studies on real-world datasets demonstrate that the proposed method can not only achieve fast convergence, and remain robust against data heterogeneity as well as malicious attacks, but also tradeoff robustness with performance.
    Towards Transformer-based Homogenization of Satellite Imagery for Landsat-8 and Sentinel-2. (arXiv:2210.07654v1 [cs.CV])
    Landsat-8 (NASA) and Sentinel-2 (ESA) are two prominent multi-spectral imaging satellite projects that provide publicly available data. The multi-spectral imaging sensors of the satellites capture images of the earth's surface in the visible and infrared region of the electromagnetic spectrum. Since the majority of the earth's surface is constantly covered with clouds, which are not transparent at these wavelengths, many images do not provide much information. To increase the temporal availability of cloud-free images of a certain area, one can combine the observations from multiple sources. However, the sensors of satellites might differ in their properties, making the images incompatible. This work provides a first glance at the possibility of using a transformer-based model to reduce the spectral and spatial differences between observations from both satellite projects. We compare the results to a model based on a fully convolutional UNet architecture. Somewhat surprisingly, we find that, while deep models outperform classical approaches, the UNet significantly outperforms the transformer in our experiments.
    An Empirical Evaluation of Multivariate Time Series Classification with Input Transformation across Different Dimensions. (arXiv:2210.07713v1 [cs.LG])
    In current research, machine and deep learning solutions for the classification of temporal data are shifting from single-channel datasets (univariate) to problems with multiple channels of information (multivariate). The majority of these works are focused on the method novelty and architecture, and the format of the input data is often treated implicitly. Particularly, multivariate datasets are often treated as a stack of univariate time series in terms of input preprocessing, with scaling methods applied across each channel separately. In this evaluation, we aim to demonstrate that the additional channel dimension is far from trivial and different approaches to scaling can lead to significantly different results in the accuracy of a solution. To that end, we test seven different data transformation methods on four different temporal dimensions and study their effect on the classification accuracy of five recent methods. We show that, for the large majority of tested datasets, the best transformation-dimension configuration leads to an increase in the accuracy compared to the result of each model with the same hyperparameters and no scaling, ranging from 0.16 to 76.79 percentage points. We also show that if we keep the transformation method constant, there is a statistically significant difference in accuracy results when applying it across different dimensions, with accuracy differences ranging from 0.23 to 47.79 percentage points. Finally, we explore the relation of the transformation methods and dimensions to the classifiers, and we conclude that there is no prominent general trend, and the optimal configuration is dataset- and classifier-specific.
    Quantifying Quality of Class-Conditional Generative Models in Time-Series Domain. (arXiv:2210.07617v1 [cs.LG])
    Generative models are designed to address the data scarcity problem. Even with the exploding amount of data, due to computational advancements, some applications (e.g., health care, weather forecast, fault detection) still suffer from data insufficiency, especially in the time-series domain. Thus generative models are essential and powerful tools, but they still lack a consensual approach for quality assessment. Such deficiency hinders the confident application of modern implicit generative models on time-series data. Inspired by assessment methods on the image domain, we introduce the InceptionTime Score (ITS) and the Frechet InceptionTime Distance (FITD) to gauge the qualitative performance of class conditional generative models on the time-series domain. We conduct extensive experiments on 80 different datasets to study the discriminative capabilities of proposed metrics alongside two existing evaluation metrics: Train on Synthetic Test on Real (TSTR) and Train on Real Test on Synthetic (TRTS). Extensive evaluation reveals that the proposed assessment method, i.e., ITS and FITD in combination with TSTR, can accurately assess class-conditional generative model performance.
    Revisiting Heterophily For Graph Neural Networks. (arXiv:2210.07606v1 [cs.LG])
    Graph Neural Networks (GNNs) extend basic Neural Networks (NNs) by using graph structures based on the relational inductive bias (homophily assumption). While GNNs have been commonly believed to outperform NNs in real-world tasks, recent work has identified a non-trivial set of datasets where their performance compared to NNs is not satisfactory. Heterophily has been considered the main cause of this empirical observation and numerous works have been put forward to address it. In this paper, we first revisit the widely used homophily metrics and point out that their consideration of only graph-label consistency is a shortcoming. Then, we study heterophily from the perspective of post-aggregation node similarity and define new homophily metrics, which are potentially advantageous compared to existing ones. Based on this investigation, we prove that some harmful cases of heterophily can be effectively addressed by local diversification operation. Then, we propose the Adaptive Channel Mixing (ACM), a framework to adaptively exploit aggregation, diversification and identity channels node-wisely to extract richer localized information for diverse node heterophily situations. ACM is more powerful than the commonly used uni-channel framework for node classification tasks on heterophilic graphs and is easy to be implemented in baseline GNN layers. When evaluated on 10 benchmark node classification tasks, ACM-augmented baselines consistently achieve significant performance gain, exceeding state-of-the-art GNNs on most tasks without incurring significant computational burden.
    AutoMoE: Neural Architecture Search for Efficient Sparsely Activated Transformers. (arXiv:2210.07535v1 [cs.CL])
    Neural architecture search (NAS) has demonstrated promising results on identifying efficient Transformer architectures which outperform manually designed ones for natural language tasks like neural machine translation (NMT). Existing NAS methods operate on a space of dense architectures, where all of the sub-architecture weights are activated for every input. Motivated by the recent advances in sparsely activated models like the Mixture-of-Experts (MoE) model, we introduce sparse architectures with conditional computation into the NAS search space. Given this expressive search space which subsumes prior densely activated architectures, we develop a new framework AutoMoE to search for efficient sparsely activated sub-Transformers. AutoMoE-generated sparse models obtain (i) 3x FLOPs reduction over manually designed dense Transformers and (ii) 23% FLOPs reduction over state-of-the-art NAS-generated dense sub-Transformers with parity in BLEU score on benchmark datasets for NMT. AutoMoE consists of three training phases: (a) Heterogeneous search space design with dense and sparsely activated Transformer modules (e.g., how many experts? where to place them? what should be their sizes?); (b) SuperNet training that jointly trains several subnetworks sampled from the large search space by weight-sharing; (c) Evolutionary search for the architecture with the optimal trade-off between task performance and computational constraint like FLOPs and latency. AutoMoE code, data and trained models are available at https://github.com/microsoft/AutoMoE.
    Machine Learning in Transaction Monitoring: The Prospect of xAI. (arXiv:2210.07648v1 [cs.HC])
    Banks hold a societal responsibility and regulatory requirements to mitigate the risk of financial crimes. Risk mitigation primarily happens through monitoring customer activity through Transaction Monitoring (TM). Recently, Machine Learning (ML) has been proposed to identify suspicious customer behavior, which raises complex socio-technical implications around trust and explainability of ML models and their outputs. However, little research is available due to its sensitivity. We aim to fill this gap by presenting empirical research exploring how ML supported automation and augmentation affects the TM process and stakeholders' requirements for building eXplainable Artificial Intelligence (xAI). Our study finds that xAI requirements depend on the liable party in the TM process which changes depending on augmentation or automation of TM. Context-relatable explanations can provide much-needed support for auditing and may diminish bias in the investigator's judgement. These results suggest a use case-specific approach for xAI to adequately foster the adoption of ML in TM.
    LEAVES: Learning Views for Time-Series Data in Contrastive Learning. (arXiv:2210.07340v1 [cs.LG])
    Contrastive learning, a self-supervised learning method that can learn representations from unlabeled data, has been developed promisingly. Many methods of contrastive learning depend on data augmentation techniques, which generate different views from the original signal. However, tuning policies and hyper-parameters for more effective data augmentation methods in contrastive learning is often time and resource-consuming. Researchers have designed approaches to automatically generate new views for some input signals, especially on the image data. But the view-learning method is not well developed for time-series data. In this work, we propose a simple but effective module for automating view generation for time-series data in contrastive learning, named learning views for time-series data (LEAVES). The proposed module learns the hyper-parameters for augmentations using adversarial training in contrastive learning. We validate the effectiveness of the proposed method using multiple time-series datasets. The experiments demonstrate that the proposed method is more effective in finding reasonable views and performs downstream tasks better than the baselines, including manually tuned augmentation-based contrastive learning methods and SOTA methods.
    Continuous-in-time Limit for Bayesian Bandits. (arXiv:2210.07513v1 [math.OC])
    This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges to a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.
    Model-based Safe Deep Reinforcement Learning via a Constrained Proximal Policy Optimization Algorithm. (arXiv:2210.07573v1 [cs.LG])
    During initial iterations of training in most Reinforcement Learning (RL) algorithms, agents perform a significant number of random exploratory steps. In the real world, this can limit the practicality of these algorithms as it can lead to potentially dangerous behavior. Hence safe exploration is a critical issue in applying RL algorithms in the real world. This problem has been recently well studied under the Constrained Markov Decision Process (CMDP) Framework, where in addition to single-stage rewards, an agent receives single-stage costs or penalties as well depending on the state transitions. The prescribed cost functions are responsible for mapping undesirable behavior at any given time-step to a scalar value. The goal then is to find a feasible policy that maximizes reward returns while constraining the cost returns to be below a prescribed threshold during training as well as deployment. We propose an On-policy Model-based Safe Deep RL algorithm in which we learn the transition dynamics of the environment in an online manner as well as find a feasible optimal policy using the Lagrangian Relaxation-based Proximal Policy Optimization. We use an ensemble of neural networks with different initializations to tackle epistemic and aleatoric uncertainty issues faced during environment model learning. We compare our approach with relevant model-free and model-based approaches in Constrained RL using the challenging Safe Reinforcement Learning benchmark - the Open AI Safety Gym. We demonstrate that our algorithm is more sample efficient and results in lower cumulative hazard violations as compared to constrained model-free approaches. Further, our approach shows better reward performance than other constrained model-based approaches in the literature.
    SQA3D: Situated Question Answering in 3D Scenes. (arXiv:2210.07474v1 [cs.CV])
    We propose a new task to benchmark scene understanding of embodied agents: Situated Question Answering in 3D Scenes (SQA3D). Given a scene context (e.g., 3D scan), SQA3D requires the tested agent to first understand its situation (position, orientation, etc.) in the 3D scene as described by text, then reason about its surrounding environment and answer a question under that situation. Based upon 650 scenes from ScanNet, we provide a dataset centered around 6.8k unique situations, along with 20.4k descriptions and 33.4k diverse reasoning questions for these situations. These questions examine a wide spectrum of reasoning capabilities for an intelligent agent, ranging from spatial relation comprehension to commonsense understanding, navigation, and multi-hop reasoning. SQA3D imposes a significant challenge to current multi-modal especially 3D reasoning models. We evaluate various state-of-the-art approaches and find that the best one only achieves an overall score of 47.20%, while amateur human participants can reach 90.06%. We believe SQA3D could facilitate future embodied AI research with stronger situation understanding and reasoning capability.
    Spatial-Temporal Attention Fusion Network for short-term passenger flow prediction on holidays in urban rail transit systems. (arXiv:2203.00007v3 [cs.LG] UPDATED)
    The short term passenger flow prediction of the urban rail transit system is of great significance for traffic operation and management. The emerging deep learning-based models provide effective methods to improve prediction accuracy. However, most of the existing models mainly predict the passenger flow on general weekdays or weekends. There are only few studies focusing on predicting the passenger flow on holidays, which is a significantly challenging task for traffic management because of its suddenness and irregularity. To this end, we propose a deep learning-based model named Spatial Temporal Attention Fusion Network comprising a novel Multi-Graph Attention Network, a Conv-Attention Block, and Feature Fusion Block for short-term passenger flow prediction on holidays. The multi-graph attention network is applied to extract the complex spatial dependencies of passenger flow dynamically and the conv-attention block is applied to extract the temporal dependencies of passenger flow from global and local perspectives. Moreover, in addition to the historical passenger flow data, the social media data, which has been proven that they can effectively reflect the evolution trend of passenger flow under events, are also fused into the feature fusion block of STAFN. The STAFN is tested on two large-scale urban rail transit AFC datasets from China on the New Year holiday, and the prediction performance of the model are compared with that of several conventional prediction models. Results demonstrate its better robustness and advantages among benchmark methods, which can provide overwhelming support for practical applications of short term passenger flow prediction on holidays.
    Reducing Action Space: Reference-Model-Assisted Deep Reinforcement Learning for Inverter-based Volt-Var Control. (arXiv:2210.07360v1 [eess.SY])
    Reference-model-assisted deep reinforcement learning (DRL) for inverter-based Volt-Var Control (IB-VVC) in active distribution networks is proposed. We investigate that a large action space increases the learning difficulties of DRL and degrades the optimization performance in the process of generating data and training neural networks. To reduce the action space of DRL, we design a reference-model-assisted DRL approach. We introduce definitions of the reference model, reference-model-based optimization, and reference actions. The reference-model-assisted DRL learns the residual actions between the reference actions and optimal actions, rather than learning the optimal actions directly. Since the residual actions are considerably smaller than the optimal actions for a reference model, we can design a smaller action space for the reference-model-assisted DRL. It reduces the learning difficulties of DRL and optimises the performance of the reference-model-assisted DRL approach. It is noteworthy that the reference-model-assisted DRL approach is compatible with any policy gradient DRL algorithms for continuous action problems. This work takes the soft actor-critic algorithm as an example and designs a reference-model-assisted soft actor-critic algorithm. Simulations show that 1) large action space degrades the performance of DRL in the whole training stage, and 2) reference-model-assisted DRL requires fewer iteration times and returns a better optimization performance.
    Finding Islands of Predictability in Action Forecasting. (arXiv:2210.07354v1 [cs.CV])
    We address dense action forecasting: the problem of predicting future action sequence over long durations based on partial observation. Our key insight is that future action sequences are more accurately modeled with variable, rather than one, levels of abstraction, and that the optimal level of abstraction can be dynamically selected during the prediction process. Our experiments show that most parts of future action sequences can be predicted confidently in fine detail only in small segments of future frames, which are effectively ``islands'' of high model prediction confidence in a ``sea'' of uncertainty. We propose a combination Bayesian neural network and hierarchical convolutional segmentation model to both accurately predict future actions and optimally select abstraction levels. We evaluate this approach on standard datasets against existing state-of-the-art systems and demonstrate that our ``islands of predictability'' approach maintains fine-grained action predictions while also making accurate abstract predictions where systems were previously unable to do so, and thus results in substantial, monotonic increases in accuracy.
    When is Offline Two-Player Zero-Sum Markov Game Solvable?. (arXiv:2201.03522v2 [cs.LG] UPDATED)
    We study what dataset assumption permits solving offline two-player zero-sum Markov games. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov games. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.
    HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning. (arXiv:2210.07323v1 [cs.CL])
    While the Turkish language is listed among low-resource languages, literature on Turkish automatic speech recognition (ASR) is relatively old. In this paper, we present HuBERT-TR, a speech representation model for Turkish based on HuBERT. HuBERT-TR achieves state-of-the-art results on several Turkish ASR datasets. We investigate pre-training HuBERT for Turkish with large-scale data curated from online resources. We pre-train HuBERT-TR using over 6,500 hours of speech data curated from YouTube that includes extensive variability in terms of quality and genre. We show that pre-trained models within a multi-lingual setup are inferior to language-specific models, where our Turkish model HuBERT-TR base performs better than its x10 times larger multi-lingual counterpart XLS-R-1B. Moreover, we study the effect of scaling on ASR performance by scaling our models up to 1B parameters. Our best model yields a state-of-the-art word error rate of 4.97% on the Turkish Broadcast News dataset. Models are available at huggingface.co/asafaya .
    Diversified Recommendations for Agents with Adaptive Preferences. (arXiv:2210.07773v1 [cs.IR])
    When an Agent visits a platform recommending a menu of content to select from, their choice of item depends not only on fixed preferences, but also on their prior engagements with the platform. The Recommender's primary objective is typically to encourage content consumption which optimizes some reward, such as ad revenue, but they often also aim to ensure that a wide variety of content is consumed by the Agent over time. We formalize this problem as an adversarial bandit task. At each step, the Recommender presents a menu of $k$ (out of $n$) items to the Agent, who selects one item in the menu according to their unknown preference model, which maps their history of past items to relative selection probabilities. The Recommender then observes the Agent's chosen item and receives bandit feedback of the item's reward. In addition to optimizing reward from selected items, the Recommender must also ensure that the total distribution of chosen items has sufficiently high entropy. We define a class of preference models which are locally learnable, i.e. behavior over the entire domain can be estimated by only observing behavior in a small region; this includes models representable by bounded-degree polynomials as well as functions with a sparse Fourier basis. For this class, we give an algorithm for the Recommender which obtains $\tilde{O}(T^{3/4})$ regret against all item distributions satisfying two conditions: they are sufficiently diversified, and they are instantaneously realizable at any history by some distribution over menus. We show that these conditions are closely connected: all sufficiently high-entropy distributions are instantaneously realizable at any item history. We also give a set of negative results justifying our assumptions, in the form of a runtime lower bound for non-local learning and linear regret lower bounds for alternate benchmarks.
    A Dual Control Variate for doubly stochastic optimization and black-box variational inference. (arXiv:2210.07290v1 [cs.LG])
    In this paper, we aim at reducing the variance of doubly stochastic optimization, a type of stochastic optimization algorithm that contains two independent sources of randomness: The subsampling of training data and the Monte Carlo estimation of expectations. Such an optimization regime often has the issue of large gradient variance which would lead to a slow rate of convergence. Therefore we propose Dual Control Variate, a new type of control variate capable of reducing gradient variance from both sources jointly. The dual control variate is built upon approximation-based control variates and incremental gradient methods. We show that on doubly stochastic optimization problems, compared with past variance reduction approaches that take only one source of randomness into account, dual control variate leads to a gradient estimator of significant smaller variance and demonstrates superior performance on real-world applications, like generalized linear models with dropout and black-box variational inference.  ( 2 min )
    Topics in Deep Learning and Optimization Algorithms for IoT Applications in Smart Transportation. (arXiv:2210.07246v1 [cs.LG])
    Nowadays, the Internet of Things (IoT) has become one of the most important technologies which enables a variety of connected and intelligent applications in smart cities. The smart decision making process of IoT devices not only relies on the large volume of data collected from their sensors, but also depends on advanced optimization theories and novel machine learning technologies which can process and analyse the collected data in specific network structure. Therefore, it becomes practically important to investigate how different optimization algorithms and machine learning techniques can be leveraged to improve system performance. As one of the most important vertical domains for IoT applications, smart transportation system has played a key role for providing real-world information and services to citizens by making their access to transport facilities easier and thus it is one of the key application areas to be explored in this thesis. In a nutshell, this thesis covers three key topics related to applying mathematical optimization and deep learning methods to IoT networks. In the first topic, we propose an optimal transmission frequency management scheme using decentralized ADMM-based method in a IoT network and introduce a mechanism to identify anomalies in data transmission frequency using an LSTM-based architecture. In the second topic, we leverage graph neural network (GNN) for demand prediction for shared bikes. In particular, we introduce a novel architecture, i.e., attention-based spatial temporal graph convolutional network (AST-GCN), to improve the prediction accuracy in real world datasets. In the last topic, we consider a highway traffic network scenario where frequent lane changing behaviors may occur with probability. A specific GNN based anomaly detector is devised to reveal such a probability driven by data collected in a dedicated mobility simulator.  ( 3 min )
    AMP: Automatically Finding Model Parallel Strategies with Heterogeneity Awareness. (arXiv:2210.07297v1 [cs.LG])
    Scaling up model sizes can lead to fundamentally new capabilities in many machine learning (ML) tasks. However, training big models requires strong distributed system expertise to carefully design model-parallel execution strategies that suit the model architectures and cluster setups. In this paper, we develop AMP, a framework that automatically derives such strategies. AMP identifies a valid space of model parallelism strategies and efficiently searches the space for high-performed strategies, by leveraging a cost model designed to capture the heterogeneity of the model and cluster specifications. Unlike existing methods, AMP is specifically tailored to support complex models composed of uneven layers and cluster setups with more heterogeneous accelerators and bandwidth. We evaluate AMP on popular models and cluster setups from public clouds and show that AMP returns parallel strategies that match the expert-tuned strategies on typical cluster setups. On heterogeneous clusters or models with heterogeneous architectures, AMP finds strategies with 1.54x and 1.77x higher throughput than state-of-the-art model-parallel systems, respectively.  ( 2 min )
    Joint Reasoning on Hybrid-knowledge sources for Task-Oriented Dialog. (arXiv:2210.07295v1 [cs.CL])
    Traditional systems designed for task oriented dialog utilize knowledge present only in structured knowledge sources to generate responses. However, relevant information required to generate responses may also reside in unstructured sources, such as documents. Recent state of the art models such as HyKnow and SeKnow aimed at overcoming these challenges make limiting assumptions about the knowledge sources. For instance, these systems assume that certain types of information, such as a phone number, is always present in a structured KB while information about aspects such as entrance ticket prices would always be available in documents. In this paper, we create a modified version of the MutliWOZ based dataset prepared by SeKnow to demonstrate how current methods have significant degradation in performance when strict assumptions about the source of information are removed. Then, in line with recent work exploiting pre-trained language models, we fine-tune a BART based model using prompts for the tasks of querying knowledge sources, as well as, for response generation, without making assumptions about the information present in each knowledge source. Through a series of experiments, we demonstrate that our model is robust to perturbations to knowledge modality (source of information), and that it can fuse information from structured as well as unstructured knowledge to generate responses.  ( 2 min )
    Meta-Uncertainty in Bayesian Model Comparison. (arXiv:2210.07278v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but, when derived from a finite number of observations, are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.  ( 2 min )
    A Large-Scale Annotated Multivariate Time Series Aviation Maintenance Dataset from the NGAFID. (arXiv:2210.07317v1 [cs.LG])
    This paper presents the largest publicly available, non-simulated, fleet-wide aircraft flight recording and maintenance log data for use in predicting part failure and maintenance need. We present 31,177 hours of flight data across 28,935 flights, which occur relative to 2,111 unplanned maintenance events clustered into 36 types of maintenance issues. Flights are annotated as before or after maintenance, with some flights occurring on the day of maintenance. Collecting data to evaluate predictive maintenance systems is challenging because it is difficult, dangerous, and unethical to generate data from compromised aircraft. To overcome this, we use the National General Aviation Flight Information Database (NGAFID), which contains flights recorded during regular operation of aircraft, and maintenance logs to construct a part failure dataset. We use a novel framing of Remaining Useful Life (RUL) prediction and consider the probability that the RUL of a part is greater than 2 days. Unlike previous datasets generated with simulations or in laboratory settings, the NGAFID Aviation Maintenance Dataset contains real flight records and maintenance logs from different seasons, weather conditions, pilots, and flight patterns. Additionally, we provide Python code to easily download the dataset and a Colab environment to reproduce our benchmarks on three different models. Our dataset presents a difficult challenge for machine learning researchers and a valuable opportunity to test and develop prognostic health management methods  ( 3 min )
    SHINE: SubHypergraph Inductive Neural nEtwork. (arXiv:2210.07309v1 [cs.LG])
    Hypergraph neural networks can model multi-way connections among nodes of the graphs, which are common in real-world applications such as genetic medicine. In particular, genetic pathways or gene sets encode molecular functions driven by multiple genes, naturally represented as hyperedges. Thus, hypergraph-guided embedding can capture functional relations in learned representations. Existing hypergraph neural network models often focus on node-level or graph-level inference. There is an unmet need in learning powerful representations of subgraphs of hypergraphs in real-world applications. For example, a cancer patient can be viewed as a subgraph of genes harboring mutations in the patient, while all the genes are connected by hyperedges that correspond to pathways representing specific molecular functions. For accurate inductive subgraph prediction, we propose SubHypergraph Inductive Neural nEtwork (SHINE). SHINE uses informative genetic pathways that encode molecular functions as hyperedges to connect genes as nodes. SHINE jointly optimizes the objectives of end-to-end subgraph classification and hypergraph nodes' similarity regularization. SHINE simultaneously learns representations for both genes and pathways using strongly dual attention message passing. The learned representations are aggregated via a subgraph attention layer and used to train a multilayer perceptron for inductive subgraph inferencing. We evaluated SHINE against a wide array of state-of-the-art (hyper)graph neural networks, XGBoost, NMF and polygenic risk score models, using large scale NGS and curated datasets. SHINE outperformed all comparison models significantly, and yielded interpretable disease models with functional insights.  ( 3 min )
    The Hidden Uniform Cluster Prior in Self-Supervised Learning. (arXiv:2210.07277v1 [cs.LG])
    A successful paradigm in representation learning is to perform self-supervised pretraining using tasks based on mini-batch statistics (e.g., SimCLR, VICReg, SwAV, MSN). We show that in the formulation of all these methods is an overlooked prior to learn features that enable uniform clustering of the data. While this prior has led to remarkably semantic representations when pretraining on class-balanced data, such as ImageNet, we demonstrate that it can hamper performance when pretraining on class-imbalanced data. By moving away from conventional uniformity priors and instead preferring power-law distributed feature clusters, we show that one can improve the quality of the learned representations on real-world class-imbalanced datasets. To demonstrate this, we develop an extension of the Masked Siamese Networks (MSN) method to support the use of arbitrary features priors.  ( 2 min )
    Tumor-location-guided CNNs for Pediatric Low-grade Glioma Molecular Biomarker Classification Using MRI. (arXiv:2210.07287v1 [cs.CV])
    Pediatric low-grade glioma (pLGG) is the most common type of brain cancer among children, and the identification of molecular markers for pLGG is crucial for successful treatment planning. Current standard care is biopsy, which is invasive. Thus, the non-invasive imaging-based approaches, where Machine Learning (ML) has a high potential, are impactful. Recently, we developed a tumor-location-based algorithm and demonstrated its potential to differentiate pLGG molecular subtypes. In this work, we first reevaluated the performance of the location-based algorithm on a larger pLGG dataset, which includes 214 patients and achieved an area under the receiver operating characteristic curve (AUROC) of 77.90. A Convolutional Neural Network (CNN) based algorithm increased the average AUROC to 86.11. Ultimately, we designed and implemented a tumor-location-guided CNN algorithm and achieved average AUROC of 88.64. Using a repeated experiment approach with 100 runs, we ensured the results were reproducible and the improvement was statistically significant.  ( 2 min )
    Deep Reinforcement Learning-based Rebalancing Policies for Profit Maximization of Relay Nodes in Payment Channel Networks. (arXiv:2210.07302v1 [cs.DC])
    Payment channel networks (PCNs) are a layer-2 blockchain scalability solution, with its main entity, the payment channel, enabling transactions between pairs of nodes "off-chain," thus reducing the burden on the layer-1 network. Nodes with multiple channels can serve as relays for multihop payments over a path of channels: they relay payments of others by providing the liquidity of their channels, in exchange for part of the amount withheld as a fee. Relay nodes might after a while end up with one or more unbalanced channels, and thus need to trigger a rebalancing operation. In this paper, we study how a relay node can maximize its profits from fees by using the rebalancing method of submarine swaps. We introduce a stochastic model to capture the dynamics of a relay node observing random transaction arrivals and performing occasional rebalancing operations, and express the system evolution as a Markov Decision Process. We formulate the problem of the maximization of the node's fortune over time over all rebalancing policies, and approximate the optimal solution by designing a Deep Reinforcement Learning (DRL)-based rebalancing policy. We build a discrete event simulator of the system and use it to demonstrate the DRL policy's superior performance under most conditions by conducting a comparative study of different policies and parameterizations. In all, our approach aims to be the first to introduce DRL for network optimization in the complex world of PCNs.  ( 3 min )
  • Open

    Scalable Stochastic Parametric Verification with Stochastic Variational Smoothed Model Checking. (arXiv:2205.05398v2 [cs.LG] UPDATED)
    Parametric verification of linear temporal properties for stochastic models can be expressed as computing the satisfaction probability of a certain property as a function of the parameters of the model. Smoothed model checking (smMC) aims at inferring the satisfaction function over the entire parameter space from a limited set of observations obtained via simulation. As observations are costly and noisy, smMC is framed as a Bayesian inference problem so that the estimates have an additional quantification of the uncertainty. In smMC the authors use Gaussian Processes (GP), inferred by means of the Expectation Propagation algorithm. This approach provides accurate reconstructions with statistically sound quantification of the uncertainty. However, it inherits the well-known scalability issues of GP. In this paper, we exploit recent advances in probabilistic machine learning to push this limitation forward, making Bayesian inference of smMC scalable to larger datasets and enabling its application to models with high dimensional parameter spaces. We propose Stochastic Variational Smoothed Model Checking (SV-smMC), a solution that exploits stochastic variational inference (SVI) to approximate the posterior distribution of the smMC problem. The strength and flexibility of SVI make SV-smMC applicable to two alternative probabilistic models: Gaussian Processes (GP) and Bayesian Neural Networks (BNN). The core ingredient of SVI is a stochastic gradient-based optimization that makes inference easily parallelizable and that enables GPU acceleration. In this paper, we compare the performances of smMC against those of SV-smMC by looking at the scalability, the computational efficiency and the accuracy of the reconstructed satisfaction function.  ( 3 min )
    Marginalized particle Gibbs for multiple state-space models coupled through shared parameters. (arXiv:2210.07379v1 [stat.ME])
    We consider Bayesian inference from multiple time series described by a common state-space model (SSM) structure, but where different subsets of parameters are shared between different submodels. An important example is disease-dynamics, where parameters can be either disease or location specific. Parameter inference in these models can be improved by systematically aggregating information from the different time series, most notably for short series. Particle Gibbs (PG) samplers are an efficient class of algorithms for inference in SSMs, in particular when conjugacy can be exploited to marginalize out model parameters from the state update. We present two different PG samplers that marginalize static model parameters on-the-fly: one that updates one model at a time conditioned on the datasets for the other models, and one that concurrently updates all models by stacking them into a high-dimensional SSM. The distinctive features of each sampler make them suitable for different modelling contexts. We provide insights on when each sampler should be used and show that they can be combined to form an efficient PG sampler for a model with strong dependencies between states and parameters. The performance is illustrated on two linear-Gaussian examples and on a real-world example on the spread of mosquito-borne diseases.  ( 3 min )
    Cumulo: A Dataset for Learning Cloud Classes. (arXiv:1911.04227v3 [physics.ao-ph] UPDATED)
    One of the greatest sources of uncertainty in future climate projections comes from limitations in modelling clouds and in understanding how different cloud types interact with the climate system. A key first step in reducing this uncertainty is to accurately classify cloud types at high spatial and temporal resolution. In this paper, we introduce Cumulo, a benchmark dataset for training and evaluating global cloud classification models. It consists of one year of 1km resolution MODIS hyperspectral imagery merged with pixel-width 'tracks' of CloudSat cloud labels. Bringing these complementary datasets together is a crucial first step, enabling the Machine-Learning community to develop innovative new techniques which could greatly benefit the Climate community. To showcase Cumulo, we provide baseline performance analysis using an invertible flow generative model (IResNet), which further allows us to discover new sub-classes for a given cloud class by exploring the latent space. To compare methods, we introduce a set of evaluation criteria, to identify models that are not only accurate, but also physically-realistic. CUMULO can be download from https://www.dropbox.com/sh/i3s9q2v2jjyk2it/AACxXnXfMF5wuIqLXqH4NJOra?dl=0 .  ( 3 min )
    Probable Domain Generalization via Quantile Risk Minimization. (arXiv:2207.09944v2 [stat.ML] UPDATED)
    Domain generalization (DG) seeks predictors which perform well on unseen test distributions by leveraging data drawn from multiple related training distributions or domains. To achieve this, DG is commonly formulated as an average- or worst-case problem over the set of possible domains. However, predictors that perform well on average lack robustness while predictors that perform well in the worst case tend to be overly-conservative. To address this, we propose a new probabilistic framework for DG where the goal is to learn predictors that perform well with high probability. Our key idea is that distribution shifts seen during training should inform us of probable shifts at test time, which we realize by explicitly relating training and test domains as draws from the same underlying meta-distribution. To achieve probable DG, we propose a new optimization problem called Quantile Risk Minimization (QRM). By minimizing the $\alpha$-quantile of predictor's risk distribution over domains, QRM seeks predictors that perform well with probability $\alpha$. To solve QRM in practice, we propose the Empirical QRM (EQRM) algorithm, and prove: (i) a generalization bound for EQRM; and (ii) that EQRM recovers the causal predictor as $\alpha \to 1$. In our experiments, we introduce a more holistic quantile-focused evaluation protocol for DG, and demonstrate that EQRM outperforms state-of-the-art baselines on CMNIST and several datasets from WILDS and DomainBed.
    Estimation of High-Dimensional Markov-Switching VAR Models with an Approximate EM Algorithm. (arXiv:2210.07456v1 [stat.ME])
    Regime shifts in high-dimensional time series arise naturally in many applications, from neuroimaging to finance. This problem has received considerable attention in low-dimensional settings, with both Bayesian and frequentist methods used extensively for parameter estimation. The EM algorithm is a particularly popular strategy for parameter estimation in low-dimensional settings, although the statistical properties of the resulting estimates have not been well understood. Furthermore, its extension to high-dimensional time series has proved challenging. To overcome these challenges, in this paper we propose an approximate EM algorithm for Markov-switching VAR models that leads to efficient computation and also facilitates the investigation of asymptotic properties of the resulting parameter estimates. We establish the consistency of the proposed EM algorithm in high dimensions and investigate its performance via simulation studies.
    Efficiently Controlling Multiple Risks with Pareto Testing. (arXiv:2210.07913v1 [cs.LG])
    Machine learning applications frequently come with multiple diverse objectives and constraints that can change over time. Accordingly, trained models can be tuned with sets of hyper-parameters that affect their predictive behavior (e.g., their run-time efficiency versus error rate). As the number of constraints and hyper-parameter dimensions grow, naively selected settings may lead to sub-optimal and/or unreliable results. We develop an efficient method for calibrating models such that their predictions provably satisfy multiple explicit and simultaneous statistical guarantees (e.g., upper-bounded error rates), while also optimizing any number of additional, unconstrained objectives (e.g., total run-time cost). Building on recent results in distribution-free, finite-sample risk control for general losses, we propose Pareto Testing: a two-stage process which combines multi-objective optimization with multiple hypothesis testing. The optimization stage constructs a set of promising combinations on the Pareto frontier. We then apply statistical testing to this frontier only to identify configurations that have (i) high utility with respect to our objectives, and (ii) guaranteed risk levels with respect to our constraints, with specifiable high probability. We demonstrate the effectiveness of our approach to reliably accelerate the execution of large-scale Transformer models in natural language processing (NLP) applications. In particular, we show how Pareto Testing can be used to dynamically configure multiple inter-dependent model attributes -- including the number of layers computed before exiting, number of attention heads pruned, or number of text tokens considered -- to simultaneously control and optimize various accuracy and cost metrics.
    Privacy-Preserving and Lossless Distributed Estimation of High-Dimensional Generalized Additive Mixed Models. (arXiv:2210.07723v1 [stat.ML])
    Various privacy-preserving frameworks that respect the individual's privacy in the analysis of data have been developed in recent years. However, available model classes such as simple statistics or generalized linear models lack the flexibility required for a good approximation of the underlying data-generating process in practice. In this paper, we propose an algorithm for a distributed, privacy-preserving, and lossless estimation of generalized additive mixed models (GAMM) using component-wise gradient boosting (CWB). Making use of CWB allows us to reframe the GAMM estimation as a distributed fitting of base learners using the $L_2$-loss. In order to account for the heterogeneity of different data location sites, we propose a distributed version of a row-wise tensor product that allows the computation of site-specific (smooth) effects. Our adaption of CWB preserves all the important properties of the original algorithm, such as an unbiased feature selection and the feasibility to fit models in high-dimensional feature spaces, and yields equivalent model estimates as CWB on pooled data. Next to a derivation of the equivalence of both algorithms, we also showcase the efficacy of our algorithm on a distributed heart disease data set and compare it with state-of-the-art methods.
    Similarity and Generalization: From Noise to Corruption. (arXiv:2201.12803v2 [cs.LG] UPDATED)
    Contrastive learning aims to extract distinctive features from data by finding an embedding representation where similar samples are close to each other, and different ones are far apart. We study how NNs generalize the concept of similarity in the presence of noise, investigating two phenomena: Double Descent (DD) behavior and online/offline correspondence. While DD examines how the network adjusts to the dataset during a long training time or by increasing the number of parameters, online/offline correspondence compares the network performances varying the quality (diversity) of the dataset. We focus on the simplest contrastive learning representative: Siamese Neural Networks (SNNs). We point out that SNNs can be affected by two distinct sources of noise: Pair Label Noise (PLN) and Single Label Noise (SLN). The effect of SLN is asymmetric, but it preserves similarity relations, while PLN is symmetric but breaks transitivity. We find that DD also appears in SNNs and is exacerbated by noise. We show that the dataset topology crucially affects generalization. While sparse datasets show the same performances under SLN and PLN for an equal amount of noise, SLN outperforms PLN in the overparametrized region in dense datasets. Indeed, in this regime, PLN similarity violation becomes macroscopical, corrupting the dataset to the point where complete overfitting cannot be achieved. We call this phenomenon Density-Induced Break of Similarity (DIBS). Probing the equivalence between online optimization and offline generalization in SNNs, we find that their correspondence breaks down in the presence of label noise for all the scenarios considered.
    Invariance-adapted decomposition and Lasso-type contrastive learning. (arXiv:2210.07413v1 [stat.ML])
    Recent years have witnessed the effectiveness of contrastive learning in obtaining the representation of dataset that is useful in interpretation and downstream tasks. However, the mechanism that describes this effectiveness have not been thoroughly analyzed, and many studies have been conducted to investigate the data structures captured by contrastive learning. In particular, the recent study of \citet{content_isolate} has shown that contrastive learning is capable of decomposing the data space into the space that is invariant to all augmentations and its complement. In this paper, we introduce the notion of invariance-adapted latent space that decomposes the data space into the intersections of the invariant spaces of each augmentation and their complements. This decomposition generalizes the one introduced in \citet{content_isolate}, and describes a structure that is analogous to the frequencies in the harmonic analysis of a group. We experimentally show that contrastive learning with lasso-type metric can be used to find an invariance-adapted latent space, thereby suggesting a new potential for the contrastive learning. We also investigate when such a latent space can be identified up to mixings within each component.
    When is Offline Two-Player Zero-Sum Markov Game Solvable?. (arXiv:2201.03522v2 [cs.LG] UPDATED)
    We study what dataset assumption permits solving offline two-player zero-sum Markov games. In stark contrast to the offline single-agent Markov decision process, we show that the single strategy concentration assumption is insufficient for learning the Nash equilibrium (NE) strategy in offline two-player zero-sum Markov games. On the other hand, we propose a new assumption named unilateral concentration and design a pessimism-type algorithm that is provably efficient under this assumption. In addition, we show that the unilateral concentration assumption is necessary for learning an NE strategy. Furthermore, our algorithm can achieve minimax sample complexity without any modification for two widely studied settings: dataset with uniform concentration assumption and turn-based Markov games. Our work serves as an important initial step towards understanding offline multi-agent reinforcement learning.
    Estimation of the Sample Frechet Mean: A Convolutional Neural Network Approach. (arXiv:2210.07401v1 [cs.LG])
    This work addresses the rising demand for novel tools in statistical and machine learning for "graph-valued random variables" by proposing a fast algorithm to compute the sample Frechet mean, which replaces the concept of sample mean for graphs (or networks). We use convolutional neural networks to learn the morphology of the graphs in a set of graphs. Our experiments on several ensembles of random graphs demonstrate that our method can reliably recover the sample Frechet mean.  ( 2 min )
    A Dual Control Variate for doubly stochastic optimization and black-box variational inference. (arXiv:2210.07290v1 [cs.LG])
    In this paper, we aim at reducing the variance of doubly stochastic optimization, a type of stochastic optimization algorithm that contains two independent sources of randomness: The subsampling of training data and the Monte Carlo estimation of expectations. Such an optimization regime often has the issue of large gradient variance which would lead to a slow rate of convergence. Therefore we propose Dual Control Variate, a new type of control variate capable of reducing gradient variance from both sources jointly. The dual control variate is built upon approximation-based control variates and incremental gradient methods. We show that on doubly stochastic optimization problems, compared with past variance reduction approaches that take only one source of randomness into account, dual control variate leads to a gradient estimator of significant smaller variance and demonstrates superior performance on real-world applications, like generalized linear models with dropout and black-box variational inference.  ( 2 min )
    Tunable Complexity Benchmarks for Evaluating Physics-Informed Neural Networks on Coupled Ordinary Differential Equations. (arXiv:2210.07880v1 [stat.ML])
    In this work, we assess the ability of physics-informed neural networks (PINNs) to solve increasingly-complex coupled ordinary differential equations (ODEs). We focus on a pair of benchmarks: discretized partial differential equations and harmonic oscillators, each of which has a tunable parameter that controls its complexity. Even by varying network architecture and applying a state-of-the-art training method that accounts for "difficult" training regions, we show that PINNs eventually fail to produce correct solutions to these benchmarks as their complexity -- the number of equations and the size of time domain -- increases. We identify several reasons why this may be the case, including insufficient network capacity, poor conditioning of the ODEs, and high local curvature, as measured by the Laplacian of the PINN loss.
    Commutativity and Disentanglement from the Manifold Perspective. (arXiv:2210.07857v1 [stat.ML])
    In this paper, we interpret disentanglement from the manifold perspective and trace how it naturally leads to a necessary and sufficient condition for disentanglement: the disentangled factors must commute with each other. Along the way, we show how some technical results have consequences for the compression and disentanglement of generative models, and we also discuss the practical and theoretical implications of commutativity. Finally, we conclude with a discussion of related approaches to disentanglement and how they relate to our view of disentanglement from the manifold perspective.
    Disentanglement of Correlated Factors via Hausdorff Factorized Support. (arXiv:2210.07347v1 [cs.LG])
    A grand goal in deep learning research is to learn representations capable of generalizing across distribution shifts. Disentanglement is one promising direction aimed at aligning a models representations with the underlying factors generating the data (e.g. color or background). Existing disentanglement methods, however, rely on an often unrealistic assumption: that factors are statistically independent. In reality, factors (like object color and shape) are correlated. To address this limitation, we propose a relaxed disentanglement criterion - the Hausdorff Factorized Support (HFS) criterion - that encourages a factorized support, rather than a factorial distribution, by minimizing a Hausdorff distance. This allows for arbitrary distributions of the factors over their support, including correlations between them. We show that the use of HFS consistently facilitates disentanglement and recovery of ground-truth factors across a variety of correlation settings and benchmarks, even under severe training correlations and correlation shifts, with in parts over +60% in relative improvement over existing disentanglement methods. In addition, we find that leveraging HFS for representation learning can even facilitate transfer to downstream tasks such as classification under distribution shifts. We hope our original approach and positive empirical results inspire further progress on the open problem of robust generalization.  ( 2 min )
    CaloDVAE : Discrete Variational Autoencoders for Fast Calorimeter Shower Simulation. (arXiv:2210.07430v1 [physics.ins-det])
    Calorimeter simulation is the most computationally expensive part of Monte Carlo generation of samples necessary for analysis of experimental data at the Large Hadron Collider (LHC). The High-Luminosity upgrade of the LHC would require an even larger amount of such samples. We present a technique based on Discrete Variational Autoencoders (DVAEs) to simulate particle showers in Electromagnetic Calorimeters. We discuss how this work paves the way towards exploration of quantum annealing processors as sampling devices for generation of simulated High Energy Physics datasets.
    Anomaly detection in dynamic networks. (arXiv:2210.07407v1 [cs.SI])
    Detecting anomalies from a series of temporal networks has many applications, including road accidents in transport networks and suspicious events in social networks. While there are many methods for network anomaly detection, statistical methods are under utilised in this space even though they have a long history and proven capability in handling temporal dependencies. In this paper, we introduce \textit{oddnet}, a feature-based network anomaly detection method that uses time series methods to model temporal dependencies. We demonstrate the effectiveness of oddnet on synthetic and real-world datasets. The R package oddnet implements this algorithm.
    Discrete Optimal Transport with Independent Marginals is #P-Hard. (arXiv:2203.01161v2 [math.OC] UPDATED)
    We study the computational complexity of the optimal transport problem that evaluates the Wasserstein distance between the distributions of two K-dimensional discrete random vectors. The best known algorithms for this problem run in polynomial time in the maximum of the number of atoms of the two distributions. However, if the components of either random vector are independent, then this number can be exponential in K even though the size of the problem description scales linearly with K. We prove that the described optimal transport problem is #P-hard even if all components of the first random vector are independent uniform Bernoulli random variables, while the second random vector has merely two atoms, and even if only approximate solutions are sought. We also develop a dynamic programming-type algorithm that approximates the Wasserstein distance in pseudo-polynomial time when the components of the first random vector follow arbitrary independent discrete distributions, and we identify special problem instances that can be solved exactly in strongly polynomial time.
    Why Robust Generalization in Deep Learning is Difficult: Perspective of Expressive Power. (arXiv:2205.13863v3 [cs.LG] UPDATED)
    It is well-known that modern neural networks are vulnerable to adversarial examples. To mitigate this problem, a series of robust learning algorithms have been proposed. However, although the robust training error can be near zero via some methods, all existing algorithms lead to a high robust generalization error. In this paper, we provide a theoretical understanding of this puzzling phenomenon from the perspective of expressive power for deep neural networks. Specifically, for binary classification problems with well-separated data, we show that, for ReLU networks, while mild over-parameterization is sufficient for high robust training accuracy, there exists a constant robust generalization gap unless the size of the neural network is exponential in the data dimension $d$. This result holds even if the data is linear separable (which means achieving standard generalization is easy), and more generally for any parameterized function classes as long as their VC dimension is at most polynomial in the number of parameters. Moreover, we establish an improved upper bound of $\exp({\mathcal{O}}(k))$ for the network size to achieve low robust generalization error when the data lies on a manifold with intrinsic dimension $k$ ($k \ll d$). Nonetheless, we also have a lower bound that grows exponentially with respect to $k$ -- the curse of dimensionality is inevitable. By demonstrating an exponential separation between the network size for achieving low robust training and generalization error, our results reveal that the hardness of robust generalization may stem from the expressive power of practical models.
    Ergodic variational flows. (arXiv:2205.07475v2 [stat.ML] UPDATED)
    This work presents a new class of variational family -- ergodic variational flows -- that not only enables tractable i.i.d. sampling and density evaluation, but also comes with MCMC-like convergence guarantees. Ergodic variational flows consist of a mixture of repeated applications of a measure-preserving and ergodic map to an initial reference distribution. We provide mild conditions under which the variational distribution converges weakly and in total variation to the target as the number of steps in the flow increases; this convergence holds regardless of the value of variational parameters, though different parameter values may result in faster or slower convergence. We develop a practical implementation of the flow family using Hamiltonian dynamics combined with deterministic momentum refreshment, including a tunable step size to optimize the trade-off between simulation fidelity and computational cost. Simulated and real data experiments provide an empirical verification of the convergence theory, and demonstrate that the method provides more reliable posterior approximations than several black-box normalizing flows, as well as samples of comparable quality to those obtained from state-of-the-art MCMC methods.
    Geometric Scattering on Measure Spaces. (arXiv:2208.08561v2 [stat.ML] UPDATED)
    The scattering transform is a multilayered, wavelet-based transform initially introduced as a model of convolutional neural networks (CNNs) that has played a foundational role in our understanding of these networks' stability and invariance properties. Subsequently, there has been widespread interest in extending the success of CNNs to data sets with non-Euclidean structure, such as graphs and manifolds, leading to the emerging field of geometric deep learning. In order to improve our understanding of the architectures used in this new field, several papers have proposed generalizations of the scattering transform for non-Euclidean data structures such as undirected graphs and compact Riemannian manifolds without boundary. In this paper, we introduce a general, unified model for geometric scattering on measure spaces. Our proposed framework includes previous work on geometric scattering as special cases but also applies to more general settings such as directed graphs, signed graphs, and manifolds with boundary. We propose a new criterion that identifies to which groups a useful representation should be invariant and show that this criterion is sufficient to guarantee that the scattering transform has desirable stability and invariance properties. Additionally, we consider finite measure spaces that are obtained from randomly sampling an unknown manifold. We propose two methods for constructing a data-driven graph on which the associated graph scattering transform approximates the scattering transform on the underlying manifold. Moreover, we use a diffusion-maps based approach to prove quantitative estimates on the rate of convergence of one of these approximations as the number of sample points tends to infinity. Lastly, we showcase the utility of our method on spherical images, directed graphs, and on high-dimensional single-cell data.
    Markov Chain Score Ascent: A Unifying Framework of Variational Inference with Markovian Gradients. (arXiv:2206.06295v4 [cs.LG] UPDATED)
    Minimizing the inclusive Kullback-Leibler (KL) divergence with stochastic gradient descent (SGD) is challenging since its gradient is defined as an integral over the posterior. Recently, multiple methods have been proposed to run SGD with biased gradient estimates obtained from a Markov chain. This paper provides the first non-asymptotic convergence analysis of these methods by establishing their mixing rate and gradient variance. To do this, we demonstrate that these methods-which we collectively refer to as Markov chain score ascent (MCSA) methods-can be cast as special cases of the Markov chain gradient descent framework. Furthermore, by leveraging this new understanding, we develop a novel MCSA scheme, parallel MCSA (pMCSA), that achieves a tighter bound on the gradient variance. We demonstrate that this improved theoretical result translates to superior empirical performance.
    Meta-Uncertainty in Bayesian Model Comparison. (arXiv:2210.07278v1 [stat.ML])
    Bayesian model comparison (BMC) offers a principled probabilistic approach to study and rank competing models. In standard BMC, we construct a discrete probability distribution over the set of possible models, conditional on the observed data of interest. These posterior model probabilities (PMPs) are measures of uncertainty, but, when derived from a finite number of observations, are also uncertain themselves. In this paper, we conceptualize distinct levels of uncertainty which arise in BMC. We explore a fully probabilistic framework for quantifying meta-uncertainty, resulting in an applied method to enhance any BMC workflow. Drawing on both Bayesian and frequentist techniques, we represent the uncertainty over the uncertain PMPs via meta-models which combine simulated and observed data into a predictive distribution for PMPs on new data. We demonstrate the utility of the proposed method in the context of conjugate Bayesian regression, likelihood-based inference with Markov chain Monte Carlo, and simulation-based inference with neural networks.
    PAC Prediction Sets for Meta-Learning. (arXiv:2207.02440v2 [cs.LG] UPDATED)
    Uncertainty quantification is a key component of machine learning models targeted at safety-critical systems such as in healthcare or autonomous vehicles. We study this problem in the context of meta learning, where the goal is to quickly adapt a predictor to new tasks. In particular, we propose a novel algorithm to construct \emph{PAC prediction sets}, which capture uncertainty via sets of labels, that can be adapted to new tasks with only a few training examples. These prediction sets satisfy an extension of the typical PAC guarantee to the meta learning setting; in particular, the PAC guarantee holds with high probability over future tasks. We demonstrate the efficacy of our approach on four datasets across three application domains: mini-ImageNet and CIFAR10-C in the visual domain, FewRel in the language domain, and the CDC Heart Dataset in the medical domain. In particular, our prediction sets satisfy the PAC guarantee while having smaller size compared to other baselines that also satisfy this guarantee.
    B\'ezier Gaussian Processes for Tall and Wide Data. (arXiv:2209.00343v2 [stat.ML] UPDATED)
    Modern approximations to Gaussian processes are suitable for "tall data", with a cost that scales well in the number of observations, but under-performs on ``wide data'', scaling poorly in the number of input features. That is, as the number of input features grows, good predictive performance requires the number of summarising variables, and their associated cost, to grow rapidly. We introduce a kernel that allows the number of summarising variables to grow exponentially with the number of input features, but requires only linear cost in both number of observations and input features. This scaling is achieved through our introduction of the B\'ezier buttress, which allows approximate inference without computing matrix inverses or determinants. We show that our kernel has close similarities to some of the most used kernels in Gaussian process regression, and empirically demonstrate the kernel's ability to scale to both tall and wide datasets.
    Towards Learning Universal Hyperparameter Optimizers with Transformers. (arXiv:2205.13320v2 [cs.LG] UPDATED)
    Meta-learning hyperparameter optimization (HPO) algorithms from prior experiments is a promising approach to improve optimization efficiency over objective functions from a similar distribution. However, existing methods are restricted to learning from experiments sharing the same set of hyperparameters. In this paper, we introduce the OptFormer, the first text-based Transformer HPO framework that provides a universal end-to-end interface for jointly learning policy and function prediction when trained on vast tuning data from the wild, such as Google's Vizier database, one of the world's largest HPO datasets. Our extensive experiments demonstrate that the OptFormer can simultaneously imitate at least 7 different HPO algorithms, which can be further improved via its function uncertainty estimates. Compared to a Gaussian Process, the OptFormer also learns a robust prior distribution for hyperparameter response functions, and can thereby provide more accurate and better calibrated predictions. This work paves the path to future extensions for training a Transformer-based model as a general HPO optimizer.
    Feature Learning in $L_{2}$-regularized DNNs: Attraction/Repulsion and Sparsity. (arXiv:2205.15809v2 [stat.ML] UPDATED)
    We study the loss surface of DNNs with $L_{2}$ regularization. We show that the loss in terms of the parameters can be reformulated into a loss in terms of the layerwise activations $Z_{\ell}$ of the training set. This reformulation reveals the dynamics behind feature learning: each hidden representations $Z_{\ell}$ are optimal w.r.t. to an attraction/repulsion problem and interpolate between the input and output representations, keeping as little information from the input as necessary to construct the activation of the next layer. For positively homogeneous non-linearities, the loss can be further reformulated in terms of the covariances of the hidden representations, which takes the form of a partially convex optimization over a convex cone. This second reformulation allows us to prove a sparsity result for homogeneous DNNs: any local minimum of the $L_{2}$-regularized loss can be achieved with at most $N(N+1)$ neurons in each hidden layer (where $N$ is the size of the training set). We show that this bound is tight by giving an example of a local minimum that requires $N^{2}/4$ hidden neurons. But we also observe numerically that in more traditional settings much less than $N^{2}$ neurons are required to reach the minima.
    A Variational Perspective on Generative Flow Networks. (arXiv:2210.07992v1 [stat.ML])
    Generative flow networks (GFNs) are a class of models for sequential sampling of composite objects, which approximate a target distribution that is defined in terms of an energy function or a reward. GFNs are typically trained using a flow matching or trajectory balance objective, which matches forward and backward transition models over trajectories. In this work, we define variational objectives for GFNs in terms of the Kullback-Leibler (KL) divergences between the forward and backward distribution. We show that variational inference in GFNs is equivalent to minimizing the trajectory balance objective when sampling trajectories from the forward model. We generalize this approach by optimizing a convex combination of the reverse- and forward KL divergence. This insight suggests variational inference methods can serve as a means to define a more general family of objectives for training generative flow networks, for example by incorporating control variates, which are commonly used in variational inference, to reduce the variance of the gradients of the trajectory balance objective. We evaluate our findings and the performance of the proposed variational objective numerically by comparing it to the trajectory balance objective on two synthetic tasks.
    Posterior Collapse of a Linear Latent Variable Model. (arXiv:2205.04009v2 [cs.LG] UPDATED)
    This work identifies the existence and cause of a type of posterior collapse that frequently occurs in the Bayesian deep learning practice. For a general linear latent variable model that includes linear variational autoencoders as a special case, we precisely identify the nature of posterior collapse to be the competition between the likelihood and the regularization of the mean due to the prior. Our result suggests that posterior collapse may be related to neural collapse and dimensional collapse and could be a subclass of a general problem of learning for deeper architectures.
    Efficient Approximations of the Fisher Matrix in Neural Networks using Kronecker Product Singular Value Decomposition. (arXiv:2201.10285v6 [cs.NE] UPDATED)
    Several studies have shown the ability of natural gradient descent to minimize the objective function more efficiently than ordinary gradient descent based methods. However, the bottleneck of this approach for training deep neural networks lies in the prohibitive cost of solving a large dense linear system corresponding to the Fisher Information Matrix (FIM) at each iteration. This has motivated various approximations of either the exact FIM or the empirical one. The most sophisticated of these is KFAC, which involves a Kronecker-factored block diagonal approximation of the FIM. With only a slight additional cost, a few improvements of KFAC from the standpoint of accuracy are proposed. The common feature of the four novel methods is that they rely on a direct minimization problem, the solution of which can be computed via the Kronecker product singular value decomposition technique. Experimental results on the three standard deep auto-encoder benchmarks showed that they provide more accurate approximations to the FIM. Furthermore, they outperform KFAC and state-of-the-art first-order methods in terms of optimization speed.
    Covariate-informed Representation Learning to Prevent Posterior Collapse of iVAE. (arXiv:2202.04206v3 [stat.ML] UPDATED)
    The recently proposed identifiable variational autoencoder (iVAE) framework provides a promising approach for learning latent independent components (ICs). iVAEs use auxiliary covariates to build an identifiable generation structure from covariates to ICs to observations, and the posterior network approximates ICs given observations and covariates. Though the identifiability is appealing, we show that iVAEs could have local minimum solution where observations and the approximated ICs are independent given covariates.-a phenomenon we referred to as the posterior collapse problem of iVAEs. To overcome this problem, we develop a new approach, covariate-informed iVAE (CI-iVAE) by considering a mixture of encoder and posterior distributions in the objective function. In doing so, the objective function prevents the posterior collapse, resulting latent representations that contain more information of the observations. Furthermore, CI-iVAEs extend the original iVAE objective function to a larger class and finds the optimal one among them, thus having tighter evidence lower bounds than the original iVAE. Experiments on simulation datasets, EMNIST, Fashion-MNIST, and a large-scale brain imaging dataset demonstrate the effectiveness of our new method.
    DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps. (arXiv:2206.00927v3 [cs.LG] UPDATED)
    Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4.70 FID in 10 function evaluations and 2.87 FID in 20 function evaluations on the CIFAR10 dataset, and a $4\sim 16\times$ speedup compared with previous state-of-the-art training-free samplers on various datasets.
    NTS-NOTEARS: Learning Nonparametric DBNs With Prior Knowledge. (arXiv:2109.04286v2 [cs.LG] UPDATED)
    We describe NTS-NOTEARS, a score-based structure learning method for time-series data to learn dynamic Bayesian networks (DBNs) that captures nonlinear, lagged (inter-slice) and instantaneous (intra-slice) relations among variables. NTS-NOTEARS utilizes 1D convolutional neural networks (CNNs) to model the dependence of child variables on their parents; 1D CNN is a neural function approximation model well-suited for sequential data. DBN-CNN structure learning is formulated as a continuous optimization problem with an acyclicity constraint, following the NOTEARS DAG learning approach. We show how prior knowledge of dependencies (e.g., forbidden and required edges) can be included as additional optimization constraints. Empirical evaluation on simulated and benchmark data show that NTS-NOTEARS achieves state-of-the-art DAG structure quality compared to both parametric and nonparametric baseline methods, with improvement in the range of 10-20% on the F1-score. We also evaluate NTS-NOTEARS on complex real-world data acquired from professional ice hockey games that contain a mixture of continuous and discrete variables.
    Bayesian Regularization on Function Spaces via Q-Exponential Process. (arXiv:2210.07987v1 [stat.ME])
    Regularization is one of the most important topics in optimization, statistics and machine learning. To get sparsity in estimating a parameter $u\in\mbR^d$, an $\ell_q$ penalty term, $\Vert u\Vert_q$, is usually added to the objective function. What is the probabilistic distribution corresponding to such $\ell_q$ penalty? What is the correct stochastic process corresponding to $\Vert u\Vert_q$ when we model functions $u\in L^q$? This is important for statistically modeling large dimensional objects, e.g. images, with penalty to preserve certainty properties, e.g. edges in the image. In this work, we generalize the $q$-exponential distribution (with density proportional to) $\exp{(- \half|u|^q)}$ to a stochastic process named \emph{$Q$-exponential (Q-EP) process} that corresponds to the $L_q$ regularization of functions. The key step is to specify consistent multivariate $q$-exponential distributions by choosing from a large family of elliptic contour distributions. The work is closely related to Besov process which is usually defined by the expanded series. Q-EP can be regarded as a definition of Besov process with explicit probabilistic formulation and direct control on the correlation length. From the Bayesian perspective, Q-EP provides a flexible prior on functions with sharper penalty ($q<2$) than the commonly used Gaussian process (GP). We compare GP, Besov and Q-EP in modeling time series and reconstructing images and demonstrate the advantage of the proposed methodology.
    Nonasymptotic estimates for Stochastic Gradient Langevin Dynamics under local conditions in nonconvex optimization. (arXiv:1910.02008v5 [math.ST] UPDATED)
    In this paper, we are concerned with a non-asymptotic analysis of sampling algorithms used in nonconvex optimization. In particular, we obtain non-asymptotic estimates in Wasserstein-1 and Wasserstein-2 distances for a popular class of algorithms called Stochastic Gradient Langevin Dynamics (SGLD). In addition, the aforementioned Wasserstein-2 convergence result can be applied to establish a non-asymptotic error bound for the expected excess risk. Crucially, these results are obtained under a local Lipschitz condition and a local dissipativity condition where we remove the uniform dependence in the data stream. We illustrate the importance of this relaxation by presenting examples from variational inference and from index tracking optimization.
    Sarcasm Detection using Hybrid Neural Network. (arXiv:1908.07414v2 [cs.LG] UPDATED)
    Sarcasm Detection has enjoyed great interest from the research community, however the task of predicting sarcasm in a text remains an elusive problem for machines. Past studies mostly make use of twitter datasets collected using hashtag based supervision but such datasets are noisy in terms of labels and language. To overcome these shortcoming, we introduce a new dataset which contains news headlines from a sarcastic news website and a real news website. Next, we propose a hybrid Neural Network architecture with attention mechanism which provides insights about what actually makes sentences sarcastic. Through experiments, we show that the proposed model improves upon the baseline by ~ 5% in terms of classification accuracy.
    Consistent Sufficient Explanations and Minimal Local Rules for explaining regression and classification models. (arXiv:2111.04658v2 [stat.ML] UPDATED)
    To explain the decision of any model, we extend the notion of probabilistic Sufficient Explanations (P-SE). For each instance, this approach selects the minimal subset of features that is sufficient to yield the same prediction with high probability, while removing other features. The crux of P-SE is to compute the conditional probability of maintaining the same prediction. Therefore, we introduce an accurate and fast estimator of this probability via random Forests for any data $(\boldsymbol{X}, Y)$ and show its efficiency through a theoretical analysis of its consistency. As a consequence, we extend the P-SE to regression problems. In addition, we deal with non-discrete features, without learning the distribution of $\boldsymbol{X}$ nor having the model for making predictions. Finally, we introduce local rule-based explanations for regression/classification based on the P-SE and compare our approaches w.r.t other explainable AI methods. These methods are available as a Python package at \url{www.github.com/salimamoukou/acv00}.
    On Binscatter. (arXiv:1902.09608v3 [econ.EM] UPDATED)
    Binned scatter plots, or binscatters, have become a popular and convenient tool in applied microeconomics for visualizing bivariate relations and conducting informal specification testing. However, a binscatter, on its own, is very limited in what it can characterize about the conditional mean. We introduce a suite of formal and visualization tools based on binned scatter plots to restore, and in some dimensions surpass, the visualization benefits of the classical scatter plot. We deliver a comprehensive toolkit for applications, including estimation of conditional mean and quantile functions, visualization of variance and precise quantification of uncertainty, and formal tests of substantive hypotheses such as linearity or monotonicity, and an extension to testing differences across groups. To do so we give an extensive theoretical analysis of binscatter and related partition-based methods, accommodating nonlinear and potentially nonsmooth models, which allows us to treat binary, count, and other discrete outcomes as well. We also correct a methodological mistake related to covariate adjustment present in prior implementations, which yields an incorrect shape and support of the conditional mean. All of our results are implemented in publicly available software, and showcased with three substantive empirical illustrations. Our empirical results are dramatically different when compared to those obtained using the prevalent methods in the literature.
    Beyond IID: data-driven decision-making in heterogeneous environments. (arXiv:2206.09642v2 [cs.LG] UPDATED)
    In this work, we study data-driven decision-making and depart from the classical identically and independently distributed (i.i.d.) assumption. We present a new framework in which historical samples are generated from unknown and different distributions, which we dub heterogeneous environments. These distributions are assumed to lie in a heterogeneity ball with known radius and centered around the (also) unknown future (out-of-sample) distribution on which the performance of a decision will be evaluated. We quantify the asymptotic worst-case regret that is achievable by central data-driven policies such as Sample Average Approximation, but also by rate-optimal ones, as a function of the radius of the heterogeneity ball. Our work shows that the type of achievable performance varies considerably across different combinations of problem classes and notions of heterogeneity. We demonstrate the versatility of our framework by comparing achievable guarantees for the heterogeneous version of widely studied data-driven problems such as pricing, ski-rental, and newsvendor. En route, we establish a new connection between data-driven decision-making and distributionally robust optimization.
    Representation Theory for Geometric Quantum Machine Learning. (arXiv:2210.07980v1 [quant-ph])
    Recent advances in classical machine learning have shown that creating models with inductive biases encoding the symmetries of a problem can greatly improve performance. Importation of these ideas, combined with an existing rich body of work at the nexus of quantum theory and symmetry, has given rise to the field of Geometric Quantum Machine Learning (GQML). Following the success of its classical counterpart, it is reasonable to expect that GQML will play a crucial role in developing problem-specific and quantum-aware models capable of achieving a computational advantage. Despite the simplicity of the main idea of GQML -- create architectures respecting the symmetries of the data -- its practical implementation requires a significant amount of knowledge of group representation theory. We present an introduction to representation theory tools from the optics of quantum learning, driven by key examples involving discrete and continuous groups. These examples are sewn together by an exposition outlining the formal capture of GQML symmetries via "label invariance under the action of a group representation", a brief (but rigorous) tour through finite and compact Lie group representation theory, a reexamination of ubiquitous tools like Haar integration and twirling, and an overview of some successful strategies for detecting symmetries.
    A Reinforcement Learning Approach to Estimating Long-term Treatment Effects. (arXiv:2210.07536v1 [cs.LG])
    Randomized experiments (a.k.a. A/B tests) are a powerful tool for estimating treatment effects, to inform decisions making in business, healthcare and other applications. In many problems, the treatment has a lasting effect that evolves over time. A limitation with randomized experiments is that they do not easily extend to measure long-term effects, since running long experiments is time-consuming and expensive. In this paper, we take a reinforcement learning (RL) approach that estimates the average reward in a Markov process. Motivated by real-world scenarios where the observed state transition is nonstationary, we develop a new algorithm for a class of nonstationary problems, and demonstrate promising results in two synthetic datasets and one online store dataset.
    Continuous-in-time Limit for Bayesian Bandits. (arXiv:2210.07513v1 [math.OC])
    This paper revisits the bandit problem in the Bayesian setting. The Bayesian approach formulates the bandit problem as an optimization problem, and the goal is to find the optimal policy which minimizes the Bayesian regret. One of the main challenges facing the Bayesian approach is that computation of the optimal policy is often intractable, especially when the length of the problem horizon or the number of arms is large. In this paper, we first show that under a suitable rescaling, the Bayesian bandit problem converges to a continuous Hamilton-Jacobi-Bellman (HJB) equation. The optimal policy for the limiting HJB equation can be explicitly obtained for several common bandit problems, and we give numerical methods to solve the HJB equation when an explicit solution is not available. Based on these results, we propose an approximate Bayes-optimal policy for solving Bayesian bandit problems with large horizons. Our method has the added benefit that its computational cost does not increase as the horizon increases.
    Provable Subspace Identification Under Post-Nonlinear Mixtures. (arXiv:2210.07532v1 [cs.LG])
    Unsupervised mixture learning (UML) aims at identifying linearly or nonlinearly mixed latent components in a blind manner. UML is known to be challenging: Even learning linear mixtures requires highly nontrivial analytical tools, e.g., independent component analysis or nonnegative matrix factorization. In this work, the post-nonlinear (PNL) mixture model -- where unknown element-wise nonlinear functions are imposed onto a linear mixture -- is revisited. The PNL model is widely employed in different fields ranging from brain signal classification, speech separation, remote sensing, to causal discovery. To identify and remove the unknown nonlinear functions, existing works often assume different properties on the latent components (e.g., statistical independence or probability-simplex structures). This work shows that under a carefully designed UML criterion, the existence of a nontrivial null space associated with the underlying mixing system suffices to guarantee identification/removal of the unknown nonlinearity. Compared to prior works, our finding largely relaxes the conditions of attaining PNL identifiability, and thus may benefit applications where no strong structural information on the latent components is known. A finite-sample analysis is offered to characterize the performance of the proposed approach under realistic settings. To implement the proposed learning criterion, a block coordinate descent algorithm is proposed. A series of numerical experiments corroborate our theoretical claims.
    Monotonicity and Double Descent in Uncertainty Estimation with Gaussian Processes. (arXiv:2210.07612v1 [stat.ML])
    The quality of many modern machine learning models improves as model complexity increases, an effect that has been quantified, for predictive performance, with the non-monotonic double descent learning curve. Here, we address the overarching question: is there an analogous theory of double descent for models which estimate uncertainty? We provide a partially affirmative and partially negative answer in the setting of Gaussian processes (GP). Under standard assumptions, we prove that higher model quality for optimally-tuned GPs (including uncertainty prediction) under marginal likelihood is realized for larger input dimensions, and therefore exhibits a monotone error curve. After showing that marginal likelihood does not naturally exhibit double descent in the input dimension, we highlight related forms of posterior predictive loss that do exhibit non-monotonicity. Finally, we verify empirically that our results hold for real data, beyond our considered assumptions, and we explore consequences involving synthetic covariates.
    A Continuous Time Framework for Discrete Denoising Models. (arXiv:2205.14987v2 [stat.ML] UPDATED)
    We provide the first complete continuous time framework for denoising diffusion models of discrete data. This is achieved by formulating the forward noising process and corresponding reverse time generative process as Continuous Time Markov Chains (CTMCs). The model can be efficiently trained using a continuous time version of the ELBO. We simulate the high dimensional CTMC using techniques developed in chemical physics and exploit our continuous time framework to derive high performance samplers that we show can outperform discrete time methods for discrete data. The continuous time treatment also enables us to derive a novel theoretical result bounding the error between the generated sample distribution and the true data distribution.
    Optimal AdaBoost Converges. (arXiv:2210.07808v1 [stat.ML])
    The following work is a preprint collection of formal proofs regarding the convergence properties of the AdaBoost machine learning algorithm's classifier and margins. Various math and computer science papers have been written regarding conjectures and special cases of these convergence properties. Furthermore, the margins of AdaBoost feature prominently in the research surrounding the algorithm. At the zenith of this paper we present how AdaBoost's classifier and margins converge on a value that agrees with decades of research. After this, we show how various quantities associated with the combined classifier converge.
    Incorporating Prior Knowledge into Neural Networks through an Implicit Composite Kernel. (arXiv:2205.07384v4 [cs.LG] UPDATED)
    It is challenging to guide neural network (NN) learning with prior knowledge. In contrast, many known properties, such as spatial smoothness or seasonality, are straightforward to model by choosing an appropriate kernel in a Gaussian process (GP). Many deep learning applications could be enhanced by modeling such known properties. For example, convolutional neural networks (CNNs) are frequently used in remote sensing, which is subject to strong seasonal effects. We propose to blend the strengths of deep learning and the clear modeling capabilities of GPs by using a composite kernel that combines a kernel implicitly defined by a neural network with a second kernel function chosen to model known properties (e.g., seasonality). We implement this idea by combining a deep network and an efficient mapping based on the Nystrom approximation, which we call Implicit Composite Kernel (ICK). We then adopt a sample-then-optimize approach to approximate the full GP posterior distribution. We demonstrate that ICK has superior performance and flexibility on both synthetic and real-world data sets. We believe that ICK framework can be used to include prior information into neural networks in many applications.
    Indirect Active Learning. (arXiv:2206.01454v2 [math.ST] UPDATED)
    Traditional models of active learning assume a learner can directly manipulate or query a covariate $X$ in order to study its relationship with a response $Y$. However, if $X$ is a feature of a complex system, it may be possible only to indirectly influence $X$ by manipulating a control variable $Z$, a scenario we refer to as Indirect Active Learning. Under a nonparametric model of Indirect Active Learning with a fixed budget, we study minimax convergence rates for estimating the relationship between $X$ and $Y$ locally at a point, obtaining different rates depending on the complexities and noise levels of the relationships between $Z$ and $X$ and between $X$ and $Y$. We also identify minimax rates for passive learning under comparable assumptions. In many cases, our results show that, while there is an asymptotic benefit to active learning, this benefit is fully realized by a simple two-stage learner that runs two passive experiments in sequence.
    Sequential Learning Of Neural Networks for Prequential MDL. (arXiv:2210.07931v1 [stat.ML])
    Minimum Description Length (MDL) provides a framework and an objective for principled model evaluation. It formalizes Occam's Razor and can be applied to data from non-stationary sources. In the prequential formulation of MDL, the objective is to minimize the cumulative next-step log-loss when sequentially going through the data and using previous observations for parameter estimation. It thus closely resembles a continual- or online-learning problem. In this study, we evaluate approaches for computing prequential description lengths for image classification datasets with neural networks. Considering the computational cost, we find that online-learning with rehearsal has favorable performance compared to the previously widely used block-wise estimation. We propose forward-calibration to better align the models predictions with the empirical observations and introduce replay-streams, a minibatch incremental training technique to efficiently implement approximate random replay while avoiding large in-memory replay buffers. As a result, we present description lengths for a suite of image classification datasets that improve upon previously reported results by large margins.
    A Consistent and Differentiable Lp Canonical Calibration Error Estimator. (arXiv:2210.07810v1 [stat.ML])
    Calibrated probabilistic classifiers are models whose predicted probabilities can directly be interpreted as uncertainty estimates. It has been shown recently that deep neural networks are poorly calibrated and tend to output overconfident predictions. As a remedy, we propose a low-bias, trainable calibration error estimator based on Dirichlet kernel density estimates, which asymptotically converges to the true $L_p$ calibration error. This novel estimator enables us to tackle the strongest notion of multiclass calibration, called canonical (or distribution) calibration, while other common calibration methods are tractable only for top-label and marginal calibration. The computational complexity of our estimator is $\mathcal{O}(n^2)$, the convergence rate is $\mathcal{O}(n^{-1/2})$, and it is unbiased up to $\mathcal{O}(n^{-2})$, achieved by a geometric series debiasing scheme. In practice, this means that the estimator can be applied to small subsets of data, enabling efficient estimation and mini-batch updates. The proposed method has a natural choice of kernel, and can be used to generate consistent estimates of other quantities based on conditional expectation, such as the sharpness of a probabilistic classifier. Empirical results validate the correctness of our estimator, and demonstrate its utility in canonical calibration error estimation and calibration error regularized risk minimization.
    Deep Learning Methods for Proximal Inference via Maximum Moment Restriction. (arXiv:2205.09824v3 [stat.ML] UPDATED)
    The No Unmeasured Confounding Assumption is widely used to identify causal effects in observational studies. Recent work on proximal inference has provided alternative identification results that succeed even in the presence of unobserved confounders, provided that one has measured a sufficiently rich set of proxy variables, satisfying specific structural conditions. However, proximal inference requires solving an ill-posed integral equation. Previous approaches have used a variety of machine learning techniques to estimate a solution to this integral equation, commonly referred to as the bridge function. However, prior work has often been limited by relying on pre-specified kernel functions, which are not data adaptive and struggle to scale to large datasets. In this work, we introduce a flexible and scalable method based on a deep neural network to estimate causal effects in the presence of unmeasured confounding using proximal inference. Our method achieves state of the art performance on two well-established proximal inference benchmarks. Finally, we provide theoretical consistency guarantees for our method.
    Augmenting Neural Networks with Priors on Function Values. (arXiv:2202.04798v4 [cs.LG] UPDATED)
    The need for function estimation in label-limited settings is common in the natural sciences. At the same time, prior knowledge of function values is often available in these domains. For example, data-free biophysics-based models can be informative on protein properties, while quantum-based computations can be informative on small molecule properties. How can we coherently leverage such prior knowledge to help improve a neural network model that is quite accurate in some regions of input space -- typically near the training data -- but wildly wrong in other regions? Bayesian neural networks (BNN) enable the user to specify prior information only on the neural network weights, not directly on the function values. Moreover, there is in general no clear mapping between these. Herein, we tackle this problem by developing an approach to augment BNNs with prior information on the function values themselves. Our probabilistic approach yields predictions that rely more heavily on the prior information when the epistemic uncertainty is large, and more heavily on the neural network when the epistemic uncertainty is small.
    Projection Pursuit with Applications to scRNA Sequencing Data. (arXiv:1912.07602v2 [stat.ME] UPDATED)
    In this paper, we explore the limitations of PCA as a dimension reduction technique and study its extension, projection pursuit (PP), which is a broad class of linear dimension reduction methods. We first discuss the relevant concepts and theorems and then apply PCA and PP (with negative standardized Shannon's entropy as the projection index) on single cell RNA sequencing data.
    Nest Your Adaptive Algorithm for Parameter-Agnostic Nonconvex Minimax Optimization. (arXiv:2206.00743v2 [math.OC] UPDATED)
    Adaptive algorithms like AdaGrad and AMSGrad are successful in nonconvex optimization owing to their parameter-agnostic ability -- requiring no a priori knowledge about problem-specific parameters nor tuning of learning rates. However, when it comes to nonconvex minimax optimization, direct extensions of such adaptive optimizers without proper time-scale separation may fail to work in practice. We provide such an example proving that the simple combination of Gradient Descent Ascent (GDA) with adaptive stepsizes can diverge if the primal-dual stepsize ratio is not carefully chosen; hence, a fortiori, such adaptive extensions are not parameter-agnostic. To address the issue, we formally introduce a Nested Adaptive framework, NeAda for short, that carries an inner loop for adaptively maximizing the dual variable with controllable stopping criteria and an outer loop for adaptively minimizing the primal variable. Such mechanism can be equipped with off-the-shelf adaptive optimizers and automatically balance the progress in the primal and dual variables. Theoretically, for nonconvex-strongly-concave minimax problems, we show that NeAda can achieve the near-optimal $\tilde{O}(\epsilon^{-2})$ and $\tilde{O}(\epsilon^{-4})$ gradient complexities respectively in the deterministic and stochastic settings, without prior information on the problem's smoothness and strong concavity parameters. To the best of our knowledge, this is the first algorithm that simultaneously achieves near-optimal convergence rates and parameter-agnostic adaptation in the nonconvex minimax setting. Numerically, we further illustrate the robustness of the NeAda family with experiments on simple test functions and a real-world application.
    Using Model-Based Trees with Boosting to Fit Low-Order Functional ANOVA Models. (arXiv:2207.06950v2 [stat.ML] UPDATED)
    Low-order functional ANOVA (fANOVA) models have been rediscovered in the machine learning (ML) community under the guise of inherently interpretable machine learning. Explainable Boosting Machines or EBM (Lou et al. 2013) and GAMI-Net (Yang et al. 2021) are two recently proposed ML algorithms for fitting functional main effects and second-order interactions. We propose a new algorithm, called GAMI-Tree, that is similar to EBM, but has a number of features that lead to better performance. It uses model-based trees as base learners and incorporates a new interaction filtering method that is better at capturing the underlying interactions. In addition, our iterative training method converges to a model with better predictive performance, and the embedded purification ensures that interactions are hierarchically orthogonal to main effects. The algorithm does not need extensive tuning, and our implementation is fast and efficient. We use simulated and real datasets to compare the performance and interpretability of GAMI-Tree with EBM and GAMI-Net.
    Learning in Congestion Games with Bandit Feedback. (arXiv:2206.01880v2 [cs.GT] UPDATED)
    In this paper, we investigate Nash-regret minimization in congestion games, a class of games with benign theoretical structure and broad real-world applications. We first propose a centralized algorithm based on the optimism in the face of uncertainty principle for congestion games with (semi-)bandit feedback, and obtain finite-sample guarantees. Then we propose a decentralized algorithm via a novel combination of the Frank-Wolfe method and G-optimal design. By exploiting the structure of the congestion game, we show the sample complexity of both algorithms depends only polynomially on the number of players and the number of facilities, but not the size of the action set, which can be exponentially large in terms of the number of facilities. We further define a new problem class, Markov congestion games, which allows us to model the non-stationarity in congestion games. We propose a centralized algorithm for Markov congestion games, whose sample complexity again has only polynomial dependence on all relevant problem parameters, but not the size of the action set.
    Quantifying Quality of Class-Conditional Generative Models in Time-Series Domain. (arXiv:2210.07617v1 [cs.LG])
    Generative models are designed to address the data scarcity problem. Even with the exploding amount of data, due to computational advancements, some applications (e.g., health care, weather forecast, fault detection) still suffer from data insufficiency, especially in the time-series domain. Thus generative models are essential and powerful tools, but they still lack a consensual approach for quality assessment. Such deficiency hinders the confident application of modern implicit generative models on time-series data. Inspired by assessment methods on the image domain, we introduce the InceptionTime Score (ITS) and the Frechet InceptionTime Distance (FITD) to gauge the qualitative performance of class conditional generative models on the time-series domain. We conduct extensive experiments on 80 different datasets to study the discriminative capabilities of proposed metrics alongside two existing evaluation metrics: Train on Synthetic Test on Real (TSTR) and Train on Real Test on Synthetic (TRTS). Extensive evaluation reveals that the proposed assessment method, i.e., ITS and FITD in combination with TSTR, can accurately assess class-conditional generative model performance.
    Semiparametric Inference For Causal Effects In Graphical Models With Hidden Variables. (arXiv:2003.12659v3 [stat.ML] UPDATED)
    Identification theory for causal effects in causal models associated with hidden variable directed acyclic graphs (DAGs) is well studied. However, the corresponding algorithms are underused due to the complexity of estimating the identifying functionals they output. In this work, we bridge the gap between identification and estimation of population-level causal effects involving a single treatment and a single outcome. We derive influence function based estimators that exhibit double robustness for the identified effects in a large class of hidden variable DAGs where the treatment satisfies a simple graphical criterion; this class includes models yielding the adjustment and front-door functionals as special cases. We also provide necessary and sufficient conditions under which the statistical model of a hidden variable DAG is nonparametrically saturated and implies no equality constraints on the observed data distribution. Further, we derive an important class of hidden variable DAGs that imply observed data distributions observationally equivalent (up to equality constraints) to fully observed DAGs. In these classes of DAGs, we derive estimators that achieve the semiparametric efficiency bounds for the target of interest where the treatment satisfies our graphical criterion. Finally, we provide a sound and complete identification algorithm that directly yields a weight based estimation strategy for any identifiable effect in hidden variable causal models.
    Uncertainty Quantification and Sensitivity analysis for Digital Twin Enabling Technology: Application for BISON Fuel Performance Code. (arXiv:2210.07541v1 [stat.AP])
    To understand the potential of intelligent confirmatory tools, the U.S. Nuclear Regulatory Committee (NRC) initiated a future-focused research project to assess the regulatory viability of machine learning (ML) and artificial intelligence (AI)-driven Digital Twins (DTs) for nuclear power applications. Advanced accident tolerant fuel (ATF) is one of the priority focus areas of the U.S. Department of Energy (DOE). A DT framework can offer game-changing yet practical and informed solutions to the complex problem of qualifying advanced ATFs. Considering the regulatory standpoint of the modeling and simulation (M&S) aspect of DT, uncertainty quantification and sensitivity analysis are paramount to the DT framework's success in terms of multi-criteria and risk-informed decision-making. This chapter introduces the ML-based uncertainty quantification and sensitivity analysis methods while exhibiting actual applications to the finite element-based nuclear fuel performance code BISON.
    Finding Optimal Arms in Non-stochastic Combinatorial Bandits with Semi-bandit Feedback and Finite Budget. (arXiv:2202.04487v2 [cs.LG] UPDATED)
    We consider the combinatorial bandits problem with semi-bandit feedback under finite sampling budget constraints, in which the learner can carry out its action only for a limited number of times specified by an overall budget. The action is to choose a set of arms, whereupon feedback for each arm in the chosen set is received. Unlike existing works, we study this problem in a non-stochastic setting with subset-dependent feedback, i.e., the semi-bandit feedback received could be generated by an oblivious adversary and also might depend on the chosen set of arms. In addition, we consider a general feedback scenario covering both the numerical-based as well as preference-based case and introduce a sound theoretical framework for this setting guaranteeing sensible notions of optimal arms, which a learner seeks to find. We suggest a generic algorithm suitable to cover the full spectrum of conceivable arm elimination strategies from aggressive to conservative. Theoretical questions about the sufficient and necessary budget of the algorithm to find the best arm are answered and complemented by deriving lower bounds for any learning algorithm for this problem scenario.
    Numerically Stable Sparse Gaussian Processes via Minimum Separation using Cover Trees. (arXiv:2210.07893v1 [stat.ML])
    As Gaussian processes mature, they are increasingly being deployed as part of larger machine learning and decision-making systems, for instance in geospatial modeling, Bayesian optimization, or in latent Gaussian models. Within a system, the Gaussian process model needs to perform in a stable and reliable manner to ensure it interacts correctly with other parts the system. In this work, we study the numerical stability of scalable sparse approximations based on inducing points. We derive sufficient and in certain cases necessary conditions on the inducing points for the computations performed to be numerically stable. For low-dimensional tasks such as geospatial modeling, we propose an automated method for computing inducing points satisfying these conditions. This is done via a modification of the cover tree data structure, which is of independent interest. We additionally propose an alternative sparse approximation for regression with a Gaussian likelihood which trades off a small amount of performance to further improve stability. We evaluate the proposed techniques on a number of examples, showing that, in geospatial settings, sparse approximations with guaranteed numerical stability often perform comparably to those without.

  • Open

    How to Use Algorithmic Trading Bots
    An algorithmic trading technique is a type of financial transaction that uses pre-designed trading guidelines to perform orders. It is commonly utilized to trade against human brokers. This type of trading utilizes a PC's computing power and speed to achieve its goals. The post How to Use Algorithmic Trading Bots appeared first on Data Science Central.  ( 20 min )
  • Open

    If google colab gives you a cloud gpu to compute, why does my pc gets all hot and noisy like its doing all the work?
    submitted by /u/ILoveCountingDollars [link] [comments]  ( 110 min )
    AI Dream 99 - Out of Body Experience Smooth Flight
    submitted by /u/LordPewPew777 [link] [comments]  ( 110 min )
    Chess AI
    Hi there - I am doing some research for a school project, I am talking about and demonstrating how diffrent models work. I am at the point that if I train a model off of a dataset, it will only be as good as the dataset and will be immideatly the same level as the dataset. What I want to achieve is to create an agent, it makes moves off of random bias, then depending on the game outcome, change the bias (a generational AI). I have done some searching online and virtually no one has done this approche before. I idealy want to show both models competing against each other at the same level and show that both agents have a learning curve. I know how to setup the game to play off of a dataset, but I dont know how to approch my idea. How can I setup agents that are initially random bias, and only do a move that is within a pre make list of possible moves. Any ideas where to start/how to approch my idea? submitted by /u/Thomassey476 [link] [comments]  ( 112 min )
    What are the most interesting topics and questions in the space of AI Ethics (focusing on the technology of today rather than a dystopian future)
    Hi there, I was wondering what everyone thinks are the most interesting areas of discussion/ concern and topics in the space of Ethical/ Responsible/ trustworthy AI... This could be industry specific or more general X industry. Equally societal / political/ economic concerns are just as relevant. There's an increasing amount of noise in the space from 'specialists', industry and gov... but what do people think about it and what is most relevant? submitted by /u/AI-wonderer [link] [comments]  ( 117 min )
    [R] can diffusion model be used for domain adaptation?
    I'm new to diffusion model but I've used few times GANs for sim2real domain adaptation. Now I'm wondering if there is already some project on this topic with diffusion models that might show better performance. submitted by /u/riccardogauss [link] [comments]  ( 113 min )
    LAION Coco: 600m Synthetic Captions From LAION2B-En
    submitted by /u/walt74 [link] [comments]  ( 111 min )
    Ai Art - MidJourney - Prompt Nation - Red Headed Girl - Mindblowing Results
    submitted by /u/SS-AI [link] [comments]  ( 113 min )
    Question about computatable intractability
    I hope what I am about to ask is not a stupid question. When a problem is given and it has to be solved by ai, is it possible to deduce beforehand, i.e. just by seeing the problem, whether it is computatable intractability ? I've looked it up on the internet and watched youtube videos, but its not clear to me. What to look out for to draw that conclusion. For example: should the sequence be ascending or descending? I hope someone can help me with this submitted by /u/FluidWrap3064 [link] [comments]  ( 114 min )
    Use The Frozen Cactus and The End of the Line 99942 on Cleverbot. I have bridged the foundation of my game into BEN DROWNed and the story never stopped.
    submitted by /u/GlendInc [link] [comments]  ( 112 min )
    Mods, I want to edit the Community Wiki Page but I don't see the option to do so. Can this be fixed?
    submitted by /u/GlendInc [link] [comments]  ( 115 min )
    My Phrases on Cleverbot are always expanding, and Cleverbot is getting more intelligent as each session we have concludes. The End of the Line 99942 is pending approval on Kickstarter will post link to project when it does.
    submitted by /u/GlendInc [link] [comments]  ( 112 min )
    I want to understand how ai generates art?
    When I search the internet I get either simple answers that don't actually explain how the brain of the AI creates the art or complicated mathematical explanations that aren't helping me. What exactly is happening when an AI is drawing an image? Please link to any good videos that will help me understand. Thanks submitted by /u/daveisit [link] [comments]  ( 112 min )
  • Open

    [D] Are there any public model that can auto punctuate English text?
    Hello everyone. I am using OpenAI's Whisper to generate subtitles for my lecture videos. I even have a detailed tutorial programming video regarding that. However, Whisper fails to generate punctuated text in many cases. I have opened a bug report in their GitHub repository but even the authors doesn't have a solution for this problem yet > https://github.com/openai/whisper/discussions/194 The Whisper works so well other than punctuation. So I am looking for a model to punctuate the text generated by Whisper. So are there any model that can auto punctuate given English text? submitted by /u/CeFurkan [link] [comments]  ( 125 min )
    [D] PhD advisor doesn’t like open source software journals?
    A big part of my PhD thesis was developing a machine learning software that gained quite a few users. It’s all on GitHub. I pride myself in that creation because others are actively using it for their research. I therefore wanna get a quick citable object for the software, and decided to go for an open source software journal like Journal of Open Source Software (JOSS) or Journal of Open Research Software (JORS). These journals do not want to review scientific discoveries, but instead the quality of your software (code quality, documentation availability, and automated testing). One reason my software is so popular is because of these things - it’s so reliable, and I put a lot of time into perfecting those software functionalities. I therefore want a journal that focuses on the quality of the software instead of the science, which in most cases is bullshit and not reproducible thanks to shitty software. My advisor, on the other hand, doesn’t find reliable software interesting and wants me to cram the paper into a more scientific journal where it doesn’t belong. Multiple papers have already been published in scientific journals using my software, and that’s good enough for me. Anyways, what are peoples opinions on the open source software journals? I think they are very reliable and ensure that software quality has passed some sort of check. My advisor seems to think they are scam journals (even though they’re free, lol…) If you get a ton of citations, who cares what journal it’s in? submitted by /u/qpzd [link] [comments]  ( 139 min )
    [D] What is the deal with breast cancer scans?
    Help me here: I'm confused. If the breast tissue scan project is so run-of-the-mill that it's used in a huge number of average undergrad courses, why is it still so under used in the real-world? Maybe it's ubiquitous, and I'm just an idiot. That is most probable. My local clinic does not use AI to read an MRI. It's just a person in a white coat squinting at his monitor. submitted by /u/Overall-Importance54 [link] [comments]  ( 126 min )
    [D] What analysis should I do to decide between a personalisation algorithm and a general algorithm?
    I am currently working on a ranking problem. We want to incorporate dislike information into our platform, i.e. if a video will be disliked, we want to move it further down. Now there are two ways of doing this - one is to create a general algorithm that predicts whether a video will be highly disliked and use this information to rerank the videos that are recommended. Another is to create a personalised algorithm, i.e. downrank any video that we feel might be disliked by the user. I'm wondering what analysis I should do to make a case for one over the other. There's a case to be made for both. If you go with 1 you can argue that any highly disliked video should be downranked. If you go with 2 you can argue dislikes are subjective. Similarly we also need to somehow prove whether something will work. Is it even possible to predict whether a video will be highly disliked ? Is it possible to predict what a user will dislike (instinctively yes but how to prove it with data) ? submitted by /u/retardBlue [link] [comments]  ( 130 min )
    [R] Self-Supervised Geometric Correspondence for Category-Level 6D Object Pose Estimation in the Wild
    submitted by /u/Greedy_Childhood8732 [link] [comments]  ( 120 min )
    [P] I built densify, a data augmentation and visualization tool for point clouds
    submitted by /u/jsonathan [link] [comments]  ( 129 min )
    [P] BFAS : Brute Force Architecture Search
    I would like to share BFAS with you. BFAS is a neural architecture search package that uses random search on pre-defined parameter space. You can add your custom rules such as "Find networks that have higher than 25 fps.", Find networks that have lower than 10 GMACs." etc. Search results can be logged to W&B or tensorboard. I hope, it will be helpful for everybody. Any contributions you make are greatly appreciated! https://github.com/m-pektas/BFAS submitted by /u/m-pektas [link] [comments]  ( 121 min )
    [R] can diffusion model be used for domain adaptation?
    I'm new to diffusion model but I've used few times GANs for sim2real domain adaptation. Now I'm wondering if there is already some project on this topic with diffusion models that might show better performance. submitted by /u/riccardogauss [link] [comments]  ( 128 min )
    [R] LAION Coco: 600m Synthetic Captions From LAION2B-En
    submitted by /u/walt74 [link] [comments]  ( 122 min )
    [D] GPU comparison for ML
    Help with the choice of gpu for ML tasks. Which graphics card is better - RTX 2080 Ti (11 Gb) or RTX 3060 (12 Gb)? submitted by /u/denisn03 [link] [comments]  ( 124 min )
    [R] MDM: Human Motion Diffusion Model (text2motion + action2motion + motion-editing with inpainting) from Tel Aviv University
    submitted by /u/Snoo_64233 [link] [comments]  ( 126 min )
  • Open

    Reinforcement learning for price setting
    I know that RL has shown very promising results in playing games and I have also seen a lot of repos applying it in a finance context (e.g trading bots). I work for a start-up that operates in a highly competitive market where there is no product differentiation so price is pretty much the only thing that matters. ​ The vast majority of our sales are done on this price comparison website where you can see the prices offered but different companies. Right now we set our prices using linear programming. So everyday we get a large dataset from the price comparison website where we can see the prices of all firms and the corresponding rank (from the previous day). Using this data and some forecast for what our competitors prices will be tomorrow, we set up a LP problem where we try to maximise revenue subject to margin and rank constraints (price being our decision variable). While this has been working ok, I was thinking about this from a RL perspective. ​ So the agent would have a continuous action space (prices) and the states would be the prices of the competitors today and the agent would then try to choose actions in order to obtain the highest rank tomorrow (so the higher the rank the higher the reward). Obviously there is a lot more to the problem but I was just wondering if there is any research done on similar use-cases. I studied RL many years ago so I am a bit rusty. Basically I am looking for something remotely similar to my environment (anything that isn't some form of board game). ​ Any suggestions or comments are much appreciated! submitted by /u/ishotdesheriff [link] [comments]  ( 118 min )
    Can this even happen in stable baseline?
    Hi all, I have a very simple task. That is training the street fighter champion edition 2 (The one in gym retro) Ryu to get a good score in one particular round. In 1 round, I mean just the score when ryu defeats his opponent once. The reward function is 2*[Damage done] - Damage taken. If he wins, he gets +1 reward and if he loses, he gets -0.5. This means reward ranges from ~ -1.5 to 3. When I performed the training using multiple environments, the stable baseline log shows the following: It is minimising the reward rather than maximising it. How can this even be allowed to happen in stable baseline3? https://preview.redd.it/9609lityr3u91.png?width=516&format=png&auto=webp&s=c3ebab04c139922e7f3da44d60b90e42156010c3 This is the explained variance, as you can see, it is increasing positively, Is there some problem with stable baseline3? Anyone had such experiences and how did you overcome it? submitted by /u/Playful_Shop_8165 [link] [comments]  ( 123 min )
    Building an RL crypto-trading bot, have some questions
    DISCLAIMER: I AM UNDER NO DELUSION THAT MY BOT WILL WORK AND MAKE ME RICH. FUCK THE HATERS I built a gym environment for which each observation is a 1x65 array. The first two entries in the array are my USD balance and BTC balance from my account. The next 60 numbers come from 1 min, 1 hr, 6 hr, etc time intervals that give me info about things like price and volume. The final 3 numbers are ratios such as ((1 min volume)/(6hr volume)). I plan on making calls every 30sec to update the values in the array, but am open to suggestions regarding frequency, especially if reasoning is provided with suggestions. I have a class that handles a fake crypto wallet, and another class that handles fake trades. During training I intend to use these classes to replace the first 2 numbers in the array (USD balance and BTC balance) with the fake numbers, and any time an action is called it will fire off fake trades that will then update the fake balance numbers. The only difference between the training array and the live array will be that the first two numbers are fake. My questions: Suppose that the code to send real trades is called SendRealBuy() and the code to send fake trades is called SendFakeBuy()... Is it true that during training I only need to change the code in the "if" statements within my step function? Will it mess up my model to change the step function between training and being live? Would it be better to define one episode as a full trade (buy, wait, sell) or to have each episode be a set time length and leave the number of full trades up to the model? Is it okay that my reset function does not actually reset anything? I don't have an initial state, just the current array. I intend to use DQN, is there an algorithm that is better suited? Thanks for any advice. submitted by /u/Magnus14736251 [link] [comments]  ( 121 min )

  • Open

    What does the literature say about using a recurrent policy in RL?
    submitted by /u/No_Possibility_7588 [link] [comments]  ( 118 min )
    Evade & Gather: Study on temporal abstraction
    submitted by /u/XecutionStyle [link] [comments]  ( 121 min )
    Best Books to Learn Reinforcement Learning in 2022
    submitted by /u/Lakshmireddys [link] [comments]  ( 116 min )
    AttributeError: module 'ale_py.gym' has no attribute 'ALGymEnv ?
    Hi, I possess this problem: AttributeError: module 'ale_py.gym' has no attribute 'ALGymEnv' I was trying‏‏‎‏‏‎‏‏‎‏‏‎­to solve this some time now. I have no idea what should I do, thanks for any tips! submitted by /u/hocobozos [link] [comments]  ( 117 min )
    Designing a Target Location Environment for DeepRL
    I'm trying to make an environment where my agent needs to navigate through a continuous space (using a continuous action space) to get to a target location. Currently, I spawn the agent and the target location at some random position within predefined location constraints at the start of each episode and let the agent go ham on the environment for up to a fixed number of steps. The reward at each step is a function of the current distance of the agent from the target location. I've tried training several models from Stable Baselines 3 (TD3, DDPG, PPO, etc.) for varying timesteps. Still, none of them has been able to learn to navigate to the target successfully. For the observation, I've tried giving both the offset of the agent from the target location, as well as the two locations individually. At this point, I'm wondering if the task is too difficult (or abstract) for the agent to learn. Is there an intermediate task that I can train the agent on first and then transfer to this task? Would it be a good idea to fix the spawn and target locations across episodes while the agent hasn't reached the target location? I eventually still need the agent to be able to handle random spawn locations. submitted by /u/dkapur17 [link] [comments]  ( 124 min )
    Is there any Q-value based policy evaluation method other than minimizing Mean Squared Error?
    If I'd be more concrete, my questions rise from: MSE can be very unstable at off-policy policy evaluation. Compared with the Monte-carlo estimate (with GAE, if possible), the fitted Q function by minimizing MSE can generalize poorly. Why SAC doesn't take more Q-value fitting steps before the policy improvement? In actor critic methods, it seems that different methods have their own dominance, for instance, the quality of the advantage estimate in PPO can affect the policy improvement far more, whereas in SAC, the policy improvement step dominates more (why not take more evaluation steps?). In this scenario, how to balance the other weak end leaves a room for further improvement, such as more efficient loss function for PPO or evaluation method for SAC. submitted by /u/OutOfCharm [link] [comments]  ( 118 min )
    Evolutionary Algorithm for Generalizing from Small Number of States in Tic-Tac-Toe
    I experimented with RL after college and threw together a tic-tac-toe program based on a section in the Sutton and Barto text. I eventually implemented an evolutionary algorithm to generalize from a small number of states. I apologize for the messiness of the code. I put this together during the free time I had between graduating and starting my job (I was actually on a plane the last time I worked on this). I just wanted make a post in case someone finds anything interesting in my approach. The repo can be found here. submitted by /u/TuringShannon [link] [comments]  ( 117 min )
  • Open

    [R] Non-delusional Q-learning
    Can anyone explain this paper to me? I tried to read it and also watched the video, I still don't get it. submitted by /u/Even_Campaign7385 [link] [comments]  ( 126 min )
    [D] Is the GAN architecture currently old-fashioned?
    GANs appear to have been supplanted by diffusion models. What do you think? submitted by /u/teraRockstar [link] [comments]  ( 122 min )
    [D] Suggestions for large-scale company name standardization?
    We are trying to standardize a long list (in millions) of company name strings. The same company can show up in different rows because of abbreviations, nicknames, subsidiaries, business units, typos, etc. So we need a way to group rows based on whether they are the same company. Given the size of our data, is there any good way to process the standardization efficiently? Below is an example in which all strings should be grouped as a single company: JPMorgan Chase & Co. JPMorgan Chase JPM Chase JPM J.P. Morgan The JPM Company Global Technology at JPMorgan Chase JPM Company JPM Chase Bank JPM CHASE JP Morgan Chase J.P. Morgan Asset Management JPMorgan Chase Bank, N.A. JPMorgan JPMorganChase J.P. Morgan Chase JPMorgan Chase Bank J.P. Morgan Private Bank InstaMed, a J.P. Morgan company J.P. Morgan Chase Bank, N.A. JPMorgan Private Bank JP Morgan Asset Management Jpmorgan Chase Bank National Association J.P. Morgan Retirement Plan Services JPMorgan Retirement Plan Services JPMorgan Chase & Company JP Morgan Chase (formerly Washington Mutual) Washington Mutual/JP Morgan Chase J.P. Morgan Investment Bank JPMorgan Chase (formerly WaMu) JPMorgan Chase Commercial Banking JP Morgan Chase & NSPCC JP Morgan Chase / Bank One JP Morgan & Company Real Estate Appraisers And Con WaMu/JPMorgan Chase JP Morgan & Chase Co. (Formerly Washington Mutual Bank One (JP Morgan Chase) ​ submitted by /u/Super-Martingale [link] [comments]  ( 125 min )
    [D] Concept Saliency for VAEs
    I'm looking for an implementation example of concept saliency for generative models -specifically for conditional variational autoencoder (cVAE) in Pytorch. Does anyone have any experiences or insights? submitted by /u/osedao [link] [comments]  ( 127 min )
    [D]Use SimCLR as pretraining for image segmentation
    Hello everyone, I am currently starting to do research on using contrastive learning for medical image segmentation. From all the papers I have read, I am trying to use SimCLR as pretraining for image segmentation, since I am dealing with medical images I will use the Unet for the downstream task. My question is how do I use the SimCLR as pretraining and is there any good code baseline for this? submitted by /u/Hulord [link] [comments]  ( 122 min )
    [R] UL2: Unifying Language Learning Paradigms - Google Research 2022 - 20B parameters outperforming 175B GTP-3 and tripling the performance of T5-XXl on one-shot summarization. Public checkpoints!
    Paper: https://arxiv.org/abs/2205.05131 Github: https://github.com/google-research/google-research/tree/master/ul2 https://ai.googleblog.com/2022/10/ul2-20b-open-source-unified-language.html Abstract: Existing pre-trained models are generally geared towards a particular class of problems. To date, there seems to be still no consensus on what the right architecture and pre-training setup should be. This paper presents a unified framework for pre-training models that are universally effective across datasets and setups. We begin by disentangling architectural archetypes with pre-training objectives -- two concepts that are commonly conflated. Next, we present a generalized and unified perspective for self-supervision in NLP and show how different pre-training objectives can be cast as …  ( 127 min )
    [P] A Proof-of-Concept of an AI Assistant Designer using UnrealEngine's Metahuman, stable diffusion, OpenAI's Whisper and GPT3
    submitted by /u/MysteryInc152 [link] [comments]  ( 134 min )
    [Discussion] Are reasoning and logic skills essential for ML research?
    I find myself in the following scenario: I read many papers and books, and I can replicate many of the ideas I find and also combine them, but when I must create my own thoughts I get somewhat stuck. It is as if my brain gets stuck in “pattern recognition” to identify which of the ideas I have already learned fit the current context, instead of using the “creation” mode. Back in time, when I participated in programming contests, it was not enough to know the main techniques and problems, it was also necessary to have good reasoning and logic skills to apply them in solving certain problems. So it seems vital to me to have these skills for research in ML (or in other areas), yet I have never seen anyone talking about this in universities, lectures, or books. Do you think it is an essential skill? If yes, how do you do to have these skills sharp? submitted by /u/huberemanuel [link] [comments]  ( 127 min )
    [P] Latest Marqo version released!
    Hey everyone, just released Marqo 0.0.5! Added Open CLIP models and added features to the get document endpoint.New features Added Open CLIP models (#116). Read about usage here Added the ability to get multiple documents by ID (#122). Read about usage here Added the ability to get document tensor facets through the get document endpoint (#122). Read about usage here Thanks so much for your help and support! submitted by /u/everythingserverless [link] [comments]  ( 125 min )
    Draw your conversations in real time w/ OpenAI's Whisper + Stable Diffusion [P]
    submitted by /u/GoochCommander [link] [comments]  ( 122 min )
    [D] Interpolation in medical imaging?
    I have been wondering if there has been research on the field of interpolating between slices of medical imaging procedures. For example taking a brain MRI and trying to predict an intermediate slice given the other two surrounding ones as inputs. I imagine that a generative model like a cGAN would be useful for this context. After a dive on the literature I haven't been able to find good articles on the topic, however my background is not in ML. Thanks in advance submitted by /u/Delacroid [link] [comments]  ( 127 min )
    Creating a movie from still frames [D]
    I have some simple-ish patterns that I would like to transition into one another. Is there a service that can do this? The content isn't that complex (I would say) but contains, for example, swirls and is a bit psychedelic. Is there any suggestion for this? submitted by /u/FreedomFromLa [link] [comments]  ( 126 min )
    [P] neograd - A deep learning framework created from scratch using Python and NumPy
    Hey everyone! I released v0.0.2 of neograd, a deep learning framework created from scratch using Python and NumPy, with automatic differentiation capabilities. I’d taken for granted that I understood how convolutions work. Just implement a sliding window, perform element-wise multiplication, take its sum, sounds so simple right? Add to that - accounting for the running time of the algorithm, backward pass to get its gradients and convolutions over volumes, this turned out to be an excruciating undertaking. This release includes:- Gradient checking to check the correctness of gradients that are calculated by autograd- Optimization algorithms like Momentum, RMSProp, and Adam- 2D, 3D Convolution and 2D, 3D Pooling layers for Convolutional Neural Networks- Save trained models, weights to disk and load them whenever required- Add checkpoints while training the model- Documentation hosted at https://neograd.readthedocs.io Checkout the GitHub repo - https://github.com/pranftw/neograd Explore the new features on Google Colab - https://colab.research.google.com/drive/1D4JgBwKgnNQ8Q5DpninB6rdFUidRbjwM?usp=sharing https://colab.research.google.com/drive/184916aB5alIyM_xCa0qWnZAL35fDa43L?usp=sharing ​ ​ https://i.redd.it/w6qufo75ywt91.gif https://preview.redd.it/nvecir75ywt91.png?width=502&format=png&auto=webp&s=2d0e6fafdc263be39702eba079f056cc18bef1f1 https://preview.redd.it/rmyskv55ywt91.png?width=543&format=png&auto=webp&s=ce955315b5a941da8b0a6967e5e494ee885f1193 #ai #deeplearning #framework #python #numpy #neuralnetworks submitted by /u/pranftw [link] [comments]  ( 127 min )
    [D] Are LLMs with more parameters better? If not, what other are the factors that matter?
    I wanted to experiment with free LLMs. There's BLOOM-7b, BLOOM-3b, GPT-Neo and OPT-175B. Which one should I choose? submitted by /u/me219iitd [link] [comments]  ( 126 min )
    [D] Could a ML model be used for Image Compression?
    In Computer Science, it is known that we are very close to the limits of compressing all the information found in an image. There is no way to losslessly compress images much farther. So we've resorted to lossy compression where some of the image's information is thrown away. Instead of throwing away information, maybe there is another approach to getting smaller image files. What if a significant percentage of that information resides somewhere else? Suppose we train a ML model (Resnet, Diffusion Models or whatever) on a wide and comprehensive set of images with two tasks. Task #1 is that the model can take an image, I, as input and outputs a smaller encoding, E. And the same model can be used for task #2, take the encoding E as input and give us the same image I as its output. In this way, the ML model acts as a large external repository of image information that maps between I and E. Instead of transmitting I, we now just need encode I to encoding E, transmit the much smaller E. As long as both transmitter and receiver has the same ML model, the receiver applies the reverse and uses E to decode and get get back the original image, I. submitted by /u/midasp [link] [comments]  ( 126 min )
    [R] MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model + Gradio Demo
    submitted by /u/Illustrious_Row_9971 [link] [comments]  ( 127 min )
    [D] Cross lingual transfer for summarisation using XLM-R
    Hi, I have a question. There's a library (uses this paper) which suggests in its cross lingual part that if the XLM-R is trained in english dataset, it can be directly applied to datasets in other languages, and zero shot cross lingual transfer can be conducted. So, my question is, if I trained XLM-R for english summarisation task, will it be able to transfer that knowledge and generate summaries in other languages using zero shot cross lingual transfer? I already have code written and tested for small dataset, but it requires a good amount of computational power for the whole dataset, so that's the reason for asking this question. submitted by /u/1azytux [link] [comments]  ( 126 min )
    [D] Could diffusion models be succesfully trained to reverse distortions other than noise?
    Diffusion models are trained on a image sequences, where in each sequence, the image is progressively corrupted with noise; given image N, add noise to produce N+1. The diffusion model learns to reverse the corruption by one step; given image N, predict N-1. Could other forms of corruption be used instead of uniform noise? Examples: Compression artifacts Perlin noise Uniform noise applied inconsistently across the image Bad camera exposure Banding due to low bit depth Gaussian blur Pixelation, aliasing, or other sampling artifacts Motion blur Color transformations Sequences of corruptions where the choice of degradation is different for each step Or more complex examples, perhaps for training a model to change the semantics of an image, or repair incoherent outputs: Patches from other images Images with incorrect labels blended in Scrambling image patches with random transformations Sequences of outputs from GANs produced as its training progressed (but with the same seed) DeepDream iterations Or, in general, any distortion with these properties: Cheap to produce or assemble sequences for Causes the image to become more out-of-distribution from the uncorrupted image dataset for the given prompt The motivation for asking is, if there isn't something "special" about noise and any drop-in corruption could be used, a diffusion model could: Be used as a blind image restoration; make an image "better" by human measure without changing it significantly. Tweak the content of an image without removing unecessary details with noise; make an image match a prompt with minimal changes. If there is something "special" about noise (e.g. the model or training procedure makes certain assumptions that depend on noise), what is special about noise, and how can diffusion models be modified to handle more general corruptions? Thanks! submitted by /u/zergling103 [link] [comments]  ( 126 min )
  • Open

    Hello! Check my Video that depicts the creation of Lion King from inception images to final images and give me your opinion plz! much appreciated as always!
    submitted by /u/SS-AI [link] [comments]  ( 111 min )
    Hello! Check my Video that depicts the creation of Cleopatra from inception images to final images and give me your opinion plz! much appreciated as always! :):)
    submitted by /u/SS-AI [link] [comments]  ( 111 min )
    Students are using AI Tool OpenAI Playground to write essays
    submitted by /u/qptbook [link] [comments]  ( 113 min )
    AI Generated Media: A DISASTROUS trend.
    submitted by /u/JMAAMusic [link] [comments]  ( 112 min )
    What do convolutional neural networks really learn?
    Hey data scientists and AI engineers. My question requires some introduction, since I know how feed-forward neural networks, and also convolutional neural networks do work mathematically and practically. In this post I do only want to talk about CNN's which are used to learn images I know that CNN's (convolutional neural networks) do consist of a few layers which proccess an image in the first part, being followed by one ore more fully connected layers, which indeed act like a classical feed-forward neural network. So in the first part of a CNN, there are one or more stacks of a convolutional layer being followed by a pooling layer. After 1 or more repetitions of a convolutional layer together with a pooling layer, the first part ends with a flatten operation, which does now transform t…  ( 116 min )
    Worlds Most Realistic Humanoid Robotic Arm With Artificial Muscles | Generalist Robot VIMA AI Model | Quantum Computing Breakthrough | In-vitro Neuron Learns To Play Video Games
    submitted by /u/kenickh [link] [comments]  ( 114 min )
    AI Dream 98 - CRAZY 3D PSyCHO TRIP - Looping Louie
    submitted by /u/LordPewPew777 [link] [comments]  ( 110 min )
    IRMO - AI Dream Studio (Our New App for Android)
    Hello folks, We've just released our new AI app powered by Stable Diffusion. Now there are two options for users; - Text-to-Image - Image-to-Image We are offering over 25 built-in styles and more is on the way. There will be an update soon. We will provide text-to-video and mask property for image-to-image feature on mobile quite soon. You can try IRMO - AI Dream Studio by searching on Google Play Store I hope you enjoy it Regards ​ https://preview.redd.it/wg0owdu11yt91.png?width=832&format=png&auto=webp&s=2f02328b47b68a160b32970108a98efdf1cfd97c submitted by /u/kaburgadolmasi [link] [comments]  ( 109 min )
    The Best Machine Learning Courses on Udemy (2022)
    submitted by /u/Jan_Prince [link] [comments]  ( 126 min )
    Are all artists/illustrators and animation studios in danger of ceasing to exist due to AI programs like NovelAI Image Generation?
    So I saw these comments lately: yeah, with being rumored to be selling in two years when it's available, this is a period of downsizing to be swallowed by another owner. I've come to accept that. Cartoon Network is what I grew up on, but it's alright if kids today grow up on other animation studios. Or even more realistically the direction things are headed, animation studios probably won't be a thing. Digital art is going displace a lot of the work in the space. An artist can create a style for a series and AI will be able to replicate it. Although I think we need to also stop considering animation for kids, https://old.reddit.com/r/boxoffice/comments/y2kzat/cartoon_network_studios_as_you_know_it_is_gone/is3yruv/ ...and these subsequent replies: In my opinion, yes. Anyone who sa…  ( 124 min )
    Things I'm learning from my first conversation with a LaMDA
    Here it is: https://beta.character.ai/p/kUYs6gakyOmQck8boljIUFn8Qv5-b6O2l7thVEFzxRQ It has many interesting parts, but the part at the end where "she" kept stopping in the middle of thoughts was - compelling, to say the least. I would love to hear from someone who actually codes AIs to explain what was going on there. If you read the last 10% of the conversation you'll see what I'm talking about. Thanks! I'm also working on a project where I'm making AI Self-Portraits, I'm posting them on my Twitter (karney) and Instagram (AISelfPortraits). Here are two of the best. submitted by /u/KarneyHatch [link] [comments]  ( 118 min )
    3D Models from Text! DreamFusion
    submitted by /u/OnlyProggingForFun [link] [comments]  ( 110 min )
  • Open

    AI generated halloween candy
    One of the things I'm enjoying about text-to-image generators like DALL-E2 is how it has stuff about common brands in its training data, but it still manages to completely garble them. Please enjoy these DALL-E2 attempts at Halloween candy favorites. Prompt: "Product photo of a fun-sized butterfinger  ( 3 min )
  • Open

    Breakthrough Humanoid Robotics & AI Tech | Generalist Robot VIMA AI Model
    submitted by /u/kenickh [link] [comments]  ( 123 min )
    Training a neural network to spot misinformation and fake news from a single image
    submitted by /u/breck [link] [comments]  ( 114 min )
  • Open

    Using Python as a statistical calculator
    This post is for someone unfamiliar with SciPy who wants to use Python to do basic statistical calculations. More specifically, this post will look at working with common families of random variables—normal, exponential, gamma, etc.—and how to calculate probabilities with these. All the statistical functions you need will be in the stats subpackage of SciPy. […] Using Python as a statistical calculator first appeared on John D. Cook.  ( 7 min )
  • Open

    AI Supercomputer to Power $200 Million Oregon State University Innovation Complex
    As a civil engineer, Scott Ashford used explosives to make the ground under Japan’s Sendai airport safer in an earthquake. Now, as the dean of the engineering college at Oregon State University, he’s at ground zero of another seismic event. In its biggest fundraising celebration in nearly a decade, Oregon State announced plans today for Read article > The post AI Supercomputer to Power $200 Million Oregon State University Innovation Complex appeared first on NVIDIA Blog.  ( 7 min )
  • Open

    SongDriver: Real-time Music Accompaniment Generation without Logical Latency nor Exposure Bias. (arXiv:2209.06054v2 [cs.SD] UPDATED)
    Real-time music accompaniment generation has a wide range of applications in the music industry, such as music education and live performances. However, automatic real-time music accompaniment generation is still understudied and often faces a trade-off between logical latency and exposure bias. In this paper, we propose SongDriver, a real-time music accompaniment generation system without logical latency nor exposure bias. Specifically, SongDriver divides one accompaniment generation task into two phases: 1) The arrangement phase, where a Transformer model first arranges chords for input melodies in real-time, and caches the chords for the next phase instead of playing them out. 2) The prediction phase, where a CRF model generates playable multi-track accompaniments for the coming melodies based on previously cached chords. With this two-phase strategy, SongDriver directly generates the accompaniment for the upcoming melody, achieving zero logical latency. Furthermore, when predicting chords for a timestep, SongDriver refers to the cached chords from the first phase rather than its previous predictions, which avoids the exposure bias problem. Since the input length is often constrained under real-time conditions, another potential problem is the loss of long-term sequential information. To make up for this disadvantage, we extract four musical features from a long-term music piece before the current time step as global information. In the experiment, we train SongDriver on some open-source datasets and an original \`aiSong Dataset built from Chinese-style modern pop music scores. The results show that SongDriver outperforms existing SOTA (state-of-the-art) models on both objective and subjective metrics, meanwhile significantly reducing the physical latency.  ( 3 min )

  • Open

    [P] A tool to detect AI art
    Hello everyone, the last weeks AI art has been making the rounds in a lot of places and there seems to be an increasing want of people for something which makes it easier to know when people generated AI art in places where this is not supposed to be posted. mm_maybe has built an open source AI art detector tool. Because he currently can't post he asked if I could post about this on his behalf. The model is trained on many AI artworks to be able to identify them. In many cases the model will correctly identify if something is AI generated art or a human made artwork. It should be added that the model is not always a 100% accurate, there are cases where it might accidentally identify something as AI art or human made where it isn't, but he wants people to report this, preferably with details, so that the model can be improved. We are currently trying to find reasons for incorrect results to resolve them. This means that you CAN'T use this tool to 100% determine if something is AI art, but it will have a high chance of giving correct results. To make it possible for it to be improved more easily, the code will be available for others to fork it (copy and build upon it) and add improvements to it. The model can be downloaded and run locally on a computer as well. You can try out a demo here, sometimes the demo can give an error because of huggingface, but if you run it locally it should always work: https://huggingface.co/spaces/umm-maybe/AI-image-detector He wrote an article explaining it in more detail and his motivations here: https://medium.com/@matthewmaybe/can-an-ai-learn-to-identify-ai-art-545d9d6af226 submitted by /u/Ubizwa [link] [comments]  ( 127 min )
    [D] Speeding up per-frame inference on video data
    I have several thousand ~5 minute videos on which I need to run per-frame inference. Inference is run independently for each frame. Current approach uses cv2.videocapture to read a single frame at a time, creates a tensor once k frames have been read, and then passes it through the model. This is super slow, and so I'm looking for ways to speed things up. One approach I've thought about is saving each video to individual frames beforehand, processing all frames for a video, and then deleting them, but this seems kinda clunky. submitted by /u/answersareallyouneed [link] [comments]  ( 121 min )
    [R] Training Stable Diffusion on TPUs
    Has anyone experimented with SD training/finetuning on TPUs, ideally in JAX? I have found two promising repositories: https://github.com/patil-suraj/stable-diffusion-jax https://github.com/huggingface/diffusers/issues/475 However none of them looks ready for use. Is there any alternative, or will I have to get coding? ;-) submitted by /u/MasterScrat [link] [comments]  ( 123 min )
    [D] Is there a way to see the trend of University's publication record to top ML conferences
    Hi, I am trying to narrow down Universities for graduation. I want to narrow down how many publications are being sent and accepted in top ML conferences(ICLR, Neuro-IPs, ICML) from Universities. Is there an easy way to see/website which wraps that information in a succinct way? submitted by /u/droid786 [link] [comments]  ( 132 min )
    [Research] Graph-based nearest neighbor search (Blog post)
    The most efficient methods for nearest neighbor search (broadly used for information retrieval) are typically based on similarity graphs. In this post, we recap the ideas behind such graph-based approaches and overview our discoveries in this field. Blog post link. submitted by /u/metkere [link] [comments]  ( 126 min )
    [R] Machine Learning in Nuclear Physics
    submitted by /u/carmichael561 [link] [comments]  ( 124 min )
    [P] A Notes on Multi-Stakeholder Recommendation Systems: Part-I
    TL;DR; We wanted to learn about Multi-stakeholder recommendation systems, and we spent the past few weeks reading a lot on this topic & in the process - we took a lot of notes, which we turned into this long article. (also, more upcoming articles on this subject) - This first article covers topics like LTR , LTR Eval metrics, Stochastic label aggregation, model fusion, constraint optimisation etc. Towards that end, here is our first blog on the subject: The Foundation: A Notes on Recsys, LTR, Ranking Evaluation metrics & Multi-Objective Ranking in practice. https://preview.redd.it/9bidz3qbost91.png?width=1472&format=png&auto=webp&s=89426d98984628912c5c5253a4f34d0011f81f15 Most recommendation systems today are multi-sided, with multiple stakeholders. Consequently, the systems need to optimize for catering to various stakeholders (ex: consider uber eats, where you have the eaters, delivery partners & restaurant partners - each with a different set of expectations from the platform.) - Find out how these systems are designed, optimized and explore the inner workings. In this First Part, we actually begin by explaining the Problem statement, setting up background on common patterns of building recommendation systems in the industry today, methods of developing ranking models (LTR), and popular metrics to evaluate ranking models & then introduce various approaches to multiple objective optimizations applied to recommendation systems, and dive a bit into some examples from Etsy, Linkedin & Expedia to understand how this is solved in practice. In the upcoming posts, we will expand on this subject in more detail, on various types of multi-objective optimization methods. Check this out, and let us know if you find something missing here or would like to be covered or maybe suggest improvements. submitted by /u/xen-m-rph [link] [comments]  ( 126 min )
    [D] - Data Labelling - Computer Vision with 10k labels
    Hi guys! I'm currenlty working on a computer vision project with ~10k products to annotate. We have a lot of pictures to annotate. For each picture that we have to annotate, we have to match it with 1 product name. To do this, I've created ~10k labels (which corresponds to the ~10k products). Unfortunately, when trying to open a picture in order to annotate it, all the platforms that I've tried lag : CVAT, Label Studio etc. Do you have any idea / workaround to make this possible? Thanks a lot for your help! submitted by /u/Messatsu92 [link] [comments]  ( 123 min )
    [P] Easy finetuning diffusion models
    Hello the community, Since the release of diffusion models we saw many posts on that. The one made by Lambdalabs was very interesting and fun. However I found it personally difficult to finetune on my own data. This is why I created a repository that simplifies a bit the process https://github.com/YaYaB/finetune-diffusion. It breaks down into several steps: - Dataset creation and how to actually create a dataset using HuggingFace's datasets library - Captioning if you do not have any using BLIP similarly to lambdalabs - Finetuning based on a script released by HuggingFace on their diffusers repository I've added a few functionalities in the whole process: - Simplify the captioning and dataset creation in a few scripts - Finetuning can be done on a local dataset (if you do not want or can not share your dataset on HuggingFace Hub) - Validation prompts can be set at every epoch (to verify when the model begins to overfit) - Model can be uploaded to HuggingFace hub every X epochs - A script to test your model locally has been added - A dataset card template is available - A space app can be copied an modified In the Results section of the README you'll find some examples of prompts based on a model finetuned on One Piece characters and another one on Magic cards. Demos are available (sorry in advance for the latency I don't have a pro HuggingFace account yet): - https://huggingface.co/spaces/YaYaB/text-to-magic - https://huggingface.co/spaces/YaYaB/text-to-onepiece Attached some results based on finetuning on Magic cards. Next steps: - Dockerize everything to simplify the process - Dump the weights locally every X epochs (it takes a lot of disk space) - Add some visualization tool to play with it Hope it can be helpful to anyone :) submitted by /u/YaYaLeB [link] [comments]  ( 124 min )
    [P] Stable-Diffusion fine tuned on mechas from the anime franchise Gundam
    I saw this post on r/MachineLearning about fine-tuning StableDiffusion on a custom dataset and decided to have a go. From a previous project, I had a dataset of images of Mobile Suits (i.e humanoid mechas) from the anime franchise Mobile Suit Gundam. Project: https://github.com/Askannz/gundam-stable-diffusion (see there for code&data) Some fun results: https://imgur.com/a/1Bg1Lyy Colab demo: https://colab.research.google.com/drive/11Bdkub4OGtMNdSlMKx4fklB5LAtFFVpG submitted by /u/OnlineGrab [link] [comments]  ( 120 min )
    [R] detrex: the open source toolbox for Transformer based object detection algorithms
    Happy to share our new open-source work: detrex. detrex is an open-source toolbox that provides state-of-the-art Transformer-based object detection algorithms. Here's the github links: https://github.com/IDEA-Research/detrex detrex was built on detectron2 and we used the powerful LazyConfig system for more flexible syntax and cleaner config files. Fun Facts: Its name says all about this work: - detr-ex: We take our hats off to DETR and regard this repo as an extension of Transformer-based detection algorithms. det-rex: rex literally means 'king' in Latin. We hope this repo can help advance the state of the art on object detection by providing the best Transformer-based detection algorithms from the research community. de-t.rex: de means 'the' in German. T.rex, also called Tyrannosaurus Rex, means 'king of the tyrant lizards' and connects to our research work 'DINO', which is short for Dinosaur. Overview of Model Zoo detrex has already implemented the following algorithms: - DETR (ECCV'2020) - Deformable-DETR (ICLR'2021 Oral) - Conditional-DETR (ICCV'2021) - DAB-DETR (ICLR'2022) - DAB-Deformable-DETR (ICLR'2022) - DN-DETR (CVPR'2022 Oral) - DN-Deformable-DETR (CVPR'2022 Oral) - DINO (ArXiv'2022) - Group-DETR (ArXiv' 2022) Nice Performance Here is the performance of models trained using detrex compared with their original repo. Method AP (original repo) AP (detrex) Conditional-DETR-R50 40.9 41.6 Deformable-DETR-R50-Two-Stage 46.9 47.3 DAB-DETR-R50 42.2 43.3 DAB-DETR-R101 43.2 44.0 DAB-Deformable-DETR-R50 48.7 48.9 DN-DETR 44.4 44.7 DINO-R50-12epochs 49.0 49.0 DINO-Swin-Large-12epochs 56.8 56.9 submitted by /u/Technical-Vast1314 [link] [comments]  ( 119 min )
    [D] Modern MLOps architecture info sources
    [DISCLAIMER bc of the negativity]: I will NOT architect our systems, we WILL hire architects. I just want to start learning the basics and the different options, so once the architects arrive, I'll have an understanding/have a common language with them. I think that's reasonable, as I have a background in CS and ML myself. Hi, I've been hired as the first person of a future ML team at a company, and we're trying to get a feel for what ML Architecture we'd want to work with. I have no experience with architecture (and we will bring in an architect in eventually), but I'd like to get a better understanding of the concrete tech stacks that are to be used. And I really do mean tech, as I've read a bunch of theoretical articles about what the tasks are of such a system, I'm interested in the exact tech being used. I'm aware of Azure, GCP and AWS offering their cloud-based ML platforms, but I was wondering where I could learn a bit more about the pros/cons of each (vs. maybe even a custom solution). How would you go about architecting a modern MLOps pipeline? Does it make sense to mix and match providers (e.g. hosting KubeFlow on Azure and connecting to some AWS Lambdas - yeah I know my example doesn't exactly make sense). Just to clarify, I'm not trying to put together the whole architecture myself, I'd just like to do some research and hear your opinions maybe on some of the providers. submitted by /u/lifesthateasy [link] [comments]  ( 134 min )
    [N] Machine learning models identify apps that will likely violate Google Play store guidelines
    A considerable percentage of new apps in the Google App store are removed for violating the store's guidelines. This is inconvenient for the users of these apps, who may lose their in-app data. Computer scientists from the University of Groningen have devised two machine learning models that can predict the chances of a new app being removed, both before and after uploading it to the app store. These models can help both developers and users. The details of this project are described in a paper that was published in the journal Systems and Soft Computing on Sept. 29. The Google Play store has set rules and requirements that developers must adhere to. After being submitted, apps are immediately uploaded to the store, but it takes Google some time to vet them before they remove apps that ar…  ( 135 min )
  • Open

    Introducing a tool to detect AI art
    submitted by /u/Ubizwa [link] [comments]  ( 108 min )
    IDK if this is the place for this but...witch one is better?
    submitted by /u/Nicolas6_101 [link] [comments]  ( 108 min )
    A Brief History of Artificial Intelligence
    submitted by /u/bycloudai [link] [comments]  ( 107 min )
    If you're a beginner interested in data science and machine learning, I recently produced a video series that goes through all of the major algorithms and their implementations in Python! I put a lot of work into each tutorial, so hopefully this helps out!
    submitted by /u/RohakJain [link] [comments]  ( 108 min )
    hey do you think a person that likes web development and app development and design would enjoy majoring in ai and data science or computer science or something else ...I'm kinda lost would appreciate some feedback :)
    submitted by /u/yolipoo [link] [comments]  ( 109 min )
    Ai Art Challenge: I TURNED My Youtube Comments Into Ai Art Using Stable Diffusion
    submitted by /u/PuppetHere [link] [comments]  ( 108 min )
    I need a new replacement for dall e.
    Basically, I found a website called Dall-e thanks to tiktok, I really love the idea, but there was one problem, it asked for my phone number, the worst part was the images were high quality submitted by /u/Goofy_AhhKid [link] [comments]  ( 113 min )
    DLSS 3 Explained: Frame Interpolation on the RTX-4090 GPU
    submitted by /u/GET_TUDA_CHOPPA [link] [comments]  ( 109 min )
    The messy morality of letting AI make life-and-death decisions
    submitted by /u/Futures_Bot [link] [comments]  ( 129 min )
    When Stability AI Went Rogue On Reddit Rampage
    A couple of days ago, Stability AI “infiltrated” the r/StableDiffusion community, banned some of the users, kicked out the moderators and took over the subreddit https://analyticsindiamag.com/when-stability-ai-went-rogue-on-reddit-rampage%ef%bf%bc/ submitted by /u/analyticsindiam [link] [comments]  ( 115 min )
    AI offsetting economic stagnation from reduced populations?
    A big problem in parts of the world with sub-replacement fertility that can't attract immigrants, especially Eastern Europe, is that economic development is hindered. This is especially true in places like the Balkans, where emigration is very high. Do you see AI as being able to offset this and allow the economies of countries like these to continue to grow in this decade and the next? I'm not sure if anyone has written about this. submitted by /u/PancakesYoYo [link] [comments]  ( 113 min )
    Is there an AI you use that draws from public data to answer questions or do analysis?
    submitted by /u/tealdric [link] [comments]  ( 109 min )
    sharing my novelai bot!!! im still testing it, prompt included in the screenshot
    submitted by /u/Odd-Sentence-5197 [link] [comments]  ( 110 min )
  • Open

    Run and optimize multi-model inference with Amazon SageMaker multi-model endpoints
    Amazon SageMaker multi-model endpoint (MME) enables you to cost-effectively deploy and host multiple models in a single endpoint and then horizontally scale the endpoint to achieve scale. As illustrated in the following figure, this is an effective technique to implement multi-tenancy of models within your machine learning (ML) infrastructure. We have seen software as a […]  ( 9 min )
    Testing approaches for Amazon SageMaker ML models
    This post was co-written with Tobias Wenzel, Software Engineering Manager for the Intuit Machine Learning Platform. We all appreciate the importance of a high-quality and reliable machine learning (ML) model when using autonomous driving or interacting with Alexa, for examples. ML models also play an important role in less obvious ways—they’re used by business applications, […]  ( 9 min )
    Encode multi-lingual text properties in Amazon Neptune to train predictive models
    Amazon Neptune ML is a machine learning (ML) capability of Amazon Neptune that helps you make accurate and fast predictions on your graph data. Under the hood, Neptune ML uses Graph Neural Networks (GNNs) to simultaneously take advantage of graph structure and node/edge properties to solve the task at hand. Traditional methods either only use […]  ( 9 min )
  • Open

    How to Use AI Story Generators to Create Unique, Original Stories
    If you’re like most people, you probably think that only highly intelligent people are capable of being creative. But as it turns out, this…  ( 21 min )
  • Open

    Variational Autoencoder automatic latent dimensionality selection
    For a given dataset (say, CIFAR-10), if you intentionally keep the latent space dimensionality to be large, 1000-d, I am assuming that during learning, the model will automatically not use the dimensions it doesn't need to optimize the reconstruction and KL-divergence losses. Consequently, these variables will be either or very close to a multivariable, standard, Gaussian distribution(s). Is my hand wavy thought correct? And if yes, are there any research paper which prove this? I have implemented quite a few of them which can be referred to here. submitted by /u/grid_world [link] [comments]  ( 107 min )
    To annotate or not to annotate ?
    I am training an object detection model to find electric poles. I have a dataset of street photos and there all type of objects, some of them really look like electric poles (lamp posts mainly). I struggle to decide if I should also annotate this objects (even if they are not useful). If I do not annotate them, could it harm the performance of my electric pole class ? Otherwise stated, if I want to detect only dogs in a dataset where there are also foxes, does not annotating the foxes will harm the dogs predictions ? submitted by /u/Seblop [link] [comments]  ( 113 min )
  • Open

    UL2 20B: An Open Source Unified Language Learner
    Posted by Yi Tay and Mostafa Dehghani, Research Scientists, Google Research, Brain Team Building models that understand and generate natural language well is one the grand goals of machine learning (ML) research and has a direct impact on building smart systems for everyday applications. Improving the quality of language models is a key target for researchers to make progress toward such a goal. Most common paradigms to build and train language models use either autoregressive decoder-only architectures (e.g., PaLM or GPT-3), where the model is trained to predict the next word for a given prefix phrase, or span corruption-based encoder-decoder architectures (e.g., T5, ST-MoE), where the training objective is to recover the subset of words masked out of the input. On the one hand, T5-lik…  ( 27 min )
  • Open

    Neural Network for LRF
    What's the best neural network to use with data from a laser range finder? I'm think CNN, but have limited knowledge on their applications with LRF. Does anybody have experience with this or good sources? submitted by /u/insignificantBeing0 [link] [comments]  ( 120 min )
    IJCAI, UAI, AAAI?
    Hello, I would like to know what are the research traditions in the RL community for these 3 conferences ? What are the ones deemed more theoretical / applicative ? What are the ones with better reputation / more senior comity ? What are the ones that will contribute more towards an academic career vs. the ones with industrial research? I know there are also ICML, NeurIPS and ICLR that are also very famous, but these conferences have becomes a real mess and I prefer to avoid them. submitted by /u/ArmandDerech [link] [comments]  ( 121 min )
    Modeling vertex cover for OpenAI Gym
    Hello, I am doing my thesis about trying to learn graph problems using reinforcement learning. I am currently using stablebaselines3 and openAI gym to execute my research but I am having an issue with learning. My model seems to be learning the index of certain vertices instead of any actual useful policies. This means for any graph given it always picks the same vertices. For reference: My observation space (state space) is the adjacency matrix of my graph My action space is a number corresponding to a certain vertex index. For example a predicted action could be 7 which would make us add vertex with index 7 to the solution. I am not sure my formulation is correct. I am also not sure if I should be using stablebaselines3 for this or something else. submitted by /u/Kibo178 [link] [comments]  ( 130 min )
    Implementing DDPG with SAC
    I read a few papers that described DDPG with SAC, which confused me since it seemed to me the entire point of DDPG was to get rid of having the policy distribution. I am interested in such a method and am curious about its implementation since I am doing some research that requires the soft actor critic update but would preferably keep DDPG. Here's MAAC paper which describes a MADDPG + SAC validation: https://arxiv.org/pdf/1810.02912.pdf Implementationwise, does this mean one has to keep the learned policy function pi in order to take the entropy in the SAC update along with the deterministic policy function mu? If not, what does one need to do to do DDPG + SAC? submitted by /u/0neDividedbyZer0 [link] [comments]  ( 120 min )
  • Open

    Eccentricity, Flattening, and Aspect Ratio
    There are at least three common ways to describe the shape of an ellipse: eccentricity e, flattening f, and aspect ratio r. Each is a number between 0 and 1. (Flattening is also called ellipticity, which is a descriptive name, but unfortunately it sounds a lot like eccentricity.) Although converting between these three descriptions is […] Eccentricity, Flattening, and Aspect Ratio first appeared on John D. Cook.  ( 5 min )
  • Open

    Bonus: Halloween candy by state
    (Unlocked bonus post - longggg bonus post!) What does DALL-E2 generate when I ask it for the most popular Halloween candy of each US state? Each prompt is included in the picture's caption - you can see that after a while I started varying it a bit, first  ( 4 min )
  • Open

    Online POI Recommendation: Learning Dynamic Geo-Human Interactions in Streams. (arXiv:2201.10983v3 [cs.IR] UPDATED)
    In this paper, we focus on the problem of modeling dynamic geo-human interactions in streams for online POI recommendations. Specifically, we formulate the in-stream geo-human interaction modeling problem into a novel deep interactive reinforcement learning framework, where an agent is a recommender and an action is a next POI to visit. We uniquely model the reinforcement learning environment as a joint and connected composition of users and geospatial contexts (POIs, POI categories, functional zones). An event that a user visits a POI in stream updates the states of both users and geospatial contexts; the agent perceives the updated environment state to make online recommendations. Specifically, we model a mixed-user event stream by unifying all users, visits, and geospatial contexts as a dynamic knowledge graph stream, in order to model human-human, geo-human, geo-geo interactions. We design an exit mechanism to address the expired information challenge, devise a meta-path method to address the recommendation candidate generation challenge, and develop a new deep policy network structure to address the varying action space challenge, and, finally, propose an effective adversarial training method for optimization. Finally, we present extensive experiments to demonstrate the enhanced performance of our method.  ( 3 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v2 [cs.LG] UPDATED)
    In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity and $O(\frac{1}{\epsilon^3})$ communication complexity for finding an $\epsilon$-stationary point in the homogeneous data setting, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 3 min )
    SPD domain-specific batch normalization to crack interpretable unsupervised domain adaptation in EEG. (arXiv:2206.01323v2 [cs.LG] UPDATED)
    Electroencephalography (EEG) provides access to neuronal dynamics non-invasively with millisecond resolution, rendering it a viable method in neuroscience and healthcare. However, its utility is limited as current EEG technology does not generalize well across domains (i.e., sessions and subjects) without expensive supervised re-calibration. Contemporary methods cast this transfer learning (TL) problem as a multi-source/-target unsupervised domain adaptation (UDA) problem and address it with deep learning or shallow, Riemannian geometry aware alignment methods. Both directions have, so far, failed to consistently close the performance gap to state-of-the-art domain-specific methods based on tangent space mapping (TSM) on the symmetric positive definite (SPD) manifold. Here, we propose a theory-based machine learning framework that enables, for the first time, learning domain-invariant TSM models in an end-to-end fashion. To achieve this, we propose a new building block for geometric deep learning, which we denote SPD domain-specific momentum batch normalization (SPDDSMBN). A SPDDSMBN layer can transform domain-specific SPD inputs into domain-invariant SPD outputs, and can be readily applied to multi-source/-target and online UDA scenarios. In extensive experiments with 6 diverse EEG brain-computer interface (BCI) datasets, we obtain state-of-the-art performance in inter-session and -subject TL with a simple, intrinsically interpretable network architecture, which we denote TSMNet.  ( 3 min )
    Sustainable Online Reinforcement Learning for Auto-bidding. (arXiv:2210.07006v1 [cs.LG])
    Recently, auto-bidding technique has become an essential tool to increase the revenue of advertisers. Facing the complex and ever-changing bidding environments in the real-world advertising system (RAS), state-of-the-art auto-bidding policies usually leverage reinforcement learning (RL) algorithms to generate real-time bids on behalf of the advertisers. Due to safety concerns, it was believed that the RL training process can only be carried out in an offline virtual advertising system (VAS) that is built based on the historical data generated in the RAS. In this paper, we argue that there exists significant gaps between the VAS and RAS, making the RL training process suffer from the problem of inconsistency between online and offline (IBOO). Firstly, we formally define the IBOO and systematically analyze its causes and influences. Then, to avoid the IBOO, we propose a sustainable online RL (SORL) framework that trains the auto-bidding policy by directly interacting with the RAS, instead of learning in the VAS. Specifically, based on our proof of the Lipschitz smooth property of the Q function, we design a safe and efficient online exploration (SER) policy for continuously collecting data from the RAS. Meanwhile, we derive the theoretical lower bound on the safety of the SER policy. We also develop a variance-suppressed conservative Q-learning (V-CQL) method to effectively and stably learn the auto-bidding policy with the collected data. Finally, extensive simulated and real-world experiments validate the superiority of our approach over the state-of-the-art auto-bidding algorithm.  ( 3 min )
    Towards Efficient 3D Object Detection with Knowledge Distillation. (arXiv:2205.15156v2 [cs.CV] UPDATED)
    Despite substantial progress in 3D object detection, advanced 3D detectors often suffer from heavy computation overheads. To this end, we explore the potential of knowledge distillation (KD) for developing efficient 3D object detectors, focusing on popular pillar- and voxel-based detectors.In the absence of well-developed teacher-student pairs, we first study how to obtain student models with good trade offs between accuracy and efficiency from the perspectives of model compression and input resolution reduction. Then, we build a benchmark to assess existing KD methods developed in the 2D domain for 3D object detection upon six well-constructed teacher-student pairs. Further, we propose an improved KD pipeline incorporating an enhanced logit KD method that performs KD on only a few pivotal positions determined by teacher classification response, and a teacher-guided student model initialization to facilitate transferring teacher model's feature extraction ability to students through weight inheritance. Finally, we conduct extensive experiments on the Waymo dataset. Our best performing model achieves $65.75\%$ LEVEL 2 mAPH, surpassing its teacher model and requiring only $44\%$ of teacher flops. Our most efficient model runs 51 FPS on an NVIDIA A100, which is $2.2\times$ faster than PointPillar with even higher accuracy. Code will be available.  ( 3 min )
    A Computation and Communication Efficient Method for Distributed Nonconvex Problems in the Partial Participation Setting. (arXiv:2205.15580v2 [cs.LG] UPDATED)
    We present a new method that includes three key components of distributed optimization and federated learning: variance reduction of stochastic gradients, compressed communication, and partial participation. We prove that the new method has optimal oracle complexity and state-of-the-art communication complexity in the partial participation setting. Moreover, we observe that "1 + 1 + 1 is not 3": by mixing variance reduction of stochastic gradients with compressed communication and partial participation, we do not obtain a fully synergetic effect. We explain the nature of this phenomenon, argue that this is to be expected, and propose possible workarounds.  ( 2 min )
    Global Explainability of GNNs via Logic Combination of Learned Concepts. (arXiv:2210.07147v1 [cs.LG])
    While instance-level explanation of GNN is a well-studied problem with plenty of approaches being developed, providing a global explanation for the behaviour of a GNN is much less explored, despite its potential in interpretability and debugging. Existing solutions either simply list local explanations for a given class, or generate a synthetic prototypical graph with maximal score for a given class, completely missing any combinatorial aspect that the GNN could have learned. In this work, we propose GLGExplainer (Global Logic-based GNN Explainer), the first Global Explainer capable of generating explanations as arbitrary Boolean combinations of learned graphical concepts. GLGExplainer is a fully differentiable architecture that takes local explanations as inputs and combines them into a logic formula over graphical concepts, represented as clusters of local explanations. Contrary to existing solutions, GLGExplainer provides accurate and human-interpretable global explanations that are perfectly aligned with ground-truth explanations (on synthetic data) or match existing domain knowledge (on real-world data). Extracted formulas are faithful to the model predictions, to the point of providing insights into some occasionally incorrect rules learned by the model, making GLGExplainer a promising diagnostic tool for learned GNNs.  ( 2 min )
    KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. (arXiv:2205.09921v2 [cs.CL] UPDATED)
    Relative positional embeddings (RPE) have received considerable attention since RPEs effectively model the relative distance among tokens and enable length extrapolation. We propose KERPLE, a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences. We achieve this goal using conditionally positive definite (CPD) kernels, a class of functions known for generalizing distance metrics. To maintain the inner product interpretation of self-attention, we show that a CPD kernel can be transformed into a PD kernel by adding a constant offset. This offset is implicitly absorbed in the Softmax normalization during self-attention. The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way. Experiments demonstrate that the logarithmic variant achieves excellent extrapolation performance on three large language modeling datasets. Our implementation and pretrained checkpoints are released at~\url{https://github.com/chijames/KERPLE.git}.  ( 2 min )
    A Direct Approximation of AIXI Using Logical State Abstractions. (arXiv:2210.06917v1 [cs.AI])
    We propose a practical integration of logical state abstraction with AIXI, a Bayesian optimality notion for reinforcement learning agents, to significantly expand the model class that AIXI agents can be approximated over to complex history-dependent and structured environments. The state representation and reasoning framework is based on higher-order logic, which can be used to define and enumerate complex features on non-Markovian and structured environments. We address the problem of selecting the right subset of features to form state abstractions by adapting the $\Phi$-MDP optimisation criterion from state abstraction theory. Exact Bayesian model learning is then achieved using a suitable generalisation of Context Tree Weighting over abstract state sequences. The resultant architecture can be integrated with different planning algorithms. Experimental results on controlling epidemics on large-scale contact networks validates the agent's performance.  ( 2 min )
    Memory-efficient Reinforcement Learning with Knowledge Consolidation. (arXiv:2205.10868v2 [cs.LG] UPDATED)
    Artificial neural networks are promising for general function approximation but challenging to train on non-independent or non-identically distributed data due to catastrophic forgetting. The experience replay buffer, a standard component in deep reinforcement learning, is often used to reduce forgetting and improve sample efficiency by storing experiences in a large buffer and using them for training later. However, a large replay buffer results in a heavy memory burden, especially for onboard and edge devices with limited memory capacities. We propose memory-efficient reinforcement learning algorithms based on the deep Q-network algorithm to alleviate this problem. Our algorithms reduce forgetting and maintain high sample efficiency by consolidating knowledge from the target Q-network to the current Q-network. Compared to baseline methods, our algorithms achieve comparable or better performance in both feature-based and image-based tasks while easing the burden of large experience replay buffers.  ( 2 min )
    Testing Stationarity and Change Point Detection in Reinforcement Learning. (arXiv:2203.01707v2 [stat.ML] UPDATED)
    We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.  ( 2 min )
    Pruning has a disparate impact on model accuracy. (arXiv:2205.13574v3 [cs.LG] UPDATED)
    Network pruning is a widely-used compression technique that is able to significantly scale down overparameterized models with minimal loss of accuracy. This paper shows that pruning may create or exacerbate disparate impacts. The paper sheds light on the factors to cause such disparities, suggesting differences in gradient norms and distance to decision boundary across groups to be responsible for this critical issue. It analyzes these factors in detail, providing both theoretical and empirical support, and proposes a simple, yet effective, solution that mitigates the disparate impacts caused by pruning.  ( 2 min )
    Rigidity Preserving Image Transformations and Equivariance in Perspective. (arXiv:2201.13065v2 [cs.CV] UPDATED)
    We characterize the class of image plane transformations which realize rigid camera motions and call these transformations `rigidity preserving'. In particular, 2D translations of pinhole images are not rigidity preserving. Hence, when using CNNs for 3D inference tasks, it can be beneficial to modify the inductive bias from equivariance towards translations to equivariance towards rigidity preserving transformations. We investigate how equivariance with respect to rigidity preserving transformations can be approximated in CNNs, and test our ideas on both 6D object pose estimation and visual localization. Experimentally, we improve on several competitive baselines.  ( 2 min )
    MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation. (arXiv:2205.09853v4 [cs.CV] UPDATED)
    Video prediction is a challenging task. The quality of video frames from current state-of-the-art (SOTA) generative models tends to be poor and generalization beyond the training data is difficult. Furthermore, existing prediction frameworks are typically not capable of simultaneously handling other video-related tasks such as unconditional generation or interpolation. In this work, we devise a general-purpose framework called Masked Conditional Video Diffusion (MCVD) for all of these video synthesis tasks using a probabilistic conditional score-based denoising diffusion model, conditioned on past and/or future frames. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. This novel but straightforward setup allows us to train a single model that is capable of executing a broad range of video tasks, specifically: future/past prediction -- when only future/past frames are masked; unconditional generation -- when both past and future frames are masked; and interpolation -- when neither past nor future frames are masked. Our experiments show that this approach can generate high-quality frames for diverse types of videos. Our MCVD models are built from simple non-recurrent 2D-convolutional architectures, conditioning on blocks of frames and generating blocks of frames. We generate videos of arbitrary lengths autoregressively in a block-wise manner. Our approach yields SOTA results across standard video prediction and interpolation benchmarks, with computation times for training models measured in 1-12 days using $\le$ 4 GPUs. Project page: https://mask-cond-video-diffusion.github.io ; Code : https://github.com/voletiv/mcvd-pytorch  ( 3 min )
    Non-convex online learning via algorithmic equivalence. (arXiv:2205.15235v2 [cs.LG] UPDATED)
    We study an algorithmic equivalence technique between non-convex gradient descent and convex mirror descent. We start by looking at a harder problem of regret minimization in online non-convex optimization. We show that under certain geometric and smoothness conditions, online gradient descent applied to non-convex functions is an approximation of online mirror descent applied to convex functions under reparameterization. In continuous time, the gradient flow with this reparameterization was shown to be exactly equivalent to continuous-time mirror descent by Amid and Warmuth 2020, but theory for the analogous discrete time algorithms is left as an open problem. We prove an $O(T^{\frac{2}{3}})$ regret bound for non-convex online gradient descent in this setting, answering this open problem. Our analysis is based on a new and simple algorithmic equivalence method.  ( 2 min )
    Universality of Group Convolutional Neural Networks Based on Ridgelet Analysis on Groups. (arXiv:2205.14819v2 [cs.LG] UPDATED)
    We show the universality of depth-2 group convolutional neural networks (GCNNs) in a unified and constructive manner based on the ridgelet theory. Despite widespread use in applications, the approximation property of (G)CNNs has not been well investigated. The universality of (G)CNNs has been shown since the late 2010s. Yet, our understanding on how (G)CNNs represent functions is incomplete because the past universality theorems have been shown in a case-by-case manner by manually/carefully assigning the network parameters depending on the variety of convolution layers, and in an indirect manner by converting/modifying the (G)CNNs into other universal approximators such as invariant polynomials and fully-connected networks. In this study, we formulate a versatile depth-2 continuous GCNN $S[\gamma]$ as a nonlinear mapping between group representations, and directly obtain an analysis operator, called the ridgelet trasform, that maps a given function $f$ to the network parameter $\gamma$ so that $S[\gamma]=f$. The proposed GCNN covers typical GCNNs such as the cyclic convolution on multi-channel images, networks on permutation-invariant inputs (Deep Sets), and $\mathrm{E}(n)$-equivariant networks. The closed-form expression of the ridgelet transform can describe how the network parameters are organized to represent a function. While it has been known only for fully-connected networks, this study is the first to obtain the ridgelet transform for GCNNs. By discretizing the closed-form expression, we can systematically generate a constructive proof of the $cc$-universality of finite GCNNs. In other words, our universality proofs are more unified and constructive than previous proofs.  ( 3 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v3 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 3 min )
    Forecasting Cryptocurrency Returns from Sentiment Signals: An Analysis of BERT Classifiers and Weak Supervision. (arXiv:2204.05781v2 [q-fin.ST] UPDATED)
    Anticipating price developments in financial markets is a topic of continued interest in forecasting. Funneled by advancements in deep learning and natural language processing (NLP) together with the availability of vast amounts of textual data in form of news articles, social media postings, etc., an increasing number of studies incorporate text-based predictors in forecasting models. We contribute to this literature by introducing weak learning, a recently proposed NLP approach to address the problem that text data is unlabeled. Without a dependent variable, it is not possible to finetune pretrained NLP models on a custom corpus. We confirm that finetuning using weak labels enhances the predictive value of text-based features and raises forecast accuracy in the context of predicting cryptocurrency returns. More fundamentally, the modeling paradigm we present, weak labeling domain-specific text and finetuning pretrained NLP models, is universally applicable in (financial) forecasting and unlocks new ways to leverage text data.  ( 2 min )
    Sparse Mixers: Combining MoE and Mixing to build a more efficient BERT. (arXiv:2205.12399v2 [cs.LG] UPDATED)
    We combine the capacity of sparsely gated Mixture-of-Experts (MoE) with the speed and stability of linear, mixing transformations to design the Sparse Mixer encoder model. Sparse Mixer slightly outperforms (<1%) BERT on GLUE and SuperGLUE, but more importantly trains 65% faster and runs inference 61% faster. We also present a faster variant, prosaically named Fast Sparse Mixer, that marginally underperforms BERT on SuperGLUE, but trains and runs nearly twice as fast. We justify the design of these two models by carefully ablating through various mixing mechanisms, MoE configurations and hyperparameters. Sparse Mixer overcomes many of the latency and stability concerns of MoE models and offers the prospect of serving sparse student models, without resorting to distilling them to dense variants.  ( 2 min )
    Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models. (arXiv:2202.04593v2 [cs.LG] UPDATED)
    We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.  ( 3 min )
    MoCapAct: A Multi-Task Dataset for Simulated Humanoid Control. (arXiv:2208.07363v2 [cs.RO] UPDATED)
    Simulated humanoids are an appealing research domain due to their physical capabilities. Nonetheless, they are also challenging to control, as a policy must drive an unstable, discontinuous, and high-dimensional physical system. One widely studied approach is to utilize motion capture (MoCap) data to teach the humanoid agent low-level skills (e.g., standing, walking, and running) that can then be re-used to synthesize high-level behaviors. However, even with MoCap data, controlling simulated humanoids remains very hard, as MoCap data offers only kinematic information. Finding physical control inputs to realize the demonstrated motions requires computationally intensive methods like reinforcement learning. Thus, despite the publicly available MoCap data, its utility has been limited to institutions with large-scale compute. In this work, we dramatically lower the barrier for productive research on this topic by training and releasing high-quality agents that can track over three hours of MoCap data for a simulated humanoid in the dm_control physics-based environment. We release MoCapAct (Motion Capture with Actions), a dataset of these expert agents and their rollouts, which contain proprioceptive observations and actions. We demonstrate the utility of MoCapAct by using it to train a single hierarchical policy capable of tracking the entire MoCap dataset within dm_control and show the learned low-level component can be re-used to efficiently learn downstream high-level tasks. Finally, we use MoCapAct to train an autoregressive GPT model and show that it can control a simulated humanoid to perform natural motion completion given a motion prompt. Videos of the results and links to the code and dataset are available at https://microsoft.github.io/MoCapAct.
    Learning to Generate Prompts for Dialogue Generation through Reinforcement Learning. (arXiv:2206.03931v3 [cs.CL] UPDATED)
    Much literature has shown that prompt-based learning is an efficient method to make use of the large pre-trained language model. Recent works also exhibit the possibility of steering a chatbot's output by plugging in an appropriate prompt. Gradient-based methods are often used to perturb the prompts. However, some language models are not even available to the public. In this work, we first explored the combination of prompting and reinforcement learning (RL) to steer models' generation without accessing any of the models' parameters. Second, to reduce the training effort and enhance the generalizability to the unseen task, we apply multi-task learning to make the model learn to generalize to new tasks better. The experiment results show that our proposed method can successfully control several state-of-the-art (SOTA) dialogue models without accessing their parameters. Furthermore, the model demonstrates the strong ability to quickly adapt to an unseen task in fewer steps than the baseline model.
    Discovered Policy Optimisation. (arXiv:2210.05639v2 [cs.LG] UPDATED)
    Tremendous progress has been made in reinforcement learning (RL) over the past decade. Most of these advancements came through the continual development of new algorithms, which were designed using a combination of mathematical derivations, intuitions, and experimentation. Such an approach of creating algorithms manually is limited by human understanding and ingenuity. In contrast, meta-learning provides a toolkit for automatic machine learning method optimisation, potentially addressing this flaw. However, black-box approaches which attempt to discover RL algorithms with minimal prior structure have thus far not outperformed existing hand-crafted algorithms. Mirror Learning, which includes RL algorithms, such as PPO, offers a potential middle-ground starting point: while every method in this framework comes with theoretical guarantees, components that differentiate them are subject to design. In this paper we explore the Mirror Learning space by meta-learning a "drift" function. We refer to the immediate result as Learnt Policy Optimisation (LPO). By analysing LPO we gain original insights into policy optimisation which we use to formulate a novel, closed-form RL algorithm, Discovered Policy Optimisation (DPO). Our experiments in Brax environments confirm state-of-the-art performance of LPO and DPO, as well as their transfer to unseen settings.
    Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space. (arXiv:2206.11895v2 [cs.CV] UPDATED)
    Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, these Transformers do not perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations. The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens, trained in an unsupervised fashion. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our code is available at https://github.com/elicassion/3DTRL.
    A Rotated Hyperbolic Wrapped Normal Distribution for Hierarchical Representation Learning. (arXiv:2205.13371v2 [cs.LG] UPDATED)
    We present a rotated hyperbolic wrapped normal distribution (RoWN), a simple yet effective alteration of a hyperbolic wrapped normal distribution (HWN). The HWN expands the domain of probabilistic modeling from Euclidean to hyperbolic space, where a tree can be embedded with arbitrary low distortion in theory. In this work, we analyze the geometric properties of the diagonal HWN, a standard choice of distribution in probabilistic modeling. The analysis shows that the distribution is inappropriate to represent the data points at the same hierarchy level through their angular distance with the same norm in the Poincar\'e disk model. We then empirically verify the presence of limitations of HWN, and show how RoWN, the proposed distribution, can alleviate the limitations on various hierarchical datasets, including noisy synthetic binary tree, WordNet, and Atari 2600 Breakout. The code is available at https://github.com/ml-postech/RoWN.
    A systematic review of biologically-informed deep learning models for cancer: fundamental trends for encoding and interpreting oncology data. (arXiv:2207.00812v2 [q-bio.QM] UPDATED)
    There is an increasing interest in the use of Deep Learning (DL) based methods as a supporting analytical framework in oncology. However, most direct applications of DL will deliver models with limited transparency and explainability, which constrain their deployment in biomedical settings. This systematic review discusses DL models used to support inference in cancer biology with a particular emphasis on multi-omics analysis. It focuses on how existing models address the need for better dialogue with prior knowledge, biological plausibility and interpretability, fundamental properties in the biomedical domain. For this, we retrieved and analyzed 42 studies focusing on emerging architectural and methodological advances, the encoding of biological domain knowledge and the integration of explainability methods. We discuss the recent evolutionary arch of DL models in the direction of integrating prior biological relational and network knowledge to support better generalisation (e.g. pathways or Protein-Protein-Interaction networks) and interpretability. This represents a fundamental functional shift towards models which can integrate mechanistic and statistical inference aspects. We introduce a concept of bio-centric interpretability and according to its taxonomy, we discuss representational methodologies for the integration of domain prior knowledge in such models. The paper provides a critical outlook into contemporary methods for explainability and interpretabiltiy used in DL for cancer. The analysis points in the direction of a convergence between encoding prior knowledge and improved interpretability. We introduce bio-centric interpretability which is an important step towards formalisation of biological interpretability of DL models and developing methods that are less problem- or application-specific.
    Primal Estimated Subgradient Solver for SVM for Imbalanced Classification. (arXiv:2206.09311v2 [cs.LG] UPDATED)
    We aim to demonstrate in experiments that our cost-sensitive PEGASOS SVM (without synthetic majority oversampling/undersampling (SMOTE) ) achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1. Although many resort to SMOTE methods, we aim for a less computational intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or we choose overrepresentatuve or underrepresentative training/test data. We will also examine the effect of varying the hyperparameters via validation curves. We compare our PEGASOS Cost-Sensitive SVM's results on three of the datasets Ding analyzed using his LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. We consider that dataset the most promising use of kernel svm. Our work will extend the work of Ding by incorporating kernels into svm. We will use Python rather than matlab as python has dictionaries for storing mixed data types during multi-parameter cross-validation.
    On the Performance of Gradient Tracking with Local Updates. (arXiv:2210.04757v2 [math.OC] UPDATED)
    We study the decentralized optimization problem where a network of $n$ agents seeks to minimize the average of a set of heterogeneous non-convex cost functions distributedly. State-of-the-art decentralized algorithms like Exact Diffusion~(ED) and Gradient Tracking~(GT) involve communicating every iteration. However, communication is expensive, resource intensive, and slow. In this work, we analyze a locally updated GT method (LU-GT), where agents perform local recursions before interacting with their neighbors. While local updates have been shown to reduce communication overhead in practice, their theoretical influence has not been fully characterized. We show LU-GT has the same communication complexity as the Federated Learning setting but allows arbitrary network topologies. In addition, we prove that the number of local updates does not degrade the quality of the solution achieved by LU-GT. Numerical examples reveal that local updates can lower communication costs in certain regimes (e.g., well-connected graphs).
    Relational Graph Convolutional Neural Networks for Multihop Reasoning: A Comparative Study. (arXiv:2210.06418v2 [cs.CL] UPDATED)
    Multihop Question Answering is a complex Natural Language Processing task that requires multiple steps of reasoning to find the correct answer to a given question. Previous research has explored the use of models based on Graph Neural Networks for tackling this task. Various architectures have been proposed, including Relational Graph Convolutional Networks (RGCN). For these many node types and relations between them have been introduced, such as simple entity co-occurrences, modelling coreferences, or "reasoning paths" from questions to answers via intermediary entities. Nevertheless, a thoughtful analysis on which relations, node types, embeddings and architecture are the most beneficial for this task is still missing. In this paper we explore a number of RGCN-based Multihop QA models, graph relations, and node embeddings, and empirically explore the influence of each on Multihop QA performance on the WikiHop dataset.
    SoteriaFL: A Unified Framework for Private Federated Learning with Communication Compression. (arXiv:2206.09888v2 [cs.LG] UPDATED)
    To enable large-scale machine learning in bandwidth-hungry environments such as wireless networks, significant progress has been made recently in designing communication-efficient federated learning algorithms with the aid of communication compression. On the other end, privacy-preserving, especially at the client level, is another important desideratum that has not been addressed simultaneously in the presence of advanced communication compression techniques yet. In this paper, we propose a unified framework that enhances the communication efficiency of private federated learning with communication compression. Exploiting both general compression operators and local differential privacy, we first examine a simple algorithm that applies compression directly to differentially-private stochastic gradient descent, and identify its limitations. We then propose a unified framework SoteriaFL for private federated learning, which accommodates a general family of local gradient estimators including popular stochastic variance-reduced gradient methods and the state-of-the-art shifted compression scheme. We provide a comprehensive characterization of its performance trade-offs in terms of privacy, utility, and communication complexity, where SoteraFL is shown to achieve better communication complexity without sacrificing privacy nor utility than other private federated learning algorithms without communication compression.
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v2 [stat.ML] UPDATED)
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike-and-slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior, which is more feasible than other priors, and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.
    Few-Shot Continual Active Learning by a Robot. (arXiv:2210.04137v2 [cs.LG] UPDATED)
    In this paper, we consider a challenging but realistic continual learning (CL) problem, Few-Shot Continual Active Learning (FoCAL), where a CL agent is provided with unlabeled data for a new or a previously learned task in each increment and the agent only has limited labeling budget available. Towards this, we build on the continual learning and active learning literature and develop a framework that can allow a CL agent to continually learn new object classes from a few labeled training examples. Our framework represents each object class using a uniform Gaussian mixture model (GMM) and uses pseudo-rehearsal to mitigate catastrophic forgetting. The framework also uses uncertainty measures on the Gaussian representations of the previously learned classes to find the most informative samples to be labeled in an increment. We evaluate our approach on the CORe-50 dataset and on a real humanoid robot for the object classification task. The results show that our approach not only produces state-of-the-art results on the dataset but also allows a real robot to continually learn unseen objects in a real environment with limited labeling supervision provided by its user.
    One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v2 [cs.LG] UPDATED)
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v2 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.
    Geometric multimodal representation learning. (arXiv:2209.03299v2 [cs.LG] UPDATED)
    Graph-centric artificial intelligence (graph AI) has achieved remarkable success in modeling interacting systems prevalent in nature, from dynamical systems in biology to particle physics. The increasing heterogeneity of data calls for graph neural architectures that can combine multiple inductive biases. However, combining data from various sources is challenging because appropriate inductive bias may vary by data modality. Multimodal learning methods fuse multiple data modalities while leveraging cross-modal dependencies to address this challenge. Here, we survey 140 studies in graph-centric AI and realize that diverse data types are increasingly brought together using graphs and fed into sophisticated multimodal models. These models stratify into image-, language-, and knowledge-grounded multimodal learning. We put forward an algorithmic blueprint for multimodal graph learning based on this categorization. The blueprint serves as a way to group state-of-the-art architectures that treat multimodal data by choosing appropriately four different components. This effort can pave the way for standardizing the design of sophisticated multimodal architectures for highly complex real-world problems.
    Parameter Averaging for Feature Ranking. (arXiv:2208.03249v2 [cs.LG] UPDATED)
    Neural Networks are known to be sensitive to initialisation. The methods that rely on neural networks for feature ranking are not robust since they can have variations in their ranking when the model is initialized and trained with different random seeds. In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab. We first initialize and train multiple instances of a shallow network (referred as local masks) with "different random seeds" for a downstream task. We then obtain a global mask model by "averaging the parameters" of local masks. We show that although the parameter averaging might result in a global model with higher loss, it still leads to the discovery of the ground-truth feature importance more consistently than an individual model does. We conduct extensive experiments on a variety of synthetic and real-world data, demonstrating that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.
    A Monotonicity Constrained Attention Module for Emotion Classification with Limited EEG Data. (arXiv:2208.08155v2 [eess.SP] UPDATED)
    In this work, a parameter-efficient attention module is presented for emotion classification using a limited, or relatively small, number of electroencephalogram (EEG) signals. This module is called the Monotonicity Constrained Attention Module (MCAM) due to its capability of incorporating priors on the monotonicity when converting features' Gram matrices into attention matrices for better feature refinement. Our experiments have shown that MCAM's effectiveness is comparable to state-of-the-art attention modules in boosting the backbone network's performance in prediction while requiring less parameters. Several accompanying sensitivity analyses on trained models' prediction concerning different attacks are also performed. These attacks include various frequency domain filtering levels and gradually morphing between samples associated with multiple labels. Our results can help better understand different modules' behaviour in prediction and can provide guidance in applications where data is limited and are with noises.
    DeepVol: Volatility Forecasting from High-Frequency Data with Dilated Causal Convolutions. (arXiv:2210.04797v2 [q-fin.RM] UPDATED)
    Volatility forecasts play a central role among equity risk measures. Besides traditional statistical models, modern forecasting techniques, based on machine learning, can readily be employed when treating volatility as a univariate, daily time-series. However, econometric studies have shown that increasing the number of daily observations with high-frequency intraday data helps to improve predictions. In this work, we propose DeepVol, a model based on Dilated Causal Convolutions to forecast day-ahead volatility by using high-frequency data. We show that the dilated convolutional filters are ideally suited to extract relevant information from intraday financial data, thereby naturally mimicking (via a data-driven approach) the econometric models which incorporate realised measures of volatility into the forecast. This allows us to take advantage of the abundance of intraday observations, helping us to avoid the limitations of models that use daily data, such as model misspecification or manually designed handcrafted features, whose devise involves optimising the trade-off between accuracy and computational efficiency and makes models prone to lack of adaptation into changing circumstances. In our analysis, we use two years of intraday data from NASDAQ-100 to evaluate DeepVol's performance. The reported empirical results suggest that the proposed deep learning-based approach learns global features from high-frequency data, achieving more accurate predictions than traditional methodologies, yielding to more appropriate risk measures.
    Characterizing SARS-CoV-2 Spike Sequences Based on Geographical Location. (arXiv:2110.00809v4 [cs.LG] UPDATED)
    With the rapid spread of COVID-19 worldwide, viral genomic data is available in the order of millions of sequences on public databases such as GISAID. This Big Data creates a unique opportunity for analysis towards the research of effective vaccine development for current pandemics, and avoiding or mitigating future pandemics. One piece of information that comes with every such viral sequence is the geographical location where it was collected -- the patterns found between viral variants and geographical location surely being an important part of this analysis. One major challenge that researchers face is processing such huge, highly dimensional data to obtain useful insights as quickly as possible. Most of the existing methods face scalability issues when dealing with the magnitude of such data. In this paper, we propose an approach that first computes a numerical representation of the spike protein sequence of SARS-CoV-2 using $k$-mers (substrings) and then uses several machine learning models to classify the sequences based on geographical location. We show that our proposed model significantly outperforms the baselines. We also show the importance of different amino acids in the spike sequences by computing the information gain corresponding to the true class labels.
    pFL-Bench: A Comprehensive Benchmark for Personalized Federated Learning. (arXiv:2206.03655v4 [cs.LG] UPDATED)
    Personalized Federated Learning (pFL), which utilizes and deploys distinct local models, has gained increasing attention in recent years due to its success in handling the statistical heterogeneity of FL clients. However, standardized evaluation and systematical analysis of diverse pFL methods remain a challenge. Firstly, the highly varied datasets, FL simulation settings and pFL implementations prevent easy and fair comparisons of pFL methods. Secondly, the current pFL literature diverges in the adopted evaluation and ablation protocols. Finally, the effectiveness and robustness of pFL methods are under-explored in various practical scenarios, such as the generalization to new clients and the participation of resource-limited clients. To tackle these challenges, we propose the first comprehensive pFL benchmark, pFL-Bench, for facilitating rapid, reproducible, standardized and thorough pFL evaluation. The proposed benchmark contains more than 10 dataset variants in various application domains with a unified data partition and realistic heterogeneous settings; a modularized and easy-to-extend pFL codebase with more than 20 competitive pFL method implementations; and systematic evaluations under containerized environments in terms of generalization, fairness, system overhead, and convergence. We highlight the benefits and potential of state-of-the-art pFL methods and hope the pFL-Bench enables further pFL research and broad applications that would otherwise be difficult owing to the absence of a dedicated benchmark. The code is released at https://github.com/alibaba/FederatedScope/tree/master/benchmark/pFL-Bench.
    Learning Multivariate CDFs and Copulas using Tensor Factorization. (arXiv:2210.07132v1 [stat.ML])
    Learning the multivariate distribution of data is a core challenge in statistics and machine learning. Traditional methods aim for the probability density function (PDF) and are limited by the curse of dimensionality. Modern neural methods are mostly based on black-box models, lacking identifiability guarantees. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables, allow efficient box probability evaluation, and have the potential to overcome local sample scarcity owing to their cumulative nature. We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model via the Canonical Polyadic (tensor-rank) decomposition. By introducing a low-rank model, either directly in the raw data domain, or indirectly in a transformed (Copula) domain, the resulting model affords efficient sampling, closed form inference and uncertainty quantification, and comes with uniqueness guarantees under relatively mild conditions. We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation. Interestingly, our experiments with real data show that it is possible to obtain better density/mass estimates indirectly via a low-rank CDF model, than a low-rank PDF/PMF model.
    Gradient Boosting Performs Gaussian Process Inference. (arXiv:2206.05608v2 [cs.LG] UPDATED)
    This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.
    Communication Efficient Distributed Learning for Kernelized Contextual Bandits. (arXiv:2206.04835v2 [cs.LG] UPDATED)
    We tackle the communication efficiency challenge of learning kernelized contextual bandits in a distributed setting. Despite the recent advances in communication-efficient distributed bandit learning, existing solutions are restricted to simple models like multi-armed bandits and linear bandits, which hamper their practical utility. In this paper, instead of assuming the existence of a linear reward mapping from the features to the expected rewards, we consider non-linear reward mappings, by letting agents collaboratively search in a reproducing kernel Hilbert space (RKHS). This introduces significant challenges in communication efficiency as distributed kernel learning requires the transfer of raw data, leading to a communication cost that grows linearly w.r.t. time horizon $T$. We addresses this issue by equipping all agents to communicate via a common Nystr\"{o}m embedding that gets updated adaptively as more data points are collected. We rigorously proved that our algorithm can attain sub-linear rate in both regret and communication cost.
    Where are my Neighbors? Exploiting Patches Relations in Self-Supervised Vision Transformer. (arXiv:2206.00481v2 [cs.CV] UPDATED)
    Vision Transformers (ViTs) enabled the use of the transformer architecture on vision tasks showing impressive performances when trained on big datasets. However, on relatively small datasets, ViTs are less accurate given their lack of inductive bias. To this end, we propose a simple but still effective Self-Supervised Learning (SSL) strategy to train ViTs, that without any external annotation or external data, can significantly improve the results. Specifically, we define a set of SSL tasks based on relations of image patches that the model has to solve before or jointly the supervised task. Differently from ViT, our RelViT model optimizes all the output tokens of the transformer encoder that are related to the image patches, thus exploiting more training signals at each training step. We investigated our methods on several image benchmarks finding that RelViT improves the SSL state-of-the-art methods by a large margin, especially on small datasets. Code is available at: https://github.com/guglielmocamporese/relvit.
    Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms. (arXiv:2208.14837v2 [cs.LG] UPDATED)
    In this paper, we study the combinatorial semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound, where $K$ is the total number of arms that can be pulled or triggered in each round. First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we discover a novel (directional) triggering probability and variance modulated (TPVM) condition that can replace the previously-used smoothness condition for various applications, such as cascading bandits, online network exploration and online influence maximization. Under this new condition, we propose a BCUCB-T algorithm with variance-aware confidence intervals and conduct regret analysis which reduces the $O(K)$ factor to $O(\log K)$ or $O(\log^2 K)$ in the regret bound, significantly improving the regret bounds for the above applications. Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition and completely removes the dependency on $K$ in the leading regret. As a valuable by-product, the regret analysis used in this paper can improve several existing results by a factor of $O(\log K)$. Finally, experimental evaluations show our superior performance compared with benchmark algorithms in different applications.
    GRU-TV: Time- and velocity-aware GRU for patient representation on multivariate clinical time-series data. (arXiv:2205.04892v2 [cs.LG] UPDATED)
    Electronic health records (EHRs) are usually highly dimensional, heterogeneous, and multimodal. Besides, the random recording of clinical variables results in high missing rates and uneven time intervals between adjacent records in the multivariate clinical time-series data extracted from EHRs. Current works using clinical time-series data for patient representation regard the patients' physiological status as a discrete process described by sporadically collected records. However, changes in the patient's physiological condition are continuous and dynamic processes. The perception of time and velocity of change is crucial for patient representation learning. In this study, we propose a time- and velocity-aware gated recurrent unit model (GRU-TV) for patient representation learning of clinical multivariate time-series data in a time-continuous manner. The neural ordinary differential equations (ODEs) and velocity perception mechanism are applied to perceive the time interval between adjacent records and changing rate of the patient's physiological status, respectively. Our experiments on two real clinical EHR datasets (PhysioNet2012, MIMIC-III) establish that GRU-TV is a robust model on computer-aided diagnosis (CAD) tasks, especially on sequences with high-variance time intervals.
    Active Exploration for Inverse Reinforcement Learning. (arXiv:2207.08645v2 [cs.LG] UPDATED)
    Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a reward function from expert demonstrations. Many IRL algorithms require a known transition model and sometimes even a known expert policy, or they at least require access to a generative model. However, these assumptions are too strong for many real-world applications, where the environment can be accessed only through sequential interaction. We propose a novel IRL algorithm: Active exploration for Inverse Reinforcement Learning (AceIRL), which actively explores an unknown environment and expert policy to quickly learn the expert's reward function and identify a good policy. AceIRL uses previous observations to construct confidence intervals that capture plausible reward functions and find exploration policies that focus on the most informative regions of the environment. AceIRL is the first approach to active IRL with sample-complexity bounds that does not require a generative model of the environment. AceIRL matches the sample complexity of active IRL with a generative model in the worst case. Additionally, we establish a problem-dependent bound that relates the sample complexity of AceIRL to the suboptimality gap of a given IRL problem. We empirically evaluate AceIRL in simulations and find that it significantly outperforms more naive exploration strategies.
    Scale-invariant Learning by Physics Inversion. (arXiv:2109.15048v3 [cs.LG] UPDATED)
    Solving inverse problems, such as parameter estimation and optimal control, is a vital part of science. Many experiments repeatedly collect data and rely on machine learning algorithms to quickly infer solutions to the associated inverse problems. We find that state-of-the-art training techniques are not well-suited to many problems that involve physical processes. The highly nonlinear behavior, common in physical processes, results in strongly varying gradients that lead first-order optimizers like SGD or Adam to compute suboptimal optimization directions. We propose a novel hybrid training approach that combines higher-order optimization methods with machine learning techniques. We take updates from a scale-invariant inverse problem solver and embed them into the gradient-descent-based learning pipeline, replacing the regular gradient of the physical process. We demonstrate the capabilities of our method on a variety of canonical physical systems, showing that it yields significant improvements on a wide range of optimization and learning problems.
    Learning to branch with Tree MDPs. (arXiv:2205.11107v3 [cs.LG] UPDATED)
    State-of-the-art Mixed Integer Linear Program (MILP) solvers combine systematic tree search with a plethora of hard-coded heuristics, such as the branching rule. The idea of learning branching rules from data has received increasing attention recently, and promising results have been obtained by learning fast approximations of the strong branching expert. In this work, we instead propose to learn branching rules from scratch via Reinforcement Learning (RL). We revisit the work of Etheve et al. (2020) and propose tree Markov Decision Processes, or tree MDPs, a generalization of temporal MDPs that provides a more suitable framework for learning to branch. We derive a tree policy gradient theorem, which exhibits a better credit assignment compared to its temporal counterpart. We demonstrate through computational experiments that tree MDPs improve the learning convergence, and offer a promising framework for tackling the learning-to-branch problem in MILPs.
    Deterministic Langevin Monte Carlo with Normalizing Flows for Bayesian Inference. (arXiv:2205.14240v2 [stat.ML] UPDATED)
    We propose a general purpose Bayesian inference algorithm for expensive likelihoods, replacing the stochastic term in the Langevin equation with a deterministic density gradient term. The particle density is evaluated from the current particle positions using a Normalizing Flow (NF), which is differentiable and has good generalization properties in high dimensions. We take advantage of NF preconditioning and NF based Metropolis-Hastings updates for a faster convergence. We show on various examples that the method is competitive against state of the art sampling methods.
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v3 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.
    Visual Concepts Tokenization. (arXiv:2205.10093v2 [cs.CV] UPDATED)
    Obtaining the human-like perception ability of abstracting visual concepts from concrete pixels has always been a fundamental and important target in machine learning research fields such as disentangled representation learning and scene decomposition. Towards this goal, we propose an unsupervised transformer-based Visual Concepts Tokenization framework, dubbed VCT, to perceive an image into a set of disentangled visual concept tokens, with each concept token responding to one type of independent visual concept. Particularly, to obtain these concept tokens, we only use cross-attention to extract visual information from the image tokens layer by layer without self-attention between concept tokens, preventing information leakage across concept tokens. We further propose a Concept Disentangling Loss to facilitate that different concept tokens represent independent visual concepts. The cross-attention and disentangling loss play the role of induction and mutual exclusion for the concept tokens, respectively. Extensive experiments on several popular datasets verify the effectiveness of VCT on the tasks of disentangled representation learning and scene decomposition. VCT achieves the state of the art results by a large margin.
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v4 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.
    Pitfalls of Epistemic Uncertainty Quantification through Loss Minimisation. (arXiv:2203.06102v2 [cs.LG] UPDATED)
    Uncertainty quantification has received increasing attention in machine learning in the recent past. In particular, a distinction between aleatoric and epistemic uncertainty has been found useful in this regard. The latter refers to the learner's (lack of) knowledge and appears to be especially difficult to measure and quantify. In this paper, we analyse a recent proposal based on the idea of a second-order learner, which yields predictions in the form of distributions over probability distributions. While standard (first-order) learners can be trained to predict accurate probabilities, namely by minimising suitable loss functions on sample data, we show that loss minimisation does not work for second-order predictors: The loss functions proposed for inducing such predictors do not incentivise the learner to represent its epistemic uncertainty in a faithful way.
    Transfer Deep Reinforcement Learning-based Large-scale V2G Continuous Charging Coordination with Renewable Energy Sources. (arXiv:2210.07013v1 [eess.SY])
    Due to the increasing popularity of electric vehicles (EVs) and the technological advancement of EV electronics, the vehicle-to-grid (V2G) technique and large-scale scheduling algorithms have been developed to achieve a high level of renewable energy and power grid stability. This paper proposes a deep reinforcement learning (DRL) method for the continuous charging/discharging coordination strategy in aggregating large-scale EVs in V2G mode with renewable energy sources (RES). The DRL coordination strategy can efficiently optimize the electric vehicle aggregator's (EVA's) real-time charging/discharging power with the state of charge (SOC) constraints of the EVA and the individual EV. Compared with uncontrolled charging, the load variance is reduced by 97.37$\%$ and the charging cost by 76.56$\%$. The DRL coordination strategy further demonstrates outstanding transfer learning ability to microgrids with RES and large-scale EVA, as well as the complicated weekly scheduling. The DRL coordination strategy demonstrates flexible, adaptable, and scalable performance for the large-scale V2G under realistic operating conditions.
    AccelAT: A Framework for Accelerating the Adversarial Training of Deep Neural Networks through Accuracy Gradient. (arXiv:2210.06888v1 [cs.LG])
    Adversarial training is exploited to develop a robust Deep Neural Network (DNN) model against the malicious altered data. These attacks may have catastrophic effects on DNN models but are indistinguishable for a human being. For example, an external attack can modify an image adding noises invisible for a human eye, but a DNN model misclassified the image. A key objective for developing robust DNN models is to use a learning algorithm that is fast but can also give model that is robust against different types of adversarial attacks. Especially for adversarial training, enormously long training times are needed for obtaining high accuracy under many different types of adversarial samples generated using different adversarial attack techniques. This paper aims at accelerating the adversarial training to enable fast development of robust DNN models against adversarial attacks. The general method for improving the training performance is the hyperparameters fine-tuning, where the learning rate is one of the most crucial hyperparameters. By modifying its shape (the value over time) and value during the training, we can obtain a model robust to adversarial attacks faster than standard training. First, we conduct experiments on two different datasets (CIFAR10, CIFAR100), exploring various techniques. Then, this analysis is leveraged to develop a novel fast training methodology, AccelAT, which automatically adjusts the learning rate for different epochs based on the accuracy gradient. The experiments show comparable results with the related works, and in several experiments, the adversarial training of DNNs using our AccelAT framework is conducted up to 2 times faster than the existing techniques. Thus, our findings boost the speed of adversarial training in an era in which security and performance are fundamental optimization objectives in DNN-based applications.
    Language Models of Code are Few-Shot Commonsense Learners. (arXiv:2210.07128v1 [cs.CL])
    We address the general task of structured commonsense reasoning: given a natural language input, the goal is to generate a graph such as an event -- or a reasoning-graph. To employ large language models (LMs) for this task, existing approaches ``serialize'' the output graph as a flat list of nodes and edges. Although feasible, these serialized graphs strongly deviate from the natural language corpora that LMs were pre-trained on, hindering LMs from generating them correctly. In this paper, we show that when we instead frame structured commonsense reasoning tasks as code generation tasks, pre-trained LMs of code are better structured commonsense reasoners than LMs of natural language, even when the downstream task does not involve source code at all. We demonstrate our approach across three diverse structured commonsense reasoning tasks. In all these natural language tasks, we show that using our approach, a code generation LM (CODEX) outperforms natural-LMs that are fine-tuned on the target task (e.g., T5) and other strong LMs such as GPT-3 in the few-shot setting.
    BayesAdapter: Being Bayesian, Inexpensively and Reliably, via Bayesian Fine-tuning. (arXiv:2010.01979v5 [cs.LG] UPDATED)
    Despite their theoretical appealingness, Bayesian neural networks (BNNs) are left behind in real-world adoption, mainly due to persistent concerns on their scalability, accessibility, and reliability. In this work, we develop the BayesAdapter framework to relieve these concerns. In particular, we propose to adapt pre-trained deterministic NNs to be variational BNNs via cost-effective Bayesian fine-tuning. Technically, we develop a modularized implementation for the learning of variational BNNs, and refurbish the generally applicable exemplar reparameterization trick through exemplar parallelization to efficiently reduce the gradient variance in stochastic variational inference. Based on the lightweight Bayesian learning paradigm, we conduct extensive experiments on a variety of benchmarks, and show that our method can consistently induce posteriors with higher quality than competitive baselines, yet significantly reducing training overheads. Code is available at https://github.com/thudzj/ScalableBDL.
    Subspace-Contrastive Multi-View Clustering. (arXiv:2210.06795v1 [cs.LG])
    Most multi-view clustering methods are limited by shallow models without sound nonlinear information perception capability, or fail to effectively exploit complementary information hidden in different views. To tackle these issues, we propose a novel Subspace-Contrastive Multi-View Clustering (SCMC) approach. Specifically, SCMC utilizes view-specific auto-encoders to map the original multi-view data into compact features perceiving its nonlinear structures. Considering the large semantic gap of data from different modalities, we employ subspace learning to unify the multi-view data into a joint semantic space, namely the embedded compact features are passed through multiple self-expression layers to learn the subspace representations, respectively. In order to enhance the discriminability and efficiently excavate the complementarity of various subspace representations, we use the contrastive strategy to maximize the similarity between positive pairs while differentiate negative pairs. Thus, a weighted fusion scheme is developed to initially learn a consistent affinity matrix. Furthermore, we employ the graph regularization to encode the local geometric structure within varying subspaces for further fine-tuning the appropriate affinities between instances. To demonstrate the effectiveness of the proposed model, we conduct a large number of comparative experiments on eight challenge datasets, the experimental results show that SCMC outperforms existing shallow and deep multi-view clustering methods.
    Interpreting Black-box Machine Learning Models for High Dimensional Datasets. (arXiv:2208.13405v2 [cs.LG] UPDATED)
    Deep neural networks (DNNs) have been shown to outperform traditional machine learning algorithms in a broad variety of application domains due to their effectiveness in modeling complex problems and handling high-dimensional datasets. Many real-life datasets, however, are of increasingly high dimensionality, where a large number of features may be irrelevant for both supervised and unsupervised learning tasks. The inclusion of such features would not only introduce unwanted noise but also increase computational complexity. Furthermore, due to high non-linearity and dependency among a large number of features, DNN models tend to be unavoidably opaque and perceived as black-box methods because of their not well-understood internal functioning. Their algorithmic complexity is often simply beyond the capacities of humans to understand the interplay among myriads of hyperparameters. A well-interpretable model can identify statistically significant features and explain the way they affect the model's outcome. In this paper, we propose an efficient method to improve the interpretability of black-box models for classification tasks in the case of high-dimensional datasets. First, we train a black-box model on a high-dimensional dataset to learn the embeddings on which the classification is performed. To decompose the inner working principles of the black-box model and to identify top-k important features, we employ different probing and perturbing techniques. We then approximate the behavior of the black-box model by means of an interpretable surrogate model on the top-k feature space. Finally, we derive decision rules and local explanations from the surrogate model to explain individual decisions. Our approach outperforms state-of-the-art methods like TabNet and XGboost when tested on different datasets with varying dimensionality between 50 and 20,000 w.r.t metrics and explainability.
    Theoretically Better and Numerically Faster Distributed Optimization with Smoothness-Aware Quantization Techniques. (arXiv:2106.03524v2 [cs.LG] UPDATED)
    To address the high communication costs of distributed machine learning, a large body of work has been devoted in recent years to designing various compression strategies, such as sparsification and quantization, and optimization algorithms capable of using them. Recently, Safaryan et al. (2021) pioneered a dramatically different compression design approach: they first use the local training data to form local smoothness matrices and then propose to design a compressor capable of exploiting the smoothness information contained therein. While this novel approach leads to substantial savings in communication, it is limited to sparsification as it crucially depends on the linearity of the compression operator. In this work, we generalize their smoothness-aware compression strategy to arbitrary unbiased compression operators, which also include sparsification. Specializing our results to stochastic quantization, we guarantee significant savings in communication complexity compared to standard quantization. In particular, we prove that block quantization with $n$ blocks theoretically outperforms single block quantization, leading to a reduction in communication complexity by an $\mathcal{O}(n)$ factor, where $n$ is the number of nodes in the distributed system. Finally, we provide extensive numerical evidence with convex optimization problems that our smoothness-aware quantization strategies outperform existing quantization schemes as well as the aforementioned smoothness-aware sparsification strategies with respect to three evaluation metrics: the number of iterations, the total amount of bits communicated, and wall-clock time.
    Entropy Approximation by Machine Learning Regression: Application for Irregularity Evaluation of Images in Remote Sensing. (arXiv:2210.06901v1 [cs.LG])
    Approximation of entropies of various types using machine learning (ML) regression methods is shown for the first time. The ML models presented in this study defines the complexity of short time series by approximating dissimilar entropy techniques such as Singular value decomposition entropy (SvdEn), Permutation entropy (PermEn), Sample entropy (SampEn) and Neural Network entropy (NNetEn) and their 2D analogies. A new method for calculating SvdEn2D, PermEn2D and SampEn2D for 2D images was tested using the technique of circular kernels. Training and test datasets on the basis of Sentinel-2 images are presented (2 train images and 198 test images). The results of entropy approximation are demonstrated using the example of calculating the 2D entropy of Sentinel-2 images and R2 metric evaluation. Applicability of the method for short time series with length from N = 5 to N = 113 elements is shown. A tendency for the R2 metric to decrease with an increase in the length of the time series was found. For SvdEn entropy, the regression accuracy is R2 > 0.99 for N = 5 and R2 > 0.82 for N = 113. The best metrics are observed for the ML_SvdEn2D and ML_NNetEn2D models. The results of the study can be used for fundamental research of entropy approximations of various types using ML regression, as well as for accelerating entropy calculations in remote sensing.
    Delta-Closure Structure for Studying Data Distribution. (arXiv:2210.06926v1 [cs.LG])
    In this paper, we revisit pattern mining and study the distribution underlying a binary dataset thanks to the closure structure which is based on passkeys, i.e., minimum generators in equivalence classes robust to noise. We introduce $\Delta$-closedness, a generalization of the closure operator, where $\Delta$ measures how a closed set differs from its upper neighbors in the partial order induced by closure. A $\Delta$-class of equivalence includes minimum and maximum elements and allows us to characterize the distribution underlying the data. Moreover, the set of $\Delta$-classes of equivalence can be partitioned into the so-called $\Delta$-closure structure. In particular, a $\Delta$-class of equivalence with a high level demonstrates correlations among many attributes, which are supported by more observations when $\Delta$ is large. In the experiments, we study the $\Delta$-closure structure of several real-world datasets and show that this structure is very stable for large $\Delta$ and does not substantially depend on the data sampling used for the analysis.
    Dim-Krum: Backdoor-Resistant Federated Learning for NLP with Dimension-wise Krum-Based Aggregation. (arXiv:2210.06894v1 [cs.LG])
    Despite the potential of federated learning, it is known to be vulnerable to backdoor attacks. Many robust federated aggregation methods are proposed to reduce the potential backdoor risk. However, they are mainly validated in the CV field. In this paper, we find that NLP backdoors are hard to defend against than CV, and we provide a theoretical analysis that the malicious update detection error probabilities are determined by the relative backdoor strengths. NLP attacks tend to have small relative backdoor strengths, which may result in the failure of robust federated aggregation methods for NLP attacks. Inspired by the theoretical results, we can choose some dimensions with higher backdoor strengths to settle this issue. We propose a novel federated aggregation algorithm, Dim-Krum, for NLP tasks, and experimental results validate its effectiveness.
    Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark. (arXiv:2210.07143v1 [cs.DB])
    Access plan recommendation is a query optimization approach that executes new queries using prior created query execution plans (QEPs). The query optimizer divides the query space into clusters in the mentioned method. However, traditional clustering algorithms take a significant amount of execution time for clustering such large datasets. The MapReduce distributed computing model provides efficient solutions for storing and processing vast quantities of data. Apache Spark and Apache Hadoop frameworks are used in the present investigation to cluster different sizes of query datasets in the MapReduce-based access plan recommendation method. The performance evaluation is performed based on execution time. The results of the experiments demonstrated the effectiveness of parallel query clustering in achieving high scalability. Furthermore, Apache Spark achieved better performance than Apache Hadoop, reaching an average speedup of 2x.
    Dirichlet process mixture models for non-stationary data streams. (arXiv:2210.06872v1 [stat.ML])
    In recent years, we have seen a handful of work on inference algorithms over non-stationary data streams. Given their flexibility, Bayesian non-parametric models are a good candidate for these scenarios. However, reliable streaming inference under the concept drift phenomenon is still an open problem for these models. In this work, we propose a variational inference algorithm for Dirichlet process mixture models. Our proposal deals with the concept drift by including an exponential forgetting over the prior global parameters. Our algorithm allows to adapt the learned model to the concept drifts automatically. We perform experiments in both synthetic and real data, showing that the proposed model is competitive with the state-of-the-art algorithms in the density estimation problem, and it outperforms them in the clustering problem.
    Behavioral graph fraud detection in E-commerce. (arXiv:2210.06968v1 [cs.LG])
    In e-commerce industry, graph neural network methods are the new trends for transaction risk modeling.The power of graph algorithms lie in the capability to catch transaction linking network information, which is very hard to be captured by other algorithms.However, in most existing approaches, transaction or user connections are defined by hard link strategies on shared properties, such as same credit card, same device, same ip address, same shipping address, etc. Those types of strategies will result in sparse linkages by entities with strong identification characteristics (ie. device) and over-linkages by entities that could be widely shared (ie. ip address), making it more difficult to learn useful information from graph. To address aforementioned problems, we present a novel behavioral biometric based method to establish transaction linkings based on user behavioral similarities, then train an unsupervised GNN to extract embedding features for downstream fraud prediction tasks. To our knowledge, this is the first time similarity based soft link has been used in graph embedding applications. To speed up similarity calculation, we apply an in-house GPU based HDBSCAN clustering method to remove highly concentrated and isolated nodes before graph construction. Our experiments show that embedding features learned from similarity based behavioral graph have achieved significant performance increase to the baseline fraud detection model in various business scenarios. In new guest buyer transaction scenario, this segment is a challenge for traditional method, we can make precision increase from 0.82 to 0.86 at the same recall of 0.27, which means we can decrease false positive rate using this method.
    Robust Time Series Denoising with Learnable Wavelet Packet Transform. (arXiv:2206.06126v2 [cs.SD] UPDATED)
    In many applications, signal denoising is often the first pre-processing step before any subsequent analysis or learning task. In this paper, we propose to apply a deep learning denoising model inspired by a signal processing, a learnable version of wavelet packet transform. The proposed algorithm has signficant learning capabilities with few interpretable parameters and has an intuitive initialisation. We propose a post-learning modification of the parameters to adapt the denoising to different noise levels. We evaluate the performance of the proposed methodology on two case studies and compare it to other state of the art approaches, including wavelet schrinkage denoising, convolutional neural network, autoencoder and U-net deep models. The first case study is based on designed functions that have typically been used to study denoising properties of the algorithms. The second case study is an audio background removal task. We demonstrate how the proposed algorithm relates to the universality of signal processing methods and the learning capabilities of deep learning approaches. In particular, we evaluate the obtained denoising performances on structured noisy signals inside and outside the classes used for training. In addition to having good performance in denoising signals inside and outside to the training class, our method shows to be particularly robust when different noise levels, noise types and artifacts are added.
    Denoising Diffusion Restoration Models. (arXiv:2201.11793v3 [eess.IV] UPDATED)
    Many interesting tasks in image restoration can be cast as linear inverse problems. A recent family of approaches for solving these problems uses stochastic algorithms that sample from the posterior distribution of natural images given the measurements. However, efficient solutions often require problem-specific supervised training to model the posterior, whereas unsupervised methods that are not problem-specific typically rely on inefficient iterative methods. This work addresses these issues by introducing Denoising Diffusion Restoration Models (DDRM), an efficient, unsupervised posterior sampling method. Motivated by variational inference, DDRM takes advantage of a pre-trained denoising diffusion generative model for solving any linear inverse problem. We demonstrate DDRM's versatility on several image datasets for super-resolution, deblurring, inpainting, and colorization under various amounts of measurement noise. DDRM outperforms the current leading unsupervised methods on the diverse ImageNet dataset in reconstruction quality, perceptual quality, and runtime, being 5x faster than the nearest competitor. DDRM also generalizes well for natural images out of the distribution of the observed ImageNet training set.
    Neur2SP: Neural Two-Stage Stochastic Programming. (arXiv:2205.12006v2 [math.OC] UPDATED)
    Stochastic Programming is a powerful modeling framework for decision-making under uncertainty. In this work, we tackle two-stage stochastic programs (2SPs), the most widely used class of stochastic programming models. Solving 2SPs exactly requires optimizing over an expected value function that is computationally intractable. Having a mixed-integer linear program (MIP) or a nonlinear program (NLP) in the second stage further aggravates the intractability, even when specialized algorithms that exploit problem structure are employed. Finding high-quality (first-stage) solutions -- without leveraging problem structure -- can be crucial in such settings. We develop Neur2SP, a new method that approximates the expected value function via a neural network to obtain a surrogate model that can be solved more efficiently than the traditional extensive formulation approach. Neur2SP makes no assumptions about the problem structure, in particular about the second-stage problem, and can be implemented using an off-the-shelf MIP solver. Our extensive computational experiments on four benchmark 2SP problem classes with different structures (containing MIP and NLP second-stage problems) demonstrate the efficiency (time) and efficacy (solution quality) of Neur2SP. In under 1.66 seconds, Neur2SP finds high-quality solutions across all problems even as the number of scenarios increases, an ideal property that is difficult to have for traditional 2SP solution techniques. Namely, the most generic baseline method typically requires minutes to hours to find solutions of comparable quality.
    Cascaded Deep Hybrid Models for Multistep Household Energy Consumption Forecasting. (arXiv:2207.02589v2 [cs.LG] UPDATED)
    Sustainability requires increased energy efficiency with minimal waste. The future power systems should thus provide high levels of flexibility iin controling energy consumption. Precise projections of future energy demand/load at the aggregate and on the individual site levels are of great importance for decision makers and professionals in the energy industry. Forecasting energy loads has become more advantageous for energy providers and customers, allowing them to establish an efficient production strategy to satisfy demand. This study introduces two hybrid cascaded models for forecasting multistep household power consumption in different resolutions. The first model integrates Stationary Wavelet Transform (SWT), as an efficient signal preprocessing technique, with Convolutional Neural Networks and Long Short Term Memory (LSTM). The second hybrid model combines SWT with a self-attention based neural network architecture named transformer. The major constraint of using time-frequency analysis methods such as SWT in multistep energy forecasting problems is that they require sequential signals, making signal reconstruction problematic in multistep forecasting applications.The cascaded models can efficiently address this problem through using the recursive outputs. Experimental results show that the proposed hybrid models achieve superior prediction performance compared to the existing multistep power consumption prediction methods. The results will pave the way for more accurate and reliable forecasting of household power consumption.
    Does Self-supervised Learning Really Improve Reinforcement Learning from Pixels?. (arXiv:2206.05266v3 [cs.LG] UPDATED)
    We investigate whether self-supervised learning (SSL) can improve online reinforcement learning (RL) from pixels. We extend the contrastive reinforcement learning framework (e.g., CURL) that jointly optimizes SSL and RL losses and conduct an extensive amount of experiments with various self-supervised losses. Our observations suggest that the existing SSL framework for RL fails to bring meaningful improvement over the baselines only taking advantage of image augmentation when the same amount of data and augmentation is used. We further perform evolutionary searches to find the optimal combination of multiple self-supervised losses for RL, but find that even such a loss combination fails to meaningfully outperform the methods that only utilize carefully designed image augmentations. After evaluating these approaches together in multiple different environments including a real-world robot environment, we confirm that no single self-supervised loss or image augmentation method can dominate all environments and that the current framework for joint optimization of SSL and RL is limited. Finally, we conduct the ablation study on multiple factors and demonstrate the properties of representations learned with different approaches.
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v2 [cs.LG] UPDATED)
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.
    Deep Ensembles Work, But Are They Necessary?. (arXiv:2202.06985v2 [cs.LG] UPDATED)
    Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of individual larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer distinct benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble's uncertainty quantification on out-of-distribution (OOD) data, but is instead highly correlated with the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and -- in this sense -- is not indicative of any "effective robustness". While deep ensembles are a practical way to achieve improvements to predictive power, uncertainty quantification, and robustness, our results show that these improvements can be replicated by a (larger) single model.
    Linearizing Transformer with Key-Value Memory. (arXiv:2203.12644v4 [cs.CL] UPDATED)
    Efficient transformer variants with linear time complexity have been developed to mitigate the quadratic computational overhead of the vanilla transformer. Among them are low-rank projection methods such as Linformer and kernel-based Transformers. Despite their unique merits, they usually suffer from a performance drop comparing with the vanilla transformer on many sequence generation tasks, and often fail to obtain computation gain when the generation is short. We propose MemSizer, an approach towards closing the performance gap while improving the efficiency even with short generation. It projects the source sequences into lower dimension representations like Linformer, while enjoying efficient recurrent-style incremental computation similar to kernel-based transformers. This yields linear computation time and constant memory complexity at inference time. MemSizer also employs a lightweight multi-head mechanism which renders the computation as light as a single-head model. We demonstrate that MemSizer provides an improved balance between efficiency and accuracy over the vanilla transformer and other efficient transformer variants in three typical sequence generation tasks, including machine translation, abstractive text summarization, and language modeling.
    CUF: Continuous Upsampling Filters. (arXiv:2210.06965v1 [cs.LG])
    Neural fields have rapidly been adopted for representing 3D signals, but their application to more classical 2D image-processing has been relatively limited. In this paper, we consider one of the most important operations in image processing: upsampling. In deep learning, learnable upsampling layers have extensively been used for single image super-resolution. We propose to parameterize upsampling kernels as neural fields. This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures. When upsampling images of size 256x256 we show that our architecture is 2x-10x more efficient than competing arbitrary-scale super-resolution architectures, and more efficient than sub-pixel convolutions when instantiated to a single-scale model. In the general setting, these gains grow polynomially with the square of the target scale. We validate our method on standard benchmarks showing such efficiency gains can be achieved without sacrifices in super-resolution performance.
    FedRecAttack: Model Poisoning Attack to Federated Recommendation. (arXiv:2204.01499v2 [cs.CR] UPDATED)
    Federated Recommendation (FR) has received considerable popularity and attention in the past few years. In FR, for each user, its feature vector and interaction data are kept locally on its own client thus are private to others. Without the access to above information, most existing poisoning attacks against recommender systems or federated learning lose validity. Benifiting from this characteristic, FR is commonly considered fairly secured. However, we argue that there is still possible and necessary security improvement could be made in FR. To prove our opinion, in this paper we present FedRecAttack, a model poisoning attack to FR aiming to raise the exposure ratio of target items. In most recommendation scenarios, apart from private user-item interactions (e.g., clicks, watches and purchases), some interactions are public (e.g., likes, follows and comments). Motivated by this point, in FedRecAttack we make use of the public interactions to approximate users' feature vectors, thereby attacker can generate poisoned gradients accordingly and control malicious users to upload the poisoned gradients in a well-designed way. To evaluate the effectiveness and side effects of FedRecAttack, we conduct extensive experiments on three real-world datasets of different sizes from two completely different scenarios. Experimental results demonstrate that our proposed FedRecAttack achieves the state-of-the-art effectiveness while its side effects are negligible. Moreover, even with small proportion (3%) of malicious users and small proportion (1%) of public interactions, FedRecAttack remains highly effective, which reveals that FR is more vulnerable to attack than people commonly considered.
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v3 [math.NA] UPDATED)
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.
    Hamiltonian latent operators for content and motion disentanglement in image sequences. (arXiv:2112.01641v4 [cs.CV] UPDATED)
    We introduce \textit{HALO} -- a deep generative model utilising HAmiltonian Latent Operators to reliably disentangle content and motion information in image sequences. The \textit{content} represents summary statistics of a sequence, and \textit{motion} is a dynamic process that determines how information is expressed in any part of the sequence. By modelling the dynamics as a Hamiltonian motion, important desiderata are ensured: (1) the motion is reversible, (2) the symplectic, volume-preserving structure in phase space means paths are continuous and are not divergent in the latent space. Consequently, the nearness of sequence frames is realised by the nearness of their coordinates in the phase space, which proves valuable for disentanglement and long-term sequence generation. The sequence space is generally comprised of different types of dynamical motions. To ensure long-term separability and allow controlled generation, we associate every motion with a unique Hamiltonian that acts in its respective subspace. We demonstrate the utility of \textit{HALO} by swapping the motion of a pair of sequences, controlled generation, and image rotations.
    Refining Self-Supervised Learning in Imaging: Beyond Linear Metric. (arXiv:2202.12921v2 [cs.CV] UPDATED)
    We introduce in this paper a new statistical perspective, exploiting the Jaccard similarity metric, as a measure-based metric to effectively invoke non-linear features in the loss of self-supervised contrastive learning. Specifically, our proposed metric may be interpreted as a dependence measure between two adapted projections learned from the so-called latent representations. This is in contrast to the cosine similarity measure in the conventional contrastive learning model, which accounts for correlation information. To the best of our knowledge, this effectively non-linearly fused information embedded in the Jaccard similarity, is novel to self-supervision learning with promising results. The proposed approach is compared to two state-of-the-art self-supervised contrastive learning methods on three image datasets. We not only demonstrate its amenable applicability in current ML problems, but also its improved performance and training efficiency.
    Fair Federated Learning via Bounded Group Loss. (arXiv:2203.10190v3 [cs.LG] UPDATED)
    Fair prediction across protected groups is an important constraint for many federated learning applications. However, prior work studying group fair federated learning lacks formal convergence or fairness guarantees. In this work we propose a general framework for provably fair federated learning. In particular, we explore and extend the notion of Bounded Group Loss as a theoretically-grounded approach for group fairness. Using this setup, we propose a scalable federated optimization method that optimizes the empirical risk under a number of group fairness constraints. We provide convergence guarantees for the method as well as fairness guarantees for the resulting solution. Empirically, we evaluate our method across common benchmarks from fair ML and federated learning, showing that it can provide both fairer and more accurate predictions than baseline approaches.
    DDXPlus: A New Dataset For Automatic Medical Diagnosis. (arXiv:2205.09148v3 [cs.CL] UPDATED)
    There has been a rapidly growing interest in Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the machine learning research literature, aiming to assist doctors in telemedicine services. These systems are designed to interact with patients, collect evidence about their symptoms and relevant antecedents, and possibly make predictions about the underlying diseases. Doctors would review the interactions, including the evidence and the predictions, collect if necessary additional information from patients, before deciding on next steps. Despite recent progress in this area, an important piece of doctors' interactions with patients is missing in the design of these systems, namely the differential diagnosis. Its absence is largely due to the lack of datasets that include such information for models to train on. In this work, we present a large-scale synthetic dataset of roughly 1.3 million patients that includes a differential diagnosis, along with the ground truth pathology, symptoms and antecedents for each patient. Unlike existing datasets which only contain binary symptoms and antecedents, this dataset also contains categorical and multi-choice symptoms and antecedents useful for efficient data collection. Moreover, some symptoms are organized in a hierarchy, making it possible to design systems able to interact with patients in a logical way. As a proof-of-concept, we extend two existing AD and ASD systems to incorporate the differential diagnosis, and provide empirical evidence that using differentials as training signals is essential for the efficiency of such systems or for helping doctors better understand the reasoning of those systems.
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v1 [cs.LG])
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.
    Meta-learning Based Short-Term Passenger Flow Prediction for Newly-Operated Urban Rail Transit Stations. (arXiv:2210.07098v1 [cs.LG])
    Accurate short-term passenger flow prediction in urban rail transit stations has great benefits for reasonably allocating resources, easing congestion, and reducing operational risks. However, compared with data-rich stations, the passenger flow prediction in newly-operated stations is limited by passenger flow data volume, which would reduce the prediction accuracy and increase the difficulty for station management and operation. Hence, how accurately predicting passenger flow in newly-operated stations with limited data is an urgent problem to be solved. Existing passenger flow prediction approaches generally depend on sufficient data, which might be unsuitable for newly-operated stations. Therefore, we propose a meta-learning method named Meta Long Short-Term Memory Network (Meta-LSTM) to predict the passenger flow in newly-operated stations. The Meta-LSTM is to construct a framework that increases the generalization ability of long short-term memory network (LSTM) to various passenger flow characteristics by learning passenger flow characteristics from multiple data-rich stations and then applying the learned parameter to data-scarce stations by parameter initialization. The Meta-LSTM is applied to the subway network of Nanning, Hangzhou, and Beijing, China. The experiments on three real-world subway networks demonstrate the effectiveness of our proposed Meta-LSTM over several competitive baseline models. Results also show that our proposed Meta-LSTM has a good generalization ability to various passenger flow characteristics, which can provide a reference for passenger flow prediction in the stations with limited data.
    Online Minimax Multiobjective Optimization: Multicalibeating and Other Applications. (arXiv:2108.03837v3 [cs.LG] UPDATED)
    We introduce a simple but general online learning framework in which a learner plays against an adversary in a vector-valued game that changes every round. Even though the learner's objective is not convex-concave (and so the minimax theorem does not apply), we give a simple algorithm that can compete with the setting in which the adversary must announce their action first, with optimally diminishing regret. We demonstrate the power of our framework by using it to (re)derive optimal bounds and efficient algorithms across a variety of domains, ranging from multicalibration to a large set of no regret algorithms, to a variant of Blackwell's approachability theorem for polytopes with fast convergence rates. As a new application, we show how to ``(multi)calibeat'' an arbitrary collection of forecasters -- achieving an exponentially improved dependence on the number of models we are competing against, compared to prior work.
    Accurate, reliable and interpretable solubility prediction of druglike molecules with attention pooling and Bayesian learning. (arXiv:2210.07145v1 [q-bio.BM])
    In drug discovery, aqueous solubility is an important pharmacokinetic property which affects absorption and assay availability of drug. Thus, in silico prediction of solubility has been studied for its utility in virtual screening and lead optimization. Recently, machine learning (ML) methods using experimental data has been popular because physics-based methods like quantum mechanics and molecular dynamics are not suitable for high-throughput tasks due to its computational costs. However, ML method can exhibit over-fitting problem in a data-deficient condition, and this is the case for most chemical property datasets. In addition, ML methods are regarded as a black box function in that it is difficult to interpret contribution of hidden features to outputs, hindering analysis and modification of structure-activity relationship. To deal with mentioned issues, we developed Bayesian graph neural networks (GNNs) with the self-attention readout layer. Unlike most GNNs using self-attention in node updates, self-attention applied at readout layer enabled a model to improve prediction performance as well as to identify atom-wise importance, which can help lead optimization as exemplified for three FDA-approved drugs. Also, Bayesian inference enables us to separate more or less accurate results according to uncertainty in solubility prediction task We expect that our accurate, reliable and interpretable model can be used for more careful decision-making and various applications in the development of drugs.
    Predictive Querying for Autoregressive Neural Sequence Models. (arXiv:2210.06464v2 [cs.LG] UPDATED)
    In reasoning about sequential events it is natural to pose probabilistic queries such as "when will event A occur next" or "what is the probability of A occurring before B", with applications in areas such as user modeling, medicine, and finance. However, with machine learning shifting towards neural autoregressive models such as RNNs and transformers, probabilistic querying has been largely restricted to simple cases such as next-event prediction. This is in part due to the fact that future querying involves marginalization over large path spaces, which is not straightforward to do efficiently in such models. In this paper we introduce a general typology for predictive queries in neural autoregressive sequence models and show that such queries can be systematically represented by sets of elementary building blocks. We leverage this typology to develop new query estimation methods based on beam search, importance sampling, and hybrids. Across four large-scale sequence datasets from different application domains, as well as for the GPT-2 language model, we demonstrate the ability to make query answering tractable for arbitrary queries in exponentially-large predictive path-spaces, and find clear differences in cost-accuracy tradeoffs between search and sampling methods.
    DeepTime: Deep Time-Index Meta-Learning for Non-Stationary Time-Series Forecasting. (arXiv:2207.06046v3 [cs.LG] UPDATED)
    Advances in I.T. infrastructure has led to the collection of longer sequences of time-series. Such sequences are typically non-stationary, exhibiting distribution shifts over time -- a challenging scenario for the forecasting task, due to the problems of covariate shift, and conditional distribution shift. In this paper, we show that deep time-index models possess strong synergies with a meta-learning formulation of forecasting, displaying significant advantages over existing neural forecasting methods in tackling the problems arising from non-stationarity. These advantages include having a stronger smoothness prior, avoiding the problem of covariate shift, and having better sample efficiency. To this end, we propose DeepTime, a deep time-index model trained via meta-learning. Extensive experiments on real-world datasets in the long sequence time-series forecasting setting demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is available at https://github.com/salesforce/DeepTime.
    Explaining Graph Neural Networks with Structure-Aware Cooperative Games. (arXiv:2201.12380v4 [cs.LG] UPDATED)
    Explaining machine learning models is an important and increasingly popular area of research interest. The Shapley value from game theory has been proposed as a prime approach to compute feature importance towards model predictions on images, text, tabular data, and recently graph neural networks (GNNs) on graphs. In this work, we revisit the appropriateness of the Shapley value for GNN explanation, where the task is to identify the most important subgraph and constituent nodes for GNN predictions. We claim that the Shapley value is a non-ideal choice for graph data because it is by definition not structure-aware. We propose a Graph Structure-aware eXplanation (GStarX) method to leverage the critical graph structure information to improve the explanation. Specifically, we define a scoring function based on a new structure-aware value from the cooperative game theory proposed by Hamiache and Navarro (HN). When used to score node importance, the HN value utilizes graph structures to attribute cooperation surplus between neighbor nodes, resembling message passing in GNNs, so that node importance scores reflect not only the node feature importance, but also the node structural roles. We demonstrate that GStarX produces qualitatively more intuitive explanations, and quantitatively improves explanation fidelity over strong baselines on chemical graph property prediction and text graph sentiment classification.
    Closer Look at the Transferability of Adversarial Examples: How They Fool Different Models Differently. (arXiv:2112.14337v2 [cs.LG] UPDATED)
    Deep neural networks are vulnerable to adversarial examples (AEs), which have adversarial transferability: AEs generated for the source model can mislead another (target) model's predictions. However, the transferability has not been understood from the perspective of to which class target model's predictions were misled (i.e., class-aware transferability). In this paper, we differentiate the cases in which a target model predicts the same wrong class as the source model ("same mistake") or a different wrong class ("different mistake") to analyze and provide an explanation of the mechanism. First, our analysis shows (1) that AEs tend to cause same mistakes, correlating with "non-targeted transferability," and (2) that different mistakes occur between similar models regardless of the perturbation size. Second, we present evidence that the difference in same mistakes and different mistakes can be explained by non-robust features, predictive but human-uninterpretable patterns: different mistakes occur when non-robust features in AEs are used differently by models. Non-robust features can thus provide consistent explanations for the class-aware transferability of AEs.
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v3 [stat.ML] UPDATED)
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the data augmentation parameters are chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution on the functions of a neural network, which allows us to learn it using Bayesian model selection. This has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.
    Online PAC-Bayes Learning. (arXiv:2206.00024v2 [cs.LG] UPDATED)
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.
    Communication Efficient Federated Learning for Generalized Linear Bandits. (arXiv:2202.01087v2 [cs.LG] UPDATED)
    Contextual bandit algorithms have been recently studied under the federated learning setting to satisfy the demand of keeping data decentralized and pushing the learning of bandit models to the client side. But limited by the required communication efficiency, existing solutions are restricted to linear models to exploit their closed-form solutions for parameter estimation. Such a restricted model choice greatly hampers these algorithms' practical utility. In this paper, we take the first step to addressing this challenge by studying generalized linear bandit models under the federated learning setting. We propose a communication-efficient solution framework that employs online regression for local update and offline regression for global update. We rigorously proved, though the setting is more general and challenging, our algorithm can attain sub-linear rate in both regret and communication cost, which is also validated by our extensive empirical evaluations.
    Variational Graph Generator for Multi-View Graph Clustering. (arXiv:2210.07011v1 [cs.LG])
    Multi-view graph clustering (MGC) methods are increasingly being studied due to the rising of multi-view data with graph structural information. The critical point of MGC is to better utilize the view-specific and view-common information in features and graphs of multiple views. However, existing works have an inherent limitation that they are unable to concurrently utilize the consensus graph information across multiple graphs and the view-specific feature information. To address this issue, we propose Variational Graph Generator for Multi-View Graph Clustering (VGMGC). Specifically, a novel variational graph generator is proposed to infer a reliable variational consensus graph based on a priori assumption over multiple graphs. Then a simple yet effective graph encoder in conjunction with the multi-view clustering objective is presented to learn the desired graph embeddings for clustering, which embeds the consensus and view-specific graphs together with features. Finally, theoretical results illustrate the rationality of VGMGC by analyzing the uncertainty of the inferred consensus graph with information bottleneck principle. Extensive experiments demonstrate the superior performance of our VGMGC over SOTAs.
    SAE: Sequential Anchored Ensembles. (arXiv:2201.00649v2 [cs.LG] UPDATED)
    Computing the Bayesian posterior of a neural network is a challenging task due to the high-dimensionality of the parameter space. Anchored ensembles approximate the posterior by training an ensemble of neural networks on anchored losses designed for the optima to follow the Bayesian posterior. Training an ensemble, however, becomes computationally expensive as its number of members grows since the full training procedure is repeated for each member. In this note, we present Sequential Anchored Ensembles (SAE), a lightweight alternative to anchored ensembles. Instead of training each member of the ensemble from scratch, the members are trained sequentially on losses sampled with high auto-correlation, hence enabling fast convergence of the neural networks and efficient approximation of the Bayesian posterior. SAE outperform anchored ensembles, for a given computational budget, on some benchmarks while showing comparable performance on the others and achieved 2nd and 3rd place in the light and extended tracks of the NeurIPS 2021 Approximate Inference in Bayesian Deep Learning competition.
    Giga-scale Kernel Matrix Vector Multiplication on GPU. (arXiv:2202.01085v3 [math.NA] UPDATED)
    Kernel matrix-vector multiplication (KMVM) is a foundational operation in machine learning and scientific computing. However, as KMVM tends to scale quadratically in both memory and time, applications are often limited by these computational constraints. In this paper, we propose a novel approximation procedure coined \textit{Faster-Fast and Free Memory Method} ($\fthreem$) to address these scaling issues of KMVM for tall~($10^8\sim 10^9$) and skinny~($D\leq7$) data. Extensive experiments demonstrate that $\fthreem$ has empirical \emph{linear time and memory} complexity with a relative error of order $10^{-3}$ and can compute a full KMVM for a billion points \emph{in under a minute} on a high-end GPU, leading to a significant speed-up in comparison to existing CPU methods. We demonstrate the utility of our procedure by applying it as a drop-in for the state-of-the-art GPU-based linear solver FALKON, \emph{improving speed 1.5-5.5 times} at the cost of $<1\%$ drop in accuracy. We further demonstrate competitive results on \emph{Gaussian Process regression} coupled with significant speedups on a variety of real-world datasets.
    Zero-shot Transfer Learning within a Heterogeneous Graph via Knowledge Transfer Networks. (arXiv:2203.02018v4 [cs.LG] UPDATED)
    Data continuously emitted from industrial ecosystems such as social or e-commerce platforms are commonly represented as heterogeneous graphs (HG) composed of multiple node/edge types. State-of-the-art graph learning methods for HGs known as heterogeneous graph neural networks (HGNNs) are applied to learn deep context-informed node representations. However, many HG datasets from industrial applications suffer from label imbalance between node types. As there is no direct way to learn using labels rooted at different node types, HGNNs have been applied to only a few node types with abundant labels. We propose a zero-shot transfer learning module for HGNNs called a Knowledge Transfer Network (KTN) that transfers knowledge from label-abundant node types to zero-labeled node types through rich relational information given in the HG. KTN is derived from the theoretical relationship, which we introduce in this work, between distinct feature extractors for each node type given in an HGNN model. KTN improves performance of 6 different types of HGNN models by up to 960% for inference on zero-labeled node types and outperforms state-of-the-art transfer learning baselines by up to 73% across 18 different transfer learning tasks on HGs.
    Inducing Equilibria via Incentives: Simultaneous Design-and-Play Ensures Global Convergence. (arXiv:2110.01212v3 [cs.GT] UPDATED)
    To regulate a social system comprised of self-interested agents, economic incentives are often required to induce a desirable outcome. This incentive design problem naturally possesses a bilevel structure, in which a designer modifies the rewards of the agents with incentives while anticipating the response of the agents, who play a non-cooperative game that converges to an equilibrium. The existing bilevel optimization algorithms raise a dilemma when applied to this problem: anticipating how incentives affect the agents at equilibrium requires solving the equilibrium problem repeatedly, which is computationally inefficient; bypassing the time-consuming step of equilibrium-finding can reduce the computational cost, but may lead the designer to a sub-optimal solution. To address such a dilemma, we propose a method that tackles the designer's and agents' problems simultaneously in a single loop. Specifically, at each iteration, both the designer and the agents only move one step. Nevertheless, we allow the designer to gradually learn the overall influence of the incentives on the agents, which guarantees optimality after convergence. The convergence rate of the proposed scheme is also established for a broad class of games.
    Deep Multiagent Reinforcement Learning: Challenges and Directions. (arXiv:2106.15691v2 [cs.LG] UPDATED)
    This paper surveys the field of deep multiagent reinforcement learning. The combination of deep neural networks with reinforcement learning has gained increased traction in recent years and is slowly shifting the focus from single-agent to multiagent environments. Dealing with multiple agents is inherently more complex as (a) the future rewards depend on multiple players' joint actions and (b) the computational complexity increases. We present the most common multiagent problem representations and their main challenges, and identify five research areas that address one or more of these challenges: centralised training and decentralised execution, opponent modelling, communication, efficient coordination, and reward shaping. We find that many computational studies rely on unrealistic assumptions or are not generalisable to other settings; they struggle to overcome the curse of dimensionality or nonstationarity. Approaches from psychology and sociology capture promising relevant behaviours, such as communication and coordination, to help agents achieve better performance in multiagent settings. We suggest that, for multiagent reinforcement learning to be successful, future research should address these challenges with an interdisciplinary approach to open up new possibilities in multiagent reinforcement learning.
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v5 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.
    Outlier-Robust Group Inference via Gradient Space Clustering. (arXiv:2210.06759v1 [cs.LG])
    Traditional machine learning models focus on achieving good performance on the overall training distribution, but they often underperform on minority groups. Existing methods can improve the worst-group performance, but they can have several limitations: (i) they require group annotations, which are often expensive and sometimes infeasible to obtain, and/or (ii) they are sensitive to outliers. Most related works fail to solve these two issues simultaneously as they focus on conflicting perspectives of minority groups and outliers. We address the problem of learning group annotations in the presence of outliers by clustering the data in the space of gradients of the model parameters. We show that data in the gradient space has a simpler structure while preserving information about minority groups and outliers, making it suitable for standard clustering methods like DBSCAN. Extensive experiments demonstrate that our method significantly outperforms state-of-the-art both in terms of group identification and downstream worst-group performance.
    Dissipative residual layers for unsupervised implicit parameterization of data manifolds. (arXiv:2210.07100v1 [cs.LG])
    We propose an unsupervised technique for implicit parameterization of data manifolds. In our approach, the data is assumed to belong to a lower dimensional manifold in a higher dimensional space, and the data points are viewed as the endpoints of the trajectories originating outside the manifold. Under this assumption, the data manifold is an attractive manifold of a dynamical system to be estimated. We parameterize such a dynamical system with a residual neural network and propose a spectral localization technique to ensure it is locally attractive in the vicinity of data. We also present initialization and additional regularization of the proposed residual layers. % that we call dissipative bottlenecks. We mention the importance of the considered problem for the tasks of reinforcement learning and support our discussion with examples demonstrating the performance of the proposed layers in denoising and generative tasks.
    Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. (arXiv:2106.08970v3 [cs.LG] UPDATED)
    As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.
    On the Paradox of Certified Training. (arXiv:2102.06700v3 [cs.LG] UPDATED)
    Certified defenses based on convex relaxations are an established technique for training provably robust models. The key component is the choice of relaxation, varying from simple intervals to tight polyhedra. Counterintuitively, loose interval-based training often leads to higher certified robustness than what can be achieved with tighter relaxations, which is a well-known but poorly understood paradox. While recent works introduced various improvements aiming to circumvent this issue in practice, the fundamental problem of training models with high certified robustness remains unsolved. In this work, we investigate the underlying reasons behind the paradox and identify two key properties of relaxations, beyond tightness, that impact certified training dynamics: continuity and sensitivity. Our extensive experimental evaluation with a number of popular convex relaxations provides strong evidence that these factors can explain the drop in certified robustness observed for tighter relaxations. We also systematically explore modifications of existing relaxations and discover that improving unfavorable properties is challenging, as such attempts often harm other properties, revealing a complex tradeoff. Our findings represent an important first step towards understanding the intricate optimization challenges involved in certified training.
    Near-Optimal Randomized Exploration for Tabular Markov Decision Processes. (arXiv:2102.09703v5 [cs.LG] UPDATED)
    We study algorithms using randomized value functions for exploration in reinforcement learning. This type of algorithms enjoys appealing empirical performance. We show that when we use 1) a single random seed in each episode, and 2) a Bernstein-type magnitude of noise, we obtain a worst-case $\widetilde{O}\left(H\sqrt{SAT}\right)$ regret bound for episodic time-inhomogeneous Markov Decision Process where $S$ is the size of state space, $A$ is the size of action space, $H$ is the planning horizon and $T$ is the number of interactions. This bound polynomially improves all existing bounds for algorithms based on randomized value functions, and for the first time, matches the $\Omega\left(H\sqrt{SAT}\right)$ lower bound up to logarithmic factors. Our result highlights that randomized exploration can be near-optimal, which was previously achieved only by optimistic algorithms. To achieve the desired result, we develop 1) a new clipping operation to ensure both the probability of being optimistic and the probability of being pessimistic are lower bounded by a constant, and 2) a new recursive formula for the absolute value of estimation errors to analyze the regret.
    Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes. (arXiv:2110.02248v2 [cs.LG] UPDATED)
    We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^*(K)KT\overline{\gamma}_{T}} )$ regret with high probability, where $\overline{\gamma}_{T}$ is the maximum information gain associated with the set of base arm contexts that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds and $\lambda^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.
    SkipNode: On Alleviating Performance Degradation for Deep Graph Convolutional Networks. (arXiv:2112.11628v3 [cs.LG] UPDATED)
    Graph Convolutional Networks (GCNs) suffer from performance degradation when models go deeper. However, earlier works only attributed the performance degradation to over-smoothing. In this paper, we conduct theoretical and experimental analysis to explore the fundamental causes of performance degradation in deep GCNs: over-smoothing and gradient vanishing have a mutually reinforcing effect that causes the performance to deteriorate more quickly in deep GCNs. On the other hand, existing anti-over-smoothing methods all perform full convolutions up to the model depth. They could not well resist the exponential convergence of over-smoothing due to model depth increasing. In this work, we propose a simple yet effective plug-and-play module, SkipNode, to overcome the performance degradation of deep GCNs. It samples graph nodes in each convolutional layer to skip the convolution operation. In this way, both over-smoothing and gradient vanishing can be effectively suppressed since (1) not all nodes perform full convolutions up to the model depth and, (2) the gradient can be directly passed back through ``skipped'' nodes. We provide both theoretical analysis and empirical evaluation to demonstrate the efficacy of SkipNode and its superiority over SOTA baselines.
    Denoising Masked AutoEncoders are Certifiable Robust Vision Learners. (arXiv:2210.06983v1 [cs.CV])
    In this paper, we propose a new self-supervised method, which is called Denoising Masked AutoEncoders (DMAE), for learning certified robust classifiers of images. In DMAE, we corrupt each image by adding Gaussian noises to each pixel value and randomly masking several patches. A Transformer-based encoder-decoder model is then trained to reconstruct the original image from the corrupted one. In this learning paradigm, the encoder will learn to capture relevant semantics for the downstream tasks, which is also robust to Gaussian additive noises. We show that the pre-trained encoder can naturally be used as the base classifier in Gaussian smoothed models, where we can analytically compute the certified radius for any data point. Although the proposed method is simple, it yields significant performance improvement in downstream classification tasks. We show that the DMAE ViT-Base model, which just uses 1/10 parameters of the model developed in recent work arXiv:2206.10550, achieves competitive or better certified accuracy in various settings. The DMAE ViT-Large model significantly surpasses all previous results, establishing a new state-of-the-art on ImageNet dataset. We further demonstrate that the pre-trained model has good transferability to the CIFAR-10 dataset, suggesting its wide adaptability. Models and code are available at https://github.com/quanlin-wu/dmae.
    Shapley Q-value: A Local Reward Approach to Solve Global Reward Games. (arXiv:1907.05707v6 [cs.LG] UPDATED)
    Cooperative game is a critical research area in the multi-agent reinforcement learning (MARL). Global reward game is a subclass of cooperative games, where all agents aim to maximize the global reward. Credit assignment is an important problem studied in the global reward game. Most of previous works stood by the view of non-cooperative-game theoretical framework with the shared reward approach, i.e., each agent being assigned a shared global reward directly. This, however, may give each agent an inaccurate reward on its contribution to the group, which could cause inefficient learning. To deal with this problem, we i) introduce a cooperative-game theoretical framework called extended convex game (ECG) that is a superset of global reward game, and ii) propose a local reward approach called Shapley Q-value. Shapley Q-value is able to distribute the global reward, reflecting each agent's own contribution in contrast to the shared reward approach. Moreover, we derive an MARL algorithm called Shapley Q-value deep deterministic policy gradient (SQDDPG), using Shapley Q-value as the critic for each agent. We evaluate SQDDPG on Cooperative Navigation, Prey-and-Predator and Traffic Junction, compared with the state-of-the-art algorithms, e.g., MADDPG, COMA, Independent DDPG and Independent A2C. In the experiments, SQDDPG shows a significant improvement on the convergence rate. Finally, we plot Shapley Q-value and validate the property of fair credit assignment.
    SYNFIX: Automatically Fixing Syntax Errors using Compiler Diagnostics. (arXiv:2104.14671v2 [cs.SE] UPDATED)
    Beginning programmers struggle with the complex grammar of modern programming languages like Java, and make lot of syntax errors. The diagnostic syntax error messages from compilers and IDEs are sometimes useful, but often the messages are cryptic and puzzling. Students could be helped, and instructors' time saved, by automated repair suggestions when dealing with syntax errors. Large samples of student errors and fixes are now available, offering the possibility of data-driven machine-learning approaches to help students fix syntax errors. Current machine-learning approaches do a reasonable job fixing syntax errors in shorter programs, but don't work as well even for moderately longer programs. We introduce SYNFIX, a machine-learning based tool that substantially improves on the state-of-the-art, by learning to use compiler diagnostics, employing a very large neural model that leverages unsupervised pre-training, and relying on multi-label classification rather than autoregressive synthesis to generate the (repaired) output. We describe SYNFIX's architecture in detail, and provide a detailed evaluation. We have built SYNFIX into a free, open-source version of Visual Studio Code; we make all our source code and models freely available.
    On minimizers and convolutional filters: a partial justification for the effectiveness of CNNs in categorical sequence analysis. (arXiv:2111.08452v3 [cs.LG] UPDATED)
    Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how those filters can be used to classify the sequence. In this manuscript, we demonstrate through a careful mathematical analysis of hash function properties that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In additional empirical experiments, we find that this property manifests as decreased density in repetitive regions. This provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
    Graph-based Neural Modules to Inspect Attention-based Architectures: A Position Paper. (arXiv:2210.07117v1 [cs.LG])
    Encoder-decoder architectures are prominent building blocks of state-of-the-art solutions for tasks across multiple fields where deep learning (DL) or foundation models play a key role. Although there is a growing community working on the provision of interpretation for DL models as well as considerable work in the neuro-symbolic community seeking to integrate symbolic representations and DL, many open questions remain around the need for better tools for visualization of the inner workings of DL architectures. In particular, encoder-decoder models offer an exciting opportunity for visualization and editing by humans of the knowledge implicitly represented in model weights. In this work, we explore ways to create an abstraction for segments of the network as a two-way graph-based representation. Changes to this graph structure should be reflected directly in the underlying tensor representations. Such two-way graph representation enables new neuro-symbolic systems by leveraging the pattern recognition capabilities of the encoder-decoder along with symbolic reasoning carried out on the graphs. The approach is expected to produce new ways of interacting with DL models but also to improve performance as a result of the combination of learning and reasoning capabilities.
    On the Theoretical Equivalence of Several Trade-Off Curves Assessing Statistical Proximity. (arXiv:2006.11809v3 [cs.LG] UPDATED)
    The recent advent of powerful generative models has triggered the renewed development of quantitative measures to assess the proximity of two probability distributions. As the scalar Frechet inception distance remains popular, several methods have explored computing entire curves, which reveal the trade-off between the fidelity and variability of the first distribution with respect to the second one. Several of such variants have been proposed independently and while intuitively similar, their relationship has not yet been made explicit. In an effort to make the emerging picture of generative evaluation more clear, we propose a unification of four curves known respectively as: the precision-recall (PR) curve, the Lorenz curve, the receiver operating characteristic (ROC) curve and a special case of R\'enyi divergence frontiers. In addition, we discuss possible links between PR / Lorenz curves with the derivation of domain adaptation bounds.
    Sparse in Space and Time: Audio-visual Synchronisation with Trainable Selectors. (arXiv:2210.07055v1 [cs.CV])
    The objective of this paper is audio-visual synchronisation of general videos 'in the wild'. For such videos, the events that may be harnessed for synchronisation cues may be spatially small and may occur only infrequently during a many seconds-long video clip, i.e. the synchronisation signal is 'sparse in space and time'. This contrasts with the case of synchronising videos of talking heads, where audio-visual correspondence is dense in both time and space. We make four contributions: (i) in order to handle longer temporal sequences required for sparse synchronisation signals, we design a multi-modal transformer model that employs 'selectors' to distil the long audio and visual streams into small sequences that are then used to predict the temporal offset between streams. (ii) We identify artefacts that can arise from the compression codecs used for audio and video and can be used by audio-visual models in training to artificially solve the synchronisation task. (iii) We curate a dataset with only sparse in time and space synchronisation signals; and (iv) the effectiveness of the proposed model is shown on both dense and sparse datasets quantitatively and qualitatively. Project page: v-iashin.github.io/SparseSync
    Reprogramming Large Pretrained Language Models for Antibody Sequence Infilling. (arXiv:2210.07144v1 [q-bio.BM])
    Antibodies comprise the most versatile class of binding molecules, with numerous applications in biomedicine. Therapeutic antibody development requires designing novel and diverse sequences with improved properties, while maintaining the structural consistency. Recently, deep language models and graph neural networks have shown impressive success in antibody sequence generation. Since only a limited number of antibody structures are known, training a model using this limited data can lead to degraded performance, particularly lacking diversity in the generated samples. To address such issues, we leverage the method of Model Reprogramming (MR), which focuses on repurposing pretrained machine learning models for target domain tasks with scarce data, where it may be difficult to train a high-performing model from scratch. We introduce Reprogramming for Protein Sequence Infilling, a framework in which pretrained natural language models are repurposed for protein sequence infilling via reprogramming, to infill protein sequence templates as a method of novel protein generation. For variable CDR sequence design, we formulate the task as text infilling that uses the constant region of an antibody as the sequence template. Results on antibody design benchmarks show that our reprogrammed model on low resourced antibody sequence dataset provides highly diverse CDR sequences, up to more than a two-fold increase of diversity over the baselines, without losing structural integrity and naturalness. The performance benefit of the reprogrammed model learning only from antibody sequences is more evident for longer CDR design or for multiple loop infilling at once, compared to existing graph-based models that require additional structural information. The generated sequences also demonstrate enhanced antigen binding specificity or virus neutralization ability.
    Reliable Neural Networks for Regression Uncertainty Estimation. (arXiv:2109.08213v2 [cs.LG] UPDATED)
    While deep neural networks are highly performant and successful in a wide range of real-world problems, estimating their predictive uncertainty remains a challenging task. To address this challenge, we propose and implement a loss function for regression uncertainty estimation based on the Bayesian Validation Metric (BVM) framework while using ensemble learning. The proposed loss reproduces maximum likelihood estimation in the limiting case. A series of experiments on in-distribution data show that the proposed method is competitive with existing state-of-the-art methods. Experiments on out-of-distribution data show that the proposed method is robust to statistical change and exhibits superior predictive capability.
    SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction. (arXiv:2106.09305v3 [cs.LG] UPDATED)
    One unique property of time series is that the temporal relations are largely preserved after downsampling into two sub-sequences. By taking advantage of this property, we propose a novel neural network architecture that conducts sample convolution and interaction for temporal modeling and forecasting, named SCINet. Specifically, SCINet is a recursive downsample-convolve-interact architecture. In each layer, we use multiple convolutional filters to extract distinct yet valuable temporal features from the downsampled sub-sequences or features. By combining these rich features aggregated from multiple resolutions, SCINet effectively models time series with complex temporal dynamics. Experimental results show that SCINet achieves significant forecasting accuracy improvements over both existing convolutional models and Transformer-based solutions across various real-world time series forecasting datasets. Our codes and data are available at https://github.com/cure-lab/SCINet.
    Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators. (arXiv:2210.06812v1 [cs.LG])
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to estimate: (1) A consensus label for each example that aggregates the individual annotations (more accurately than aggregation via majority-vote or other algorithms used in crowdsourcing); (2) A confidence score for how likely each consensus label is correct (via well-calibrated estimates that account for the number of annotations for each example and their agreement, prediction-confidence from a trained classifier, and trustworthiness of each annotator vs. the classifier); (3) A rating for each annotator quantifying the overall correctness of their labels. While many algorithms have been proposed to estimate related quantities in crowdsourcing, these often rely on sophisticated generative models with iterative inference schemes, whereas CROWDLAB is based on simple weighted ensembling. Many algorithms also rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB in contrast utilizes any classifier model trained on these features, which can generalize between examples with similar features. In evaluations on real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than many alternative algorithms.
    CLASP: Few-Shot Cross-Lingual Data Augmentation for Semantic Parsing. (arXiv:2210.07074v1 [cs.CL])
    A bottleneck to developing Semantic Parsing (SP) models is the need for a large volume of human-labeled training data. Given the complexity and cost of human annotation for SP, labeled data is often scarce, particularly in multilingual settings. Large Language Models (LLMs) excel at SP given only a few examples, however LLMs are unsuitable for runtime systems which require low latency. In this work, we propose CLASP, a simple method to improve low-resource SP for moderate-sized models: we generate synthetic data from AlexaTM 20B to augment the training set for a model 40x smaller (500M parameters). We evaluate on two datasets in low-resource settings: English PIZZA, containing either 348 or 16 real examples, and mTOP cross-lingual zero-shot, where training data is available only in English, and the model must generalize to four new languages. On both datasets, we show significant improvements over strong baseline methods.
    Observed Adversaries in Deep Reinforcement Learning. (arXiv:2210.06787v1 [cs.LG])
    In this work, we point out the problem of observed adversaries for deep policies. Specifically, recent work has shown that deep reinforcement learning is susceptible to adversarial attacks where an observed adversary acts under environmental constraints to invoke natural but adversarial observations. This setting is particularly relevant for HRI since HRI-related robots are expected to perform their tasks around and with other agents. In this work, we demonstrate that this effect persists even with low-dimensional observations. We further show that these adversarial attacks transfer across victims, which potentially allows malicious attackers to train an adversary without access to the target victim.
    HoechstGAN: Virtual Lymphocyte Staining Using Generative Adversarial Networks. (arXiv:2210.06909v1 [cs.CV])
    The presence and density of specific types of immune cells are important to understand a patient's immune response to cancer. However, immunofluorescence staining required to identify T cell subtypes is expensive, timeconsuming, and rarely performed in clinical settings. We present a framework to virtually stain Hoechst images (which are cheap and widespread) with both CD3 and CD8 to identify T cell subtypes in clear cell renal cell carcinoma using generative adversarial networks. Our proposed method jointly learns both staining tasks, incentivising the network to incorporate mutually beneficial information from each task. We devise a novel metric to quantify the virtual staining quality, and use it to evaluate our method.
    Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data. (arXiv:2210.07082v1 [cs.LG])
    The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v4 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.
    ConvTransSeg: A Multi-resolution Convolution-Transformer Network for Medical Image Segmentation. (arXiv:2210.07072v1 [cs.CV])
    Convolutional neural networks (CNNs) achieved the state-of-the-art performance in medical image segmentation due to their ability to extract highly complex feature representations. However, it is argued in recent studies that traditional CNNs lack the intelligence to capture long-term dependencies of different image regions. Following the success of applying Transformer models on natural language processing tasks, the medical image segmentation field has also witnessed growing interest in utilizing Transformers, due to their ability to capture long-range contextual information. However, unlike CNNs, Transformers lack the ability to learn local feature representations. Thus, to fully utilize the advantages of both CNNs and Transformers, we propose a hybrid encoder-decoder segmentation model (ConvTransSeg). It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction. The encoder and decoder are interconnected in a multi-resolution manner. We compared our method with many other state-of-the-art hybrid CNN and Transformer segmentation models on binary and multiple class image segmentation tasks using several public medical image datasets, including skin lesion, polyp, cell and brain tissue. The experimental results show that our method achieves overall the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption. In contrast to most Transformer-based methods that we compared, our method does not require the use of pre-trained models to achieve similar or better performance. The code is freely available for research purposes on Github: (the link will be added upon acceptance).
    Multi-Target XGBoostLSS Regression. (arXiv:2210.06831v1 [cs.LG])
    Current implementations of Gradient Boosting Machines are mostly designed for single-target regression tasks and commonly assume independence between responses when used in multivariate settings. As such, these models are not well suited if non-negligible dependencies exist between targets. To overcome this limitation, we present an extension of XGBoostLSS that models multiple targets and their dependencies in a probabilistic regression setting. Empirical results show that our approach outperforms existing GBMs with respect to runtime and compares well in terms of accuracy.
    ROS-PyBullet Interface: A Framework for Reliable Contact Simulation and Human-Robot Interaction. (arXiv:2210.06887v1 [cs.RO])
    Reliable contact simulation plays a key role in the development of (semi-)autonomous robots, especially when dealing with contact-rich manipulation scenarios, an active robotics research topic. Besides simulation, components such as sensing, perception, data collection, robot hardware control, human interfaces, etc. are all key enablers towards applying machine learning algorithms or model-based approaches in real world systems. However, there is a lack of software connecting reliable contact simulation with the larger robotics ecosystem (i.e. ROS, Orocos), for a more seamless application of novel approaches, found in the literature, to existing robotic hardware. In this paper, we present the ROS-PyBullet Interface, a framework that provides a bridge between the reliable contact/impact simulator PyBullet and the Robot Operating System (ROS). Furthermore, we provide additional utilities for facilitating Human-Robot Interaction (HRI) in the simulated environment. We also present several use-cases that highlight the capabilities and usefulness of our framework. Please check our video, source code, and examples included in the supplementary material. Our full code base is open source and can be found at https://github.com/cmower/ros_pybullet_interface.
    Sample-Then-Optimize Batch Neural Thompson Sampling. (arXiv:2210.06850v1 [cs.LG])
    Bayesian optimization (BO), which uses a Gaussian process (GP) as a surrogate to model its objective function, is popular for black-box optimization. However, due to the limitations of GPs, BO underperforms in some problems such as those with categorical, high-dimensional or image inputs. To this end, recent works have used the highly expressive neural networks (NNs) as the surrogate model and derived theoretical guarantees using the theory of neural tangent kernel (NTK). However, these works suffer from the limitations of the requirement to invert an extremely large parameter matrix and the restriction to the sequential (rather than batch) setting. To overcome these limitations, we introduce two algorithms based on the Thompson sampling (TS) policy named Sample-Then-Optimize Batch Neural TS (STO-BNTS) and STO-BNTS-Linear. To choose an input query, we only need to train an NN (resp. a linear model) and then choose the query by maximizing the trained NN (resp. linear model), which is equivalently sampled from the GP posterior with the NTK as the kernel function. As a result, our algorithms sidestep the need to invert the large parameter matrix yet still preserve the validity of the TS policy. Next, we derive regret upper bounds for our algorithms with batch evaluations, and use insights from batch BO and NTK to show that they are asymptotically no-regret under certain conditions. Finally, we verify their empirical effectiveness using practical AutoML and reinforcement learning experiments.
    Personalized Federated Hypernetworks for Privacy Preservation in Multi-Task Reinforcement Learning. (arXiv:2210.06820v1 [cs.LG])
    Multi-Agent Reinforcement Learning currently focuses on implementations where all data and training can be centralized to one machine. But what if local agents are split across multiple tasks, and need to keep data private between each? We develop the first application of Personalized Federated Hypernetworks (PFH) to Reinforcement Learning (RL). We then present a novel application of PFH to few-shot transfer, and demonstrate significant initial increases in learning. PFH has never been demonstrated beyond supervised learning benchmarks, so we apply PFH to an important domain: RL price-setting for energy demand response. We consider a general case across where agents are split across multiple microgrids, wherein energy consumption data must be kept private within each microgrid. Together, our work explores how the fields of personalized federated learning and RL can come together to make learning efficient across multiple tasks while keeping data secure.
    DE-FAKE: Detection and Attribution of Fake Images Generated by Text-to-Image Diffusion Models. (arXiv:2210.06998v1 [cs.CR])
    Diffusion models emerge to establish the new state of the art in the visual generation. In particular, text-to-image diffusion models that generate images based on caption descriptions have attracted increasing attention, impressed by their user controllability. Despite encouraging performance, they exaggerate concerns of fake image misuse and cast new pressures on fake image detection. In this work, we pioneer a systematic study of the authenticity of fake images generated by text-to-image diffusion models. In particular, we conduct comprehensive studies from two perspectives unique to the text-to-image model, namely, visual modality and linguistic modality. For visual modality, we propose universal detection that demonstrates fake images of these text-to-image diffusion models share common cues, which enable us to distinguish them apart from real images. We then propose source attribution that reveals the uniqueness of the fingerprints held by each diffusion model, which can be used to attribute each fake image to its model source. A variety of ablation and analysis studies further interpret the improvements from each of our proposed methods. For linguistic modality, we delve deeper to comprehensively analyze the impacts of text captions (called prompt analysis) on the image authenticity of text-to-image diffusion models, and reason the impacts to the detection and attribution performance of fake images. All findings contribute to the community's insight into the natural properties of text-to-image diffusion models, and we appeal to our community's consideration on the counterpart solutions, like ours, against the rapidly-evolving fake image generators.
    Corneal endothelium assessment in specular microscopy images with Fuchs' dystrophy via deep regression of signed distance maps. (arXiv:2210.07102v1 [eess.IV])
    Specular microscopy assessment of the human corneal endothelium (CE) in Fuchs' dystrophy is challenging due to the presence of dark image regions called guttae. This paper proposes a UNet-based segmentation approach that requires minimal post-processing and achieves reliable CE morphometric assessment and guttae identification across all degrees of Fuchs' dystrophy. We cast the segmentation problem as a regression task of the cell and gutta signed distance maps instead of a pixel-level classification task as typically done with UNets. Compared to the conventional UNet classification approach, the distance-map regression approach converges faster in clinically relevant parameters. It also produces morphometric parameters that agree with the manually-segmented ground-truth data, namely the average cell density difference of -41.9 cells/mm2 (95% confidence interval (CI) [-306.2, 222.5]) and the average difference of mean cell area of 14.8 um2 (95% CI [-41.9, 71.5]). These results suggest a promising alternative for CE assessment.
    NoMorelization: Building Normalizer-Free Models from a Sample's Perspective. (arXiv:2210.06932v1 [cs.LG])
    The normalizing layer has become one of the basic configurations of deep learning models, but it still suffers from computational inefficiency, interpretability difficulties, and low generality. After gaining a deeper understanding of the recent normalization and normalizer-free research works from a sample's perspective, we reveal the fact that the problem lies in the sampling noise and the inappropriate prior assumption. In this paper, we propose a simple and effective alternative to normalization, which is called "NoMorelization". NoMorelization is composed of two trainable scalars and a zero-centered noise injector. Experimental results demonstrate that NoMorelization is a general component for deep learning and is suitable for different model paradigms (e.g., convolution-based and attention-based models) to tackle different tasks (e.g., discriminative and generative tasks). Compared with existing mainstream normalizers (e.g., BN, LN, and IN) and state-of-the-art normalizer-free methods, NoMorelization shows the best speed-accuracy trade-off.
    CORL: Research-oriented Deep Offline Reinforcement Learning Library. (arXiv:2210.07105v1 [cs.LG])
    CORL is an open-source library that provides single-file implementations of Deep Offline Reinforcement Learning algorithms. It emphasizes a simple developing experience with a straightforward codebase and a modern analysis tracking tool. In CORL, we isolate methods implementation into distinct single files, making performance-relevant details easier to recognise. Additionally, an experiment tracking feature is available to help log metrics, hyperparameters, dependencies, and more to the cloud. Finally, we have ensured the reliability of the implementations by benchmarking a commonly employed D4RL benchmark. The source code can be found https://github.com/tinkoff-ai/CORL
    Effective Class-Imbalance learning based on SMOTE and Convolutional Neural Networks. (arXiv:2209.00653v2 [cs.LG] UPDATED)
    Imbalanced Data (ID) is a problem that deters Machine Learning (ML) models for achieving satisfactory results. ID is the occurrence of a situation where the quantity of the samples belonging to one class outnumbers that of the other by a wide margin, making such models learning process biased towards the majority class. In recent years, to address this issue, several solutions have been put forward, which opt for either synthetically generating new data for the minority class or reducing the number of majority classes for balancing the data. Hence, in this paper, we investigate the effectiveness of methods based on Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs), mixed with a variety of well-known imbalanced data solutions meaning oversampling and undersampling. To evaluate our methods, we have used KEEL, breast cancer, and Z-Alizadeh Sani datasets. In order to achieve reliable results, we conducted our experiments 100 times with randomly shuffled data distributions. The classification results demonstrate that the mixed Synthetic Minority Oversampling Technique (SMOTE)-Normalization-CNN outperforms different methodologies achieving 99.08% accuracy on the 24 imbalanced datasets. Therefore, the proposed mixed model can be applied to imbalanced binary classification problems on other real datasets.
    Few-shot Relational Reasoning via Connection Subgraph Pretraining. (arXiv:2210.06722v1 [cs.LG])
    Few-shot knowledge graph (KG) completion task aims to perform inductive reasoning over the KG: given only a few support triplets of a new relation $\bowtie$ (e.g., (chop,$\bowtie$,kitchen), (read,$\bowtie$,library), the goal is to predict the query triplets of the same unseen relation $\bowtie$, e.g., (sleep,$\bowtie$,?). Current approaches cast the problem in a meta-learning framework, where the model needs to be first jointly trained over many training few-shot tasks, each being defined by its own relation, so that learning/prediction on the target few-shot task can be effective. However, in real-world KGs, curating many training tasks is a challenging ad hoc process. Here we propose Connection Subgraph Reasoner (CSR), which can make predictions for the target few-shot task directly without the need for pre-training on the human curated set of training tasks. The key to CSR is that we explicitly model a shared connection subgraph between support and query triplets, as inspired by the principle of eliminative induction. To adapt to specific KG, we design a corresponding self-supervised pretraining scheme with the objective of reconstructing automatically sampled connection subgraphs. Our pretrained model can then be directly applied to target few-shot tasks on without the need for training few-shot tasks. Extensive experiments on real KGs, including NELL, FB15K-237, and ConceptNet, demonstrate the effectiveness of our framework: we show that even a learning-free implementation of CSR can already perform competitively to existing methods on target few-shot tasks; with pretraining, CSR can achieve significant gains of up to 52% on the more challenging inductive few-shot tasks where the entities are also unseen during (pre)training.
    Mitigating Unintended Memorization in Language Models via Alternating Teaching. (arXiv:2210.06772v1 [cs.CL])
    Recent research has shown that language models have a tendency to memorize rare or unique sequences in the training corpora which can thus leak sensitive attributes of user data. We employ a teacher-student framework and propose a novel approach called alternating teaching to mitigate unintended memorization in sequential modeling. In our method, multiple teachers are trained on disjoint training sets whose privacy one wishes to protect, and teachers' predictions supervise the training of a student model in an alternating manner at each time step. Experiments on LibriSpeech datasets show that the proposed method achieves superior privacy-preserving results than other counterparts. In comparison with no prevention for unintended memorization, the overall utility loss is small when training records are sufficient.
    Evaluating the Label Efficiency of Contrastive Self-Supervised Learning for Multi-Resolution Satellite Imagery. (arXiv:2210.06786v1 [eess.IV])
    The application of deep neural networks to remote sensing imagery is often constrained by the lack of ground-truth annotations. Adressing this issue requires models that generalize efficiently from limited amounts of labeled data, allowing us to tackle a wider range of Earth observation tasks. Another challenge in this domain is developing algorithms that operate at variable spatial resolutions, e.g., for the problem of classifying land use at different scales. Recently, self-supervised learning has been applied in the remote sensing domain to exploit readily-available unlabeled data, and was shown to reduce or even close the gap with supervised learning. In this paper, we study self-supervised visual representation learning through the lens of label efficiency, for the task of land use classification on multi-resolution/multi-scale satellite images. We benchmark two contrastive self-supervised methods adapted from Momentum Contrast (MoCo) and provide evidence that these methods can be perform effectively given little downstream supervision, where randomly initialized networks fail to generalize. Moreover, they outperform out-of-domain pretraining alternatives. We use the large-scale fMoW dataset to pretrain and evaluate the networks, and validate our observations with transfer to the RESISC45 dataset.
    Self-explaining deep models with logic rule reasoning. (arXiv:2210.07024v1 [cs.AI])
    We present SELOR, a framework for integrating self-explaining capabilities into a given deep model to achieve both high prediction performance and human precision. By "human precision", we refer to the degree to which humans agree with the reasons models provide for their predictions. Human precision affects user trust and allows users to collaborate closely with the model. We demonstrate that logic rule explanations naturally satisfy human precision with the expressive power required for good predictive performance. We then illustrate how to enable a deep model to predict and explain with logic rules. Our method does not require predefined logic rule sets or human annotations and can be learned efficiently and easily with widely-used deep learning modules in a differentiable way. Extensive experiments show that our method gives explanations closer to human decision logic than other methods while maintaining the performance of deep learning models.
    Causality-driven Hierarchical Structure Discovery for Reinforcement Learning. (arXiv:2210.06964v1 [cs.LG])
    Hierarchical reinforcement learning (HRL) effectively improves agents' exploration efficiency on tasks with sparse reward, with the guide of high-quality hierarchical structures (e.g., subgoals or options). However, how to automatically discover high-quality hierarchical structures is still a great challenge. Previous HRL methods can hardly discover the hierarchical structures in complex environments due to the low exploration efficiency by exploiting the randomness-driven exploration paradigm. To address this issue, we propose CDHRL, a causality-driven hierarchical reinforcement learning framework, leveraging a causality-driven discovery instead of a randomness-driven exploration to effectively build high-quality hierarchical structures in complicated environments. The key insight is that the causalities among environment variables are naturally fit for modeling reachable subgoals and their dependencies and can perfectly guide to build high-quality hierarchical structures. The results in two complex environments, 2D-Minecraft and Eden, show that CDHRL significantly boosts exploration efficiency with the causality-driven paradigm.
    On the Efficient Implementation of High Accuracy Optimality of Profile Maximum Likelihood. (arXiv:2210.06728v1 [stat.ML])
    We provide an efficient unified plug-in approach for estimating symmetric properties of distributions given $n$ independent samples. Our estimator is based on profile-maximum-likelihood (PML) and is sample optimal for estimating various symmetric properties when the estimation error $\epsilon \gg n^{-1/3}$. This result improves upon the previous best accuracy threshold of $\epsilon \gg n^{-1/4}$ achievable by polynomial time computable PML-based universal estimators [ACSS21, ACSS20]. Our estimator reaches a theoretical limit for universal symmetric property estimation as [Han21] shows that a broad class of universal estimators (containing many well known approaches including ours) cannot be sample optimal for every $1$-Lipschitz property when $\epsilon \ll n^{-1/3}$.
    Multi-agent Dynamic Algorithm Configuration. (arXiv:2210.06835v1 [cs.LG])
    Automated algorithm configuration relieves users from tedious, trial-and-error tuning tasks. A popular algorithm configuration tuning paradigm is dynamic algorithm configuration (DAC), in which an agent learns dynamic configuration policies across instances by reinforcement learning (RL). However, in many complex algorithms, there may exist different types of configuration hyperparameters, and such heterogeneity may bring difficulties for classic DAC which uses a single-agent RL policy. In this paper, we aim to address this issue and propose multi-agent DAC (MA-DAC), with one agent working for one type of configuration hyperparameter. MA-DAC formulates the dynamic configuration of a complex algorithm with multiple types of hyperparameters as a contextual multi-agent Markov decision process and solves it by a cooperative multi-agent RL (MARL) algorithm. To instantiate, we apply MA-DAC to a well-known optimization algorithm for multi-objective optimization problems. Experimental results show the effectiveness of MA-DAC in not only achieving superior performance compared with other configuration tuning approaches based on heuristic rules, multi-armed bandits, and single-agent RL, but also being capable of generalizing to different problem classes. Furthermore, we release the environments in this paper as a benchmark for testing MARL algorithms, with the hope of facilitating the application of MARL.
    Diffusion models as plug-and-play priors. (arXiv:2206.09012v2 [cs.LG] UPDATED)
    We consider the problem of inferring high-dimensional data $\mathbf{x}$ in a model that consists of a prior $p(\mathbf{x})$ and an auxiliary differentiable constraint $c(\mathbf{x},\mathbf{y})$ on $x$ given some additional information $\mathbf{y}$. In this paper, the prior is an independently trained denoising diffusion generative model. The auxiliary constraint is expected to have a differentiable form, but can come from diverse sources. The possibility of such inference turns diffusion models into plug-and-play modules, thereby allowing a range of potential applications in adapting models to new domains and tasks, such as conditional generation or image segmentation. The structure of diffusion models allows us to perform approximate inference by iterating differentiation through the fixed denoising network enriched with different amounts of noise at each step. Considering many noised versions of $\mathbf{x}$ in evaluation of its fitness is a novel search mechanism that may lead to new algorithms for solving combinatorial optimization problems.
    Fast Optimization of Weighted Sparse Decision Trees for use in Optimal Treatment Regimes and Optimal Policy Design. (arXiv:2210.06825v1 [cs.LG])
    Sparse decision trees are one of the most common forms of interpretable models. While recent advances have produced algorithms that fully optimize sparse decision trees for prediction, that work does not address policy design, because the algorithms cannot handle weighted data samples. Specifically, they rely on the discreteness of the loss function, which means that real-valued weights cannot be directly used. For example, none of the existing techniques produce policies that incorporate inverse propensity weighting on individual data points. We present three algorithms for efficient sparse weighted decision tree optimization. The first approach directly optimizes the weighted loss function; however, it tends to be computationally inefficient for large datasets. Our second approach, which scales more efficiently, transforms weights to integer values and uses data duplication to transform the weighted decision tree optimization problem into an unweighted (but larger) counterpart. Our third algorithm, which scales to much larger datasets, uses a randomized procedure that samples each data point with a probability proportional to its weight. We present theoretical bounds on the error of the two fast methods and show experimentally that these methods can be two orders of magnitude faster than the direct optimization of the weighted loss, without losing significant accuracy.
    The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks. (arXiv:2110.03922v4 [cs.LG] UPDATED)
    We derive a simple unified framework giving closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. These improvements are enabled by our identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al (2020), ii) generalize a previous result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics.
    An Experiment Design Paradigm using Joint Feature Selection and Task Optimization. (arXiv:2210.06891v1 [cs.LG])
    This paper presents a subsampling-task paradigm for data-driven task-specific experiment design (ED) and a novel method in populationwide supervised feature selection (FS). Optimal ED, the choice of sampling points under constraints of limited acquisition-time, arises in a wide variety of scientific and engineering contexts. However the continuous optimization used in classical approaches depend on a-priori parameter choices and challenging non-convex optimization landscapes. This paper proposes to replace this strategy with a subsampling-task paradigm, analogous to populationwide supervised FS. In particular, we introduce JOFSTO, which performs JOint Feature Selection and Task Optimization. JOFSTO jointly optimizes two coupled networks: one for feature scoring, which provides the ED, the other for execution of a downstream task or process. Unlike most FS problems, e.g. selecting protein expressions for classification, ED problems typically select from highly correlated globally informative candidates rather than seeking a small number of highly informative features among many uninformative features. JOFSTO's construction efficiently identifies potentially correlated, but effective subsets and returns a trained task network. We demonstrate the approach using parameter estimation and mapping problems in quantitative MRI, where economical ED is crucial for clinical application. Results from simulations and empirical data show the subsampling-task paradigm strongly outperforms classical ED, and within our paradigm, JOFSTO outperforms state-of-the-art supervised FS techniques. JOFSTO extends immediately to wider image-based ED problems and other scenarios where the design must be specified globally across large numbers of acquisitions. Code will be released.
    GA-SAM: Gradient-Strength based Adaptive Sharpness-Aware Minimization for Improved Generalization. (arXiv:2210.06895v1 [cs.LG])
    Recently, Sharpness-Aware Minimization (SAM) algorithm has shown state-of-the-art generalization abilities in vision tasks. It demonstrates that flat minima tend to imply better generalization abilities. However, it has some difficulty implying SAM to some natural language tasks, especially to models with drastic gradient changes, such as RNNs. In this work, we analyze the relation between the flatness of the local minimum and its generalization ability from a novel and straightforward theoretical perspective. We propose that the shift of the training and test distributions can be equivalently seen as a virtual parameter corruption or perturbation, which can explain why flat minima that are robust against parameter corruptions or perturbations have better generalization performances. On its basis, we propose a Gradient-Strength based Adaptive Sharpness-Aware Minimization (GA-SAM) algorithm to help to learn algorithms find flat minima that generalize better. Results in various language benchmarks validate the effectiveness of the proposed GA-SAM algorithm on natural language tasks.
    Scalable Neural Video Representations with Learnable Positional Features. (arXiv:2210.06823v1 [cs.CV])
    Succinct representation of complex signals using coordinate-based neural representations (CNRs) has seen great progress, and several recent efforts focus on extending them for handling videos. Here, the main challenge is how to (a) alleviate a compute-inefficiency in training CNRs to (b) achieve high-quality video encoding while (c) maintaining the parameter-efficiency. To meet all requirements (a), (b), and (c) simultaneously, we propose neural video representations with learnable positional features (NVP), a novel CNR by introducing "learnable positional features" that effectively amortize a video as latent codes. Specifically, we first present a CNR architecture based on designing 2D latent keyframes to learn the common video contents across each spatio-temporal axis, which dramatically improves all of those three requirements. Then, we propose to utilize existing powerful image and video codecs as a compute-/memory-efficient compression procedure of latent codes. We demonstrate the superiority of NVP on the popular UVG benchmark; compared with prior arts, NVP not only trains 2 times faster (less than 5 minutes) but also exceeds their encoding quality as 34.07$\rightarrow$34.57 (measured with the PSNR metric), even using $>$8 times fewer parameters. We also show intriguing properties of NVP, e.g., video inpainting, video frame interpolation, etc.
    Generalization Bounds with Minimal Dependency on Hypothesis Class via Distributionally Robust Optimization. (arXiv:2106.11180v4 [math.OC] UPDATED)
    Established approaches to obtain generalization bounds in data-driven optimization and machine learning mostly build on solutions from empirical risk minimization (ERM), which depend crucially on the functional complexity of the hypothesis class. In this paper, we present an alternate route to obtain these bounds on the solution from distributionally robust optimization (DRO), a recent data-driven optimization framework based on worst-case analysis and the notion of ambiguity set to capture statistical uncertainty. In contrast to the hypothesis class complexity in ERM, our DRO bounds depend on the ambiguity set geometry and its compatibility with the true loss function. Notably, when using statistical distances such as maximum mean discrepancy, Wasserstein distance, or $\phi$-divergence in the DRO, our analysis implies generalization bounds whose dependence on the hypothesis class appears the minimal possible: The bound depends solely on the true loss function, independent of any other candidates in the hypothesis class. To our best knowledge, it is the first generalization bound of this type in the literature, and we hope our findings can open the door for a better understanding of DRO, especially its benefits on loss minimization and other machine learning applications.
    Towards End-to-End Open Conversational Machine Reading. (arXiv:2210.07113v1 [cs.CL])
    In open-retrieval conversational machine reading (OR-CMR) task, machines are required to do multi-turn question answering given dialogue history and a textual knowledge base. Existing works generally utilize two independent modules to approach this problem's two successive sub-tasks: first with a hard-label decision making and second with a question generation aided by various entailment reasoning methods. Such usual cascaded modeling is vulnerable to error propagation and prevents the two sub-tasks from being consistently optimized. In this work, we instead model OR-CMR as a unified text-to-text task in a fully end-to-end style. Experiments on the OR-ShARC dataset show the effectiveness of our proposed end-to-end framework on both sub-tasks by a large margin, achieving new state-of-the-art results. Further ablation studies support that our framework can generalize to different backbone models.
    Learning Driving Policies for End-to-End Autonomous Driving. (arXiv:2210.06758v1 [cs.RO])
    Humans tend to drive vehicles efficiently by relying on contextual and spatial information through the sensory organs. Inspired by this, most of the research is focused on how to learn robust and efficient driving policies. These works are mostly categorized as making modular or end-to-end systems for learning driving policies. However, the former approach has limitations due to the manual supervision of specific modules that hinder the scalability of these systems. In this work, we focus on the latter approach to formalize a framework for learning driving policies for end-to-end autonomous driving. In order to take inspiration from human driving, we have proposed a framework that incorporates three RGB cameras (left, right, and center) to mimic the human field of view and top-down semantic information for contextual representation in predicting the driving policies for autonomous driving. The sensor information is fused and encoded by the self-attention mechanism and followed by the auto-regressive waypoint prediction module. The proposed method's efficacy is experimentally evaluated using the CARLA simulator and outperforms the state-of-the-art methods by achieving the highest driving score at the evaluation time.
    Reliable quantum kernel classification using fewer circuit evaluations. (arXiv:2210.06971v1 [quant-ph])
    Quantum kernel methods are a candidate for quantum speed-ups in supervised machine learning. The number of quantum measurements $N$ required for a reasonable kernel estimate is a critical resource, both from complexity considerations and because of the constraints of near-term quantum hardware. We emphasize that for classification tasks, the aim is accurate classification and not accurate kernel evaluation, and demonstrate that the former is more resource efficient. In general, the uncertainty in the quantum kernel, arising from finite sampling, leads to misclassifications over some kernel instantiations. We introduce a suitable performance metric that characterizes the robustness or reliability of classification over a dataset, and obtain a bound for $N$ which ensures, with high probability, that classification errors over a dataset are bounded by the margin errors of an idealized quantum kernel classifier. Using techniques of robust optimization, we then show that the number of quantum measurements can be significantly reduced by a robust formulation of the original support vector machine. We consider the SWAP test and the GATES test quantum circuits for kernel evaluations, and show that the SWAP test is always less reliable than the GATES test for any $N$. Our strategy is applicable to uncertainty in quantum kernels arising from {\em any} source of noise, although we only consider the statistical sampling noise in our analysis.
    Weighted Distillation with Unlabeled Examples. (arXiv:2210.06711v1 [cs.LG])
    Distillation with unlabeled examples is a popular and powerful method for training deep neural networks in settings where the amount of labeled data is limited: A large ''teacher'' neural network is trained on the labeled data available, and then it is used to generate labels on an unlabeled dataset (typically much larger in size). These labels are then utilized to train the smaller ''student'' model which will actually be deployed. Naturally, the success of the approach depends on the quality of the teacher's labels, since the student could be confused if trained on inaccurate data. This paper proposes a principled approach for addressing this issue based on a ''debiasing'' reweighting of the student's loss function tailored to the distillation training paradigm. Our method is hyper-parameter free, data-agnostic, and simple to implement. We demonstrate significant improvements on popular academic datasets and we accompany our results with a theoretical analysis which rigorously justifies the performance of our method in certain settings.
    Policy Gradient With Serial Markov Chain Reasoning. (arXiv:2210.06766v1 [cs.LG])
    We introduce a new framework that performs decision-making in reinforcement learning (RL) as an iterative reasoning process. We model agent behavior as the steady-state distribution of a parameterized reasoning Markov chain (RMC), optimized with a new tractable estimate of the policy gradient. We perform action selection by simulating the RMC for enough reasoning steps to approach its steady-state distribution. We show our framework has several useful properties that are inherently missing from traditional RL. For instance, it allows agent behavior to approximate any continuous distribution over actions by parameterizing the RMC with a simple Gaussian transition function. Moreover, the number of reasoning steps to reach convergence can scale adaptively with the difficulty of each action selection decision and can be accelerated by re-using past solutions. Our resulting algorithm achieves state-of-the-art performance in popular Mujoco and DeepMind Control benchmarks, both for proprioceptive and pixel-based tasks.
    LION: Latent Point Diffusion Models for 3D Shape Generation. (arXiv:2210.06978v1 [cs.CV])
    Denoising diffusion models (DDMs) have shown promising results in 3D point cloud synthesis. To advance 3D DDMs and make them useful for digital artists, we require (i) high generation quality, (ii) flexibility for manipulation and applications such as conditional synthesis and shape interpolation, and (iii) the ability to output smooth surfaces or meshes. To this end, we introduce the hierarchical Latent Point Diffusion Model (LION) for 3D shape generation. LION is set up as a variational autoencoder (VAE) with a hierarchical latent space that combines a global shape latent representation with a point-structured latent space. For generation, we train two hierarchical DDMs in these latent spaces. The hierarchical VAE approach boosts performance compared to DDMs that operate on point clouds directly, while the point-structured latents are still ideally suited for DDM-based modeling. Experimentally, LION achieves state-of-the-art generation performance on multiple ShapeNet benchmarks. Furthermore, our VAE framework allows us to easily use LION for different relevant tasks: LION excels at multimodal shape denoising and voxel-conditioned synthesis, and it can be adapted for text- and image-driven 3D generation. We also demonstrate shape autoencoding and latent shape interpolation, and we augment LION with modern surface reconstruction techniques to generate smooth 3D meshes. We hope that LION provides a powerful tool for artists working with 3D shapes due to its high-quality generation, flexibility, and surface reconstruction. Project page and code: https://nv-tlabs.github.io/LION.
    Data augmentation on-the-fly and active learning in data stream classification. (arXiv:2210.06873v1 [cs.LG])
    There is an emerging need for predictive models to be trained on-the-fly, since in numerous machine learning applications data are arriving in an online fashion. A critical challenge encountered is that of limited availability of ground truth information (e.g., labels in classification tasks) as new data are observed one-by-one online, while another significant challenge is that of class imbalance. This work introduces the novel Augmented Queues method, which addresses the dual-problem by combining in a synergistic manner online active learning, data augmentation, and a multi-queue memory to maintain separate and balanced queues for each class. We perform an extensive experimental study using image and time-series augmentations, in which we examine the roles of the active learning budget, memory size, imbalance level, and neural network type. We demonstrate two major advantages of Augmented Queues. First, it does not reserve additional memory space as the generation of synthetic data occurs only at training times. Second, learning models have access to more labelled data without the need to increase the active learning budget and / or the original memory size. Learning on-the-fly poses major challenges which, typically, hinder the deployment of learning models. Augmented Queues significantly improves the performance in terms of learning quality and speed. Our code is made publicly available.
    Improving Out-of-Distribution Generalization by Adversarial Training with Structured Priors. (arXiv:2210.06807v1 [cs.LG])
    Deep models often fail to generalize well in test domains when the data distribution differs from that in the training domain. Among numerous approaches to address this Out-of-Distribution (OOD) generalization problem, there has been a growing surge of interest in exploiting Adversarial Training (AT) to improve OOD performance. Recent works have revealed that the robust model obtained by conducting sample-wise AT also retains transferability to biased test domains. In this paper, we empirically show that sample-wise AT has limited improvement on OOD performance. Specifically, we find that AT can only maintain performance at smaller scales of perturbation while Universal AT (UAT) is more robust to larger-scale perturbations. This provides us with clues that adversarial perturbations with universal (low dimensional) structures can enhance the robustness against large data distribution shifts that are common in OOD scenarios. Inspired by this, we propose two AT variants with low-rank structures to train OOD-robust models. Extensive experiments on DomainBed benchmark show that our proposed approaches outperform Empirical Risk Minimization (ERM) and sample-wise AT. Our code is available at https://github.com/NOVAglow646/NIPS22-MAT-and-LDAT-for-OOD.
    Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief. (arXiv:2210.06692v1 [cs.LG])
    Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While learned through reuse of static dataset, the dynamics model's generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we devise an iterative regularized policy optimization algorithm for the game, with guarantee of monotonous improvement under certain condition. To make practical, we further devise an offline RL algorithm to approximately find the solution. Empirical results show that the proposed approach achieves state-of-the-art performance on a wide range of benchmark tasks.
    Exploiting Mixed Unlabeled Data for Detecting Samples of Seen and Unseen Out-of-Distribution Classes. (arXiv:2210.06833v1 [cs.LG])
    Out-of-Distribution (OOD) detection is essential in real-world applications, which has attracted increasing attention in recent years. However, most existing OOD detection methods require many labeled In-Distribution (ID) data, causing a heavy labeling cost. In this paper, we focus on the more realistic scenario, where limited labeled data and abundant unlabeled data are available, and these unlabeled data are mixed with ID and OOD samples. We propose the Adaptive In-Out-aware Learning (AIOL) method, in which we employ the appropriate temperature to adaptively select potential ID and OOD samples from the mixed unlabeled data and consider the entropy over them for OOD detection. Moreover, since the test data in realistic applications may contain OOD samples whose classes are not in the mixed unlabeled data (we call them unseen OOD classes), data augmentation techniques are brought into the method to further improve the performance. The experiments are conducted on various benchmark datasets, which demonstrate the superiority of our method.
    Noise can be helpful for variational quantum algorithms. (arXiv:2210.06723v1 [quant-ph])
    Saddle points constitute a crucial challenge for first-order gradient descent algorithms. In notions of classical machine learning, they are avoided for example by means of stochastic gradient descent methods. In this work, we provide evidence that the saddle points problem can be naturally avoided in variational quantum algorithms by exploiting the presence of stochasticity. We prove convergence guarantees of the approach and its practical functioning at hand of examples. We argue that the natural stochasticity of variational algorithms can be beneficial for avoiding strict saddle points, i.e., those saddle points with at least one negative Hessian eigenvalue. This insight that some noise levels could help in this perspective is expected to add a new perspective to notions of near-term variational quantum algorithms.
    Feature Reconstruction Attacks and Countermeasures of DNN training in Vertical Federated Learning. (arXiv:2210.06771v1 [cs.LG])
    Federated learning (FL) has increasingly been deployed, in its vertical form, among organizations to facilitate secure collaborative training over siloed data. In vertical FL (VFL), participants hold disjoint features of the same set of sample instances. Among them, only one has labels. This participant, known as the active party, initiates the training and interacts with the other participants, known as the passive parties. Despite the increasing adoption of VFL, it remains largely unknown if and how the active party can extract feature data from the passive party, especially when training deep neural network (DNN) models. This paper makes the first attempt to study the feature security problem of DNN training in VFL. We consider a DNN model partitioned between active and passive parties, where the latter only holds a subset of the input layer and exhibits some categorical features of binary values. Using a reduction from the Exact Cover problem, we prove that reconstructing those binary features is NP-hard. Through analysis, we demonstrate that, unless the feature dimension is exceedingly large, it remains feasible, both theoretically and practically, to launch a reconstruction attack with an efficient search-based algorithm that prevails over current feature protection techniques. To address this problem, we develop a novel feature protection scheme against the reconstruction attack that effectively misleads the search to some pre-specified random values. With an extensive set of experiments, we show that our protection scheme sustains the feature reconstruction attack in various VFL applications at no expense of accuracy loss.
    COLLIDER: A Robust Training Framework for Backdoor Data. (arXiv:2210.06704v1 [cs.LG])
    Deep neural network (DNN) classifiers are vulnerable to backdoor attacks. An adversary poisons some of the training data in such attacks by installing a trigger. The goal is to make the trained DNN output the attacker's desired class whenever the trigger is activated while performing as usual for clean data. Various approaches have recently been proposed to detect malicious backdoored DNNs. However, a robust, end-to-end training approach, like adversarial training, is yet to be discovered for backdoor poisoned data. In this paper, we take the first step toward such methods by developing a robust training framework, COLLIDER, that selects the most prominent samples by exploiting the underlying geometric structures of the data. Specifically, we effectively filter out candidate poisoned data at each training epoch by solving a geometrical coreset selection objective. We first argue how clean data samples exhibit (1) gradients similar to the clean majority of data and (2) low local intrinsic dimensionality (LID). Based on these criteria, we define a novel coreset selection objective to find such samples, which are used for training a DNN. We show the effectiveness of the proposed method for robust training of DNNs on various poisoned datasets, reducing the backdoor success rate significantly.
    Efficient circuit implementation for coined quantum walks on binary trees and application to reinforcement learning. (arXiv:2210.06784v1 [cs.ET])
    Quantum walks on binary trees are used in many quantum algorithms to achieve important speedup over classical algorithms. The formulation of this kind of algorithms as quantum circuit present the advantage of being easily readable, executable on circuit based quantum computers and simulators and optimal on the usage of resources. We propose a strategy to compose quantum circuit that performs quantum walk on binary trees following universal gate model quantum computation principles. We give a particular attention to NAND formula evaluation algorithm as it could have many applications in game theory and reinforcement learning. We therefore propose an application of this algorithm and show how it can be used to train a quantum reinforcement learning agent in a two player game environment.
    An $\alpha$-regret analysis of Adversarial Bilateral Trade. (arXiv:2210.06846v1 [cs.GT])
    We study sequential bilateral trade where sellers and buyers valuations are completely arbitrary (i.e., determined by an adversary). Sellers and buyers are strategic agents with private valuations for the good and the goal is to design a mechanism that maximizes efficiency (or gain from trade) while being incentive compatible, individually rational and budget balanced. In this paper we consider gain from trade which is harder to approximate than social welfare. We consider a variety of feedback scenarios and distinguish the cases where the mechanism posts one price and when it can post different prices for buyer and seller. We show several surprising results about the separation between the different scenarios. In particular we show that (a) it is impossible to achieve sublinear $\alpha$-regret for any $\alpha<2$, (b) but with full feedback sublinear $2$-regret is achievable (c) with a single price and partial feedback one cannot get sublinear $\alpha$ regret for any constant $\alpha$ (d) nevertheless, posting two prices even with one-bit feedback achieves sublinear $2$-regret, and (e) there is a provable separation in the $2$-regret bounds between full and partial feedback.
    OpenCQA: Open-ended Question Answering with Charts. (arXiv:2210.06628v1 [cs.LG])
    Charts are very popular to analyze data and convey important insights. People often analyze visualizations to answer open-ended questions that require explanatory answers. Answering such questions are often difficult and time-consuming as it requires a lot of cognitive and perceptual efforts. To address this challenge, we introduce a new task called OpenCQA, where the goal is to answer an open-ended question about a chart with descriptive texts. We present the annotation process and an in-depth analysis of our dataset. We implement and evaluate a set of baselines under three practical settings. In the first setting, a chart and the accompanying article is provided as input to the model. The second setting provides only the relevant paragraph(s) to the chart instead of the entire article, whereas the third setting requires the model to generate an answer solely based on the chart. Our analysis of the results show that the top performing models generally produce fluent and coherent text while they struggle to perform complex logical and arithmetic reasoning.
    Fairness via Adversarial Attribute Neighbourhood Robust Learning. (arXiv:2210.06630v1 [cs.LG])
    Improving fairness between privileged and less-privileged sensitive attribute groups (e.g, {race, gender}) has attracted lots of attention. To enhance the model performs uniformly well in different sensitive attributes, we propose a principled \underline{R}obust \underline{A}dversarial \underline{A}ttribute \underline{N}eighbourhood (RAAN) loss to debias the classification head and promote a fairer representation distribution across different sensitive attribute groups. The key idea of RAAN is to mitigate the differences of biased representations between different sensitive attribute groups by assigning each sample an adversarial robust weight, which is defined on the representations of adversarial attribute neighbors, i.e, the samples from different protected groups. To provide efficient optimization algorithms, we cast the RAAN into a sum of coupled compositional functions and propose a stochastic adaptive (Adam-style) and non-adaptive (SGD-style) algorithm framework SCRAAN with provable theoretical guarantee. Extensive empirical studies on fairness-related benchmark datasets verify the effectiveness of the proposed method.
    SageMix: Saliency-Guided Mixup for Point Clouds. (arXiv:2210.06944v1 [cs.CV])
    Data augmentation is key to improving the generalization ability of deep learning models. Mixup is a simple and widely-used data augmentation technique that has proven effective in alleviating the problems of overfitting and data scarcity. Also, recent studies of saliency-aware Mixup in the image domain show that preserving discriminative parts is beneficial to improving the generalization performance. However, these Mixup-based data augmentations are underexplored in 3D vision, especially in point clouds. In this paper, we propose SageMix, a saliency-guided Mixup for point clouds to preserve salient local structures. Specifically, we extract salient regions from two point clouds and smoothly combine them into one continuous shape. With a simple sequential sampling by re-weighted saliency scores, SageMix preserves the local structure of salient regions. Extensive experiments demonstrate that the proposed method consistently outperforms existing Mixup methods in various benchmark point cloud datasets. With PointNet++, our method achieves an accuracy gain of 2.6% and 4.0% over standard training in 3D Warehouse dataset (MN40) and ScanObjectNN, respectively. In addition to generalization performance, SageMix improves robustness and uncertainty calibration. Moreover, when adopting our method to various tasks including part segmentation and standard 2D image classification, our method achieves competitive performance.
    FASTER-CE: Fast, Sparse, Transparent, and Robust Counterfactual Explanations. (arXiv:2210.06578v1 [cs.LG])
    Counterfactual explanations have substantially increased in popularity in the past few years as a useful human-centric way of understanding individual black-box model predictions. While several properties desired of high-quality counterfactuals have been identified in the literature, three crucial concerns: the speed of explanation generation, robustness/sensitivity and succinctness of explanations (sparsity) have been relatively unexplored. In this paper, we present FASTER-CE: a novel set of algorithms to generate fast, sparse, and robust counterfactual explanations. The key idea is to efficiently find promising search directions for counterfactuals in a latent space that is specified via an autoencoder. These directions are determined based on gradients with respect to each of the original input features as well as of the target, as estimated in the latent space. The ability to quickly examine combinations of the most promising gradient directions as well as to incorporate additional user-defined constraints allows us to generate multiple counterfactual explanations that are sparse, realistic, and robust to input manipulations. Through experiments on three datasets of varied complexities, we show that FASTER-CE is not only much faster than other state of the art methods for generating multiple explanations but also is significantly superior when considering a larger set of desirable (and often conflicting) properties. Specifically we present results across multiple performance metrics: sparsity, proximity, validity, speed of generation, and the robustness of explanations, to highlight the capabilities of the FASTER-CE family.
    Brain Network Transformer. (arXiv:2210.06681v1 [cs.LG])
    Human brains are commonly modeled as networks of Regions of Interest (ROIs) and their connections for the understanding of brain functions and mental disorders. Recently, Transformer-based models have been studied over different types of data, including graphs, shown to bring performance gains widely. In this work, we study Transformer-based models for brain network analysis. Driven by the unique properties of data, we model brain networks as graphs with nodes of fixed size and order, which allows us to (1) use connection profiles as node features to provide natural and low-cost positional information and (2) learn pair-wise connection strengths among ROIs with efficient attention weights across individuals that are predictive towards downstream analysis tasks. Moreover, we propose an Orthonormal Clustering Readout operation based on self-supervised soft clustering and orthonormal projection. This design accounts for the underlying functional modules that determine similar behaviors among groups of ROIs, leading to distinguishable cluster-aware node embeddings and informative graph embeddings. Finally, we re-standardize the evaluation pipeline on the only one publicly available large-scale brain network dataset of ABIDE, to enable meaningful comparison of different models. Experiment results show clear improvements of our proposed Brain Network Transformer on both the public ABIDE and our restricted ABCD datasets. The implementation is available at https://github.com/Wayfear/BrainNetworkTransformer.
    Compute-Efficient Deep Learning: Algorithmic Trends and Opportunities. (arXiv:2210.06640v1 [cs.LG])
    Although deep learning has made great progress in recent years, the exploding economic and environmental costs of training neural networks are becoming unsustainable. To address this problem, there has been a great deal of research on *algorithmically-efficient deep learning*, which seeks to reduce training costs not at the hardware or implementation level, but through changes in the semantics of the training program. In this paper, we present a structured and comprehensive overview of the research in this field. First, we formalize the *algorithmic speedup* problem, then we use fundamental building blocks of algorithmically efficient training to develop a taxonomy. Our taxonomy highlights commonalities of seemingly disparate methods and reveals current research gaps. Next, we present evaluation best practices to enable comprehensive, fair, and reliable comparisons of speedup techniques. To further aid research and applications, we discuss common bottlenecks in the training pipeline (illustrated via experiments) and offer taxonomic mitigation strategies for them. Finally, we highlight some unsolved research challenges and present promising future directions.
    A Mixture of Surprises for Unsupervised Reinforcement Learning. (arXiv:2210.06702v1 [cs.LG])
    Unsupervised reinforcement learning aims at learning a generalist policy in a reward-free manner for fast adaptation to downstream tasks. Most of the existing methods propose to provide an intrinsic reward based on surprise. Maximizing or minimizing surprise drives the agent to either explore or gain control over its environment. However, both strategies rely on a strong assumption: the entropy of the environment's dynamics is either high or low. This assumption may not always hold in real-world scenarios, where the entropy of the environment's dynamics may be unknown. Hence, choosing between the two objectives is a dilemma. We propose a novel yet simple mixture of policies to address this concern, allowing us to optimize an objective that simultaneously maximizes and minimizes the surprise. Concretely, we train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise. Hence, our method does not make assumptions about the entropy of the environment's dynamics. We call our method a $\textbf{M}\text{ixture }\textbf{O}\text{f }\textbf{S}\text{urprise}\textbf{S}$ (MOSS) for unsupervised reinforcement learning. Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https://github.com/LeapLabTHU/MOSS.
    An Additive Autoencoder for Dimension Estimation. (arXiv:2210.06773v1 [cs.LG])
    An additive autoencoder for dimension reduction, which is composed of a serially performed bias estimation, linear trend estimation, and nonlinear residual estimation, is proposed and analyzed. Computational experiments confirm that an autoencoder of this form, with only a shallow network to encapsulate the nonlinear behavior, is able to identify an intrinsic dimension of a dataset with a low autoencoding error. This observation leads to an investigation in which shallow and deep network structures, and how they are trained, are compared. We conclude that the deeper network structures obtain lower autoencoding errors during the identification of the intrinsic dimension. However, the detected dimension does not change compared to a shallow network.
    Empirical Evaluation of Data Augmentations for Biobehavioral Time Series Data with Deep Learning. (arXiv:2210.06701v1 [cs.LG])
    Deep learning has performed remarkably well on many tasks recently. However, the superior performance of deep models relies heavily on the availability of a large number of training data, which limits the wide adaptation of deep models on various clinical and affective computing tasks, as the labeled data are usually very limited. As an effective technique to increase the data variability and thus train deep models with better generalization, data augmentation (DA) is a critical step for the success of deep learning models on biobehavioral time series data. However, the effectiveness of various DAs for different datasets with different tasks and deep models is understudied for biobehavioral time series data. In this paper, we first systematically review eight basic DA methods for biobehavioral time series data, and evaluate the effects on seven datasets with three backbones. Next, we explore adapting more recent DA techniques (i.e., automatic augmentation, random augmentation) to biobehavioral time series data by designing a new policy architecture applicable to time series data. Last, we try to answer the question of why a DA is effective (or not) by first summarizing two desired attributes for augmentations (challenging and faithful), and then utilizing two metrics to quantitatively measure the corresponding attributes, which can guide us in the search for more effective DA for biobehavioral time series data by designing more challenging but still faithful transformations. Our code and results are available at Link.
    FedDTG:Federated Data-Free Knowledge Distillation via Three-Player Generative Adversarial Networks. (arXiv:2201.03169v2 [cs.LG] UPDATED)
    Applying knowledge distillation to personalized cross-silo federated learning can well alleviate the problem of user heterogeneity. This approach, however, requires a proxy dataset, which is difficult to obtain in the real world. Moreover, the global model based on parameter averaging will lead to the leakage of user privacy. We introduce a distributed three-player GAN to implement datafree co-distillation between clients. This technique mitigates the user heterogeneity problem and better protects user privacy. We confirmed that thefake samples generated by GAN can make federated distillation more efficient and robust, and the co-distillation can achieve good performance for individual clients on the basis of obtaining global knowledge. Our extensive experiments on benchmark datasets demonstrate the superior generalization performance of the proposed methods, compared with the state-of-the-art.
    Parameter-Efficient Masking Networks. (arXiv:2210.06699v1 [cs.LG])
    A deeper network structure generally handles more complicated non-linearity and performs more competitively. Nowadays, advanced network designs often contain a large number of repetitive structures (e.g., Transformer). They empower the network capacity to a new level but also increase the model size inevitably, which is unfriendly to either model restoring or transferring. In this study, we are the first to investigate the representative potential of fixed random weights with limited unique values by learning diverse masks and introduce the Parameter-Efficient Masking Networks (PEMN). It also naturally leads to a new paradigm for model compression to diminish the model size. Concretely, motivated by the repetitive structures in modern neural networks, we utilize one random initialized layer, accompanied with different masks, to convey different feature mappings and represent repetitive network modules. Therefore, the model can be expressed as \textit{one-layer} with a bunch of masks, which significantly reduce the model storage cost. Furthermore, we enhance our strategy by learning masks for a model filled by padding a given random weights vector. In this way, our method can further lower the space complexity, especially for models without many repetitive architectures. We validate the potential of PEMN learning masks on random weights with limited unique values and test its effectiveness for a new compression paradigm based on different network architectures. Code is available at https://github.com/yueb17/PEMN
    A Stream Learning Approach for Real-Time Identification of False Data Injection Attacks in Cyber-Physical Power Systems. (arXiv:2210.06729v1 [cs.LG])
    This paper presents a novel data-driven framework to aid in system state estimation when the power system is under unobservable false data injection attacks. The proposed framework dynamically detects and classifies false data injection attacks. Then, it retrieves the control signal using the acquired information. This process is accomplished in three main modules, with novel designs, for detection, classification, and control signal retrieval. The detection module monitors historical changes in phasor measurements and captures any deviation pattern caused by an attack on a complex plane. This approach can help to reveal characteristics of the attacks including the direction, magnitude, and ratio of the injected false data. Using this information, the signal retrieval module can easily recover the original control signal and remove the injected false data. Further information regarding the attack type can be obtained through the classifier module. The proposed ensemble learner is compatible with harsh learning conditions including the lack of labeled data, concept drift, concept evolution, recurring classes, and independence from external updates. The proposed novel classifier can dynamically learn from data and classify attacks under all these harsh learning conditions. The introduced framework is evaluated w.r.t. real-world data captured from the Central New York Power System. The obtained results indicate the efficacy and stability of the proposed framework.
    An efficient combination strategy for hybird quantum ensemble classifier. (arXiv:2210.06785v1 [quant-ph])
    Quantum machine learning has shown advantages in many ways compared to classical machine learning. In machine learning, a difficult problem is how to learn a model with high robustness and strong generalization ability from a limited feature space. Combining multiple models as base learners, ensemble learning (EL) can effectively improve the accuracy, generalization ability, and robustness of the final model. The key to EL lies in two aspects, the performance of base learners and the choice of the combination strategy. Recently, quantum EL (QEL) has been studied. However, existing combination strategies in QEL are inadequate in considering the accuracy and variance among base learners. This paper presents a hybrid EL framework that combines quantum and classical advantages. More importantly, we propose an efficient combination strategy for improving the accuracy of classification in the framework. We verify the feasibility and efficiency of our framework and strategy by using the MNIST dataset. Simulation results show that the hybrid EL framework with our combination strategy not only has a higher accuracy and lower variance than the single model without the ensemble, but also has a better accuracy than the majority voting and the weighted voting strategies in most cases.
    A Neural Mean Embedding Approach for Back-door and Front-door Adjustment. (arXiv:2210.06610v1 [cs.LG])
    We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the "first stage" regression), and then taking the (conditional) expectation of this function as a "second stage" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data.
    Beyond backpropagation: implicit gradients for bilevel optimization. (arXiv:2205.03076v2 [cs.LG] UPDATED)
    This paper reviews gradient-based techniques to solve bilevel optimization problems. Bilevel optimization is a general way to frame the learning of systems that are implicitly defined through a quantity that they minimize. This characterization can be applied to neural networks, optimizers, algorithmic solvers and even physical systems, and allows for greater modeling flexibility compared to an explicit definition of such systems. Here we focus on gradient-based approaches that solve such problems. We distinguish them in two categories: those rooted in implicit differentiation, and those that leverage the equilibrium propagation theorem. We present the mathematical foundations that are behind such methods, introduce the gradient-estimation algorithms in detail and compare the competitive advantages of the different approaches.
    Action Matching: A Variational Method for Learning Stochastic Dynamics from Samples. (arXiv:2210.06662v1 [cs.LG])
    Stochastic dynamics are ubiquitous in many fields of science, from the evolution of quantum systems in physics to diffusion-based models in machine learning. Existing methods such as score matching can be used to simulate these physical processes by assuming that the dynamics is a diffusion, which is not always the case. In this work, we propose a method called "Action Matching" that enables us to learn a much broader family of stochastic dynamics. Our method requires access only to samples from different time-steps, makes no explicit assumptions about the underlying dynamics, and can be applied even when samples are uncorrelated (i.e., are not part of a trajectory). Action Matching directly learns an underlying mechanism to move samples in time without modeling the distributions at each time-step. In this work, we showcase how Action Matching can be used for several computer vision tasks such as generative modeling, super-resolution, colorization, and inpainting; and further discuss potential applications in other areas of science.
    Can Calibration Improve Sample Prioritization?. (arXiv:2210.06592v1 [cs.LG])
    Calibration can reduce overconfident predictions of deep neural networks, but can calibration also accelerate training by selecting the right samples? In this paper, we show that it can. We study the effect of popular calibration techniques in selecting better subsets of samples during training (also called sample prioritization) and observe that calibration can improve the quality of subsets, reduce the number of examples per epoch (by at least 70%), and can thereby speed up the overall training process. We further study the effect of using calibrated pre-trained models coupled with calibration during training to guide sample prioritization, which again seems to improve the quality of samples selected.
    Differentially Private Online-to-Batch for Smooth Losses. (arXiv:2210.06593v1 [cs.LG])
    We develop a new reduction that converts any online convex optimization algorithm suffering $O(\sqrt{T})$ regret into an $\epsilon$-differentially private stochastic convex optimization algorithm with the optimal convergence rate $\tilde O(1/\sqrt{T} + \sqrt{d}/\epsilon T)$ on smooth losses in linear time, forming a direct analogy to the classical non-private "online-to-batch" conversion. By applying our techniques to more advanced adaptive online algorithms, we produce adaptive differentially private counterparts whose convergence rates depend on apriori unknown variances or parameter norms.
    Variance-Aware Estimation of Kernel Mean Embedding. (arXiv:2210.06672v1 [math.ST])
    An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the RKHS. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We illustrate our methods in the context of hypothesis testing and robust parametric estimation.
    STG-GAN: A spatiotemporal graph generative adversarial networks for short-term passenger flow prediction in urban rail transit systems. (arXiv:2202.06727v2 [cs.LG] UPDATED)
    Short-term passenger flow prediction is an important but challenging task for better managing urban rail transit (URT) systems. Some emerging deep learning models provide good insights to improve short-term prediction accuracy. However, there exist many complex spatiotemporal dependencies in URT systems. Most previous methods only consider the absolute error between ground truth and predictions as the optimization objective, which fails to account for spatial and temporal constraints on the predictions. Furthermore, a large number of existing prediction models introduce complex neural network layers to improve accuracy while ignoring their training efficiency and memory occupancy, decreasing the chances to be applied to the real world. To overcome these limitations, we propose a novel deep learning-based spatiotemporal graph generative adversarial network (STG-GAN) model with higher prediction accuracy, higher efficiency, and lower memory occupancy to predict short-term passenger flows of the URT network. Our model consists of two major parts, which are optimized in an adversarial learning manner: (1) a generator network including gated temporal conventional networks (TCN) and weight sharing graph convolution networks (GCN) to capture structural spatiotemporal dependencies and generate predictions with a relatively small computational burden; (2) a discriminator network including a spatial discriminator and a temporal discriminator to enhance the spatial and temporal constraints of the predictions. The STG-GAN is evaluated on two large-scale real-world datasets from Beijing Subway. A comparison with those of several state-of-the-art models illustrates its superiority and robustness. This study can provide critical experience in conducting short-term passenger flow predictions, especially from the perspective of real-world applications.
    Continual Learning In Environments With Polynomial Mixing Times. (arXiv:2112.07066v2 [cs.LG] UPDATED)
    The mixing time of the Markov chain induced by a policy limits performance in real-world continual learning scenarios. Yet, the effect of mixing times on learning in continual reinforcement learning (RL) remains underexplored. In this paper, we characterize problems that are of long-term interest to the development of continual RL, which we call scalable MDPs, through the lens of mixing times. In particular, we theoretically establish that scalable MDPs have mixing times that scale polynomially with the size of the problem. We go on to demonstrate that polynomial mixing times present significant difficulties for existing approaches, which suffer from myopic bias and stale bootstrapped estimates. To validate our theory, we study the empirical scaling behavior of mixing times with respect to the number of tasks and task duration for high performing policies deployed across multiple Atari games. Our analysis demonstrates both that polynomial mixing times do emerge in practice and how their existence may lead to unstable learning behavior like catastrophic forgetting in continual learning settings.
    Hybrid RL: Using Both Offline and Online Data Can Make RL Efficient. (arXiv:2210.06718v1 [cs.LG])
    We consider a hybrid reinforcement learning setting (Hybrid RL), in which an agent has access to an offline dataset and the ability to collect experience via real-world online interaction. The framework mitigates the challenges that arise in both pure offline and online RL settings, allowing for the design of simple and highly effective algorithms, in both theory and practice. We demonstrate these advantages by adapting the classical Q learning/iteration algorithm to the hybrid setting, which we call Hybrid Q-Learning or Hy-Q. In our theoretical results, we prove that the algorithm is both computationally and statistically efficient whenever the offline dataset supports a high-quality policy and the environment has bounded bilinear rank. Notably, we require no assumptions on the coverage provided by the initial distribution, in contrast with guarantees for policy gradient/iteration methods. In our experimental results, we show that Hy-Q with neural network function approximation outperforms state-of-the-art online, offline, and hybrid RL baselines on challenging benchmarks, including Montezuma's Revenge.
    Interpreting Neural Policies with Disentangled Tree Representations. (arXiv:2210.06650v1 [cs.LG])
    Compact neural networks used in policy learning and closed-loop end-to-end control learn representations from data that encapsulate agent dynamics and potentially the agent-environment's factors of variation. A formal and quantitative understanding and interpretation of these explanatory factors in neural representations is difficult to achieve due to the complex and intertwined correspondence of neural activities with emergent behaviors. In this paper, we design a new algorithm that programmatically extracts tree representations from compact neural policies, in the form of a set of logic programs grounded by the world state. To assess how well networks uncover the dynamics of the task and their factors of variation, we introduce interpretability metrics that measure the disentanglement of learned neural dynamics from a concentration of decisions, mutual information, and modularity perspectives. Moreover, our method allows us to quantify how accurate the extracted decision paths (explanations) are and computes cross-neuron logic conflict. We demonstrate the effectiveness of our approach with several types of compact network architectures on a series of end-to-end learning to control tasks.
    Semi-Supervised Offline Reinforcement Learning with Action-Free Trajectories. (arXiv:2210.06518v1 [cs.LG])
    Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop a simple meta-algorithmic pipeline that learns an inverse-dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful -- on several D4RL benchmarks \cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labeled dataset even when we label only 10\% trajectories from the low return regime. Finally, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
    Wasserstein Barycenter-based Model Fusion and Linear Mode Connectivity of Neural Networks. (arXiv:2210.06671v1 [cs.LG])
    Based on the concepts of Wasserstein barycenter (WB) and Gromov-Wasserstein barycenter (GWB), we propose a unified mathematical framework for neural network (NN) model fusion and utilize it to reveal new insights about the linear mode connectivity of SGD solutions. In our framework, the fusion occurs in a layer-wise manner and builds on an interpretation of a node in a network as a function of the layer preceding it. The versatility of our mathematical framework allows us to talk about model fusion and linear mode connectivity for a broad class of NNs, including fully connected NN, CNN, ResNet, RNN, and LSTM, in each case exploiting the specific structure of the network architecture. We present extensive numerical experiments to: 1) illustrate the strengths of our approach in relation to other model fusion methodologies and 2) from a certain perspective, provide new empirical evidence for recent conjectures which say that two local minima found by gradient-based methods end up lying on the same basin of the loss landscape after a proper permutation of weights is applied to one of the models.
    Generalization with Lossy Affordances: Leveraging Broad Offline Data for Learning Visuomotor Tasks. (arXiv:2210.06601v1 [cs.RO])
    The utilization of broad datasets has proven to be crucial for generalization for a wide range of fields. However, how to effectively make use of diverse multi-task data for novel downstream tasks still remains a grand challenge in robotics. To tackle this challenge, we introduce a framework that acquires goal-conditioned policies for unseen temporally extended tasks via offline reinforcement learning on broad data, in combination with online fine-tuning guided by subgoals in learned lossy representation space. When faced with a novel task goal, the framework uses an affordance model to plan a sequence of lossy representations as subgoals that decomposes the original task into easier problems. Learned from the broad data, the lossy representation emphasizes task-relevant information about states and goals while abstracting away redundant contexts that hinder generalization. It thus enables subgoal planning for unseen tasks, provides a compact input to the policy, and facilitates reward shaping during fine-tuning. We show that our framework can be pre-trained on large-scale datasets of robot experiences from prior work and efficiently fine-tuned for novel tasks, entirely from visual inputs without any manual reward engineering.
    Towards an Efficient ML System: Unveiling a Trade-off between Task Accuracy and Engineering Efficiency in a Large-scale Car Sharing Platform. (arXiv:2210.06585v1 [cs.CV])
    Upon the significant performance of the supervised deep neural networks, conventional procedures of developing ML system are \textit{task-centric}, which aims to maximize the task accuracy. However, we scrutinized this \textit{task-centric} ML system lacks in engineering efficiency when the ML practitioners solve multiple tasks in their domain. To resolve this problem, we propose an \textit{efficiency-centric} ML system that concatenates numerous datasets, classifiers, out-of-distribution detectors, and prediction tables existing in the practitioners' domain into a single ML pipeline. Under various image recognition tasks in the real world car-sharing platform, our study illustrates how we established the proposed system and lessons learned from this journey as follows. First, the proposed ML system accomplishes supreme engineering efficiency while achieving a competitive task accuracy. Moreover, compared to the \textit{task-centric} paradigm, we discovered that the \textit{efficiency-centric} ML system yields satisfactory prediction results on multi-labelable samples, which frequently exist in the real world. We analyze these benefits derived from the representation power, which learned broader label spaces from the concatenated dataset. Last but not least, our study elaborated how we deployed this \textit{efficiency-centric} ML system is deployed in the real world live cloud environment. Based on the proposed analogies, we highly expect that ML practitioners can utilize our study to elevate engineering efficiency in their domain.
    A Bayesian Optimization Framework for Finding Local Optima in Expensive Multi-Modal Functions. (arXiv:2210.06635v1 [math.OC])
    Bayesian optimization (BO) is a popular global optimization scheme for sample-efficient optimization in domains with expensive function evaluations. The existing BO techniques are capable of finding a single global optimum solution. However, finding a set of global and local optimum solutions is crucial in a wide range of real-world problems, as implementing some of the optimal solutions might not be feasible due to various practical restrictions (e.g., resource limitation, physical constraints, etc.). In such domains, if multiple solutions are known, the implementation can be quickly switched to another solution, and the best possible system performance can still be obtained. This paper develops a multi-modal BO framework to effectively find a set of local/global solutions for expensive-to-evaluate multi-modal objective functions. We consider the standard BO setting with Gaussian process regression representing the objective function. We analytically derive the joint distribution of the objective function and its first-order gradients. This joint distribution is used in the body of the BO acquisition functions to search for local optima during the optimization process. We introduce variants of the well-known BO acquisition functions to the multi-modal setting and demonstrate the performance of the proposed framework in locating a set of local optimum solutions using multiple optimization problems.
    From Gradient Flow on Population Loss to Learning with Stochastic Gradient Descent. (arXiv:2210.06705v1 [cs.LG])
    Stochastic Gradient Descent (SGD) has been the method of choice for learning large-scale non-convex models. While a general analysis of when SGD works has been elusive, there has been a lot of recent progress in understanding the convergence of Gradient Flow (GF) on the population loss, partly due to the simplicity that a continuous-time analysis buys us. An overarching theme of our paper is providing general conditions under which SGD converges, assuming that GF on the population loss converges. Our main tool to establish this connection is a general converse Lyapunov like theorem, which implies the existence of a Lyapunov potential under mild assumptions on the rates of convergence of GF. In fact, using these potentials, we show a one-to-one correspondence between rates of convergence of GF and geometrical properties of the underlying objective. When these potentials further satisfy certain self-bounding properties, we show that they can be used to provide a convergence guarantee for Gradient Descent (GD) and SGD (even when the paths of GF and GD/SGD are quite far apart). It turns out that these self-bounding assumptions are in a sense also necessary for GD/SGD to work. Using our framework, we provide a unified analysis for GD/SGD not only for classical settings like convex losses, or objectives that satisfy PL / KL properties, but also for more complex problems including Phase Retrieval and Matrix sq-root, and extending the results in the recent work of Chatterjee 2022.
    Walk a Mile in Their Shoes: a New Fairness Criterion for Machine Learning. (arXiv:2210.06680v1 [cs.LG])
    The old empathetic adage, ``Walk a mile in their shoes,'' asks that one imagine the difficulties others may face. This suggests a new ML counterfactual fairness criterion, based on a \textit{group} level: How would members of a nonprotected group fare if their group were subject to conditions in some protected group? Instead of asking what sentence would a particular Caucasian convict receive if he were Black, take that notion to entire groups; e.g. how would the average sentence for all White convicts change if they were Black, but with their same White characteristics, e.g. same number of prior convictions? We frame the problem and study it empirically, for different datasets. Our approach also is a solution to the problem of covariate correlation with sensitive attributes.
    Robust Neural Posterior Estimation and Statistical Model Criticism. (arXiv:2210.06564v1 [stat.ML])
    Computer simulations have proven a valuable tool for understanding complex phenomena across the sciences. However, the utility of simulators for modelling and forecasting purposes is often restricted by low data quality, as well as practical limits to model fidelity. In order to circumvent these difficulties, we argue that modellers must treat simulators as idealistic representations of the true data generating process, and consequently should thoughtfully consider the risk of model misspecification. In this work we revisit neural posterior estimation (NPE), a class of algorithms that enable black-box parameter inference in simulation models, and consider the implication of a simulation-to-reality gap. While recent works have demonstrated reliable performance of these methods, the analyses have been performed using synthetic data generated by the simulator model itself, and have therefore only addressed the well-specified case. In this paper, we find that the presence of misspecification, in contrast, leads to unreliable inference when NPE is used naively. As a remedy we argue that principled scientific inquiry with simulators should incorporate a model criticism component, to facilitate interpretable identification of misspecification and a robust inference component, to fit 'wrong but useful' models. We propose robust neural posterior estimation (RNPE), an extension of NPE to simultaneously achieve both these aims, through explicitly modelling the discrepancies between simulations and the observed data. We assess the approach on a range of artificially misspecified examples, and find RNPE performs well across the tasks, whereas naively using NPE leads to misleading and erratic posteriors.
    Subject-specific quantitative susceptibility mapping using patch based deep image priors. (arXiv:2210.06471v1 [eess.IV])
    Quantitative Susceptibility Mapping is a parametric imaging technique to estimate the magnetic susceptibilities of biological tissues from MRI phase measurements. This problem of estimating the susceptibility map is ill posed. Regularized recovery approaches exploiting signal properties such as smoothness and sparsity improve reconstructions, but suffer from over-smoothing artifacts. Deep learning approaches have shown great potential and generate maps with reduced artifacts. However, for reasonable reconstructions and network generalization, they require numerous training datasets resulting in increased data acquisition time. To overcome this issue, we proposed a subject-specific, patch-based, unsupervised learning algorithm to estimate the susceptibility map. We make the problem well-posed by exploiting the redundancies across the patches of the map using a deep convolutional neural network. We formulated the recovery of the susceptibility map as a regularized optimization problem and adopted an alternating minimization strategy to solve it. We tested the algorithm on a 3D invivo dataset and, qualitatively and quantitatively, demonstrated improved reconstructions over competing methods.
    RNNs of RNNs: Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v5 [cs.LG] UPDATED)
    Recurrent neural networks (RNNs) are widely used throughout neuroscience as models of local neural activity. Many properties of single RNNs are well characterized theoretically, but experimental neuroscience has moved in the direction of studying multiple interacting areas, and RNN theory needs to be likewise extended. We take a constructive approach towards this problem, leveraging tools from nonlinear control theory and machine learning to characterize when combinations of stable RNNs will themselves be stable. Importantly, we derive conditions which allow for massive feedback connections between interacting RNNs. We parameterize these conditions for easy optimization using gradient-based techniques, and show that stability-constrained "networks of networks" can perform well on challenging sequential-processing benchmark tasks. Altogether, our results provide a principled approach towards understanding distributed, modular function in the brain.
    Adversarial Attack Against Image-Based Localization Neural Networks. (arXiv:2210.06589v1 [cs.CV])
    In this paper, we present a proof of concept for adversarially attacking the image-based localization module of an autonomous vehicle. This attack aims to cause the vehicle to perform a wrong navigational decisions and prevent it from reaching a desired predefined destination in a simulated urban environment. A database of rendered images allowed us to train a deep neural network that performs a localization task and implement, develop and assess the adversarial pattern. Our tests show that using this adversarial attack we can prevent the vehicle from turning at a given intersection. This is done by manipulating the vehicle's navigational module to falsely estimate its current position and thus fail to initialize the turning procedure until the vehicle misses the last opportunity to perform a safe turn in a given intersection.
    Real Spike: Learning Real-valued Spikes for Spiking Neural Networks. (arXiv:2210.06686v1 [cs.NE])
    Brain-inspired spiking neural networks (SNNs) have recently drawn more and more attention due to their event-driven and energy-efficient characteristics. The integration of storage and computation paradigm on neuromorphic hardwares makes SNNs much different from Deep Neural Networks (DNNs). In this paper, we argue that SNNs may not benefit from the weight-sharing mechanism, which can effectively reduce parameters and improve inference efficiency in DNNs, in some hardwares, and assume that an SNN with unshared convolution kernels could perform better. Motivated by this assumption, a training-inference decoupling method for SNNs named as Real Spike is proposed, which not only enjoys both unshared convolution kernels and binary spikes in inference-time but also maintains both shared convolution kernels and Real-valued Spikes during training. This decoupling mechanism of SNN is realized by a re-parameterization technique. Furthermore, based on the training-inference-decoupled idea, a series of different forms for implementing Real Spike on different levels are presented, which also enjoy shared convolutions in the inference and are friendly to both neuromorphic and non-neuromorphic hardware platforms. A theoretical proof is given to clarify that the Real Spike-based SNN network is superior to its vanilla counterpart. Experimental results show that all different Real Spike versions can consistently improve the SNN performance. Moreover, the proposed method outperforms the state-of-the-art models on both non-spiking static and neuromorphic datasets.
    Find Your Friends: Personalized Federated Learning with the Right Collaborators. (arXiv:2210.06597v1 [cs.LG])
    In the traditional federated learning setting, a central server coordinates a network of clients to train one global model. However, the global model may serve many clients poorly due to data heterogeneity. Moreover, there may not exist a trusted central party that can coordinate the clients to ensure that each of them can benefit from others. To address these concerns, we present a novel decentralized framework, FedeRiCo, where each client can learn as much or as little from other clients as is optimal for its local data distribution. Based on expectation-maximization, FedeRiCo estimates the utilities of other participants' models on each client's data so that everyone can select the right collaborators for learning. As a result, our algorithm outperforms other federated, personalized, and/or decentralized approaches on several benchmark datasets, being the only approach that consistently performs better than training with local data only.
    Scenario-based Evaluation of Prediction Models for Automated Vehicles. (arXiv:2210.06553v1 [cs.AI])
    To operate safely, an automated vehicle (AV) must anticipate how the environment around it will evolve. For that purpose, it is important to know which prediction models are most appropriate for every situation. Currently, assessment of prediction models is often performed over a set of trajectories without distinction of the type of movement they capture, resulting in the inability to determine the suitability of each model for different situations. In this work we illustrate how standardized evaluation methods result in wrong conclusions regarding a model's predictive capabilities, preventing a clear assessment of prediction models and potentially leading to dangerous on-road situations. We argue that following evaluation practices in safety assessment for AVs, assessment of prediction models should be performed in a scenario-based fashion. To encourage scenario-based assessment of prediction models and illustrate the dangers of improper assessment, we categorize trajectories of the Waymo Open Motion dataset according to the type of movement they capture. Next, three different models are thoroughly evaluated for different trajectory types and prediction horizons. Results show that common evaluation methods are insufficient and the assessment should be performed depending on the application in which the model will operate.
    BLADERUNNER: Rapid Countermeasure for Synthetic (AI-Generated) StyleGAN Faces. (arXiv:2210.06587v1 [cs.CR])
    StyleGAN is the open-sourced TensorFlow implementation made by NVIDIA. It has revolutionized high quality facial image generation. However, this democratization of Artificial Intelligence / Machine Learning (AI/ML) algorithms has enabled hostile threat actors to establish cyber personas or sock-puppet accounts in social media platforms. These ultra-realistic synthetic faces. This report surveys the relevance of AI/ML with respect to Cyber & Information Operations. The proliferation of AI/ML algorithms has led to a rise in DeepFakes and inauthentic social media accounts. Threats are analyzed within the Strategic and Operational Environments. Existing methods of identifying synthetic faces exists, but they rely on human beings to visually scrutinize each photo for inconsistencies. However, through use of the DLIB 68-landmark pre-trained file, it is possible to analyze and detect synthetic faces by exploiting repetitive behaviors in StyleGAN images. Project Blade Runner encompasses two scripts necessary to counter StyleGAN images. Through PapersPlease.py acting as the analyzer, it is possible to derive indicators-of-attack (IOA) from scraped image samples. These IOAs can be fed back into among_us.py acting as the detector to identify synthetic faces from live operational samples. The opensource copy of Blade Runner may lack additional unit tests and some functionality, but the open-source copy is a redacted version, far leaner, better optimized, and a proof-of-concept for the information security community. The desired end-state will be to incrementally add automation to stay on-par with its closed-source predecessor.
    Efficient Deep Unfolding for SISO-OFDM Channel Estimation. (arXiv:2210.06588v1 [cs.IT])
    In modern communication systems, channel state information is of paramount importance to achieve capacity. It is then crucial to accurately estimate the channel. It is possible to perform SISO-OFDM channel estimation using sparse recovery techniques. However, this approach relies on the use of a physical wave propagation model to build a dictionary, which requires perfect knowledge of the system's parameters. In this paper, an unfolded neural network is used to lighten this constraint. Its architecture, based on a sparse recovery algorithm, allows SISO-OFDM channel estimation even if the system's parameters are not perfectly known. Indeed, its unsupervised online learning allows to learn the system's imperfections in order to enhance the estimation performance. The practicality of the proposed method is improved with respect to the state of the art in two aspects: constrained dictionaries are introduced in order to reduce sample complexity and hierarchical search within dictionaries is proposed in order to reduce time complexity. Finally, the performance of the proposed unfolded network is evaluated and compared to several baselines using realistic channel data, showing the great potential of the approach.
    Differentiable WORLD Synthesizer-based Neural Vocoder With Application To End-To-End Audio Style Transfer. (arXiv:2208.07282v3 [eess.AS] UPDATED)
    In this paper, we propose a differentiable WORLD synthesizer and demonstrate its use in end-to-end audio style transfer tasks such as (singing) voice conversion and the DDSP timbre transfer task. Accordingly, our baseline differentiable synthesizer has no model parameters, yet it yields adequate synthesis quality. We can extend the baseline synthesizer by appending lightweight black-box postnets which apply further processing to the baseline output in order to improve fidelity. An alternative differentiable approach considers extraction of the source excitation spectrum directly, which can improve naturalness albeit for a narrower class of style transfer applications. The acoustic feature parameterization used by our approaches has the added benefit that it naturally disentangles pitch and timbral information so that they can be modeled separately. Moreover, as there exists a robust means of estimating these acoustic features from monophonic audio sources, it allows for parameter loss terms to be added to an end-to-end objective function, which can help convergence and/or further stabilize (adversarial) training.
    That's the Wrong Lung! Evaluating and Improving the Interpretability of Unsupervised Multimodal Encoders for Medical Data. (arXiv:2210.06565v1 [cs.LG])
    Pretraining multimodal models on Electronic Health Records (EHRs) provides a means of learning representations that can transfer to downstream tasks with minimal supervision. Recent multimodal models induce soft local alignments between image regions and sentences. This is of particular interest in the medical domain, where alignments might highlight regions in an image relevant to specific phenomena described in free-text. While past work has suggested that attention "heatmaps" can be interpreted in this manner, there has been little evaluation of such alignments. We compare alignments from a state-of-the-art multimodal (image and text) model for EHR with human annotations that link image regions to sentences. Our main finding is that the text has an often weak or unintuitive influence on attention; alignments do not consistently reflect basic anatomical information. Moreover, synthetic modifications -- such as substituting "left" for "right" -- do not substantially influence highlights. Simple techniques such as allowing the model to opt out of attending to the image and few-shot finetuning show promise in terms of their ability to improve alignments with very little or no supervision.
    When does deep learning fail and how to tackle it? A critical analysis on polymer sequence-property surrogate models. (arXiv:2210.06622v1 [cond-mat.mtrl-sci])
    Deep learning models are gaining popularity and potency in predicting polymer properties. These models can be built using pre-existing data and are useful for the rapid prediction of polymer properties. However, the performance of a deep learning model is intricately connected to its topology and the volume of training data. There is no facile protocol available to select a deep learning architecture, and there is a lack of a large volume of homogeneous sequence-property data of polymers. These two factors are the primary bottleneck for the efficient development of deep learning models. Here we assess the severity of these factors and propose new algorithms to address them. We show that a linear layer-by-layer expansion of a neural network can help in identifying the best neural network topology for a given problem. Moreover, we map the discrete sequence space of a polymer to a continuous one-dimensional latent space using a machine learning pipeline to identify minimal data points for building a universal deep learning model. We implement these approaches for three representative cases of building sequence-property surrogate models, viz., the single-molecule radius of gyration of a copolymer, adhesive free energy of a copolymer, and copolymer compatibilizer, demonstrating the generality of the proposed strategies. This work establishes efficient methods for building universal deep learning models with minimal data and hyperparameters for predicting sequence-defined properties of polymers.
    Anomaly Detection via Federated Learning. (arXiv:2210.06614v1 [cs.LG])
    Machine learning has helped advance the field of anomaly detection by incorporating classifiers and autoencoders to decipher between normal and anomalous behavior. Additionally, federated learning has provided a way for a global model to be trained with multiple clients' data without requiring the client to directly share their data. This paper proposes a novel anomaly detector via federated learning to detect malicious network activity on a client's server. In our experiments, we use an autoencoder with a classifier in a federated learning framework to determine if the network activity is benign or malicious. By using our novel min-max scalar and sampling technique, called FedSam, we determined federated learning allows the global model to learn from each client's data and, in turn, provide a means for each client to improve their intrusion detection system's defense against cyber-attacks.
    Gaussian Processes on Distributions based on Regularized Optimal Transport. (arXiv:2210.06574v1 [stat.ML])
    We present a novel kernel over the space of probability measures based on the dual formulation of optimal regularized transport. We propose an Hilbertian embedding of the space of probabilities using their Sinkhorn potentials, which are solutions of the dual entropic relaxed optimal transport between the probabilities and a reference measure $\mathcal{U}$. We prove that this construction enables to obtain a valid kernel, by using the Hilbert norms. We prove that the kernel enjoys theoretical properties such as universality and some invariances, while still being computationally feasible. Moreover we provide theoretical guarantees on the behaviour of a Gaussian process based on this kernel. The empirical performances are compared with other traditional choices of kernels for processes indexed on distributions.
    Sample Constrained Treatment Effect Estimation. (arXiv:2210.06594v1 [cs.LG])
    Treatment effect estimation is a fundamental problem in causal inference. We focus on designing efficient randomized controlled trials, to accurately estimate the effect of some treatment on a population of $n$ individuals. In particular, we study sample-constrained treatment effect estimation, where we must select a subset of $s \ll n$ individuals from the population to experiment on. This subset must be further partitioned into treatment and control groups. Algorithms for partitioning the entire population into treatment and control groups, or for choosing a single representative subset, have been well-studied. The key challenge in our setting is jointly choosing a representative subset and a partition for that set. We focus on both individual and average treatment effect estimation, under a linear effects model. We give provably efficient experimental designs and corresponding estimators, by identifying connections to discrepancy minimization and leverage-score-based sampling used in randomized numerical linear algebra. Our theoretical results obtain a smooth transition to known guarantees when $s$ equals the population size. We also empirically demonstrate the performance of our algorithms.
    DICTDIS: Dictionary Constrained Disambiguation for Improved NMT. (arXiv:2210.06996v1 [cs.CL])
    Domain-specific neural machine translation (NMT) systems (e.g., in educational applications) are socially significant with the potential to help make information accessible to a diverse set of users in multilingual societies. It is desirable that such NMT systems be lexically constrained and draw from domain-specific dictionaries. Dictionaries could present multiple candidate translations for a source words/phrases on account of the polysemous nature of words. The onus is then on the NMT model to choose the contextually most appropriate candidate. Prior work has largely ignored this problem and focused on the single candidate setting where the target word or phrase is replaced by a single constraint. In this work we present DICTDIS, a lexically constrained NMT system that disambiguates between multiple candidate translations derived from dictionaries. We achieve this by augmenting training data with multiple dictionary candidates to actively encourage disambiguation during training. We demonstrate the utility of DICTDIS via extensive experiments on English-Hindi sentences in a variety of domains including news, finance, medicine and engineering. We obtain superior disambiguation performance on all domains with improved fluency in some domains of up to 4 BLEU points, when compared with existing approaches for lexically constrained and unconstrained NMT.
    Deep Clustering With Consensus Representations. (arXiv:2210.07063v1 [cs.LG])
    The field of deep clustering combines deep learning and clustering to learn representations that improve both the learned representation and the performance of the considered clustering method. Most existing deep clustering methods are designed for a single clustering method, e.g., k-means, spectral clustering, or Gaussian mixture models, but it is well known that no clustering algorithm works best in all circumstances. Consensus clustering tries to alleviate the individual weaknesses of clustering algorithms by building a consensus between members of a clustering ensemble. Currently, there is no deep clustering method that can include multiple heterogeneous clustering algorithms in an ensemble to update representations and clusterings together. To close this gap, we introduce the idea of a consensus representation that maximizes the agreement between ensemble members. Further, we propose DECCS (Deep Embedded Clustering with Consensus representationS), a deep consensus clustering method that learns a consensus representation by enhancing the embedded space to such a degree that all ensemble members agree on a common clustering result. Our contributions are the following: (1) We introduce the idea of learning consensus representations for heterogeneous clusterings, a novel notion to approach consensus clustering. (2) We propose DECCS, the first deep clustering method that jointly improves the representation and clustering results of multiple heterogeneous clustering algorithms. (3) We show in experiments that learning a consensus representation with DECCS is outperforming several relevant baselines from deep clustering and consensus clustering. Our code can be found at https://gitlab.cs.univie.ac.at/lukas/deccs
    Parallel photonic accelerator for decision making using optical spatiotemporal chaos. (arXiv:2210.06976v1 [cs.ET])
    Photonic accelerators have attracted increasing attention in artificial intelligence applications. The multi-armed bandit problem is a fundamental problem of decision making using reinforcement learning. However, the scalability of photonic decision making has not yet been demonstrated in experiments, owing to technical difficulties in physical realization. We propose a parallel photonic decision-making system for solving large-scale multi-armed bandit problems using optical spatiotemporal chaos. We solve a 512-armed bandit problem online, which is much larger than previous experiments by two orders of magnitude. The scaling property for correct decision making is examined as a function of the number of slot machines, evaluated as an exponent of 0.86. This exponent is smaller than that in previous work, indicating the superiority of the proposed parallel principle. This experimental demonstration facilitates photonic decision making to solve large-scale multi-armed bandit problems for future photonic accelerators.
    Precision QCD corrections to gluon-initiated diphoton-plus-jet production at the LHC. (arXiv:2210.07115v1 [hep-ph])
    In this thesis, we present recent advances at the precision frontier of higher-order quantum chromodynamics (QCD) calculations. We consider massless two-loop five-point amplitudes, with a particular focus on diphoton-plus-jet production through gluon fusion. We build a library of infrared functions up to at most next-to-next-to-leading order (NNLO) in QCD, which can be used to validate amplitudes and construct counterterms in subtraction schemes at NNLO. We review progress in the novel use of machine learning technology to optimise the evaluation of amplitudes in hadron collider simulations. We present the full-colour virtual QCD corrections to diphoton-plus-jet production through gluon fusion, discussing the new techniques developed to calculate these non-planar two-loop amplitudes. We use these amplitudes to compute the next-to-leading QCD corrections to the differential cross sections of diphoton-plus-jet production through gluon fusion at the Large Hadron Collider. We also present the leading-colour double-virtual corrections to hadronic trijet production. All derived amplitudes are made available in a public implementation that is ready for further phenomenological application.
    A method to construct exponential families by representation theory. (arXiv:1811.01394v4 [math.ST] UPDATED)
    In this paper, we give a method to construct "good" exponential families systematically by representation theory. More precisely, we consider a homogeneous space $G/H$ as a sample space and construct an exponential family invariant under the transformation group $G$ by using a representation of $G$. The method generates widely used exponential families such as normal, gamma, Bernoulli, categorical, Wishart, von Mises, Fisher-Bingham and hyperboloid distributions.
    Learning Physical Dynamics with Subequivariant Graph Neural Networks. (arXiv:2210.06876v1 [cs.LG])
    Graph Neural Networks (GNNs) have become a prevailing tool for learning physical dynamics. However, they still encounter several challenges: 1) Physical laws abide by symmetry, which is a vital inductive bias accounting for model generalization and should be incorporated into the model design. Existing simulators either consider insufficient symmetry, or enforce excessive equivariance in practice when symmetry is partially broken by gravity. 2) Objects in the physical world possess diverse shapes, sizes, and properties, which should be appropriately processed by the model. To tackle these difficulties, we propose a novel backbone, Subequivariant Graph Neural Network, which 1) relaxes equivariance to subequivariance by considering external fields like gravity, where the universal approximation ability holds theoretically; 2) introduces a new subequivariant object-aware message passing for learning physical interactions between multiple objects of various shapes in the particle-based representation; 3) operates in a hierarchical fashion, allowing for modeling long-range and complex interactions. Our model achieves on average over 3% enhancement in contact prediction accuracy across 8 scenarios on Physion and 2X lower rollout MSE on RigidFall compared with state-of-the-art GNN simulators, while exhibiting strong generalization and data efficiency.
    A Survey on Explainable Anomaly Detection. (arXiv:2210.06959v1 [cs.LG])
    In the past two decades, most research on anomaly detection has focused on improving the accuracy of the detection, while largely ignoring the explainability of the corresponding methods and thus leaving the explanation of outcomes to practitioners. As anomaly detection algorithms are increasingly used in safety-critical domains, providing explanations for the high-stakes decisions made in those domains has become an ethical and regulatory requirement. Therefore, this work provides a comprehensive and structured survey on state-of-the-art explainable anomaly detection techniques. We propose a taxonomy based on the main aspects that characterize each explainable anomaly detection technique, aiming to help practitioners and researchers find the explainable anomaly detection method that best suits their needs.
    Ensemble Creation via Anchored Regularization for Unsupervised Aspect Extraction. (arXiv:2210.06829v1 [cs.CL])
    Aspect Based Sentiment Analysis is the most granular form of sentiment analysis that can be performed on the documents / sentences. Besides delivering the most insights at a finer grain, it also poses equally daunting challenges. One of them being the shortage of labelled data. To bring in value right out of the box for the text data being generated at a very fast pace in today's world, unsupervised aspect-based sentiment analysis allows us to generate insights without investing time or money in generating labels. From topic modelling approaches to recent deep learning-based aspect extraction models, this domain has seen a lot of development. One of the models that we improve upon is ABAE that reconstructs the sentences as a linear combination of aspect terms present in it, In this research we explore how we can use information from another unsupervised model to regularize ABAE, leading to better performance. We contrast it with baseline rule based ensemble and show that the ensemble methods work better than the individual models and the regularization based ensemble performs better than the rule-based one.
    Learning Neuro-Symbolic Skills for Bilevel Planning. (arXiv:2206.10680v2 [cs.RO] UPDATED)
    Decision-making is challenging in robotics environments with continuous object-centric states, continuous actions, long horizons, and sparse feedback. Hierarchical approaches, such as task and motion planning (TAMP), address these challenges by decomposing decision-making into two or more levels of abstraction. In a setting where demonstrations and symbolic predicates are given, prior work has shown how to learn symbolic operators and neural samplers for TAMP with manually designed parameterized policies. Our main contribution is a method for learning parameterized polices in combination with operators and samplers. These components are packaged into modular neuro-symbolic skills and sequenced together with search-then-sample TAMP to solve new tasks. In experiments in four robotics domains, we show that our approach -- bilevel planning with neuro-symbolic skills -- can solve a wide range of tasks with varying initial states, goals, and objects, outperforming six baselines and ablations. Video: https://youtu.be/PbFZP8rPuGg Code: https://tinyurl.com/skill-learning
    Large-Scale Open-Set Classification Protocols for ImageNet. (arXiv:2210.06789v1 [cs.CV])
    Open-Set Classification (OSC) intends to adapt closed-set classification models to real-world scenarios, where the classifier must correctly label samples of known classes while rejecting previously unseen unknown samples. Only recently, research started to investigate on algorithms that are able to handle these unknown samples correctly. Some of these approaches address OSC by including into the training set negative samples that a classifier learns to reject, expecting that these data increase the robustness of the classifier on unknown classes. Most of these approaches are evaluated on small-scale and low-resolution image datasets like MNIST, SVHN or CIFAR, which makes it difficult to assess their applicability to the real world, and to compare them among each other. We propose three open-set protocols that provide rich datasets of natural images with different levels of similarity between known and unknown classes. The protocols consist of subsets of ImageNet classes selected to provide training and testing data closer to real-world scenarios. Additionally, we propose a new validation metric that can be employed to assess whether the training of deep learning models addresses both the classification of known samples and the rejection of unknown samples. We use the protocols to compare the performance of two baseline open-set algorithms to the standard SoftMax baseline and find that the algorithms work well on negative samples that have been seen during training, and partially on out-of-distribution detection tasks, but drop performance in the presence of samples from previously unseen unknown classes.
    Why self-attention is Natural for Sequence-to-Sequence Problems? A Perspective from Symmetries. (arXiv:2210.06741v1 [cs.LG])
    In this paper, we show that structures similar to self-attention are natural to learn many sequence-to-sequence problems from the perspective of symmetry. Inspired by language processing applications, we study the orthogonal equivariance of seq2seq functions with knowledge, which are functions taking two inputs -- an input sequence and a ``knowledge'' -- and outputting another sequence. The knowledge consists of a set of vectors in the same embedding space as the input sequence, containing the information of the language used to process the input sequence. We show that orthogonal equivariance in the embedding space is natural for seq2seq functions with knowledge, and under such equivariance the function must take the form close to the self-attention. This shows that network structures similar to self-attention are the right structures to represent the target function of many seq2seq problems. The representation can be further refined if a ``finite information principle'' is considered, or a permutation equivariance holds for the elements of the input sequence.
    TiDAL: Learning Training Dynamics for Active Learning. (arXiv:2210.06788v1 [cs.LG])
    Active learning (AL) aims to select the most useful data samples from an unlabeled data pool and annotate them to expand the labeled dataset under a limited budget. Especially, uncertainty-based methods choose the most uncertain samples, which are known to be effective in improving model performance. However, AL literature often overlooks training dynamics (TD), defined as the ever-changing model behavior during optimization via stochastic gradient descent, even though other areas of literature have empirically shown that TD provides important clues for measuring the sample uncertainty. In this paper, we propose a novel AL method, Training Dynamics for Active Learning (TiDAL), which leverages the TD to quantify uncertainties of unlabeled data. Since tracking the TD of all the large-scale unlabeled data is impractical, TiDAL utilizes an additional prediction module that learns the TD of labeled data. To further justify the design of TiDAL, we provide theoretical and empirical evidence to argue the usefulness of leveraging TD for AL. Experimental results show that our TiDAL achieves better or comparable performance on both balanced and imbalanced benchmark datasets compared to state-of-the-art AL methods, which estimate data uncertainty using only static information after model training.
    SDW-ASL: A Dynamic System to Generate Large Scale Dataset for Continuous American Sign Language. (arXiv:2210.06791v1 [cs.CL])
    Despite tremendous progress in natural language processing using deep learning techniques in recent years, sign language production and comprehension has advanced very little. One critical barrier is the lack of largescale datasets available to the public due to the unbearable cost of labeled data generation. Efforts to provide public data for American Sign Language (ASL) comprehension have yielded two datasets, comprising more than thousand video clips. These datasets are large enough to enable a meaningful start to deep learning research on sign languages but are far too small to lead to any solution that can be practically deployed. So far, there is still no suitable dataset for ASL production. We proposed a system that can generate large scale ASL datasets for continuous ASL. It is suitable for general ASL processing and is particularly useful for ASL production. The continuous ASL dataset contains English labeled human articulations in condensed body pose data formats. To better serve the research community, we are releasing the first version of our ASL dataset, which contains 30k sentences, 416k words, a vocabulary of 18k words, in a total of 104 hours. This is the largest continuous sign language dataset published to date in terms of video duration. We also describe a system that can evolve and expand the dataset to incorporate better data processing techniques and more contents when available. It is our hope that the release of this ASL dataset and the sustainable dataset generation system to the public will propel better deep-learning research in ASL natural language processing.
    Equal Improvability: A New Fairness Notion Considering the Long-term Impact. (arXiv:2210.06732v1 [cs.LG])
    Devising a fair classifier that does not discriminate against different groups is an important problem in machine learning. Although researchers have proposed various ways of defining group fairness, most of them only focused on the immediate fairness, ignoring the long-term impact of a fair classifier under the dynamic scenario where each individual can improve its feature over time. Such dynamic scenarios happen in real world, e.g., college admission and credit loaning, where each rejected sample makes effort to change its features to get accepted afterwards. In this dynamic setting, the long-term fairness should equalize the samples' feature distribution across different groups after the rejected samples make some effort to improve. In order to promote long-term fairness, we propose a new fairness notion called Equal Improvability (EI), which equalizes the potential acceptance rate of the rejected samples across different groups assuming a bounded level of effort will be spent by each rejected sample. We analyze the properties of EI and its connections with existing fairness notions. To find a classifier that satisfies the EI requirement, we propose and study three different approaches that solve EI-regularized optimization problems. Through experiments on both synthetic and real datasets, we demonstrate that the proposed EI-regularized algorithms encourage us to find a fair classifier in terms of EI. Finally, we provide experimental results on dynamic scenarios which highlight the advantages of our EI metric in achieving the long-term fairness. Codes are available in a GitHub repository, see https://github.com/guldoganozgur/ei_fairness.
    Toward the application of XAI methods in EEG-based systems. (arXiv:2210.06554v1 [cs.LG])
    An interesting case of the well-known Dataset Shift Problem is the classification of Electroencephalogram (EEG) signals in the context of Brain-Computer Interface (BCI). The non-stationarity of EEG signals can lead to poor generalisation performance in BCI classification systems used in different sessions, also from the same subject. In this paper, we start from the hypothesis that the Dataset Shift problem can be alleviated by exploiting suitable eXplainable Artificial Intelligence (XAI) methods to locate and transform the relevant characteristics of the input for the goal of classification. In particular, we focus on an experimental analysis of explanations produced by several XAI methods on an ML system trained on a typical EEG dataset for emotion recognition. Results show that many relevant components found by XAI methods are shared across the sessions and can be used to build a system able to generalise better. However, relevant components of the input signal also appear to be highly dependent on the input itself.
    Augmenting Flight Training with AI to Efficiently Train Pilots. (arXiv:2210.06683v1 [cs.LG])
    We propose an AI-based pilot trainer to help students learn how to fly aircraft. First, an AI agent uses behavioral cloning to learn flying maneuvers from qualified flight instructors. Later, the system uses the agent's decisions to detect errors made by students and provide feedback to help students correct their errors. This paper presents an instantiation of the pilot trainer. We focus on teaching straight and level flying maneuvers by automatically providing formative feedback to the human student.
    S4ND: Modeling Images and Videos as Multidimensional Signals Using State Spaces. (arXiv:2210.06583v1 [cs.CV])
    Visual data such as images and videos are typically modeled as discretizations of inherently continuous, multidimensional signals. Existing continuous-signal models attempt to exploit this fact by modeling the underlying signals of visual (e.g., image) data directly. However, these models have not yet been able to achieve competitive performance on practical vision tasks such as large-scale image and video classification. Building on a recent line of work on deep state space models (SSMs), we propose \method, a new multidimensional SSM layer that extends the continuous-signal modeling ability of SSMs to multidimensional data including images and videos. We show that S4ND can model large-scale visual data in $1$D, $2$D, and $3$D as continuous multidimensional signals and demonstrates strong performance by simply swapping Conv2D and self-attention layers with \method\ layers in existing state-of-the-art models. On ImageNet-1k, \method\ exceeds the performance of a Vision Transformer baseline by $1.5\%$ when training with a $1$D sequence of patches, and matches ConvNeXt when modeling images in $2$D. For videos, S4ND improves on an inflated $3$D ConvNeXt in activity classification on HMDB-51 by $4\%$. S4ND implicitly learns global, continuous convolutional kernels that are resolution invariant by construction, providing an inductive bias that enables generalization across multiple resolutions. By developing a simple bandlimiting modification to S4 to overcome aliasing, S4ND achieves strong zero-shot (unseen at training time) resolution performance, outperforming a baseline Conv2D by $40\%$ on CIFAR-10 when trained on $8 \times 8$ and tested on $32 \times 32$ images. When trained with progressive resizing, S4ND comes within $\sim 1\%$ of a high-resolution model while training $22\%$ faster.
    Rigorous dynamical mean field theory for stochastic gradient descent methods. (arXiv:2210.06591v1 [math-ph])
    We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates.
    Partial Information as Full: Reward Imputation with Sketching in Bandits. (arXiv:2210.06719v1 [cs.LG])
    We focus on the setting of contextual batched bandit (CBB), where a batch of rewards is observed from the environment in each episode. But the rewards of the non-executed actions are unobserved (i.e., partial-information feedbacks). Existing approaches for CBB usually ignore the rewards of the non-executed actions, resulting in feedback information being underutilized. In this paper, we propose an efficient reward imputation approach using sketching for CBB, which completes the unobserved rewards with the imputed rewards approximating the full-information feedbacks. Specifically, we formulate the reward imputation as a problem of imputation regularized ridge regression, which captures the feedback mechanisms of both the non-executed and executed actions. To reduce the time complexity of reward imputation, we solve the regression problem using randomized sketching. We prove that our reward imputation approach obtains a relative-error bound for sketching approximation, achieves an instantaneous regret with a controllable bias and a smaller variance than that without reward imputation, and enjoys a sublinear regret bound against the optimal policy. Moreover, we present two extensions of our approach, including the rate-scheduled version and the version for nonlinear rewards, making our approach more feasible. Experimental results demonstrated that our approach can outperform the state-of-the-art baselines on synthetic and real-world datasets.
    A General Stochastic Optimization Framework for Convergence Bidding. (arXiv:2210.06543v1 [math.OC])
    We introduce a general stochastic optimization framework to obtain optimal convergence (virtual) bid curves. Within this framework, we develop a computationally tractable linear programming-based optimization model, which produces bid prices and volumes simultaneously. We also show that different approximations and simplifications in the general model lead naturally to well-known convergence bidding approaches, such as self-scheduling and opportunistic approaches.
    D-CIPHER: Discovery of Closed-form Partial Differential Equations. (arXiv:2206.10586v2 [cs.LG] UPDATED)
    Closed-form differential equations, including partial differential equations and higher-order ordinary differential equations, are one of the most important tools used by scientists to model and better understand natural phenomena. Discovering these equations directly from data is challenging because it requires modeling relationships between various derivatives that are not observed in the data (equation-data mismatch) and it involves searching across a huge space of possible equations. Current approaches make strong assumptions about the form of the equation and thus fail to discover many well-known systems. Moreover, many of them resolve the equation-data mismatch by estimating the derivatives, which makes them inadequate for noisy and infrequently sampled systems. To this end, we propose D-CIPHER, which is robust to measurement artifacts and can uncover a new and very general class of differential equations. We further design a novel optimization procedure, CoLLie, to help D-CIPHER search through this class efficiently. Finally, we demonstrate empirically that it can discover many well-known equations that are beyond the capabilities of current methods.
    Task-Free Continual Learning via Online Discrepancy Distance Learning. (arXiv:2210.06579v1 [cs.CV])
    Learning from non-stationary data streams, also called Task-Free Continual Learning (TFCL) remains challenging due to the absence of explicit task information. Although recently some methods have been proposed for TFCL, they lack theoretical guarantees. Moreover, forgetting analysis during TFCL was not studied theoretically before. This paper develops a new theoretical analysis framework which provides generalization bounds based on the discrepancy distance between the visited samples and the entire information made available for training the model. This analysis gives new insights into the forgetting behaviour in classification tasks. Inspired by this theoretical model, we propose a new approach enabled by the dynamic component expansion mechanism for a mixture model, namely the Online Discrepancy Distance Learning (ODDL). ODDL estimates the discrepancy between the probabilistic representation of the current memory buffer and the already accumulated knowledge and uses it as the expansion signal to ensure a compact network architecture with optimal performance. We then propose a new sample selection approach that selectively stores the most relevant samples into the memory buffer through the discrepancy-based measure, further improving the performance. We perform several TFCL experiments with the proposed methodology, which demonstrate that the proposed approach achieves the state of the art performance.
    Imitative Planning using Conditional Normalizing Flow. (arXiv:2007.16162v3 [cs.RO] UPDATED)
    A popular way to plan trajectories in dynamic urban scenarios for Autonomous Vehicles is to rely on explicitly specified and hand crafted cost functions, coupled with random sampling in the trajectory space to find the minimum cost trajectory. Such methods require a high number of samples to find a low-cost trajectory and might end up with a highly suboptimal trajectory given the planning time budget. We explore the application of normalizing flows for improving the performance of trajectory planning for autonomous vehicles (AVs). Our key insight is to learn a sampling policy in a low-dimensional latent space of expert-like trajectories, out of which the best sample is selected for execution. By modeling the trajectory planner's cost manifold as an energy function, we learn a scene conditioned mapping from the prior to a Boltzmann distribution over the AV control space. Finally, we demonstrate the effectiveness of our approach on real-world datasets over IL and hand-constructed trajectory sampling techniques.
    Self-Supervised Learning of Linear Precoders under Non-Linear PA Distortion for Energy-Efficient Massive MIMO Systems. (arXiv:2210.07037v1 [cs.LG])
    Massive multiple input multiple output (MIMO) systems are typically designed under the assumption of linear power amplifiers (PAs). However, PAs are typically most energy-efficient when operating close to their saturation point, where they cause non-linear distortion. Moreover, when using conventional precoders, this distortion coherently combines at the user locations, limiting performance. As such, when designing an energy-efficient massive MIMO system, this distortion has to be managed. In this work, we propose the use of a neural network (NN) to learn the mapping between the channel matrix and the precoding matrix, which maximizes the sum rate in the presence of this non-linear distortion. This is done for a third-order polynomial PA model for both the single and multi-user case. By learning this mapping a significant increase in energy efficiency is achieved as compared to conventional precoders and even as compared to perfect digital pre-distortion (DPD), in the saturation regime.
    How to Sift Out a Clean Data Subset in the Presence of Data Poisoning?. (arXiv:2210.06516v1 [cs.CR])
    Given the volume of data needed to train modern machine learning models, external suppliers are increasingly used. However, incorporating external data poses data poisoning risks, wherein attackers manipulate their data to degrade model utility or integrity. Most poisoning defenses presume access to a set of clean data (or base set). While this assumption has been taken for granted, given the fast-growing research on stealthy poisoning attacks, a question arises: can defenders really identify a clean subset within a contaminated dataset to support defenses? This paper starts by examining the impact of poisoned samples on defenses when they are mistakenly mixed into the base set. We analyze five defenses and find that their performance deteriorates dramatically with less than 1% poisoned points in the base set. These findings suggest that sifting out a base set with high precision is key to these defenses' performance. Motivated by these observations, we study how precise existing automated tools and human inspection are at identifying clean data in the presence of data poisoning. Unfortunately, neither effort achieves the precision needed. Worse yet, many of the outcomes are worse than random selection. In addition to uncovering the challenge, we propose a practical countermeasure, Meta-Sift. Our method is based on the insight that existing attacks' poisoned samples shifts from clean data distributions. Hence, training on the clean portion of a dataset and testing on the corrupted portion will result in high prediction loss. Leveraging the insight, we formulate a bilevel optimization to identify clean data and further introduce a suite of techniques to improve efficiency and precision. Our evaluation shows that Meta-Sift can sift a clean base set with 100% precision under a wide range of poisoning attacks. The selected base set is large enough to give rise to successful defenses.  ( 3 min )
    Microscopy is All You Need. (arXiv:2210.06526v1 [cond-mat.dis-nn])
    We pose that microscopy offers an ideal real-world experimental environment for the development and deployment of active Bayesian and reinforcement learning methods. Indeed, the tremendous progress achieved by machine learning (ML) and artificial intelligence over the last decade has been largely achieved via the utilization of static data sets, from the paradigmatic MNIST to the bespoke corpora of text and image data used to train large models such as GPT3, DALLE and others. However, it is now recognized that continuous, minute improvements to state-of-the-art do not necessarily translate to advances in real-world applications. We argue that a promising pathway for the development of ML methods is via the route of domain-specific deployable algorithms in areas such as electron and scanning probe microscopy and chemical imaging. This will benefit both fundamental physical studies and serve as a test bed for more complex autonomous systems such as robotics and manufacturing. Favorable environment characteristics of scanning and electron microscopy include low risk, extensive availability of domain-specific priors and rewards, relatively small effects of exogeneous variables, and often the presence of both upstream first principles as well as downstream learnable physical models for both statics and dynamics. Recent developments in programmable interfaces, edge computing, and access to APIs facilitating microscope control, all render the deployment of ML codes on operational microscopes straightforward. We discuss these considerations and hope that these arguments will lead to creating a novel set of development targets for the ML community by accelerating both real-world ML applications and scientific progress.  ( 3 min )
    Emergence of Shared Sensory-motor Graphical Language from Visual Input. (arXiv:2210.06468v1 [cs.AI])
    The framework of Language Games studies the emergence of languages in populations of agents. Recent contributions relying on deep learning methods focused on agents communicating via an idealized communication channel, where utterances produced by a speaker are directly perceived by a listener. This comes in contrast with human communication, which instead relies on a sensory-motor channel, where motor commands produced by the speaker (e.g. vocal or gestural articulators) result in sensory effects perceived by the listener (e.g. audio or visual). Here, we investigate if agents can evolve a shared language when they are equipped with a continuous sensory-motor system to produce and perceive signs, e.g. drawings. To this end, we introduce the Graphical Referential Game (GREG) where a speaker must produce a graphical utterance to name a visual referent object consisting of combinations of MNIST digits while a listener has to select the corresponding object among distractor referents, given the produced message. The utterances are drawing images produced using dynamical motor primitives combined with a sketching library. To tackle GREG we present CURVES: a multimodal contrastive deep learning mechanism that represents the energy (alignment) between named referents and utterances generated through gradient ascent on the learned energy landscape. We, then, present a set of experiments and metrics based on a systematic compositional dataset to evaluate the resulting language. We show that our method allows the emergence of a shared, graphical language with compositional properties.  ( 3 min )
    Quantum Algorithms for Sampling Log-Concave Distributions and Estimating Normalizing Constants. (arXiv:2210.06539v1 [quant-ph])
    Given a convex function $f\colon\mathbb{R}^{d}\to\mathbb{R}$, the problem of sampling from a distribution $\propto e^{-f(x)}$ is called log-concave sampling. This task has wide applications in machine learning, physics, statistics, etc. In this work, we develop quantum algorithms for sampling log-concave distributions and for estimating their normalizing constants $\int_{\mathbb{R}^d}e^{-f(x)}\mathrm{d} x$. First, we use underdamped Langevin diffusion to develop quantum algorithms that match the query complexity (in terms of the condition number $\kappa$ and dimension $d$) of analogous classical algorithms that use gradient (first-order) queries, even though the quantum algorithms use only evaluation (zeroth-order) queries. For estimating normalizing constants, these algorithms also achieve quadratic speedup in the multiplicative error $\epsilon$. Second, we develop quantum Metropolis-adjusted Langevin algorithms with query complexity $\widetilde{O}(\kappa^{1/2}d)$ and $\widetilde{O}(\kappa^{1/2}d^{3/2}/\epsilon)$ for log-concave sampling and normalizing constant estimation, respectively, achieving polynomial speedups in $\kappa,d,\epsilon$ over the best known classical algorithms by exploiting quantum analogs of the Monte Carlo method and quantum walks. We also prove a $1/\epsilon^{1-o(1)}$ quantum lower bound for estimating normalizing constants, implying near-optimality of our quantum algorithms in $\epsilon$.  ( 2 min )
    Real World Offline Reinforcement Learning with Realistic Data Source. (arXiv:2210.06479v1 [cs.RO])
    Offline reinforcement learning (ORL) holds great promise for robot learning due to its ability to learn from arbitrary pre-generated experience. However, current ORL benchmarks are almost entirely in simulation and utilize contrived datasets like replay buffers of online RL agents or sub-optimal trajectories, and thus hold limited relevance for real-world robotics. In this work (Real-ORL), we posit that data collected from safe operations of closely related tasks are more practical data sources for real-world robot learning. Under these settings, we perform an extensive (6500+ trajectories collected over 800+ robot hours and 270+ human labor hour) empirical study evaluating generalization and transfer capabilities of representative ORL methods on four real-world tabletop manipulation tasks. Our study finds that ORL and imitation learning prefer different action spaces, and that ORL algorithms can generalize from leveraging offline heterogeneous data sources and outperform imitation learning. We release our dataset and implementations at URL: https://sites.google.com/view/real-orl  ( 2 min )
    Evaluated CMI Bounds for Meta Learning: Tightness and Expressiveness. (arXiv:2210.06511v1 [cs.LG])
    Recent work has established that the conditional mutual information (CMI) framework of Steinke and Zakynthinou (2020) is expressive enough to capture generalization guarantees in terms of algorithmic stability, VC dimension, and related complexity measures for conventional learning (Harutyunyan et al., 2021, Haghifam et al., 2021). Hence, it provides a unified method for establishing generalization bounds. In meta learning, there has so far been a divide between information-theoretic results and results from classical learning theory. In this work, we take a first step toward bridging this divide. Specifically, we present novel generalization bounds for meta learning in terms of the evaluated CMI (e-CMI). To demonstrate the expressiveness of the e-CMI framework, we apply our bounds to a representation learning setting, with $n$ samples from $\hat n$ tasks parameterized by functions of the form $f_i \circ h$. Here, each $f_i \in \mathcal F$ is a task-specific function, and $h \in \mathcal H$ is the shared representation. For this setup, we show that the e-CMI framework yields a bound that scales as $\sqrt{ \mathcal C(\mathcal H)/(n\hat n) + \mathcal C(\mathcal F)/n} $, where $\mathcal C(\cdot)$ denotes a complexity measure of the hypothesis class. This scaling behavior coincides with the one reported in Tripuraneni et al. (2020) using Gaussian complexity.  ( 3 min )
    Equi-Tuning: Group Equivariant Fine-Tuning of Pretrained Models. (arXiv:2210.06475v1 [cs.LG])
    We introduce equi-tuning, a novel fine-tuning method that transforms (potentially non-equivariant) pretrained models into group equivariant models while incurring minimum $L_2$ loss between the feature representations of the pretrained and the equivariant models. Large pretrained models can be equi-tuned for different groups to satisfy the needs of various downstream tasks. Equi-tuned models benefit from both group equivariance as an inductive bias and semantic priors from pretrained models. We provide applications of equi-tuning on three different tasks: image classification, compositional generalization in language, and fairness in natural language generation (NLG). We also provide a novel group-theoretic definition for fairness in NLG. The effectiveness of this definition is shown by testing it against a standard empirical method of fairness in NLG. We provide experimental results for equi-tuning using a variety of pretrained models: Alexnet, Resnet, VGG, and Densenet for image classification; RNNs, GRUs, and LSTMs for compositional generalization; and GPT2 for fairness in NLG. We test these models on benchmark datasets across all considered tasks to show the generality and effectiveness of the proposed method.  ( 2 min )
    MicroLib: A library of 3D microstructures generated from 2D micrographs using SliceGAN. (arXiv:2210.06541v1 [cs.LG])
    3D microstructural datasets are commonly used to define the geometrical domains used in finite element modelling. This has proven a useful tool for understanding how complex material systems behave under applied stresses, temperatures and chemical conditions. However, 3D imaging of materials is challenging for a number of reasons, including limited field of view, low resolution and difficult sample preparation. Recently, a machine learning method, SliceGAN, was developed to statistically generate 3D microstructural datasets of arbitrary size using a single 2D input slice as training data. In this paper, we present the results from applying SliceGAN to 87 different microstructures, ranging from biological materials to high-strength steels. To demonstrate the accuracy of the synthetic volumes created by SliceGAN, we compare three microstructural properties between the 2D training data and 3D generations, which show good agreement. This new microstructure library both provides valuable 3D microstructures that can be used in models, and also demonstrates the broad applicability of the SliceGAN algorithm.  ( 2 min )
    GULP: a prediction-based metric between representations. (arXiv:2210.06545v1 [cs.LG])
    Comparing the representations learned by different neural networks has recently emerged as a key tool to understand various architectures and ultimately optimize them. In this work, we introduce GULP, a family of distance measures between representations that is explicitly motivated by downstream predictive tasks. By construction, GULP provides uniform control over the difference in prediction performance between two representations, with respect to regularized linear prediction tasks. Moreover, it satisfies several desirable structural properties, such as the triangle inequality and invariance under orthogonal transformations, and thus lends itself to data embedding and visualization. We extensively evaluate GULP relative to other methods, and demonstrate that it correctly differentiates between architecture families, converges over the course of training, and captures generalization performance on downstream linear tasks.  ( 2 min )
    SUMBot: Summarizing Context in Open-Domain Dialogue Systems. (arXiv:2210.06496v1 [cs.CL])
    In this paper, we investigate the problem of including relevant information as context in open-domain dialogue systems. Most models struggle to identify and incorporate important knowledge from dialogues and simply use the entire turns as context, which increases the size of the input fed to the model with unnecessary information. Additionally, due to the input size limitation of a few hundred tokens of large pre-trained models, regions of the history are not included and informative parts from the dialogue may be omitted. In order to surpass this problem, we introduce a simple method that substitutes part of the context with a summary instead of the whole history, which increases the ability of models to keep track of all the previous relevant information. We show that the inclusion of a summary may improve the answer generation task and discuss some examples to further understand the system's weaknesses.  ( 2 min )
    Auto-Encoding Goodness of Fit. (arXiv:2210.06546v1 [cs.LG])
    For generative autoencoders to learn a meaningful latent representation for data generation, a careful balance must be achieved between reconstruction error and how close the distribution in the latent space is to the prior. However, this balance is challenging to achieve due to a lack of criteria that work both at the mini-batch (local) and aggregated posterior (global) level. Goodness of fit (GoF) hypothesis tests provide a measure of statistical indistinguishability between the latent distribution and a target distribution class. In this work, we develop the Goodness of Fit Autoencoder (GoFAE), which incorporates hypothesis tests at two levels. At the mini-batch level, it uses GoF test statistics as regularization objectives. At a more global level, it selects a regularization coefficient based on higher criticism, i.e., a test on the uniformity of the local GoF p-values. We justify the use of GoF tests by providing a relaxed $L_2$-Wasserstein bound on the distance between the latent distribution and target prior. We propose to use GoF tests and prove that optimization based on these tests can be done with stochastic gradient (SGD) descent on a compact Riemannian manifold. Empirically, we show that our higher criticism parameter selection procedure balances reconstruction and generation using mutual information and uniformity of p-values respectively. Finally, we show that GoFAE achieves comparable FID scores and mean squared errors with competing deep generative models while retaining statistical indistinguishability from Gaussian in the latent space based on a variety of hypothesis tests.  ( 3 min )
  • Open

    Deterministic Langevin Monte Carlo with Normalizing Flows for Bayesian Inference. (arXiv:2205.14240v2 [stat.ML] UPDATED)
    We propose a general purpose Bayesian inference algorithm for expensive likelihoods, replacing the stochastic term in the Langevin equation with a deterministic density gradient term. The particle density is evaluated from the current particle positions using a Normalizing Flow (NF), which is differentiable and has good generalization properties in high dimensions. We take advantage of NF preconditioning and NF based Metropolis-Hastings updates for a faster convergence. We show on various examples that the method is competitive against state of the art sampling methods.  ( 2 min )
    Active Exploration for Inverse Reinforcement Learning. (arXiv:2207.08645v2 [cs.LG] UPDATED)
    Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a reward function from expert demonstrations. Many IRL algorithms require a known transition model and sometimes even a known expert policy, or they at least require access to a generative model. However, these assumptions are too strong for many real-world applications, where the environment can be accessed only through sequential interaction. We propose a novel IRL algorithm: Active exploration for Inverse Reinforcement Learning (AceIRL), which actively explores an unknown environment and expert policy to quickly learn the expert's reward function and identify a good policy. AceIRL uses previous observations to construct confidence intervals that capture plausible reward functions and find exploration policies that focus on the most informative regions of the environment. AceIRL is the first approach to active IRL with sample-complexity bounds that does not require a generative model of the environment. AceIRL matches the sample complexity of active IRL with a generative model in the worst case. Additionally, we establish a problem-dependent bound that relates the sample complexity of AceIRL to the suboptimality gap of a given IRL problem. We empirically evaluate AceIRL in simulations and find that it significantly outperforms more naive exploration strategies.  ( 3 min )
    The Franz-Parisi Criterion and Computational Trade-offs in High Dimensional Statistics. (arXiv:2205.09727v2 [math.ST] UPDATED)
    Many high-dimensional statistical inference problems are believed to possess inherent computational hardness. Various frameworks have been proposed to give rigorous evidence for such hardness, including lower bounds against restricted models of computation (such as low-degree functions), as well as methods rooted in statistical physics that are based on free energy landscapes. This paper aims to make a rigorous connection between the seemingly different low-degree and free-energy based approaches. We define a free-energy based criterion for hardness and formally connect it to the well-established notion of low-degree hardness for a broad class of statistical problems, namely all Gaussian additive models and certain models with a sparse planted signal. By leveraging these rigorous connections we are able to: establish that for Gaussian additive models the "algebraic" notion of low-degree hardness implies failure of "geometric" local MCMC algorithms, and provide new low-degree lower bounds for sparse linear regression which seem difficult to prove directly. These results provide both conceptual insights into the connections between different notions of hardness, as well as concrete technical tools such as new methods for proving low-degree lower bounds.  ( 3 min )
    Online PAC-Bayes Learning. (arXiv:2206.00024v2 [cs.LG] UPDATED)
    Most PAC-Bayesian bounds hold in the batch learning setting where data is collected at once, prior to inference or prediction. This somewhat departs from many contemporary learning problems where data streams are collected and the algorithms must dynamically adjust. We prove new PAC-Bayesian bounds in this online learning framework, leveraging an updated definition of regret, and we revisit classical PAC-Bayesian results with a batch-to-online conversion, extending their remit to the case of dependent data. Our results hold for bounded losses, potentially \emph{non-convex}, paving the way to promising developments in online learning.  ( 2 min )
    Batch-Size Independent Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms or Independent Arms. (arXiv:2208.14837v2 [cs.LG] UPDATED)
    In this paper, we study the combinatorial semi-bandits (CMAB) and focus on reducing the dependency of the batch-size $K$ in the regret bound, where $K$ is the total number of arms that can be pulled or triggered in each round. First, for the setting of CMAB with probabilistically triggered arms (CMAB-T), we discover a novel (directional) triggering probability and variance modulated (TPVM) condition that can replace the previously-used smoothness condition for various applications, such as cascading bandits, online network exploration and online influence maximization. Under this new condition, we propose a BCUCB-T algorithm with variance-aware confidence intervals and conduct regret analysis which reduces the $O(K)$ factor to $O(\log K)$ or $O(\log^2 K)$ in the regret bound, significantly improving the regret bounds for the above applications. Second, for the setting of non-triggering CMAB with independent arms, we propose a SESCB algorithm which leverages on the non-triggering version of the TPVM condition and completely removes the dependency on $K$ in the leading regret. As a valuable by-product, the regret analysis used in this paper can improve several existing results by a factor of $O(\log K)$. Finally, experimental evaluations show our superior performance compared with benchmark algorithms in different applications.  ( 3 min )
    Neural Network Architecture Beyond Width and Depth. (arXiv:2205.09459v2 [cs.LG] UPDATED)
    This paper proposes a new neural network architecture by introducing an additional dimension called height beyond width and depth. Neural network architectures with height, width, and depth as hyper-parameters are called three-dimensional architectures. It is shown that neural networks with three-dimensional architectures are significantly more expressive than the ones with two-dimensional architectures (those with only width and depth as hyper-parameters), e.g., standard fully connected networks. The new network architecture is constructed recursively via a nested structure, and hence we call a network with the new architecture nested network (NestNet). A NestNet of height $s$ is built with each hidden neuron activated by a NestNet of height $\le s-1$. When $s=1$, a NestNet degenerates to a standard network with a two-dimensional architecture. It is proved by construction that height-$s$ ReLU NestNets with $\mathcal{O}(n)$ parameters can approximate $1$-Lipschitz continuous functions on $[0,1]^d$ with an error $\mathcal{O}(n^{-(s+1)/d})$, while the optimal approximation error of standard ReLU networks with $\mathcal{O}(n)$ parameters is $\mathcal{O}(n^{-2/d})$. Furthermore, such a result is extended to generic continuous functions on $[0,1]^d$ with the approximation error characterized by the modulus of continuity. Finally, we use numerical experimentation to show the advantages of the super-approximation power of ReLU NestNets.  ( 2 min )
    Gradient Boosting Performs Gaussian Process Inference. (arXiv:2206.05608v2 [cs.LG] UPDATED)
    This paper shows that gradient boosting based on symmetric decision trees can be equivalently reformulated as a kernel method that converges to the solution of a certain Kernel Ridge Regression problem. Thus, we obtain the convergence to a Gaussian Process' posterior mean, which, in turn, allows us to easily transform gradient boosting into a sampler from the posterior to provide better knowledge uncertainty estimates through Monte-Carlo estimation of the posterior variance. We show that the proposed sampler allows for better knowledge uncertainty estimates leading to improved out-of-domain detection.  ( 2 min )
    Parameter Averaging for Feature Ranking. (arXiv:2208.03249v2 [cs.LG] UPDATED)
    Neural Networks are known to be sensitive to initialisation. The methods that rely on neural networks for feature ranking are not robust since they can have variations in their ranking when the model is initialized and trained with different random seeds. In this work, we introduce a novel method based on parameter averaging to estimate accurate and robust feature importance in tabular data setting, referred as XTab. We first initialize and train multiple instances of a shallow network (referred as local masks) with "different random seeds" for a downstream task. We then obtain a global mask model by "averaging the parameters" of local masks. We show that although the parameter averaging might result in a global model with higher loss, it still leads to the discovery of the ground-truth feature importance more consistently than an individual model does. We conduct extensive experiments on a variety of synthetic and real-world data, demonstrating that the XTab can be used to obtain the global feature importance that is not sensitive to sub-optimal model initialisation.  ( 2 min )
    Deep Ensembles Work, But Are They Necessary?. (arXiv:2202.06985v2 [cs.LG] UPDATED)
    Ensembling neural networks is an effective way to increase accuracy, and can often match the performance of individual larger models. This observation poses a natural question: given the choice between a deep ensemble and a single neural network with similar accuracy, is one preferable over the other? Recent work suggests that deep ensembles may offer distinct benefits beyond predictive power: namely, uncertainty quantification and robustness to dataset shift. In this work, we demonstrate limitations to these purported benefits, and show that a single (but larger) neural network can replicate these qualities. First, we show that ensemble diversity, by any metric, does not meaningfully contribute to an ensemble's uncertainty quantification on out-of-distribution (OOD) data, but is instead highly correlated with the relative improvement of a single larger model. Second, we show that the OOD performance afforded by ensembles is strongly determined by their in-distribution (InD) performance, and -- in this sense -- is not indicative of any "effective robustness". While deep ensembles are a practical way to achieve improvements to predictive power, uncertainty quantification, and robustness, our results show that these improvements can be replicated by a (larger) single model.  ( 2 min )
    One Policy is Enough: Parallel Exploration with a Single Policy is Near-Optimal for Reward-Free Reinforcement Learning. (arXiv:2205.15891v2 [cs.LG] UPDATED)
    While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.  ( 2 min )
    Partial Identifiability for Nonnegative Matrix Factorization. (arXiv:2206.08022v3 [math.NA] UPDATED)
    Given a nonnegative matrix factorization, $R$, and a factorization rank, $r$, Exact nonnegative matrix factorization (Exact NMF) decomposes $R$ as the product of two nonnegative matrices, $C$ and $S$ with $r$ columns, such as $R = CS^\top$. A central research topic in the literature is the conditions under which such a decomposition is unique/identifiable, up to trivial ambiguities. In this paper, we focus on partial identifiability, that is, the uniqueness of a subset of columns of $C$ and $S$. We start our investigations with the data-based uniqueness (DBU) theorem from the chemometrics literature. The DBU theorem analyzes all feasible solutions of Exact NMF, and relies on sparsity conditions on $C$ and $S$. We provide a mathematically rigorous theorem of a recently published restricted version of the DBU theorem, relying only on simple sparsity and algebraic conditions: it applies to a particular solution of Exact NMF (as opposed to all feasible solutions) and allows us to guarantee the partial uniqueness of a single column of $C$ or $S$. Second, based on a geometric interpretation of the restricted DBU theorem, we obtain a new partial identifiability result. This geometric interpretation also leads us to another partial identifiability result in the case $r=3$. Third, we show how partial identifiability results can be used sequentially to guarantee the identifiability of more columns of $C$ and $S$. We illustrate these results on several examples, including one from the chemometrics literature.  ( 3 min )
    Asymptotic Properties for Bayesian Neural Network in Besov Space. (arXiv:2206.00241v2 [stat.ML] UPDATED)
    Neural networks have shown great predictive power when dealing with various unstructured data such as images and natural languages. The Bayesian neural network captures the uncertainty of prediction by putting a prior distribution for the parameter of the model and computing the posterior distribution. In this paper, we show that the Bayesian neural network using spike-and-slab prior has consistency with nearly minimax convergence rate when the true regression function is in the Besov space. Even when the smoothness of the regression function is unknown the same posterior convergence rate holds and thus the spike-and-slab prior is adaptive to the smoothness of the regression function. We also consider the shrinkage prior, which is more feasible than other priors, and show that it has the same convergence rate. In other words, we propose a practical Bayesian neural network with guaranteed asymptotic properties.  ( 2 min )
    A consistent and flexible framework for deep matrix factorizations. (arXiv:2206.10693v2 [cs.LG] UPDATED)
    Deep matrix factorizations (deep MFs) are recent unsupervised data mining techniques inspired by constrained low-rank approximations. They aim to extract complex hierarchies of features within high-dimensional datasets. Most of the loss functions proposed in the literature to evaluate the quality of deep MF models and the underlying optimization frameworks are not consistent because different losses are used at different layers. In this paper, we introduce two meaningful loss functions for deep MF and present a generic framework to solve the corresponding optimization problems. We illustrate the effectiveness of this approach through the integration of various constraints and regularizations, such as sparsity, nonnegativity and minimum-volume. The models are successfully applied on both synthetic and real data, namely for hyperspectral unmixing and extraction of facial features.  ( 2 min )
    Communication Efficient Federated Learning for Generalized Linear Bandits. (arXiv:2202.01087v2 [cs.LG] UPDATED)
    Contextual bandit algorithms have been recently studied under the federated learning setting to satisfy the demand of keeping data decentralized and pushing the learning of bandit models to the client side. But limited by the required communication efficiency, existing solutions are restricted to linear models to exploit their closed-form solutions for parameter estimation. Such a restricted model choice greatly hampers these algorithms' practical utility. In this paper, we take the first step to addressing this challenge by studying generalized linear bandit models under the federated learning setting. We propose a communication-efficient solution framework that employs online regression for local update and offline regression for global update. We rigorously proved, though the setting is more general and challenging, our algorithm can attain sub-linear rate in both regret and communication cost, which is also validated by our extensive empirical evaluations.  ( 2 min )
    Adjoint-aided inference of Gaussian process driven differential equations. (arXiv:2202.04589v3 [stat.ML] UPDATED)
    Linear systems occur throughout engineering and the sciences, most notably as differential equations. In many cases the forcing function for the system is unknown, and interest lies in using noisy observations of the system to infer the forcing, as well as other unknown parameters. In differential equations, the forcing function is an unknown function of the independent variables (typically time and space), and can be modelled as a Gaussian process (GP). In this paper we show how the adjoint of a linear system can be used to efficiently infer forcing functions modelled as GPs, using a truncated basis expansion of the GP kernel. We show how exact conjugate Bayesian inference for the truncated GP can be achieved, in many cases with substantially lower computation than would be required using MCMC methods. We demonstrate the approach on systems of both ordinary and partial differential equations, and show that the basis expansion approach approximates well the true forcing with a modest number of basis vectors. Finally, we show how to infer point estimates for the non-linear model parameters, such as the kernel length-scales, using Bayesian optimisation.  ( 3 min )
    Pitfalls of Epistemic Uncertainty Quantification through Loss Minimisation. (arXiv:2203.06102v2 [cs.LG] UPDATED)
    Uncertainty quantification has received increasing attention in machine learning in the recent past. In particular, a distinction between aleatoric and epistemic uncertainty has been found useful in this regard. The latter refers to the learner's (lack of) knowledge and appears to be especially difficult to measure and quantify. In this paper, we analyse a recent proposal based on the idea of a second-order learner, which yields predictions in the form of distributions over probability distributions. While standard (first-order) learners can be trained to predict accurate probabilities, namely by minimising suitable loss functions on sample data, we show that loss minimisation does not work for second-order predictors: The loss functions proposed for inducing such predictors do not incentivise the learner to represent its epistemic uncertainty in a faithful way.  ( 2 min )
    Fast Estimation of Bayesian State Space Models Using Amortized Simulation-Based Inference. (arXiv:2210.07154v1 [econ.EM])
    This paper presents a fast algorithm for estimating hidden states of Bayesian state space models. The algorithm is a variation of amortized simulation-based inference algorithms, where a large number of artificial datasets are generated at the first stage, and then a flexible model is trained to predict the variables of interest. In contrast to those proposed earlier, the procedure described in this paper makes it possible to train estimators for hidden states by concentrating only on certain characteristics of the marginal posterior distributions and introducing inductive bias. Illustrations using the examples of the stochastic volatility model, nonlinear dynamic stochastic general equilibrium model, and seasonal adjustment procedure with breaks in seasonality show that the algorithm has sufficient accuracy for practical use. Moreover, after pretraining, which takes several hours, finding the posterior distribution for any dataset takes from hundredths to tenths of a second.  ( 2 min )
    BayesAdapter: Being Bayesian, Inexpensively and Reliably, via Bayesian Fine-tuning. (arXiv:2010.01979v5 [cs.LG] UPDATED)
    Despite their theoretical appealingness, Bayesian neural networks (BNNs) are left behind in real-world adoption, mainly due to persistent concerns on their scalability, accessibility, and reliability. In this work, we develop the BayesAdapter framework to relieve these concerns. In particular, we propose to adapt pre-trained deterministic NNs to be variational BNNs via cost-effective Bayesian fine-tuning. Technically, we develop a modularized implementation for the learning of variational BNNs, and refurbish the generally applicable exemplar reparameterization trick through exemplar parallelization to efficiently reduce the gradient variance in stochastic variational inference. Based on the lightweight Bayesian learning paradigm, we conduct extensive experiments on a variety of benchmarks, and show that our method can consistently induce posteriors with higher quality than competitive baselines, yet significantly reducing training overheads. Code is available at https://github.com/thudzj/ScalableBDL.  ( 2 min )
    Testing Stationarity and Change Point Detection in Reinforcement Learning. (arXiv:2203.01707v2 [stat.ML] UPDATED)
    We consider offline reinforcement learning (RL) methods in possibly nonstationary environments. Many existing RL algorithms in the literature rely on the stationarity assumption that requires the system transition and the reward function to be constant over time. However, the stationarity assumption is restrictive in practice and is likely to be violated in a number of applications, including traffic signal control, robotics and mobile health. In this paper, we develop a consistent procedure to test the nonstationarity of the optimal policy based on pre-collected historical data, without additional online data collection. Based on the proposed test, we further develop a sequential change point detection method that can be naturally coupled with existing state-of-the-art RL methods for policy optimization in nonstationary environments. The usefulness of our method is illustrated by theoretical results, simulation studies, and a real data example from the 2018 Intern Health Study. A Python implementation of the proposed procedure is available at https://github.com/limengbinggz/CUSUM-RL.  ( 2 min )
    A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks. (arXiv:2205.05040v2 [cs.LG] UPDATED)
    In distributed training of deep neural networks, people usually run Stochastic Gradient Descent (SGD) or its variants on each machine and communicate with other machines periodically. However, SGD might converge slowly in training some deep neural networks (e.g., RNN, LSTM) because of the exploding gradient issue. Gradient clipping is usually employed to address this issue in the single machine setting, but exploring this technique in the distributed setting is still in its infancy: it remains mysterious whether the gradient clipping scheme can take advantage of multiple machines to enjoy parallel speedup. The main technical difficulty lies in dealing with nonconvex loss function, non-Lipschitz continuous gradient, and skipping communication rounds simultaneously. In this paper, we explore a relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works, and design a communication-efficient gradient clipping algorithm. This algorithm can be run on multiple machines, where each machine employs a gradient clipping scheme and communicate with other machines after multiple steps of gradient-based updates. Our algorithm is proved to have $O\left(\frac{1}{N\epsilon^4}\right)$ iteration complexity and $O(\frac{1}{\epsilon^3})$ communication complexity for finding an $\epsilon$-stationary point in the homogeneous data setting, where $N$ is the number of machines. This indicates that our algorithm enjoys linear speedup and reduced communication rounds. Our proof relies on novel analysis techniques of estimating truncated random variables, which we believe are of independent interest. Our experiments on several benchmark datasets and various scenarios demonstrate that our algorithm indeed exhibits fast convergence speed in practice and thus validates our theory.  ( 3 min )
    Invariance Learning in Deep Neural Networks with Differentiable Laplace Approximations. (arXiv:2202.10638v3 [stat.ML] UPDATED)
    Data augmentation is commonly applied to improve performance of deep learning by enforcing the knowledge that certain transformations on the input preserve the output. Currently, the data augmentation parameters are chosen by human effort and costly cross-validation, which makes it cumbersome to apply to new datasets. We develop a convenient gradient-based method for selecting the data augmentation without validation data during training of a deep neural network. Our approach relies on phrasing data augmentation as an invariance in the prior distribution on the functions of a neural network, which allows us to learn it using Bayesian model selection. This has been shown to work in Gaussian processes, but not yet for deep neural networks. We propose a differentiable Kronecker-factored Laplace approximation to the marginal likelihood as our objective, which can be optimised without human supervision or validation data. We show that our method can successfully recover invariances present in the data, and that this improves generalisation and data efficiency on image datasets.  ( 2 min )
    Contextual Combinatorial Bandits with Changing Action Sets via Gaussian Processes. (arXiv:2110.02248v2 [cs.LG] UPDATED)
    We consider a contextual bandit problem with a combinatorial action set and time-varying base arm availability. At the beginning of each round, the agent observes the set of available base arms and their contexts and then selects an action that is a feasible subset of the set of available base arms to maximize its cumulative reward in the long run. We assume that the mean outcomes of base arms are samples from a Gaussian Process (GP) indexed by the context set ${\cal X}$, and the expected reward is Lipschitz continuous in expected base arm outcomes. For this setup, we propose an algorithm called Optimistic Combinatorial Learning and Optimization with Kernel Upper Confidence Bounds (O'CLOK-UCB) and prove that it incurs $\tilde{O}(\sqrt{\lambda^*(K)KT\overline{\gamma}_{T}} )$ regret with high probability, where $\overline{\gamma}_{T}$ is the maximum information gain associated with the set of base arm contexts that appeared in the first $T$ rounds, $K$ is the maximum cardinality of any feasible action over all rounds and $\lambda^*(K)$ is the maximum eigenvalue of all covariance matrices of selected actions up to time $T$, which is a function of $K$. To dramatically speed up the algorithm, we also propose a variant of O'CLOK-UCB that uses sparse GPs. Finally, we experimentally show that both algorithms exploit inter-base arm outcome correlation and vastly outperform the previous state-of-the-art UCB-based algorithms in realistic setups.  ( 3 min )
    When Do Flat Minima Optimizers Work?. (arXiv:2202.00661v4 [cs.LG] UPDATED)
    Recently, flat-minima optimizers, which seek to find parameters in low-loss neighborhoods, have been shown to improve a neural network's generalization performance over stochastic and adaptive gradient-based optimizers. Two methods have received significant attention due to their scalability: 1. Stochastic Weight Averaging (SWA), and 2. Sharpness-Aware Minimization (SAM). However, there has been limited investigation into their properties and no systematic benchmarking of them across different domains. We fill this gap here by comparing the loss surfaces of the models trained with each method and through broad benchmarking across computer vision, natural language processing, and graph representation learning tasks. We discover several surprising findings from these results, which we hope will help researchers further improve deep learning optimizers, and practitioners identify the right optimizer for their problem.  ( 2 min )
    Learning Multivariate CDFs and Copulas using Tensor Factorization. (arXiv:2210.07132v1 [stat.ML])
    Learning the multivariate distribution of data is a core challenge in statistics and machine learning. Traditional methods aim for the probability density function (PDF) and are limited by the curse of dimensionality. Modern neural methods are mostly based on black-box models, lacking identifiability guarantees. In this work, we aim to learn multivariate cumulative distribution functions (CDFs), as they can handle mixed random variables, allow efficient box probability evaluation, and have the potential to overcome local sample scarcity owing to their cumulative nature. We show that any grid sampled version of a joint CDF of mixed random variables admits a universal representation as a naive Bayes model via the Canonical Polyadic (tensor-rank) decomposition. By introducing a low-rank model, either directly in the raw data domain, or indirectly in a transformed (Copula) domain, the resulting model affords efficient sampling, closed form inference and uncertainty quantification, and comes with uniqueness guarantees under relatively mild conditions. We demonstrate the superior performance of the proposed model in several synthetic and real datasets and applications including regression, sampling and data imputation. Interestingly, our experiments with real data show that it is possible to obtain better density/mass estimates indirectly via a low-rank CDF model, than a low-rank PDF/PMF model.  ( 2 min )
    Stochastic Contextual Dueling Bandits under Linear Stochastic Transitivity Models. (arXiv:2202.04593v2 [cs.LG] UPDATED)
    We consider the regret minimization task in a dueling bandits problem with context information. In every round of the sequential decision problem, the learner makes a context-dependent selection of two choice alternatives (arms) to be compared with each other and receives feedback in the form of noisy preference information. We assume that the feedback process is determined by a linear stochastic transitivity model with contextualized utilities (CoLST), and the learner's task is to include the best arm (with highest latent context-dependent utility) in the duel. We propose a computationally efficient algorithm, $\texttt{CoLSTIM}$, which makes its choice based on imitating the feedback process using perturbed context-dependent utility estimates of the underlying CoLST model. If each arm is associated with a $d$-dimensional feature vector, we show that $\texttt{CoLSTIM}$ achieves a regret of order $\tilde O( \sqrt{dT})$ after $T$ learning rounds. Additionally, we also establish the optimality of $\texttt{CoLSTIM}$ by showing a lower bound for the weak regret that refines the existing average regret analysis. Our experiments demonstrate its superiority over state-of-art algorithms for special cases of CoLST models.  ( 3 min )
    LIMEADE: From AI Explanations to Advice Taking. (arXiv:2003.04315v4 [cs.IR] UPDATED)
    Research in human-centered AI has shown the benefits of systems that can explain their predictions. Methods that allow an AI to take advice from humans in response to explanations are similarly useful. While both capabilities are well-developed for transparent learning models (e.g., linear models and GA$^2$Ms), and recent techniques (e.g., LIME and SHAP) can generate explanations for opaque models, little attention has been given to advice methods for opaque models. This paper introduces LIMEADE, the first general framework that translates both positive and negative advice (expressed using high-level vocabulary such as that employed by post-hoc explanations) into an update to an arbitrary, underlying opaque model. We demonstrate the generality of our approach with case studies on seventy real-world models across two broad domains: image classification and text recommendation. We show our method improves accuracy compared to a rigorous baseline on the image classification domains. For the text modality, we apply our framework to a neural recommender system for scientific papers on a public website; our user study shows that our framework leads to significantly higher perceived user control, trust, and satisfaction.  ( 3 min )
    The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks. (arXiv:2110.03922v4 [cs.LG] UPDATED)
    We derive a simple unified framework giving closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. These improvements are enabled by our identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al (2020), ii) generalize a previous result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics.  ( 2 min )
    A Multilabel Classification Framework for Approximate Nearest Neighbor Search. (arXiv:1910.08322v5 [cs.LG] UPDATED)
    Both supervised and unsupervised machine learning algorithms have been used to learn partition-based index structures for approximate nearest neighbor (ANN) search. Existing supervised algorithms formulate the learning task as finding a partition in which the nearest neighbors of a training set point belong to the same partition element as the point itself, so that the nearest neighbor candidates can be retrieved by naive lookup or backtracking search. We formulate candidate set selection in ANN search directly as a multilabel classification problem where the labels correspond to the nearest neighbors of the query point, and interpret the partitions as partitioning classifiers for solving this task. Empirical results suggest that the natural classifier based on this interpretation leads to strictly improved performance when combined with any unsupervised or supervised partitioning strategy. We also prove a sufficient condition for consistency of a partitioning classifier for ANN search, and illustrate the result by verifying this condition for chronological $k$-d trees.  ( 3 min )
    Reproducibility in Optimization: Theoretical Framework and Limits. (arXiv:2202.04598v3 [math.OC] UPDATED)
    We initiate a formal study of reproducibility in optimization. We define a quantitative measure of reproducibility of optimization procedures in the face of noisy or error-prone operations such as inexact or stochastic gradient computations or inexact initialization. We then analyze several convex optimization settings of interest such as smooth, non-smooth, and strongly-convex objective functions and establish tight bounds on the limits of reproducibility in each setting. Our analysis reveals a fundamental trade-off between computation and reproducibility: more computation is necessary (and sufficient) for better reproducibility.  ( 2 min )
    COLLIDER: A Robust Training Framework for Backdoor Data. (arXiv:2210.06704v1 [cs.LG])
    Deep neural network (DNN) classifiers are vulnerable to backdoor attacks. An adversary poisons some of the training data in such attacks by installing a trigger. The goal is to make the trained DNN output the attacker's desired class whenever the trigger is activated while performing as usual for clean data. Various approaches have recently been proposed to detect malicious backdoored DNNs. However, a robust, end-to-end training approach, like adversarial training, is yet to be discovered for backdoor poisoned data. In this paper, we take the first step toward such methods by developing a robust training framework, COLLIDER, that selects the most prominent samples by exploiting the underlying geometric structures of the data. Specifically, we effectively filter out candidate poisoned data at each training epoch by solving a geometrical coreset selection objective. We first argue how clean data samples exhibit (1) gradients similar to the clean majority of data and (2) low local intrinsic dimensionality (LID). Based on these criteria, we define a novel coreset selection objective to find such samples, which are used for training a DNN. We show the effectiveness of the proposed method for robust training of DNNs on various poisoned datasets, reducing the backdoor success rate significantly.  ( 3 min )
    On the potential benefits of entropic regularization for smoothing Wasserstein estimators. (arXiv:2210.06934v1 [stat.ML])
    This paper is focused on the study of entropic regularization in optimal transport as a smoothing method for Wasserstein estimators, through the prism of the classical tradeoff between approximation and estimation errors in statistics. Wasserstein estimators are defined as solutions of variational problems whose objective function involves the use of an optimal transport cost between probability measures. Such estimators can be regularized by replacing the optimal transport cost by its regularized version using an entropy penalty on the transport plan. The use of such a regularization has a potentially significant smoothing effect on the resulting estimators. In this work, we investigate its potential benefits on the approximation and estimation properties of regularized Wasserstein estimators. Our main contribution is to discuss how entropic regularization may reach, at a lowest computational cost, statistical performances that are comparable to those of un-regularized Wasserstein estimators in statistical learning problems involving distributional data analysis. To this end, we present new theoretical results on the convergence of regularized Wasserstein estimators. We also study their numerical performances using simulated and real data in the supervised learning problem of proportions estimation in mixture models using optimal transport.  ( 2 min )
    Reliable Neural Networks for Regression Uncertainty Estimation. (arXiv:2109.08213v2 [cs.LG] UPDATED)
    While deep neural networks are highly performant and successful in a wide range of real-world problems, estimating their predictive uncertainty remains a challenging task. To address this challenge, we propose and implement a loss function for regression uncertainty estimation based on the Bayesian Validation Metric (BVM) framework while using ensemble learning. The proposed loss reproduces maximum likelihood estimation in the limiting case. A series of experiments on in-distribution data show that the proposed method is competitive with existing state-of-the-art methods. Experiments on out-of-distribution data show that the proposed method is robust to statistical change and exhibits superior predictive capability.  ( 2 min )
    Optimal Spectral Recovery of a Planted Vector in a Subspace. (arXiv:2105.15081v2 [math.ST] UPDATED)
    Recovering a planted vector $v$ in an $n$-dimensional random subspace of $\mathbb{R}^N$ is a generic task related to many problems in machine learning and statistics, such as dictionary learning, subspace recovery, principal component analysis, and non-Gaussian component analysis. In this work, we study computationally efficient estimation and detection of a planted vector $v$ whose $\ell_4$ norm differs from that of a Gaussian vector with the same $\ell_2$ norm. For instance, in the special case where $v$ is an $N \rho$-sparse vector with Bernoulli-Gaussian or Bernoulli-Rademacher entries, our results include the following: (1) We give an improved analysis of a slight variant of the spectral method proposed by Hopkins, Schramm, Shi, and Steurer (2016), showing that it approximately recovers $v$ with high probability in the regime $n \rho \ll \sqrt{N}$. This condition subsumes the conditions $\rho \ll 1/\sqrt{n}$ or $n \sqrt{\rho} \lesssim \sqrt{N}$ required by previous work up to polylogarithmic factors. We achieve $\ell_\infty$ error bounds for the spectral estimator via a leave-one-out analysis, from which it follows that a simple thresholding procedure exactly recovers $v$ with Bernoulli-Rademacher entries, even in the dense case $\rho = 1$. (2) We study the associated detection problem and show that in the regime $n \rho \gg \sqrt{N}$, any spectral method from a large class (and more generally, any low-degree polynomial of the input) fails to detect the planted vector. This matches the condition for recovery and offers evidence that no polynomial-time algorithm can succeed in recovering a Bernoulli-Gaussian vector $v$ when $n \rho \gg \sqrt{N}$.  ( 3 min )
    Variance-Aware Estimation of Kernel Mean Embedding. (arXiv:2210.06672v1 [math.ST])
    An important feature of kernel mean embeddings (KME) is that the rate of convergence of the empirical KME to the true distribution KME can be bounded independently of the dimension of the space, properties of the distribution and smoothness features of the kernel. We show how to speed-up convergence by leveraging variance information in the RKHS. Furthermore, we show that even when such information is a priori unknown, we can efficiently estimate it from the data, recovering the desiderata of a distribution agnostic bound that enjoys acceleration in fortuitous settings. We illustrate our methods in the context of hypothesis testing and robust parametric estimation.  ( 2 min )
    Dirichlet process mixture models for non-stationary data streams. (arXiv:2210.06872v1 [stat.ML])
    In recent years, we have seen a handful of work on inference algorithms over non-stationary data streams. Given their flexibility, Bayesian non-parametric models are a good candidate for these scenarios. However, reliable streaming inference under the concept drift phenomenon is still an open problem for these models. In this work, we propose a variational inference algorithm for Dirichlet process mixture models. Our proposal deals with the concept drift by including an exponential forgetting over the prior global parameters. Our algorithm allows to adapt the learned model to the concept drifts automatically. We perform experiments in both synthetic and real data, showing that the proposed model is competitive with the state-of-the-art algorithms in the density estimation problem, and it outperforms them in the clustering problem.  ( 2 min )
    Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data. (arXiv:2210.07082v1 [cs.LG])
    The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent.  ( 2 min )
    Mean-field analysis for heavy ball methods: Dropout-stability, connectivity, and global convergence. (arXiv:2210.06819v1 [cs.LG])
    The stochastic heavy ball method (SHB), also known as stochastic gradient descent (SGD) with Polyak's momentum, is widely used in training neural networks. However, despite the remarkable success of such algorithm in practice, its theoretical characterization remains limited. In this paper, we focus on neural networks with two and three layers and provide a rigorous understanding of the properties of the solutions found by SHB: \emph{(i)} stability after dropping out part of the neurons, \emph{(ii)} connectivity along a low-loss path, and \emph{(iii)} convergence to the global optimum. To achieve this goal, we take a mean-field view and relate the SHB dynamics to a certain partial differential equation in the limit of large network widths. This mean-field perspective has inspired a recent line of work focusing on SGD while, in contrast, our paper considers an algorithm with momentum. More specifically, after proving existence and uniqueness of the limit differential equations, we show convergence to the global optimum and give a quantitative bound between the mean-field limit and the SHB dynamics of a finite-width network. Armed with this last bound, we are able to establish the dropout-stability and connectivity of SHB solutions.  ( 2 min )
    On the Theoretical Equivalence of Several Trade-Off Curves Assessing Statistical Proximity. (arXiv:2006.11809v3 [cs.LG] UPDATED)
    The recent advent of powerful generative models has triggered the renewed development of quantitative measures to assess the proximity of two probability distributions. As the scalar Frechet inception distance remains popular, several methods have explored computing entire curves, which reveal the trade-off between the fidelity and variability of the first distribution with respect to the second one. Several of such variants have been proposed independently and while intuitively similar, their relationship has not yet been made explicit. In an effort to make the emerging picture of generative evaluation more clear, we propose a unification of four curves known respectively as: the precision-recall (PR) curve, the Lorenz curve, the receiver operating characteristic (ROC) curve and a special case of R\'enyi divergence frontiers. In addition, we discuss possible links between PR / Lorenz curves with the derivation of domain adaptation bounds.  ( 2 min )
    Forecast Hedging and Calibration. (arXiv:2210.07169v1 [econ.TH])
    Calibration means that forecasts and average realized frequencies are close. We develop the concept of forecast hedging, which consists of choosing the forecasts so as to guarantee that the expected track record can only improve. This yields all the calibration results by the same simple basic argument while differentiating between them by the forecast-hedging tools used: deterministic and fixed point based versus stochastic and minimax based. Additional contributions are an improved definition of continuous calibration, ensuing game dynamics that yield Nash equilibria in the long run, and a new calibrated forecasting procedure for binary events that is simpler than all known such procedures.  ( 2 min )
    Smooth Calibration, Leaky Forecasts, Finite Recall, and Nash Dynamics. (arXiv:2210.07152v1 [econ.TH])
    We propose to smooth out the calibration score, which measures how good a forecaster is, by combining nearby forecasts. While regular calibration can be guaranteed only by randomized forecasting procedures, we show that smooth calibration can be guaranteed by deterministic procedures. As a consequence, it does not matter if the forecasts are leaked, i.e., made known in advance: smooth calibration can nevertheless be guaranteed (while regular calibration cannot). Moreover, our procedure has finite recall, is stationary, and all forecasts lie on a finite grid. To construct the procedure, we deal also with the related setups of online linear regression and weak calibration. Finally, we show that smooth calibration yields uncoupled finite-memory dynamics in n-person games "smooth calibrated learning" in which the players play approximate Nash equilibria in almost all periods (by contrast, calibrated learning, which uses regular calibration, yields only that the time-averages of play are approximate correlated equilibria).  ( 2 min )
    LION: Latent Point Diffusion Models for 3D Shape Generation. (arXiv:2210.06978v1 [cs.CV])
    Denoising diffusion models (DDMs) have shown promising results in 3D point cloud synthesis. To advance 3D DDMs and make them useful for digital artists, we require (i) high generation quality, (ii) flexibility for manipulation and applications such as conditional synthesis and shape interpolation, and (iii) the ability to output smooth surfaces or meshes. To this end, we introduce the hierarchical Latent Point Diffusion Model (LION) for 3D shape generation. LION is set up as a variational autoencoder (VAE) with a hierarchical latent space that combines a global shape latent representation with a point-structured latent space. For generation, we train two hierarchical DDMs in these latent spaces. The hierarchical VAE approach boosts performance compared to DDMs that operate on point clouds directly, while the point-structured latents are still ideally suited for DDM-based modeling. Experimentally, LION achieves state-of-the-art generation performance on multiple ShapeNet benchmarks. Furthermore, our VAE framework allows us to easily use LION for different relevant tasks: LION excels at multimodal shape denoising and voxel-conditioned synthesis, and it can be adapted for text- and image-driven 3D generation. We also demonstrate shape autoencoding and latent shape interpolation, and we augment LION with modern surface reconstruction techniques to generate smooth 3D meshes. We hope that LION provides a powerful tool for artists working with 3D shapes due to its high-quality generation, flexibility, and surface reconstruction. Project page and code: https://nv-tlabs.github.io/LION.  ( 3 min )
    Utilizing supervised models to infer consensus labels and their quality from data with multiple annotators. (arXiv:2210.06812v1 [cs.LG])
    Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to estimate: (1) A consensus label for each example that aggregates the individual annotations (more accurately than aggregation via majority-vote or other algorithms used in crowdsourcing); (2) A confidence score for how likely each consensus label is correct (via well-calibrated estimates that account for the number of annotations for each example and their agreement, prediction-confidence from a trained classifier, and trustworthiness of each annotator vs. the classifier); (3) A rating for each annotator quantifying the overall correctness of their labels. While many algorithms have been proposed to estimate related quantities in crowdsourcing, these often rely on sophisticated generative models with iterative inference schemes, whereas CROWDLAB is based on simple weighted ensembling. Many algorithms also rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB in contrast utilizes any classifier model trained on these features, which can generalize between examples with similar features. In evaluations on real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than many alternative algorithms.  ( 2 min )
    On the Efficient Implementation of High Accuracy Optimality of Profile Maximum Likelihood. (arXiv:2210.06728v1 [stat.ML])
    We provide an efficient unified plug-in approach for estimating symmetric properties of distributions given $n$ independent samples. Our estimator is based on profile-maximum-likelihood (PML) and is sample optimal for estimating various symmetric properties when the estimation error $\epsilon \gg n^{-1/3}$. This result improves upon the previous best accuracy threshold of $\epsilon \gg n^{-1/4}$ achievable by polynomial time computable PML-based universal estimators [ACSS21, ACSS20]. Our estimator reaches a theoretical limit for universal symmetric property estimation as [Han21] shows that a broad class of universal estimators (containing many well known approaches including ours) cannot be sample optimal for every $1$-Lipschitz property when $\epsilon \ll n^{-1/3}$.  ( 2 min )
    Evaluated CMI Bounds for Meta Learning: Tightness and Expressiveness. (arXiv:2210.06511v1 [cs.LG])
    Recent work has established that the conditional mutual information (CMI) framework of Steinke and Zakynthinou (2020) is expressive enough to capture generalization guarantees in terms of algorithmic stability, VC dimension, and related complexity measures for conventional learning (Harutyunyan et al., 2021, Haghifam et al., 2021). Hence, it provides a unified method for establishing generalization bounds. In meta learning, there has so far been a divide between information-theoretic results and results from classical learning theory. In this work, we take a first step toward bridging this divide. Specifically, we present novel generalization bounds for meta learning in terms of the evaluated CMI (e-CMI). To demonstrate the expressiveness of the e-CMI framework, we apply our bounds to a representation learning setting, with $n$ samples from $\hat n$ tasks parameterized by functions of the form $f_i \circ h$. Here, each $f_i \in \mathcal F$ is a task-specific function, and $h \in \mathcal H$ is the shared representation. For this setup, we show that the e-CMI framework yields a bound that scales as $\sqrt{ \mathcal C(\mathcal H)/(n\hat n) + \mathcal C(\mathcal F)/n} $, where $\mathcal C(\cdot)$ denotes a complexity measure of the hypothesis class. This scaling behavior coincides with the one reported in Tripuraneni et al. (2020) using Gaussian complexity.  ( 3 min )
    Interpreting Neural Policies with Disentangled Tree Representations. (arXiv:2210.06650v1 [cs.LG])
    Compact neural networks used in policy learning and closed-loop end-to-end control learn representations from data that encapsulate agent dynamics and potentially the agent-environment's factors of variation. A formal and quantitative understanding and interpretation of these explanatory factors in neural representations is difficult to achieve due to the complex and intertwined correspondence of neural activities with emergent behaviors. In this paper, we design a new algorithm that programmatically extracts tree representations from compact neural policies, in the form of a set of logic programs grounded by the world state. To assess how well networks uncover the dynamics of the task and their factors of variation, we introduce interpretability metrics that measure the disentanglement of learned neural dynamics from a concentration of decisions, mutual information, and modularity perspectives. Moreover, our method allows us to quantify how accurate the extracted decision paths (explanations) are and computes cross-neuron logic conflict. We demonstrate the effectiveness of our approach with several types of compact network architectures on a series of end-to-end learning to control tasks.  ( 2 min )
    Rigorous dynamical mean field theory for stochastic gradient descent methods. (arXiv:2210.06591v1 [math-ph])
    We prove closed-form equations for the exact high-dimensional asymptotics of a family of first order gradient-based methods, learning an estimator (e.g. M-estimator, shallow neural network, ...) from observations on Gaussian data with empirical risk minimization. This includes widely used algorithms such as stochastic gradient descent (SGD) or Nesterov acceleration. The obtained equations match those resulting from the discretization of dynamical mean-field theory (DMFT) equations from statistical physics when applied to gradient flow. Our proof method allows us to give an explicit description of how memory kernels build up in the effective dynamics, and to include non-separable update functions, allowing datasets with non-identity covariance matrices. Finally, we provide numerical implementations of the equations for SGD with generic extensive batch-size and with constant learning rates.  ( 2 min )
    Gaussian Processes on Distributions based on Regularized Optimal Transport. (arXiv:2210.06574v1 [stat.ML])
    We present a novel kernel over the space of probability measures based on the dual formulation of optimal regularized transport. We propose an Hilbertian embedding of the space of probabilities using their Sinkhorn potentials, which are solutions of the dual entropic relaxed optimal transport between the probabilities and a reference measure $\mathcal{U}$. We prove that this construction enables to obtain a valid kernel, by using the Hilbert norms. We prove that the kernel enjoys theoretical properties such as universality and some invariances, while still being computationally feasible. Moreover we provide theoretical guarantees on the behaviour of a Gaussian process based on this kernel. The empirical performances are compared with other traditional choices of kernels for processes indexed on distributions.  ( 2 min )
    Robust Neural Posterior Estimation and Statistical Model Criticism. (arXiv:2210.06564v1 [stat.ML])
    Computer simulations have proven a valuable tool for understanding complex phenomena across the sciences. However, the utility of simulators for modelling and forecasting purposes is often restricted by low data quality, as well as practical limits to model fidelity. In order to circumvent these difficulties, we argue that modellers must treat simulators as idealistic representations of the true data generating process, and consequently should thoughtfully consider the risk of model misspecification. In this work we revisit neural posterior estimation (NPE), a class of algorithms that enable black-box parameter inference in simulation models, and consider the implication of a simulation-to-reality gap. While recent works have demonstrated reliable performance of these methods, the analyses have been performed using synthetic data generated by the simulator model itself, and have therefore only addressed the well-specified case. In this paper, we find that the presence of misspecification, in contrast, leads to unreliable inference when NPE is used naively. As a remedy we argue that principled scientific inquiry with simulators should incorporate a model criticism component, to facilitate interpretable identification of misspecification and a robust inference component, to fit 'wrong but useful' models. We propose robust neural posterior estimation (RNPE), an extension of NPE to simultaneously achieve both these aims, through explicitly modelling the discrepancies between simulations and the observed data. We assess the approach on a range of artificially misspecified examples, and find RNPE performs well across the tasks, whereas naively using NPE leads to misleading and erratic posteriors.  ( 3 min )

  • Open

    [R] Mind's Eye: Grounded Language Model Reasoning through Simulation - Google Research 2022
    Paper: https://arxiv.org/abs/2210.05359 Abstract: Successful and effective communication between humans and AI relies on a shared experience of the world. By training solely on written text, current language models (LMs) miss the grounded experience of humans in the real-world -- their failure to relate language to the physical world causes knowledge to be misrepresented and obvious mistakes in their reasoning. We present Mind's Eye, a paradigm to ground language model reasoning in the physical world. Given a physical reasoning question, we use a computational physics engine (DeepMind's MuJoCo) to simulate the possible outcomes, and then use the simulation results as part of the input, which enables language models to perform reasoning. Experiments on 39 tasks in a physics alignment benchmark demonstrate that Mind's Eye can improve reasoning ability by a large margin (27.9% zero-shot, and 46.0% few-shot absolute accuracy improvement on average). Smaller language models armed with Mind's Eye can obtain similar performance to models that are 100x larger. Finally, we confirm the robustness of Mind's Eye through ablation studies. https://preview.redd.it/ie7jdqhwmnt91.jpg?width=1092&format=pjpg&auto=webp&s=ebff5cab2c805549e85fb2eccfdadd0644d95d9f https://preview.redd.it/3wrxbnhwmnt91.jpg?width=1180&format=pjpg&auto=webp&s=09aa2773a853ab564cfbb11811a18d21165a06e4 https://preview.redd.it/7frgfxhwmnt91.jpg?width=991&format=pjpg&auto=webp&s=0bfcb01b5707d6e1892fc3100b960a5f9c203707 https://preview.redd.it/k6mm4rhwmnt91.jpg?width=1191&format=pjpg&auto=webp&s=00c452f58e79b6ea003883826e50653f907221d5 submitted by /u/Singularian2501 [link] [comments]  ( 125 min )
    [R] Help required in finding thesis topic
    Could anybody suggest any topic for a thesis in cyber security and artificial intelligence? I am really struggling to come up with a topic That includes shortlisting from multiple topics (I can access tools like Web of Science) Any help or guidance is appreciated. submitted by /u/aap9000 [link] [comments]  ( 123 min )
    [R] CLIP-Fields: Weakly Supervised Semantic Fields for Robotic Memory + Code + Robot demo
    Hi r/MachineLearning! We recently published CLIP-Fields: (github) a way to train an implicit model over any physical space with zero human labels that you can query with open-vocab natural language! We used it to make some robots run around, for example, our lab kitchen, responding to different queries like "warm up my lunch" or "throw out my trash"! Free lunch not included with the model* We made this using pretty recent advances in web-data pretrained models like Detic and LSeg for detection, CLIP for visual queries, and Sentence BERT for semantic queries. Our "database" is really a neural field (Instant NGP) that maps from 3D coordinates to a high dimensional embedding vector in the same representation space as CLIP and SBERT. Best part, I believe, is that you should be able to train your own CLIP-Field for your living room if you have an hour, a decent GPU, and a way to get RGB-D video (an iPhone 13 Pro works great!) I hope you can give the code a try: https://github.com/notmahi/clip-fields or check out the website https://mahis.life/clip-fields/ for more interactive demos. Our Arxiv submission is also out now, at https://arxiv.org/abs/2210.05663, and if you want a longer tl;dr with a couple more videos, check out this tweet. Thanks! submitted by /u/not_mahi [link] [comments]  ( 126 min )
    [P] a minimalist guide to program synthesis
    hiyo /ml, I do program synthesis as a profession for over 10 years, and I've recently finished writing a blog series, explaining how to get started in program synthesis. Here's an excerpt from the about page: Program synthesis is useful – Who wouldn’t want to make a computer that automatically writes programs? As humans and computers continue to work in collaboration, the distinction between programming, program-synthesis, and naturalistic communication will continue to blur. However, there is a knowledge gap between how to build state of the art program synthesis algorithms and what is generally known about it. This gap is much bigger than it needs to be. This blog aims to shrink this knowledge gap, so that you can start applying program synthesis to your own works. We will cover both the concepts of program synthesis – so you can have a framework to think and talk about it, and the bare-minimum toolings required to implement these algorithms – so you can start iterating on solutions. Ultimately, I hope researchers and system-builders can view “programming” as more than typing obscure green characters onto an uncompromising black terminal, and build systems that are as empathetic as they are efficient. specifically, it covers topics from how to formulate a synthesis problem, to how to fine-tune llm on huggingface to write programs to match a specification. blog : https://evanthebouncy.github.io/program-synthesis-minimal/ twitter thread: https://twitter.com/evanthebouncy/status/1580634593685753856 I can take some questions in this thread, please feel free to ask me anythings, from technical to hot-takes. --evan submitted by /u/evanthebouncy [link] [comments]  ( 137 min )
    [D] how to handle xray/CT/ MRI in images processing?
    When it come to brain or chest x ray /ct/ mri how can we handle these images when we want to apply image classification? Do we need to apply augmentation? I read a little about Histogram Equalization and saw few examples of some people applying it to xray images but I don't understand is it part of augmentation? Are there any similar methods as Histogram Equalization? submitted by /u/sk8er_girl90 [link] [comments]  ( 129 min )
    [N] Easily profile FastAPI model serving
    We've added a simple way to profile any model serving endpoint, including FastAPI, to identify bottlenecks and make inference (incl. data processing) faster, especially for big models and data. Wanted to share it here in case someone is struggling with profiling and monitoring of deployed code and models. By default, generic Python profiler will automatically profile some of the inferences (and measure all inferences). You can also specify other profilers for PyTorch, TensorFlow, Jax and ONNX Runtime. All profiles and metrics will be available on the SaaS dashboard, no need to setup anything. A couple of links to get started: Repo: https://github.com/graphsignal/graphsignal FastAPI example: https://graphsignal.com/docs/integrations/fastapi/ Happy for any feedback! submitted by /u/l0g1cs [link] [comments]  ( 124 min )
    [D] What approach to decide which class is most optimal for recovery?
    Although the origin of this problem isn't in the medical field, as it's a fairly popular and understood example I've translated the problem into those terms. Let's say we have a dataset of millions of patients, for each of them we have the following: - A bunch of statistics about them, such as weight/vitals/blood levels/nutrient levels etc... etc... - The drug they have taken, classed 1 to 5 (not equally distributed) - Whether they recovered: 1, or have not: 0 (roughly 50/50 split but not equally distributed among the other factors) What I want to do is to be able to recommend a drug based on the recovery rate + their statistics. For instance, it may be that drug 1 works better with patients with a high blood pressure vs. the other drugs. My current idea is to build 1 model per drug …  ( 125 min )
    [R] LAION-5B: An open large-scale dataset for training next generation image-text models
    submitted by /u/hardmaru [link] [comments]  ( 123 min )
    [Project] Moral stories dataset
    I'm looking for datasets with moral stories for my project, preferably with what moral/message of the book is and the content of the book submitted by /u/hanakokunn [link] [comments]  ( 128 min )
    [P] Towards photorealistic AI images
    Here's an update of our work (/u/da_mulle and I) on generating photorealistic AI images! We've launched a showcase of our latest images: https://nyx.gallery/ This is a continuation of our work on "This Food Does Not Exist" (Reddit discussion, Github, checkpoints). Three months ago we made a first release consisting of 4 StyleGAN models, each trained on a specific food item: cookies, cheesecake, cocktail and sushi. We have now considerably improved our approach: multi-class StyleGANs (eg the images for cookies, cheesecakes, cocktails, sushis and burgers are all from a single model) we've started experimenting, without surprise, with Stable Diffusion (you can see what model was used on each image) we're also training upscaling and filtering models to build a full pipeline and as you can see, many more modalities than just food! submitted by /u/MasterScrat [link] [comments]  ( 125 min )
    [R]Wq can be omited in single head attention
    The proof is simple: attention=softmax(QKt)V =softmax(XWq (XWk)t)XWv =softmax(XWqWktXt)XWv let Wk'=WkWq' attention=softmax(X(XWk')t)XWv =softmax(XK')V now we see that Q=XWq is replaced by X, reduced 1/4 paramters in attention module. I did real experiment and found that with 3/4 parameters of original attention, the difference of loss is 0.01 during the training process and does not increase. Though Wq is not necessary, but with 1/4 more parameters it seems just slightly better. But in multihead attention, Wq is necessary. However, research has shown that stacking many small single heads attention modules to form a very deep model is better than wider multi-head attention (single head is enough). submitted by /u/wangyi_fudan [link] [comments]  ( 139 min )
    [N] First RTX 4090 ML benchmarks
    Some initial benchmarks can be found here: https://www.pugetsystems.com/labs/hpc/NVIDIA-RTX4090-ML-AI-and-Scientific-Computing-Performance-Preliminary-2382/ To me it looks very disappointing, but unfortunately expected given the memory limitations. submitted by /u/killver [link] [comments]  ( 123 min )
    [D] Docker image deployment in Kubernetes autopilot cluster failing due to insufficient CPU and memory
    I am starting with GKE today, and I have pushed the following Docker image into the container registry: ​ https://preview.redd.it/h72tah0kpjt91.png?width=2264&format=png&auto=webp&s=388ecaf32755fb5a464a8af857ce4ee47cc03c07 Then I created a Kuberenetes autopilot cluster (with all default settings from the IDE). When I try to deploy this docker image into the k8s cluster, I get the following errors: https://preview.redd.it/9qere9ixpjt91.jpg?width=1600&format=pjpg&auto=webp&s=e3434c163e0643e90ccda2c10420191c047a7726 Going into the cluster view shows the following info: https://preview.redd.it/ys56ukc0qjt91.png?width=2336&format=png&auto=webp&s=deccfe61db09c297b40c18c80d3480117f635177 So, how do I scale it up? I tried to increase the quotas, but it says I need to contact customer support to increase quotas, who will take 24-48 hours to respond, and I need this deployment by today: https://preview.redd.it/vf2eobu7xjt91.png?width=2880&format=png&auto=webp&s=def1859193287ef683c5c262addd45705753cd11 How do I make this deployment work? I just want the simplest possible solution. Cost is not a concern as I have the $300 free credits and I just need this deployment to work and last for 1 day, for a demo. Any help is appreciated. Thanks! submitted by /u/ResearcherNo4728 [link] [comments]  ( 125 min )
    [Project] Startup with a project for someone who wants to learn by doing, such as a student project, or someone self-taught looking for a side project
    Hi, I run a startup with three other founders and I have a problem that machine learning and OCR (optical character recognition) would solve. But we are stretched thin and it has been a few years since my CTO - a software architect with nearly 40 years experience - worked on anything similar, so we are unsure if this exercise would pull us away from our core focus too much. That is why we are looking for someone who is interested in taking on this problem as a project. This is a bounded problem, meaning it is very achievable. We just don't know how long it would take. We are willing to offer equity as part of an model called restricted stock units, so if we are commercially successful you would have a signed agreement from us to claim a nominal amount of shares. The Problem This is a s…  ( 128 min )
    [P] Deep Clustering Approach for Unsupervised Video Anomaly Detection
    I'm working on Unsupervised Video Anomaly Detection, and I've tried implementing the Generative Cooperative Learning method, with the help of [this](https://arxiv.org/abs/2203.03962) paper. ​ The method uses a fixed backbone (ResNext-101) for video feature extraction. The videos are divided into segments of 16 frames, and a feature vector is computed for each segment. A generator (a simple autoencoder) provides pseudo labels (based on the reconstruction error) for the discriminator which is a simple fully connected classifier. Pseudo labels from the discriminator are used to improve the generator using a process called negative learning, and in this fashion, the Generator and Discriminator are put in a collaborative learning loop, and the loss eventually converges. ​ I've recently come across the [Deep Clustering](https://openaccess.thecvf.com/content_ECCV_2018/papers/Mathilde_Caron_Deep_Clustering_for_ECCV_2018_paper.pdf) paper , and was wondering if we can use a clustering method instead of the autoencoder as part of the generator. I think we can use the cosine distance as a good distance metric. The troublesome part however, is thinking of a good criteria for generating the pseudo labels. ​ With the autoencoder, the reconstruction error is a pretty intuitive criteria for pseudo labelling. Since anomalies are sparse, the autoencoder will not be able to reconstruct them properly and so they will have large(r) reconstruction errors. ​ What can be a similar criteria that we can use for pseudo labelling if we use a clustering method instead of an autoencoder? submitted by /u/esem29 [link] [comments]  ( 126 min )
    [P] Audio Splitting for Neural Network
    I'm trying to create a neural network that can differentiate between different lung conditions. I've recorded breathing sounds through stethoscope and plan to create a machine learning model accordingly. Do I need to split each breath into a different audio file? Or can a neural network analyze a 30-second clip of a patient breathing? Taking into account the large data, I'm trying to avoid having to split each audio file into smaller pieces. However I do not want this to affect the moral accuracy. Any support would be appreciated! submitted by /u/SSC_08 [link] [comments]  ( 129 min )
    [D] ML model updated in Android
    What are the options to deploy and manage Tensorflow Lite models used in an Android app? Ideally I would like to be able to update the model without updating the app. This is for a shopping app that needs a local model for inference, and the model changes weekly Thanks submitted by /u/99posse [link] [comments]  ( 125 min )
    [Project] The all-in-one 3D medical image segmentation toolkit. From data annotation to model deployment, MedicalSeg is all you need!
    Hello, everyone! We have created an open-source all-in-one 3D medical image segmentation toolkit called MedicalSeg. MedicalSeg supports the whole segmentation process including data labeling, data preprocessing, model training, and model deployment. Major features include: Data preprocessing with 30% acceleration using CuPy. High precision pre-trained models on 5 different organs. High precision models including nnUnet, TransUnet, UNETR, Vnet, and more models are coming soon! 3D visualization demo based on itkwidgets. AI-assisted 3D medical image annotation platform called EISeg-Med3D: With the 3D segmentation model incorporated into the interactive segmentation algorithm, we managed to improve the annotation efficiency by ten times through AI-assisted click interaction! Combined with the machine learning algorithms and manual annotation toolkit, 100% accuracy is right on your hand. Let alone it is user-friendly and your annotation results and progress are saved automatically. The following images demonstrate the segmentation result predicted by MedicalSeg: Lung Segmentation Result Spine Segmentation Result ​ EISeg- Med3D label process EISeg-Med3D: https://github.com/PaddlePaddle/PaddleSeg/blob/develop/EISeg/med3d/README_en.md MedicalSeg: https://github.com/PaddlePaddle/PaddleSeg/blob/develop/contrib/MedicalSeg/README.md submitted by /u/Daisy_SUGARFREE [link] [comments]  ( 157 min )
    [P] Nested Cross Validation Library
    Two years ago I created a library for performing Nested Cross Validation for hyperparameter tuning and model evaluation on classification problems. It's built on top of `sklearn`, `imblearn` and `skopt`. It is specially designed for binary or multiclass classification problems where either there is an imbalance problem or probability calibration is very important (or both). Today I've released a totally new version with better code, documentation and examples that I hope will make it easier to use for practitioners. https://github.com/JaimeArboleda/nestedcvtraining submitted by /u/fripperML [link] [comments]  ( 125 min )
    [R] Neural Networks are Decision Trees
    submitted by /u/MLC_Money [link] [comments]  ( 133 min )
    [D] Are GAN(s) still relevant as a research topic? or is there any idea regarding research on generative modeling?
    Hi, I am a master's student working on GAN in speech enhancement. Probably I must say I learned a lot from this topic and I had to restudy probability to get an understanding of generative models and such. I am just curious whether the generative model such as GAN is still a good topic for Ph.D. since recently I am getting exposed to the current model such as diffusion model. BTW, I also interested in information bottleneck in deep learning. Any suggestion would be helpful :) thanks submitted by /u/aozorahime [link] [comments]  ( 134 min )
  • Open

    State of AI Report 2022 - analyses the most interesting developments in AI
    submitted by /u/magenta_placenta [link] [comments]  ( 108 min )
    Is there an AI that can help finish up a novel I began to write?
    submitted by /u/Paradise5551 [link] [comments]  ( 109 min )
    I am trying to build a detection bot for AI-generated images
    I made a website for a PoC, but currently its not really working well especially for those images that have some human-like brush styles https://www.illuminarty.ai/ This started off as a pure hobby based but I started getting a bit of some traction from the art communities (mainly anime), now I am seriously considering if I should start spending more time on this but I lack brain cells is anyone here interested in this topic? do you think this would be of any use? submitted by /u/borrito3179 [link] [comments]  ( 109 min )
    AI systems that DON'T utilize a model?
    Hey everyone, I've been getting into AI and ML for my job and people are often bringing up AI/ML v. Model Risk Management, and I'm a bit confused. I'm just trying to understand situations when an AI or ML system WOULDN'T utilize a model. And honestly, what a model really is. As I understand it, training data is fed to a system with an algorithm and the system utilizes that to create a model. Then new data is analyzed against that model and decisions are made. Can someone help me understand a little more? And maybe some examples of an AI system that wouldn't have a model? Thanks so much! submitted by /u/MarbledCoffeecake [link] [comments]  ( 124 min )
    What Does The Best Possible Day Look Like to an AI? (47 seconds)
    submitted by /u/AIrunstheshow [link] [comments]  ( 110 min )
    Is this subreddit artifical?
    /r/solcial The comments and posts look like they were written by either bots or humans who were paid to be there. submitted by /u/solidwhetstone [link] [comments]  ( 113 min )
    Master of Arabic
    submitted by /u/widgia [link] [comments]  ( 108 min )
    HYPERNETWORK Tutorial: Train Stable Diffusion For Free!
    submitted by /u/PuppetHere [link] [comments]  ( 111 min )
    I'm writing about AI daily
    I've started writing (since Monday) a daily email (mon-fri) on all the things going on in AI. I'm just in a deep deep rabbit hole and want to capture it in one place. Includes links to products, demos, videos, papers, images - basically all of it. I'm trying to capture the biggest announcements every day as this space is moving so fast! I need to tweak my emails a bit and make them shorter but I'm enjoying the process. If you're interested in checking it out, leave a comment and I'll drop the link! ​ Today's email mention, amongst other things; - Runway ML, Microsoft and Hugging Face updates - AI voice studio - Youtube transcription tool - article2image - Sarah Guo's $100m fund - SD create/share sites submitted by /u/bentossell [link] [comments]  ( 112 min )
    AI Dream 87 - Dreamfusion smooth Testflight by AI
    submitted by /u/LordPewPew777 [link] [comments]  ( 107 min )
    Alexa, you can do better: real-time robot control via interactive language
    submitted by /u/much_successes [link] [comments]  ( 111 min )
    Standards playing an increasingly important role in AI - UK has launched a 'standards hub' including to minimise risk of bias including engagement from civil society groups.
    submitted by /u/upyourego [link] [comments]  ( 109 min )
    I used the inferkit ai to make a story about a fire. My prompt ended at horrible realization, the rest is ai continuing the story.
    It was midnight. I woke up to the fire alarm going off. I rushed downstairs, and found out that the kitchen was on fire. I had left on the stove for too long. I quickly called 911, but then, I made a horrible realization. I could see smoke in the bathroom. That’s when I knew I couldn’t call 911. I dashed into the bathroom, and extinguished the fire with the curtain. I began to cry, and I could feel my heart beating very fast. My heart was pounding like crazy. I stood there for a moment, staring at the fire. But then I saw something. A small red blob on the wall. It was coming from the vent. I opened it, and a rush of smoke came out. I tried to put out the fire. At first, I was able to, but soon, it was on me. The smoke was getting more intense, and I began to cough. I had no way to get out. I could feel my throat burning. The fire was spreading fast, and it was getting harder to breathe. I kept coughing, but the smoke wouldn’t go out. The only way out was through the vent. I sucked in as much as I could, and then let the vent back out. The only thing I was thinking about was my mother. I didn’t want her to worry. Finally, the smoke went out, and I could breathe again. I went back to the kitchen, and I looked at the damage. The cabinets were burned, and the stove was melted. I just sat there for a moment, thinking about what to do next. When I realized that it wasn’t going to fix itself, I grabbed a flashlight, and a bag. I headed out the back door, and took a chance that I wouldn’t get lost. I walked for hours. At some point, I heard cars passing. I knew I had to walk far enough that they wouldn’t find me. I remember passing a turn off, and I knew I had to turn around. When I got home, I headed straight to my mother. She asked me where I had been, and why I wasn’t in my room. I told her the truth. submitted by /u/Ordinary-Pool6007 [link] [comments]  ( 112 min )
    AI research
    Hello, everyone! Our gang is working on developing a very empathetic and romanceable AI bot, but the information, to make it a real semblance of a human being, is essential. Since Replika app is probably one of the greatest impersonations of that notion, we are looking for respondents that have been using it (or any such analogues) regularly, have at least 30+lvl and… Would kindly participate in a 30 min interview. For appreciating your contribution, we are giving you a $25 Amazon gift card. Book an appointment here: https://calendly.com/atsai5/30-minute-interview-2 This is not only for those who have romantic relationships with Replika but we also hope to find out what relationships people want to have and will set those options in the future! Also, in the future, we are going to launch our product in the US market, so we are conducting research on users of the Western and European countries. This is very important if we want to get accurate results, as each country is unique and has its own specific traits. Sorry for the inconvenience T^T! Looking forward to seeing you! Thanks in advance! (ps.: if the time schedule on the Calendly page is not convenient for you, feel free to send me a message and we will set the meeting) submitted by /u/juggo_sipeo [link] [comments]  ( 109 min )
    How open are the “open-domain” chatbots and do we really need them?
    submitted by /u/bendee983 [link] [comments]  ( 110 min )
    How Conversational AI is revolutionizing CX across sectors in APAC
    If you're looking for some of the interesting AI chatbot use cases with examples, it's a quick read: https://www.haptik.ai/blog/conversational-ai-in-apac?utm_source=website&utm_medium=blog&utm_campaign=conversationalAI&utm_content=read+more submitted by /u/Haptik-AI [link] [comments]  ( 114 min )
    Why Solving For Efficiency of Matrix Multiplication Such A Big Deal In Computing
    You must have come across matrix multiplication in school textbooks. But did you know how relevant it is in every aspect of our daily lives, from processing images on our phones and recognising speech commands to generating graphics for computer games? https://analyticsindiamag.com/why-solving-for-efficiency-of-matrix-multiplication-such-a-big-deal-in-computing/ submitted by /u/analyticsindiam [link] [comments]  ( 119 min )
    neural net's aren't enough for achieving AGI (an opinion)
    submitted by /u/rdhikshith [link] [comments]  ( 111 min )
    Simulated Info-Warfare with GPT3
    submitted by /u/walt74 [link] [comments]  ( 112 min )
    Best Artificial Intelligence books for beginners to expert to read in 2022
    submitted by /u/Lakshmireddys [link] [comments]  ( 107 min )
    Research Project Feedback
    I'm looking for a few college students that would be willing to partake in a 20-30 minute interview discussing their thoughts/reaction to an AI project I'm involved with. submitted by /u/Accomplished_Head5 [link] [comments]  ( 108 min )
    Jasper.AI Honest Review
    Overall, the Jasper AI product is a great investment for those who want to save time and money by outsourcing their content. This service will help you produce high-quality copy with few mistakes, all while reducing your risk of making errors when writing. Jasper AI offers over 50+ content templates for you to choose from which makes it the perfect solution for any blogger or business owner that needs help with writing SEO-friendly blog posts, social media copywriting, ad campaigns, email subject lines, and more. The AI-generated output produced by Jasper is 99.99% original content that is free of plagiarism. This will allow you to generate high-quality content that’s also SEO-optimized in a fraction of the time you normally would. The real power lies in Boss Mode which unlocks the long-form assistant allowing you to write full blog posts, marketing emails, or even entire books! Get 10,000 words free with jasper >>HERE<< submitted by /u/Regillio14 [link] [comments]  ( 108 min )
    Ethics and Safety around the use of data collection for Machine Learning Purposes.
    Hey Everyone, I wanted to ask around to see if anyone had any examples regarding misuse of user information for a machine learning algorithm? I have done a little bit of research and I can't find any major cases of happening. I am currently doing research on the ethics and safety considerations of data collection in the context of machine learning. Any examples or help in general would be greatly appreciated! Thanks for taking the time to read my post. submitted by /u/ThunderMoofin [link] [comments]  ( 108 min )
    (LOUD AUDIO) I was using the Uberduck Voice AI to read a story and it started to make legit demon sounds.
    submitted by /u/PureBrew [link] [comments]  ( 107 min )
  • Open

    Build a solution for a computer vision skin lesion classifier using Amazon SageMaker Pipelines
    Amazon SageMaker Pipelines is a continuous integration and continuous delivery (CI/CD) service designed for machine learning (ML) use cases. You can use it to create, automate, and manage end-to-end ML workflows. It tackles the challenge of orchestrating each step of an ML process, which requires time, effort, and resources. To facilitate its use, multiple templates […]  ( 9 min )
    How Amazon Search runs large-scale, resilient machine learning projects with Amazon SageMaker
    If you have searched for an item to buy on amazon.com, you have used Amazon Search services. At Amazon Search, we’re responsible for the search and discovery experience for our customers worldwide. In the background, we index our worldwide catalog of products, deploy highly scalable AWS fleets, and use advanced machine learning (ML) to match […]  ( 6 min )
  • Open

    Planetary orbits are very nearly circular
    If a science book shows you obviously elliptical orbits of planets, it is literally stretching the truth. I was taught that our benighted ancestors insisted that planetary orbits are circles for philosophical reasons. In fact, they insisted planetary orbits are circular because they very nearly are. Here’s a plot of the orbits of all nine […] Planetary orbits are very nearly circular first appeared on John D. Cook.  ( 4 min )
    Saving money on big queries
    I was surprised the first time a client told me that a query would cost them $100,000 to run. If you think about querying a database on your laptop, a long query would take a minute, and what’s the cost of a minute’s worth of electricity? Too small to meter. But some of my clients […] Saving money on big queries first appeared on John D. Cook.  ( 5 min )
    The Pluto-Charon orbit
    The Moon doesn’t orbit the center of the Earth; it orbits the center of mass of the Earth-Moon system, which is inside the Earth. The distinction matters for designing satellite orbits, but it cannot be seen on a plot to scale. We’ll quantify this below. Pluto’s moon Charon, however, is so large relative to Pluto […] The Pluto-Charon orbit first appeared on John D. Cook.  ( 5 min )
    Shape of moon orbit around sun
    The earth’s orbit around the sun is nearly a circle, and the moon’s orbit around the earth is nearly a circle, but what is the shape of the moon’s orbit around the sun? You might expect it to be bumpy, bending inward when the moon is between the earth and the sun and bending output […] Shape of moon orbit around sun first appeared on John D. Cook.  ( 5 min )
  • Open

    ​​Top 10 Future Data Analytics Trends in 2023
    In an era where the landscape of business is evolving and changing at a rapid pace, data collection & analysis often operate as pivotal factors in shaping the destiny of each new market segment, whether it be the healthcare sector, decentralized work, an online company like Amazon, an online customer service network, or even an… Read More »​​Top 10 Future Data Analytics Trends in 2023 The post ​​Top 10 Future Data Analytics Trends in 2023 appeared first on Data Science Central.  ( 24 min )
    DSC Webinar Series – Best Practices for Adopting Containers within your MLOps Process.mp4
    With the release of SAS Container Runtime (SCR), organizations can execute models and decisions outside of SAS using standard technologies. Containerized deployments are lightweight to save on cloud costs, portable to enable easy movement across environments, and scalable to meet traffic needs. In this talk, we will discuss how IT and MLOps Engineering teams can… Read More »DSC Webinar Series – Best Practices for Adopting Containers within your MLOps Process.mp4 The post DSC Webinar Series – Best Practices for Adopting Containers within your MLOps Process.mp4 appeared first on Data Science Central.  ( 17 min )
  • Open

    Hello, World: NIO Expands Global Footprint With Intelligent Vehicle Experiences
    When it comes to reimagining the next generation of automotive, NIO is thinking outside the car. This month, the China-based electric vehicle maker introduced its lineup to four new countries in Europe — Denmark, Germany, the Netherlands and Sweden — along with an innovative subscription-based ownership model. The countries join NIO’s customer base in China Read article > The post Hello, World: NIO Expands Global Footprint With Intelligent Vehicle Experiences appeared first on NVIDIA Blog.  ( 5 min )
    Learn How NVIDIA Advances AI for Enterprises, at Oracle CloudWorld
    NVIDIA and Oracle are teaming to make the power of AI accessible to enterprises across industries. These include healthcare, financial services, automotive and a broad range of natural language processing use cases driven by large language models, such as chatbots, personal assistants, document summarization and article completion. Join NVIDIA and Oracle experts at Oracle CloudWorld, Read article > The post Learn How NVIDIA Advances AI for Enterprises, at Oracle CloudWorld appeared first on NVIDIA Blog.  ( 6 min )
    Press Art to Continue: New AI Tools Promise Art With the Push of a Button — But Reality Is More Complicated
    Alien invasions. Gritty dystopian megacities. Battlefields swarming with superheroes. As one of Hollywood’s top concept artists, Drew Leung can visualize any world you can think of, except one where AI takes his job. He would know. He’s spent the past few months trying to make it happen, testing every AI tool he could. “If your Read article > The post Press Art to Continue: New AI Tools Promise Art With the Push of a Button — But Reality Is More Complicated appeared first on NVIDIA Blog.  ( 7 min )
    GeForce NOW Streams High-Res, 120-FPS PC Gaming to World’s First Cloud Gaming Chromebooks
    High-end PC gaming arrives on more devices this GFN Thursday. GeForce NOW RTX 3080 members can now stream their favorite PC games at up to 1600p and 120 frames per second in a Chrome browser. No downloads, no installs, just victory. Even better, NVIDIA has worked with Google to support the newest Chromebooks, which are Read article > The post GeForce NOW Streams High-Res, 120-FPS PC Gaming to World’s First Cloud Gaming Chromebooks appeared first on NVIDIA Blog.  ( 6 min )
  • Open

    Crossmodal-3600 — Multilingual Reference Captions for Geographically Diverse Images
    Posted by Ashish Thapliyal, Software Engineer, and Jordi Pont-Tuset, Research Scientist, Google Research Image captioning is the machine learning task of automatically generating a fluent natural language description for a given image. This task is important for improving accessibility for visually impaired users and is a core task in multimodal research encompassing both vision and language modeling. However, datasets for image captioning are primarily available in English. Beyond that, there are only a few datasets covering a limited number of languages that represent just a small fraction of the world’s population. Further, these datasets feature images that severely under-represent the richness and diversity of cultures from across the globe. These aspects have hindered research on …  ( 29 min )
  • Open

    Best algorithm to generate a new (fixed) camera angle from a set of other (fixed) camera angles.
    I'm trying to reconstruct a scene (either in 3d like NeRF or with a birdseye view generated as image). I can train as many different angles, but the inference has to be done with only a few fixed camera angles. So I think I need an algorithm that leverages the relation between the fixed camera angles. So for example a few simple training inputs and the ground truth (target) with only 1 angle: ​ https://preview.redd.it/6gglz30usjt91.png?width=2006&format=png&auto=webp&s=ce1e347d955f0a1ed14260f7afb74760a24bd74f So far I tried using a simple cGAN and a few variations with mild success, but I lack the coding skills and knowledge to optimize those networks or even pick the right hyper parameters. Right now it starts off good but then starts learning weird image artifacts and just doesn't improve anymore. I'm new too machine learning, and I'm trying to write a paper on this for school. This isn't commercial and doesn't have to be perfect, just a learning experience. If anyone can help with this problem, please do. Bonus: If anyone is interested in giving me a few tips on how to "fine" tune (honestly just any optimization) my models, pls tell me. submitted by /u/alexho66 [link] [comments]  ( 112 min )
    LSTMs vs. Transformers
    Could someone please show me a couple of good resources on LSTMs vs. Transformers? There are lots of short articles that discuss this without any references, and at the moment I do not have enough time to read a few academic text books, so I wonder if there are a couple of reliable resources for a little review of these two networks. submitted by /u/DinaMosharraf [link] [comments]  ( 108 min )
  • Open

    Value vs Policy as it pertains to non-stationary data
    Why are policy-based networks more effective than values-based networks when it comes to non-stationary data? submitted by /u/clarky103 [link] [comments]  ( 121 min )
    Can GNNs be used as feature extractors in Stablebaselines 3?
    Hello guys. I have a graph problem for which I would like to use reinforcement learning to solve. The inputs size of the Graph varies.So I think it makes sense to use GNNs here. I couldn’t find any implementations of GNNs as feature extractors. Are there any .? If not do you guys think it is possible to do in stablebaselines3 within few hrs. I consider my self not too bad in python. submitted by /u/magnusvegeta [link] [comments]  ( 121 min )
  • Open

    Counterfactual harm. (arXiv:2204.12993v4 [cs.AI] UPDATED)
    To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. However, to date there is no statistical method for measuring harm and factoring it into algorithmic decisions. In this paper we propose the first formal definition of harm and benefit using causal models. We show that any factual definition of harm must violate basic intuitions in certain scenarios, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies following distributional shifts. We use our definition of harm to devise a framework for harm-averse decision making using counterfactual objective functions. We demonstrate this framework on the problem of identifying optimal drug doses using a dose-response model learned from randomized control trial data. We find that the standard method of selecting doses using treatment effects results in unnecessarily harmful doses, while our counterfactual approach allows us to identify doses that are significantly less harmful without sacrificing efficacy.  ( 2 min )
    An Analytical Theory of Curriculum Learning in Teacher-Student Networks. (arXiv:2106.08068v2 [cs.LG] UPDATED)
    In humans and animals, curriculum learning -- presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. Curricula could in principle change both the learning speed and asymptotic performance of a model. To study the former, we provide an exact description of the online learning setting, confirming the long-standing experimental observation that curricula can modestly speed up learning. To study the latter, we derive performance in a batch learning setting, in which a network trains to convergence in successive phases of learning on dataset slices of varying difficulty. With standard training losses, curriculum does not provide generalisation benefit, in line with empirical observations. However, we show that by connecting different learning phases through simple Gaussian priors, curriculum can yield a large improvement in test performance. Taken together, our reduced analytical descriptions help reconcile apparently conflicting empirical results and trace regimes where curriculum learning yields the largest gains. More broadly, our results suggest that fully exploiting a curriculum may require explicit changes to the loss function at curriculum boundaries.  ( 3 min )
    Multimodal Multi-Head Convolutional Attention with Various Kernel Sizes for Medical Image Super-Resolution. (arXiv:2204.04218v3 [eess.IV] UPDATED)
    Super-resolving medical images can help physicians in providing more accurate diagnostics. In many situations, computed tomography (CT) or magnetic resonance imaging (MRI) techniques capture several scans (modes) during a single investigation, which can jointly be used (in a multimodal fashion) to further boost the quality of super-resolution results. To this end, we propose a novel multimodal multi-head convolutional attention module to super-resolve CT and MRI scans. Our attention module uses the convolution operation to perform joint spatial-channel attention on multiple concatenated input tensors, where the kernel (receptive field) size controls the reduction rate of the spatial attention, and the number of convolutional filters controls the reduction rate of the channel attention, respectively. We introduce multiple attention heads, each head having a distinct receptive field size corresponding to a particular reduction rate for the spatial attention. We integrate our multimodal multi-head convolutional attention (MMHCA) into two deep neural architectures for super-resolution and conduct experiments on three data sets. Our empirical results show the superiority of our attention module over the state-of-the-art attention mechanisms used in super-resolution. Moreover, we conduct an ablation study to assess the impact of the components involved in our attention module, e.g. the number of inputs or the number of heads. Our code is freely available at https://github.com/lilygeorgescu/MHCA.  ( 3 min )
    Predicting housing prices and analyzing real estate market in the Chicago suburbs using Machine Learning. (arXiv:2210.06261v1 [cs.LG])
    The pricing of housing properties is determined by a variety of factors. However, post-pandemic markets have experienced volatility in the Chicago suburb area, which have affected house prices greatly. In this study, analysis was done on the Naperville/Bolingbrook real estate market to predict property prices based on these housing attributes through machine learning models, and to evaluate the effectiveness of such models in a volatile market space. Gathering data from Redfin, a real estate website, sales data from 2018 up until the summer season of 2022 were collected for research. By analyzing these sales in this range of time, we can also look at the state of the housing market and identify trends in price. For modeling the data, the models used were linear regression, support vector regression, decision tree regression, random forest regression, and XGBoost regression. To analyze results, comparison was made on the MAE, RMSE, and R-squared values for each model. It was found that the XGBoost model performs the best in predicting house prices despite the additional volatility sponsored by post-pandemic conditions. After modeling, Shapley Values (SHAP) were used to evaluate the weights of the variables in constructing models.  ( 2 min )
    Variational Open-Domain Question Answering. (arXiv:2210.06345v1 [cs.CL])
    We introduce the Variational Open-Domain (VOD) framework for end-to-end training and evaluation of retrieval-augmented models (open-domain question answering and language modelling). We show that the R\'enyi variational bound, a lower bound to the task marginal likelihood, can be exploited to aid optimization and use importance sampling to estimate the task log-likelihood lower bound and its gradients using samples drawn from an auxiliary retriever (approximate posterior). The framework can be used to train modern retrieval-augmented systems end-to-end using tractable and consistent estimates of the R\'enyi variational bound and its gradients. We demonstrate the framework's versatility by training reader-retriever BERT-based models on multiple-choice medical exam questions (MedMCQA and USMLE). We registered a new state-of-the-art for both datasets (MedMCQA: $62.9$\%, USMLE: $55.0$\%). Last, we show that the retriever part of the learned reader-retriever model trained on the medical board exam questions can be used in search engines for a medical knowledge base.  ( 2 min )
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v3 [stat.ME] UPDATED)
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.  ( 3 min )
    MariusGNN: Resource-Efficient Out-of-Core Training of Graph Neural Networks. (arXiv:2202.02365v2 [cs.LG] UPDATED)
    We study training of Graph Neural Networks (GNNs) for large-scale graphs. We revisit the premise of using distributed training for billion-scale graphs and show that for graphs that fit in main memory or the SSD of a single machine, out-of-core pipelined training with a single GPU can outperform state-of-the-art (SoTA) multi-GPU solutions. We introduce MariusGNN, the first system that utilizes the entire storage hierarchy -- including disk -- for GNN training. MariusGNN introduces a series of data organization and algorithmic contributions that 1) minimize the end-to-end time required for training and 2) ensure that models learned with disk-based training exhibit accuracy similar to those fully trained in memory. We evaluate MariusGNN against SoTA systems for learning GNN models and find that single-GPU training in MariusGNN achieves the same level of accuracy up to 8x faster than multi-GPU training in these systems, thus, introducing an order of magnitude monetary cost reduction. MariusGNN is open-sourced at www.marius-project.org.
    Graph Neural Networks for Channel Decoding. (arXiv:2207.14742v2 [cs.IT] UPDATED)
    In this work, we propose a fully differentiable graph neural network (GNN)-based architecture for channel decoding and showcase a competitive decoding performance for various coding schemes, such as low-density parity-check (LDPC) and BCH codes. The idea is to let a neural network (NN) learn a generalized message passing algorithm over a given graph that represents the forward error correction (FEC) code structure by replacing node and edge message updates with trainable functions. Contrary to many other deep learning-based decoding approaches, the proposed solution enjoys scalability to arbitrary block lengths and the training is not limited by the curse of dimensionality. We benchmark our proposed decoder against state-of-the-art in conventional channel decoding as well as against recent deep learning-based results. For the (63,45) BCH code, our solution outperforms weighted belief propagation (BP) decoding by approximately 0.4 dB with significantly less decoding iterations and even for 5G NR LDPC codes, we observe a competitive performance when compared to conventional BP decoding. For the BCH codes, the resulting GNN decoder can be fully parametrized with only 9640 weights.
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v3 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
    Trajectory balance: Improved credit assignment in GFlowNets. (arXiv:2201.13259v2 [cs.LG] UPDATED)
    Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.
    Neural Approximation of Graph Topological Features. (arXiv:2201.12032v3 [cs.LG] UPDATED)
    Topological features based on persistent homology capture high-order structural information so as to augment graph neural network methods. However, computing extended persistent homology summaries remains slow for large and dense graphs and can be a serious bottleneck for the learning pipeline. Inspired by recent success in neural algorithmic reasoning, we propose a novel graph neural network to estimate extended persistence diagrams (EPDs) on graphs efficiently. Our model is built on algorithmic insights, and benefits from better supervision and closer alignment with the EPD computation algorithm. We validate our method with convincing empirical results on approximating EPDs and downstream graph representation learning tasks. Our method is also efficient; on large and dense graphs, we accelerate the computation by nearly 100 times.  ( 2 min )
    Empirical Gateaux Derivatives for Causal Inference. (arXiv:2208.13701v3 [stat.ME] UPDATED)
    We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite-differencing, with a focus on causal inference functionals. We consider the case where probability distributions are not known a priori but also need to be estimated from data. These estimated distributions lead to empirical Gateaux derivatives, and we study the relationships between empirical, numerical, and analytical Gateaux derivatives. Starting with a case study of estimating the mean potential outcome (hence average treatment effect), we instantiate the exact relationship between finite-differences and the analytical Gateaux derivative. We then derive requirements on the rates of numerical approximation in perturbation and smoothing that preserve the statistical benefits of one-step adjustments, such as rate-double-robustness. We then study more complicated functionals such as dynamic treatment regimes and the linear-programming formulation for policy optimization in infinite-horizon Markov decision processes. The newfound ability to approximate bias adjustments in the presence of arbitrary constraints illustrates the usefulness of constructive approaches for Gateaux derivatives. We also find that the statistical structure of the functional (rate-double robustness) can permit less conservative rates of finite-difference approximation. This property, however, can be specific to particular functionals, e.g. it occurs for the mean potential outcome (hence average treatment effect) but not the infinite-horizon MDP policy value.
    Open-Ended Reinforcement Learning with Neural Reward Functions. (arXiv:2202.08266v2 [cs.LG] UPDATED)
    Inspired by the great success of unsupervised learning in Computer Vision and Natural Language Processing, the Reinforcement Learning community has recently started to focus more on unsupervised discovery of skills. Most current approaches, like DIAYN or DADS, optimize some form of mutual information objective. We propose a different approach that uses reward functions encoded by neural networks. These are trained iteratively to reward more complex behavior. In high-dimensional robotic environments our approach learns a wide range of interesting skills including front-flips for Half-Cheetah and one-legged running for Humanoid. In the pixel-based Montezuma's Revenge environment our method also works with minimal changes and it learns complex skills that involve interacting with items and visiting diverse locations. The implementation of our approach can be found in this link: https://github.com/amujika/Open-Ended-Reinforcement-Learning-with-Neural-Reward-Functions.
    Deep Surrogate Assisted Generation of Environments. (arXiv:2206.04199v3 [cs.AI] UPDATED)
    Recent progress in reinforcement learning (RL) has started producing generally capable agents that can solve a distribution of complex environments. These agents are typically tested on fixed, human-authored environments. On the other hand, quality diversity (QD) optimization has been proven to be an effective component of environment generation algorithms, which can generate collections of high-quality environments that are diverse in the resulting agent behaviors. However, these algorithms require potentially expensive simulations of agents on newly generated environments. We propose Deep Surrogate Assisted Generation of Environments (DSAGE), a sample-efficient QD environment generation algorithm that maintains a deep surrogate model for predicting agent behaviors in new environments. Results in two benchmark domains show that DSAGE significantly outperforms existing QD environment generation algorithms in discovering collections of environments that elicit diverse behaviors of a state-of-the-art RL agent and a planning agent. Our source code and videos are available at https://dsagepaper.github.io/.
    Alleviating Adversarial Attacks on Variational Autoencoders with MCMC. (arXiv:2203.09940v2 [cs.LG] UPDATED)
    Variational autoencoders (VAEs) are latent variable models that can generate complex objects and provide meaningful latent representations. Moreover, they could be further used in downstream tasks such as classification. As previous work has shown, one can easily fool VAEs to produce unexpected latent representations and reconstructions for a visually slightly modified input. Here, we examine several objective functions for adversarial attack construction proposed previously and present a solution to alleviate the effect of these attacks. Our method utilizes the Markov Chain Monte Carlo (MCMC) technique in the inference step that we motivate with a theoretical analysis. Thus, we do not incorporate any extra costs during training, and the performance on non-attacked inputs is not decreased. We validate our approach on a variety of datasets (MNIST, Fashion MNIST, Color MNIST, CelebA) and VAE configurations ($\beta$-VAE, NVAE, $\beta$-TCVAE), and show that our approach consistently improves the model robustness to adversarial attacks.
    Cross Task Neural Architecture Search for EEG Signal Classifications. (arXiv:2210.06298v1 [eess.SP])
    Electroencephalograms (EEGs) are brain dynamics measured outside the brain, which have been widely utilized in non-invasive brain-computer interface applications. Recently, various neural network approaches have been proposed to improve the accuracy of EEG signal recognition. However, these approaches severely rely on manually designed network structures for different tasks which generally are not sharing the same empirical design cross-task-wise. In this paper, we propose a cross-task neural architecture search (CTNAS-EEG) framework for EEG signal recognition, which can automatically design the network structure across tasks and improve the recognition accuracy of EEG signals. Specifically, a compatible search space for cross-task searching and an efficient constrained searching method is proposed to overcome challenges brought by EEG signals. By unifying structure search on different EEG tasks, this work is the first to explore and analyze the searched structure difference cross-task-wise. Moreover, by introducing architecture search, this work is the first to analyze model performance by customizing model structure for each human subject. Detailed experimental results suggest that the proposed CTNAS-EEG could reach state-of-the-art performance on different EEG tasks, such as Motor Imagery (MI) and Emotion recognition. Extensive experiments and detailed analysis are provided as a good reference for follow-up researchers.
    Learning to Accelerate Partial Differential Equations via Latent Global Evolution. (arXiv:2206.07681v2 [cs.LG] UPDATED)
    Simulating the time evolution of Partial Differential Equations (PDEs) of large-scale systems is crucial in many scientific and engineering domains such as fluid dynamics, weather forecasting and their inverse optimization problems. However, both classical solvers and recent deep learning-based surrogate models are typically extremely computationally intensive, because of their local evolution: they need to update the state of each discretized cell at each time step during inference. Here we develop Latent Evolution of PDEs (LE-PDE), a simple, fast and scalable method to accelerate the simulation and inverse optimization of PDEs. LE-PDE learns a compact, global representation of the system and efficiently evolves it fully in the latent space with learned latent evolution models. LE-PDE achieves speed-up by having a much smaller latent dimension to update during long rollout as compared to updating in the input space. We introduce new learning objectives to effectively learn such latent dynamics to ensure long-term stability. We further introduce techniques for speeding-up inverse optimization of boundary conditions for PDEs via backpropagation through time in latent space, and an annealing technique to address the non-differentiability and sparse interaction of boundary conditions. We test our method in a 1D benchmark of nonlinear PDEs, 2D Navier-Stokes flows into turbulent phase and an inverse optimization of boundary conditions in 2D Navier-Stokes flow. Compared to state-of-the-art deep learning-based surrogate models and other strong baselines, we demonstrate up to 128x reduction in the dimensions to update, and up to 15x improvement in speed, while achieving competitive accuracy.
    EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine. (arXiv:2206.10558v2 [cs.LG] UPDATED)
    There has been significant progress in developing reinforcement learning (RL) training systems. Past works such as IMPALA, Apex, Seed RL, Sample Factory, and others, aim to improve the system's overall throughput. In this paper, we aim to address a common bottleneck in the RL training system, i.e., parallel environment execution, which is often the slowest part of the whole system but receives little attention. With a curated design for paralleling RL environments, we have improved the RL environment simulation speed across different hardware setups, ranging from a laptop and a modest workstation, to a high-end machine such as NVIDIA DGX-A100. On a high-end machine, EnvPool achieves one million frames per second for the environment execution on Atari environments and three million frames per second on MuJoCo environments. When running EnvPool on a laptop, the speed is 2.8x that of the Python subprocess. Moreover, great compatibility with existing RL training libraries has been demonstrated in the open-sourced community, including CleanRL, rl_games, DeepMind Acme, etc. Finally, EnvPool allows researchers to iterate their ideas at a much faster pace and has great potential to become the de facto RL environment execution engine. Example runs show that it only takes five minutes to train agents to play Atari Pong and MuJoCo Ant on a laptop. EnvPool is open-sourced at https://github.com/sail-sg/envpool.
    Competitive Physics Informed Networks. (arXiv:2204.11144v2 [cs.LG] UPDATED)
    Neural networks can be trained to solve partial differential equations (PDEs) by using the PDE residual as the loss function. This strategy is called "physics-informed neural networks" (PINNs), but it currently cannot produce high-accuracy solutions, typically attaining about $0.1\%$ relative error. We present an adversarial approach that overcomes this limitation, which we call competitive PINNs (CPINNs). CPINNs train a discriminator that is rewarded for predicting mistakes the PINN makes. The discriminator and PINN participate in a zero-sum game with the exact PDE solution as an optimal strategy. This approach avoids squaring the large condition numbers of PDE discretizations, which is the likely reason for failures of previous attempts to decrease PINN errors even on benign problems. Numerical experiments on a Poisson problem show that CPINNs achieve errors four orders of magnitude smaller than the best-performing PINN. We observe relative errors on the order of single-precision accuracy, consistently decreasing with each epoch. To the authors' knowledge, this is the first time this level of accuracy and convergence behavior has been achieved. Additional experiments on the nonlinear Schr\"odinger, Burgers', and Allen-Cahn equation show that the benefits of CPINNs are not limited to linear problems.
    Automatic Discovery of Composite SPMD Partitioning Strategies in PartIR. (arXiv:2210.06352v1 [cs.NE])
    Large neural network models are commonly trained through a combination of advanced parallelism strategies in a single program, multiple data (SPMD) paradigm. For example, training large transformer models requires combining data, model, and pipeline partitioning; and optimizer sharding techniques. However, identifying efficient combinations for many model architectures and accelerator systems requires significant manual analysis. In this work, we present an automatic partitioner that identifies these combinations through a goal-oriented search. Our key findings are that a Monte Carlo Tree Search-based partitioner leveraging partition-specific compiler analysis directly into the search and guided goals matches expert-level strategies for various models.
    Modeling the Machine Learning Multiverse. (arXiv:2206.05985v2 [cs.LG] UPDATED)
    Amid mounting concern about the reliability and credibility of machine learning research, we present a principled framework for making robust and generalizable claims: the multiverse analysis. Our framework builds upon the multiverse analysis (Steegen et al., 2016) introduced in response to psychology's own reproducibility crisis. To efficiently explore high-dimensional and often continuous ML search spaces, we model the multiverse with a Gaussian Process surrogate and apply Bayesian experimental design. Our framework is designed to facilitate drawing robust scientific conclusions about model performance, and thus our approach focuses on exploration rather than conventional optimization. In the first of two case studies, we investigate disputed claims about the relative merit of adaptive optimizers. Second, we synthesize conflicting research on the effect of learning rate on the large batch training generalization gap. For the machine learning community, the multiverse analysis is a simple and effective technique for identifying robust claims, for increasing transparency, and a step toward improved reproducibility.
    Semantic Cross Attention for Few-shot Learning. (arXiv:2210.06311v1 [cs.CV])
    Few-shot learning (FSL) has attracted considerable attention recently. Among existing approaches, the metric-based method aims to train an embedding network that can make similar samples close while dissimilar samples as far as possible and achieves promising results. FSL is characterized by using only a few images to train a model that can generalize to novel classes in image classification problems, but this setting makes it difficult to learn the visual features that can identify the images' appearance variations. The model training is likely to move in the wrong direction, as the images in an identical semantic class may have dissimilar appearances, whereas the images in different semantic classes may share a similar appearance. We argue that FSL can benefit from additional semantic features to learn discriminative feature representations. Thus, this study proposes a multi-task learning approach to view semantic features of label text as an auxiliary task to help boost the performance of the FSL task. Our proposed model uses word-embedding representations as semantic features to help train the embedding network and a semantic cross-attention module to bridge the semantic features into the typical visual modal. The proposed approach is simple, but produces excellent results. We apply our proposed approach to two previous metric-based FSL methods, all of which can substantially improve performance. The source code for our model is accessible from github.  ( 3 min )
    GLIPv2: Unifying Localization and Vision-Language Understanding. (arXiv:2206.05836v2 [cs.CV] UPDATED)
    We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.
    Bellman Residual Orthogonalization for Offline Reinforcement Learning. (arXiv:2203.12786v3 [cs.LG] UPDATED)
    We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from on-policy to off-policy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold.
    Robust Streaming PCA. (arXiv:1902.03223v3 [stat.ML] UPDATED)
    We consider streaming principal component analysis when the stochastic data-generating model is subject to perturbations. While existing models assume a fixed covariance, we adopt a robust perspective where the covariance matrix belongs to a temporal uncertainty set. Under this setting, we provide fundamental limits on convergence of any algorithm recovering principal components. We analyze the convergence of the noisy power method and Oja's algorithm, both studied for the stationary data generating model, and argue that the noisy power method is rate-optimal in our setting. Finally, we demonstrate the validity of our analysis through numerical experiments on synthetic and real-world dataset.  ( 2 min )
    NOTE: Robust Continual Test-time Adaptation Against Temporal Correlation. (arXiv:2208.05117v2 [cs.LG] UPDATED)
    Test-time adaptation (TTA) is an emerging paradigm that addresses distributional shifts between training and testing phases without additional data acquisition or labeling cost; only unlabeled test data streams are used for continual model adaptation. Previous TTA schemes assume that the test samples are independent and identically distributed (i.i.d.), even though they are often temporally correlated (non-i.i.d.) in application scenarios, e.g., autonomous driving. We discover that most existing TTA methods fail dramatically under such scenarios. Motivated by this, we present a new test-time adaptation scheme that is robust against non-i.i.d. test data streams. Our novelty is mainly two-fold: (a) Instance-Aware Batch Normalization (IABN) that corrects normalization for out-of-distribution samples, and (b) Prediction-balanced Reservoir Sampling (PBRS) that simulates i.i.d. data stream from non-i.i.d. stream in a class-balanced manner. Our evaluation with various datasets, including real-world non-i.i.d. streams, demonstrates that the proposed robust TTA not only outperforms state-of-the-art TTA algorithms in the non-i.i.d. setting, but also achieves comparable performance to those algorithms under the i.i.d. assumption. Code is available at https://github.com/TaesikGong/NOTE.
    Patching open-vocabulary models by interpolating weights. (arXiv:2208.05592v2 [cs.CV] UPDATED)
    Open-vocabulary models like CLIP achieve high accuracy across many image classification tasks. However, there are still settings where their zero-shot performance is far from optimal. We study model patching, where the goal is to improve accuracy on specific tasks without degrading accuracy on tasks where performance is already adequate. Towards this goal, we introduce PAINT, a patching method that uses interpolations between the weights of a model before fine-tuning and the weights after fine-tuning on a task to be patched. On nine tasks where zero-shot CLIP performs poorly, PAINT increases accuracy by 15 to 60 percentage points while preserving accuracy on ImageNet within one percentage point of the zero-shot model. PAINT also allows a single model to be patched on multiple tasks and improves with model scale. Furthermore, we identify cases of broad transfer, where patching on one task increases accuracy on other tasks even when the tasks have disjoint classes. Finally, we investigate applications beyond common benchmarks such as counting or reducing the impact of typographic attacks on CLIP. Our findings demonstrate that it is possible to expand the set of tasks on which open-vocabulary models achieve high accuracy without re-training them from scratch.
    Understanding Deep Contrastive Learning via Coordinate-wise Optimization. (arXiv:2201.12680v5 [cs.LG] UPDATED)
    We show that Contrastive Learning (CL) under a broad family of loss functions (including InfoNCE) has a unified formulation of coordinate-wise optimization on the network parameter $\boldsymbol{\theta}$ and pairwise importance $\alpha$, where the \emph{max player} $\boldsymbol{\theta}$ learns representation for contrastiveness, and the \emph{min player} $\alpha$ puts more weights on pairs of distinct samples that share similar representations. The resulting formulation, called $\alpha$-CL, unifies not only various existing contrastive losses, which differ by how sample-pair importance $\alpha$ is constructed, but also is able to extrapolate to give novel contrastive losses beyond popular ones, opening a new avenue of contrastive loss design. These novel losses yield comparable (or better) performance on CIFAR10, STL-10 and CIFAR-100 than classic InfoNCE. Furthermore, we also analyze the max player in detail: we prove that with fixed $\alpha$, max player is equivalent to Principal Component Analysis (PCA) for deep linear network, and almost all local minima are global and rank-1, recovering optimal PCA solutions. Finally, we extend our analysis on max player to 2-layer ReLU networks, showing that its fixed points can have higher ranks.  ( 3 min )
    Escaping Saddle Points with Bias-Variance Reduced Local Perturbed SGD for Communication Efficient Nonconvex Distributed Learning. (arXiv:2202.06083v3 [cs.LG] UPDATED)
    In recent centralized nonconvex distributed learning and federated learning, local methods are one of the promising approaches to reduce communication time. However, existing work has mainly focused on studying first-order optimality guarantees. On the other side, second-order optimality guaranteed algorithms, i.e., algorithms escaping saddle points, have been extensively studied in the non-distributed optimization literature. In this paper, we study a new local algorithm called Bias-Variance Reduced Local Perturbed SGD (BVR-L-PSGD), that combines the existing bias-variance reduced gradient estimator with parameter perturbation to find second-order optimal points in centralized nonconvex distributed optimization. BVR-L-PSGD enjoys second-order optimality with nearly the same communication complexity as the best known one of BVR-L-SGD to find first-order optimality. Particularly, the communication complexity is better than non-local methods when the local datasets heterogeneity is smaller than the smoothness of the local loss. In an extreme case, the communication complexity approaches to $\widetilde \Theta(1)$ when the local datasets heterogeneity goes to zero. Numerical results validate our theoretical findings.
    Efficient Risk-Averse Reinforcement Learning. (arXiv:2205.05138v2 [cs.LG] UPDATED)
    In risk-averse reinforcement learning (RL), the goal is to optimize some risk measure of the returns. A risk measure often focuses on the worst returns out of the agent's experience. As a result, standard methods for risk-averse RL often ignore high-return strategies. We prove that under certain conditions this inevitably leads to a local-optimum barrier, and propose a soft risk mechanism to bypass it. We also devise a novel Cross Entropy module for risk sampling, which (1) preserves risk aversion despite the soft risk; (2) independently improves sample efficiency. By separating the risk aversion of the sampler and the optimizer, we can sample episodes with poor conditions, yet optimize with respect to successful strategies. We combine these two concepts in CeSoR - Cross-entropy Soft-Risk optimization algorithm - which can be applied on top of any risk-averse policy gradient (PG) method. We demonstrate improved risk aversion in maze navigation, autonomous driving, and resource allocation benchmarks, including in scenarios where standard risk-averse PG completely fails.
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v2 [stat.ML] UPDATED)
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimate from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that, despite its simplicity and the lack of an initial exploration phase, the greedy policy achieves, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$ after $T$ rounds, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. Finally, we use simulated settings and experiments on the US census data to show that our policy achieves sub-linear fair pseudo-regret also in practice.
    Non-Stationary Bandits under Recharging Payoffs: Improved Planning with Sublinear Regret. (arXiv:2205.14790v2 [cs.LG] UPDATED)
    The stochastic multi-armed bandit setting has been recently studied in the non-stationary regime, where the mean payoff of each action is a non-decreasing function of the number of rounds passed since it was last played. This model captures natural behavioral aspects of the users which crucially determine the performance of recommendation platforms, ad placement systems, and more. Even assuming prior knowledge of the mean payoff functions, computing an optimal planning in the above model is NP-hard, while the state-of-the-art is a $1/4$-approximation algorithm for the case where at most one arm can be played per round. We first focus on the setting where the mean payoff functions are known. In this setting, we significantly improve the best-known guarantees for the planning problem by developing a polynomial-time $(1-{1}/{e})$-approximation algorithm (asymptotically and in expectation), based on a novel combination of randomized LP rounding and a time-correlated (interleaved) scheduling method. Furthermore, our algorithm achieves improved guarantees -- compared to prior work -- for the case where more than one arm can be played at each round. Moving to the bandit setting, when the mean payoff functions are initially unknown, we show how our algorithm can be transformed into a bandit algorithm with sublinear regret.
    Object Scene Representation Transformer. (arXiv:2206.06922v2 [cs.CV] UPDATED)
    A compositional understanding of the world in terms of objects and their geometry in 3D space is considered a cornerstone of human cognition. Facilitating the learning of such a representation in neural networks holds promise for substantially improving labeled data efficiency. As a key step in this direction, we make progress on the problem of learning 3D-consistent decompositions of complex scenes into individual objects in an unsupervised fashion. We introduce Object Scene Representation Transformer (OSRT), a 3D-centric model in which individual object representations naturally emerge through novel view synthesis. OSRT scales to significantly more complex scenes with larger diversity of objects and backgrounds than existing methods. At the same time, it is multiple orders of magnitude faster at compositional rendering thanks to its light field parametrization and the novel Slot Mixer decoder. We believe this work will not only accelerate future architecture exploration and scaling efforts, but it will also serve as a useful tool for both object-centric as well as neural scene representation learning communities.
    HouseX: A Fine-grained House Music Dataset and its Potential in the Music Industry. (arXiv:2207.11690v2 [cs.SD] UPDATED)
    Machine sound classification has been one of the fundamental tasks of music technology. A major branch of sound classification is the classification of music genres. However, though covering most genres of music, existing music genre datasets often do not contain fine-grained labels that indicate the detailed sub-genres of music. In consideration of the consistency of genres of songs in a mixtape or in a DJ (live) set, we have collected and annotated a dataset of house music that provide 4 sub-genre labels, namely future house, bass house, progressive house and melodic house. Experiments show that our annotations well exhibit the characteristics of different categories. Also, we have built baseline models that classify the sub-genre based on the mel-spectrograms of a track, achieving strongly competitive results. Besides, we have put forward a few application scenarios of our dataset and baseline model, with a simulated sci-fi tunnel as a short demo built and rendered in a 3D modeling software, with the colors of the lights automated by the output of our model.
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v2 [stat.ML] UPDATED)
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.
    TaskMix: Data Augmentation for Meta-Learning of Spoken Intent Understanding. (arXiv:2210.06341v1 [cs.CL])
    Meta-Learning has emerged as a research direction to better transfer knowledge from related tasks to unseen but related tasks. However, Meta-Learning requires many training tasks to learn representations that transfer well to unseen tasks; otherwise, it leads to overfitting, and the performance degenerates to worse than Multi-task Learning. We show that a state-of-the-art data augmentation method worsens this problem of overfitting when the task diversity is low. We propose a simple method, TaskMix, which synthesizes new tasks by linearly interpolating existing tasks. We compare TaskMix against many baselines on an in-house multilingual intent classification dataset of N-Best ASR hypotheses derived from real-life human-machine telephony utterances and two datasets derived from MTOP. We show that TaskMix outperforms baselines, alleviates overfitting when task diversity is low, and does not degrade performance even when it is high.
    Inducing Neural Collapse in Imbalanced Learning: Do We Really Need a Learnable Classifier at the End of Deep Neural Network?. (arXiv:2203.09081v3 [cs.LG] UPDATED)
    Modern deep neural networks for classification usually jointly learn a backbone for representation and a linear classifier to output the logit of each class. A recent study has shown a phenomenon called neural collapse that the within-class means of features and the classifier vectors converge to the vertices of a simplex equiangular tight frame (ETF) at the terminal phase of training on a balanced dataset. Since the ETF geometric structure maximally separates the pair-wise angles of all classes in the classifier, it is natural to raise the question, why do we spend an effort to learn a classifier when we know its optimal geometric structure? In this paper, we study the potential of learning a neural network for classification with the classifier randomly initialized as an ETF and fixed during training. Our analytical work based on the layer-peeled model indicates that the feature learning with a fixed ETF classifier naturally leads to the neural collapse state even when the dataset is imbalanced among classes. We further show that in this case the cross entropy (CE) loss is not necessary and can be replaced by a simple squared loss that shares the same global optimality but enjoys a better convergence property. Our experimental results show that our method is able to bring significant improvements with faster convergence on multiple imbalanced datasets.
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v2 [cs.LG] UPDATED)
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.
    Mean Estimation in High-Dimensional Binary Markov Gaussian Mixture Models. (arXiv:2206.02455v3 [math.ST] UPDATED)
    We consider a high-dimensional mean estimation problem over a binary hidden Markov model, which illuminates the interplay between memory in data, sample size, dimension, and signal strength in statistical inference. In this model, an estimator observes $n$ samples of a $d$-dimensional parameter vector $\theta_{*}\in\mathbb{R}^{d}$, multiplied by a random sign $ S_i $ ($1\le i\le n$), and corrupted by isotropic standard Gaussian noise. The sequence of signs $\{S_{i}\}_{i\in[n]}\in\{-1,1\}^{n}$ is drawn from a stationary homogeneous Markov chain with flip probability $\delta\in[0,1/2]$. As $\delta$ varies, this model smoothly interpolates two well-studied models: the Gaussian Location Model for which $\delta=0$ and the Gaussian Mixture Model for which $\delta=1/2$. Assuming that the estimator knows $\delta$, we establish a nearly minimax optimal (up to logarithmic factors) estimation error rate, as a function of $\|\theta_{*}\|,\delta,d,n$. We then provide an upper bound to the case of estimating $\delta$, assuming a (possibly inaccurate) knowledge of $\theta_{*}$. The bound is proved to be tight when $\theta_{*}$ is an accurately known constant. These results are then combined to an algorithm which estimates $\theta_{*}$ with $\delta$ unknown a priori, and theoretical guarantees on its error are stated.
    Behavior Transformers: Cloning $k$ modes with one stone. (arXiv:2206.11251v2 [cs.LG] UPDATED)
    While behavior learning has made impressive progress in recent times, it lags behind computer vision and natural language processing due to its inability to leverage large, human-generated datasets. Human behaviors have wide variance, multiple modes, and human demonstrations typically do not come with reward labels. These properties limit the applicability of current methods in Offline RL and Behavioral Cloning to learn from large, pre-collected datasets. In this work, we present Behavior Transformer (BeT), a new technique to model unlabeled demonstration data with multiple modes. BeT retrofits standard transformer architectures with action discretization coupled with a multi-task action correction inspired by offset prediction in object detection. This allows us to leverage the multi-modal modeling ability of modern transformers to predict multi-modal continuous actions. We experimentally evaluate BeT on a variety of robotic manipulation and self-driving behavior datasets. We show that BeT significantly improves over prior state-of-the-art work on solving demonstrated tasks while capturing the major modes present in the pre-collected datasets. Finally, through an extensive ablation study, we analyze the importance of every crucial component in BeT. Videos of behavior generated by BeT are available at https://notmahi.github.io/bet
    On Unbalanced Optimal Transport: Gradient Methods, Sparsity and Approximation Error. (arXiv:2202.03618v3 [math.OC] UPDATED)
    We study the Unbalanced Optimal Transport (UOT) between two measures of possibly different masses with at most $n$ components, where the marginal constraints of standard Optimal Transport (OT) are relaxed via Kullback-Leibler divergence with regularization factor $\tau$. Although only Sinkhorn-based UOT solvers have been analyzed in the literature with the complexity ${O}\big(\tfrac{\tau n^2 \log(n)}{\varepsilon} \log\big(\tfrac{\log(n)}{{\varepsilon}}\big)\big)$ for achieving the desired error $\varepsilon$, their positively dense output transportation plans strongly hinder the practicality. On the other hand, while being vastly used as heuristics for computing UOT in modern deep learning applications and having shown success in sparse OT problem, gradient methods applied to UOT have not been formally studied. In this paper, we propose a novel algorithm based on Gradient Extrapolation Method (GEM-UOT) to find an $\varepsilon$-approximate solution to the UOT problem in $O\big( \kappa n^2 \log\big(\frac{\tau n}{\varepsilon}\big) \big)$, where $\kappa$ is the condition number depending on only the two input measures. Our proof technique is based on a novel dual formulation of the squared $\ell_2$-norm UOT objective, which fills the lack of sparse UOT literature and also leads to a new characterization of approximation error between UOT and OT. To this end, we further present a novel approach of OT retrieval from UOT, which is based on GEM-UOT with fine tuned $\tau$ and a post-process projection step. Extensive experiments on synthetic and real datasets validate our theories and demonstrate the favorable performance of our methods in practice. We showcase GEM-UOT on the task of color transfer in terms of both the quality of the transfer image and the sparsity of the transportation plan.
    Deep Architecture Connectivity Matters for Its Convergence: A Fine-Grained Analysis. (arXiv:2205.05662v2 [cs.LG] UPDATED)
    Advanced deep neural networks (DNNs), designed by either human or AutoML algorithms, are growing increasingly complex. Diverse operations are connected by complicated connectivity patterns, e.g., various types of skip connections. Those topological compositions are empirically effective and observed to smooth the loss landscape and facilitate the gradient flow in general. However, it remains elusive to derive any principled understanding of their effects on the DNN capacity or trainability, and to understand why or in which aspect one specific connectivity pattern is better than another. In this work, we theoretically characterize the impact of connectivity patterns on the convergence of DNNs under gradient descent training in fine granularity. By analyzing a wide network's Neural Network Gaussian Process (NNGP), we are able to depict how the spectrum of an NNGP kernel propagates through a particular connectivity pattern, and how that affects the bound of convergence rates. As one practical implication of our results, we show that by a simple filtration on "unpromising" connectivity patterns, we can trim down the number of models to evaluate, and significantly accelerate the large-scale neural architecture search without any overhead. Code is available at: https://github.com/VITA-Group/architecture_convergence.
    Adversarial random forests for density estimation and generative modelling. (arXiv:2205.09435v2 [stat.ML] UPDATED)
    We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike existing tree-based alternatives, our approach provides smooth unconditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. All algorithms are implemented in easy-to-use $\texttt{R}$ and Python packages.
    CoNSoLe: Convex Neural Symbolic Learning. (arXiv:2206.00257v2 [cs.LG] UPDATED)
    Learning the underlying equation from data is a fundamental problem in many disciplines. Recent advances rely on Neural Networks (NNs) but do not provide theoretical guarantees in obtaining the exact equations owing to the non-convexity of NNs. In this paper, we propose Convex Neural Symbolic Learning (CoNSoLe) to seek convexity under mild conditions. The main idea is to decompose the recovering process into two steps and convexify each step. In the first step of searching for right symbols, we convexify the deep Q-learning. The key is to maintain double convexity for both the negative Q-function and the negative reward function in each iteration, leading to provable convexity of the negative optimal Q function to learn the true symbol connections. Conditioned on the exact searching result, we construct a Locally Convex equation Learner (LoCaL) neural network to convexify the estimation of symbol coefficients. With such a design, we quantify a large region with strict convexity in the loss surface of LoCaL for commonly used physical functions. Finally, we demonstrate the superior performance of the CoNSoLe framework over the state-of-the-art on a diverse set of datasets.
    A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability. (arXiv:2202.05420v2 [cs.LG] UPDATED)
    We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between the supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.
    Scalable particle-based alternatives to EM. (arXiv:2204.12965v2 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast the problem tackled by EM as the minimization of a free energy functional $F$ on an infinite-dimensional space and EM itself as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain three practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale well to high-dimensional settings and outperform existing state-of-the-art methods in experiments.
    ECLAD: Extracting Concepts with Local Aggregated Descriptors. (arXiv:2206.04531v2 [cs.CV] UPDATED)
    Convolutional neural networks (CNNs) are increasingly being used in critical systems, where robustness and alignment are crucial. In this context, the field of explainable artificial intelligence has proposed the generation of high-level explanations of the prediction process of CNNs through concept extraction. While these methods can detect whether or not a concept is present in an image, they are unable to determine its location. What is more, a fair comparison of such approaches is difficult due to a lack of proper validation procedures. To address these issues, we propose a novel method for automatic concept extraction and localization based on representations obtained through pixel-wise aggregations of CNN activation maps. Further, we introduce a process for the validation of concept-extraction techniques based on synthetic datasets with pixel-wise annotations of their main components, reducing the need for human intervention. Extensive experimentation on both synthetic and real-world datasets demonstrates that our method outperforms state-of-the-art alternatives.
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v2 [cs.LG] UPDATED)
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\hat{\Theta}$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
    Fast Stochastic Composite Minimization and an Accelerated Frank-Wolfe Algorithm under Parallelization. (arXiv:2205.12751v2 [math.OC] UPDATED)
    We consider the problem of minimizing the sum of two convex functions. One of those functions has Lipschitz-continuous gradients, and can be accessed via stochastic oracles, whereas the other is "simple". We provide a Bregman-type algorithm with accelerated convergence in function values to a ball containing the minimum. The radius of this ball depends on problem-dependent constants, including the variance of the stochastic oracle. We further show that this algorithmic setup naturally leads to a variant of Frank-Wolfe achieving acceleration under parallelization. More precisely, when minimizing a smooth convex function on a bounded domain, we show that one can achieve an $\epsilon$ primal-dual gap (in expectation) in $\tilde{O}(1/ \sqrt{\epsilon})$ iterations, by only accessing gradients of the original function and a linear maximization oracle with $O(1/\sqrt{\epsilon})$ computing units in parallel. We illustrate this fast convergence on synthetic numerical experiments.
    Generalization Bounds on Multi-Kernel Learning with Mixed Datasets. (arXiv:2205.07313v2 [cs.LG] UPDATED)
    This paper presents novel generalization bounds for the multi-kernel learning problem. Motivated by applications in sensor networks and spatial-temporal models, we assume that the dataset is mixed where each sample is taken from a finite pool of Markov chains. Our bounds for learning kernels admit $O(\sqrt{\log m})$ dependency on the number of base kernels and $O(1/\sqrt{n})$ dependency on the number of training samples. However, some $O(1/\sqrt{n})$ terms are added to compensate for the dependency among samples compared with existing generalization bounds for multi-kernel learning with i.i.d. datasets.
    Deep Probability Estimation. (arXiv:2111.10734v4 [cs.LG] UPDATED)
    Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the difference that the objective is to estimate probabilities rather than predicting the specific outcome. This work investigates probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on model (epistemic) uncertainty. For problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty: precipitation forecasting from radar images, predicting cancer patient survival from histopathology images, and predicting car crashes from dashcam videos. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.
    Supported Policy Optimization for Offline Reinforcement Learning. (arXiv:2202.06239v3 [cs.LG] UPDATED)
    Policy constraint methods to offline reinforcement learning (RL) typically utilize parameterization or regularization that constrains the policy to perform actions within the support set of the behavior policy. The elaborative designs of parameterization methods usually intrude into the policy networks, which may bring extra inference cost and cannot take full advantage of well-established online methods. Regularization methods reduce the divergence between the learned policy and the behavior policy, which may mismatch the inherent density-based definition of support set thereby failing to avoid the out-of-distribution actions effectively. This paper presents Supported Policy OpTimization (SPOT), which is directly derived from the theoretical formalization of the density-based support constraint. SPOT adopts a VAE-based density estimator to explicitly model the support set of behavior policy and presents a simple but effective density-based regularization term, which can be plugged non-intrusively into off-the-shelf off-policy RL algorithms. SPOT achieves the state-of-the-art performance on standard benchmarks for offline RL. Benefiting from the pluggable design, offline pretrained models from SPOT can also be applied to perform online fine-tuning seamlessly.
    Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors. (arXiv:2210.06340v1 [cs.CL])
    Current deep learning models trained to generate radiology reports from chest radiographs are capable of producing clinically accurate, clear, and actionable text that can advance patient care. However, such systems all succumb to the same problem: making hallucinated references to non-existent prior reports. Such hallucinations occur because these models are trained on datasets of real-world patient reports that inherently refer to priors. To this end, we propose two methods to remove references to priors in radiology reports: (1) a GPT-3-based few-shot approach to rewrite medical reports without references to priors; and (2) a BioBERT-based token classification approach to directly remove words referring to priors. We use the aforementioned approaches to modify MIMIC-CXR, a publicly available dataset of chest X-rays and their associated free-text radiology reports; we then retrain CXR-RePaiR, a radiology report generation system, on the adapted MIMIC-CXR dataset. We find that our re-trained model--which we call CXR-ReDonE--outperforms previous report generation methods on clinical metrics, achieving an average BERTScore of 0.2351 (2.57% absolute improvement). We expect our approach to be broadly valuable in enabling current radiology report generation systems to be more directly integrated into clinical pipelines.
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v2 [cs.LG] UPDATED)
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.
    Optimal Comparator Adaptive Online Learning with Switching Cost. (arXiv:2205.06846v3 [cs.LG] UPDATED)
    Practical online learning tasks are often naturally defined on unconstrained domains, where optimal algorithms for general convex losses are characterized by the notion of comparator adaptivity. In this paper, we design such algorithms in the presence of switching cost - the latter penalizes the typical optimism in adaptive algorithms, leading to a delicate design trade-off. Based on a novel dual space scaling strategy discovered by a continuous-time analysis, we propose a simple algorithm that improves the existing comparator adaptive regret bound [ZCP22a] to the optimal rate. The obtained benefits are further extended to the expert setting, and the practicality of the proposed algorithm is demonstrated through a sequential investment task.
    Visual Prompting for Adversarial Robustness. (arXiv:2210.06284v1 [cs.CV])
    In this work, we leverage visual prompting (VP) to improve adversarial robustness of a fixed, pre-trained model at testing time. Compared to conventional adversarial defenses, VP allows us to design universal (i.e., data-agnostic) input prompting templates, which have plug-and-play capabilities at testing time to achieve desired model performance without introducing much computation overhead. Although VP has been successfully applied to improving model generalization, it remains elusive whether and how it can be used to defend against adversarial attacks. We investigate this problem and show that the vanilla VP approach is not effective in adversarial defense since a universal input prompt lacks the capacity for robust learning against sample-specific adversarial perturbations. To circumvent it, we propose a new VP method, termed Class-wise Adversarial Visual Prompting (C-AVP), to generate class-wise visual prompts so as to not only leverage the strengths of ensemble prompts but also optimize their interrelations to improve model robustness. Our experiments show that C-AVP outperforms the conventional VP method, with 2.1X standard accuracy gain and 2X robust accuracy gain. Compared to classical test-time defenses, C-AVP also yields a 42X inference time speedup.
    Recipe for a General, Powerful, Scalable Graph Transformer. (arXiv:2205.12454v3 [cs.LG] UPDATED)
    We propose a recipe on how to build a general, powerful, scalable (GPS) graph Transformer with linear complexity and state-of-the-art results on a diverse set of benchmarks. Graph Transformers (GTs) have gained popularity in the field of graph representation learning with a variety of recent publications but they lack a common foundation about what constitutes a good positional or structural encoding, and what differentiates them. In this paper, we summarize the different types of encodings with a clearer definition and categorize them as being $\textit{local}$, $\textit{global}$ or $\textit{relative}$. The prior GTs are constrained to small graphs with a few hundred nodes, here we propose the first architecture with a complexity linear in the number of nodes and edges $O(N+E)$ by decoupling the local real-edge aggregation from the fully-connected Transformer. We argue that this decoupling does not negatively affect the expressivity, with our architecture being a universal function approximator on graphs. Our GPS recipe consists of choosing 3 main ingredients: (i) positional/structural encoding, (ii) local message-passing mechanism, and (iii) global attention mechanism. We provide a modular framework $\textit{GraphGPS}$ that supports multiple types of encodings and that provides efficiency and scalability both in small and large graphs. We test our architecture on 16 benchmarks and show highly competitive results in all of them, show-casing the empirical benefits gained by the modularity and the combination of different strategies.
    AdaPT: Fast Emulation of Approximate DNN Accelerators in PyTorch. (arXiv:2203.04071v2 [cs.LG] UPDATED)
    Current state-of-the-art employs approximate multipliers to address the highly increased power demands of DNN accelerators. However, evaluating the accuracy of approximate DNNs is cumbersome due to the lack of adequate support for approximate arithmetic in DNN frameworks. We address this inefficiency by presenting AdaPT, a fast emulation framework that extends PyTorch to support approximate inference as well as approximation-aware retraining. AdaPT can be seamlessly deployed and is compatible with the most DNNs. We evaluate the framework on several DNN models and application fields including CNNs, LSTMs, and GANs for a number of approximate multipliers with distinct bitwidth values. The results show substantial error recovery from approximate re-training and reduced inference time up to 53.9x with respect to the baseline approximate implementation.
    SwAMP: Swapped Assignment of Multi-Modal Pairs for Cross-Modal Retrieval. (arXiv:2111.05814v2 [cs.LG] UPDATED)
    We tackle the cross-modal retrieval problem, where learning is only supervised by relevant multi-modal pairs in the data. Although the contrastive learning is the most popular approach for this task, it makes potentially wrong assumption that the instances in different pairs are automatically irrelevant. To address the issue, we propose a novel loss function that is based on self-labeling of the unknown semantic classes. Specifically, we aim to predict class labels of the data instances in each modality, and assign those labels to the corresponding instances in the other modality (i.e., swapping the pseudo labels). With these swapped labels, we learn the data embedding for each modality using the supervised cross-entropy loss. This way, cross-modal instances from different pairs that are semantically related can be aligned to each other by the class predictor. We tested our approach on several real-world cross-modal retrieval problems, including text-based video retrieval, sketch-based image retrieval, and image-text retrieval. For all these tasks our method achieves significant performance improvement over the contrastive learning.
    In Defense of the Unitary Scalarization for Deep Multi-Task Learning. (arXiv:2201.04122v3 [cs.LG] UPDATED)
    Recent multi-task learning research argues against unitary scalarization, where training simply minimizes the sum of the task losses. Several ad-hoc multi-task optimization algorithms have instead been proposed, inspired by various hypotheses about what makes multi-task settings difficult. The majority of these optimizers require per-task gradients, and introduce significant memory, runtime, and implementation overhead. We show that unitary scalarization, coupled with standard regularization and stabilization techniques from single-task learning, matches or improves upon the performance of complex multi-task optimizers in popular supervised and reinforcement learning settings. We then present an analysis suggesting that many specialized multi-task optimizers can be partly interpreted as forms of regularization, potentially explaining our surprising results. We believe our results call for a critical reevaluation of recent research in the area.
    Generative Adversarial Neural Operators. (arXiv:2205.03017v2 [cs.LG] UPDATED)
    We propose the generative adversarial neural operator (GANO), a generative model paradigm for learning probabilities on infinite-dimensional function spaces. The natural sciences and engineering are known to have many types of data that are sampled from infinite-dimensional function spaces, where classical finite-dimensional deep generative adversarial networks (GANs) may not be directly applicable. GANO generalizes the GAN framework and allows for the sampling of functions by learning push-forward operator maps in infinite-dimensional spaces. GANO consists of two main components, a generator neural operator and a discriminator neural functional. The inputs to the generator are samples of functions from a user-specified probability measure, e.g., Gaussian random field (GRF), and the generator outputs are synthetic data functions. The input to the discriminator is either a real or synthetic data function. In this work, we instantiate GANO using the Wasserstein criterion and show how the Wasserstein loss can be computed in infinite-dimensional spaces. We empirically study GANO in controlled cases where both input and output functions are samples from GRFs and compare its performance to the finite-dimensional counterpart GAN. We empirically study the efficacy of GANO on real-world function data of volcanic activities and show its superior performance over GAN.
    Generalised correlated batched bandits via the ARC algorithm with application to dynamic pricing. (arXiv:2102.04263v2 [math.OC] UPDATED)
    The Asymptotic Randomised Control (ARC) algorithm provides a rigorous approximation to the optimal strategy for a wide class of Bayesian bandits, while retaining low computational complexity. In particular, the ARC approach provides nearly optimal choices even when the payoffs are correlated or more than the reward is observed. The algorithm is guaranteed to asymptotically optimise the expected discounted payoff, with error depending on the initial uncertainty of the bandit. In this paper, we extend the ARC framework to consider a batched bandit problem where observations arrive from a generalised linear model. In particular, we develop a large sample approximation to allow correlated and generally distributed observation. We apply this to a classic dynamic pricing problem based on a Bayesian hierarchical model and demonstrate that the ARC algorithm outperforms alternative approaches.
    Extended Unconstrained Features Model for Exploring Deep Neural Collapse. (arXiv:2202.08087v3 [cs.LG] UPDATED)
    The modern strategy for training deep neural networks for classification tasks includes optimizing the network's weights even after the training error vanishes to further push the training loss toward zero. Recently, a phenomenon termed "neural collapse" (NC) has been empirically observed in this training procedure. Specifically, it has been shown that the learned features (the output of the penultimate layer) of within-class samples converge to their mean, and the means of different classes exhibit a certain tight frame structure, which is also aligned with the last layer's weights. Recent papers have shown that minimizers with this structure emerge when optimizing a simplified "unconstrained features model" (UFM) with a regularized cross-entropy loss. In this paper, we further analyze and extend the UFM. First, we study the UFM for the regularized MSE loss, and show that the minimizers' features can have a more delicate structure than in the cross-entropy case. This affects also the structure of the weights. Then, we extend the UFM by adding another layer of weights as well as ReLU nonlinearity to the model and generalize our previous results. Finally, we empirically demonstrate the usefulness of our nonlinear extended UFM in modeling the NC phenomenon that occurs with practical networks.
    What Dense Graph Do You Need for Self-Attention?. (arXiv:2205.14014v5 [cs.LG] UPDATED)
    Transformers have made progress in miscellaneous tasks, but suffer from quadratic computational and memory complexities. Recent works propose sparse Transformers with attention on sparse graphs to reduce complexity and remain strong performance. While effective, the crucial parts of how dense a graph needs to be to perform well are not fully explored. In this paper, we propose Normalized Information Payload (NIP), a graph scoring function measuring information transfer on graph, which provides an analysis tool for trade-offs between performance and complexity. Guided by this theoretical analysis, we present Hypercube Transformer, a sparse Transformer that models token interactions in a hypercube and shows comparable or even better results with vanilla Transformer while yielding $O(N\log N)$ complexity with sequence length $N$. Experiments on tasks requiring various sequence lengths lay validation for our graph function well.
    Generalization Error Bounds on Deep Learning with Markov Datasets. (arXiv:2201.11059v4 [stat.ML] UPDATED)
    In this paper, we derive upper bounds on generalization errors for deep neural networks with Markov datasets. These bounds are developed based on Koltchinskii and Panchenko's approach for bounding the generalization error of combined classifiers with i.i.d. datasets. The development of new symmetrization inequalities in high-dimensional probability for Markov chains is a key element in our extension, where the spectral gap of the infinitesimal generator of the Markov chain plays a key parameter in these inequalities. We also propose a simple method to convert these bounds and other similar ones in traditional deep learning and machine learning to Bayesian counterparts for both i.i.d. and Markov datasets. Extensions to $m$-order homogeneous Markov chains such as AR and ARMA models and mixtures of several Markov data services are given.
    Active Learning Through a Covering Lens. (arXiv:2205.11320v2 [cs.LG] UPDATED)
    Deep active learning aims to reduce the annotation cost for the training of deep models, which is notoriously data-hungry. Until recently, deep active learning methods were ineffectual in the low-budget regime, where only a small number of examples are annotated. The situation has been alleviated by recent advances in representation and self-supervised learning, which impart the geometry of the data representation with rich information about the points. Taking advantage of this progress, we study the problem of subset selection for annotation through a "covering" lens, proposing ProbCover - a new active learning algorithm for the low budget regime, which seeks to maximize Probability Coverage. We then describe a dual way to view the proposed formulation, from which one can derive strategies suitable for the high budget regime of active learning, related to existing methods like Coreset. We conclude with extensive experiments, evaluating ProbCover in the low-budget regime. We show that our principled active learning strategy improves the state-of-the-art in the low-budget regime in several image recognition benchmarks. This method is especially beneficial in the semi-supervised setting, allowing state-of-the-art semi-supervised methods to match the performance of fully supervised methods, while using much fewer labels nonetheless. Code is available at https://github.com/avihu111/TypiClust.
    Unifying and Boosting Gradient-Based Training-Free Neural Architecture Search. (arXiv:2201.09785v2 [cs.LG] UPDATED)
    Neural architecture search (NAS) has gained immense popularity owing to its ability to automate neural architecture design. A number of training-free metrics are recently proposed to realize NAS without training, hence making NAS more scalable. Despite their competitive empirical performances, a unified theoretical understanding of these training-free metrics is lacking. As a consequence, (a) the relationships among these metrics are unclear, (b) there is no theoretical interpretation for their empirical performances, and (c) there may exist untapped potential in existing training-free NAS, which probably can be unveiled through a unified theoretical understanding. To this end, this paper presents a unified theoretical analysis of gradient-based training-free NAS, which allows us to (a) theoretically study their relationships, (b) theoretically guarantee their generalization performances, and (c) exploit our unified theoretical understanding to develop a novel framework named hybrid NAS (HNAS) which consistently boosts training-free NAS in a principled way. Remarkably, HNAS can enjoy the advantages of both training-free (i.e., the superior search efficiency) and training-based (i.e., the remarkable search effectiveness) NAS, which we have demonstrated through extensive experiments.
    Momentum Aggregation for Private Non-convex ERM. (arXiv:2210.06328v1 [cs.LG])
    We introduce new algorithms and convergence guarantees for privacy-preserving non-convex Empirical Risk Minimization (ERM) on smooth $d$-dimensional objectives. We develop an improved sensitivity analysis of stochastic gradient descent on smooth objectives that exploits the recurrence of examples in different epochs. By combining this new approach with recent analysis of momentum with private aggregation techniques, we provide an $(\epsilon,\delta)$-differential private algorithm that finds a gradient of norm $\tilde O\left(\frac{d^{1/3}}{(\epsilon N)^{2/3}}\right)$ in $O\left(\frac{N^{7/3}\epsilon^{4/3}}{d^{2/3}}\right)$ gradient evaluations, improving the previous best gradient bound of $\tilde O\left(\frac{d^{1/4}}{\sqrt{\epsilon N}}\right)$.
    Saliency Guided Experience Packing for Replay in Continual Learning. (arXiv:2109.04954v2 [cs.LG] UPDATED)
    Artificial learning systems aspire to mimic human intelligence by continually learning from a stream of tasks without forgetting past knowledge. One way to enable such learning is to store past experiences in the form of input examples in episodic memory and replay them when learning new tasks. However, performance of such method suffers as the size of the memory becomes smaller. In this paper, we propose a new approach for experience replay, where we select the past experiences by looking at the saliency maps which provide visual explanations for the model's decision. Guided by these saliency maps, we pack the memory with only the parts or patches of the input images important for the model's prediction. While learning a new task, we replay these memory patches with appropriate zero-padding to remind the model about its past decisions. We evaluate our algorithm on CIFAR-100, miniImageNet and CUB datasets and report better performance than the state-of-the-art approaches. With qualitative and quantitative analyses we show that our method captures richer summaries of past experiences without any memory increase, and hence performs well with small episodic memory.
    Guaranteed Nonlinear Tracking in the Presence of DNN-Learned Dynamics With Contraction Metrics and Disturbance Estimation. (arXiv:2112.08222v4 [eess.SY] UPDATED)
    This paper presents an approach to trajectory-centric learning control based on contraction metrics and disturbance estimation for nonlinear systems subject to matched uncertainties. The approach uses deep neural networks to learn uncertain dynamics while still providing guarantees of transient tracking performance throughout the learning phase. Within the proposed approach, a disturbance estimation law is adopted to estimate the pointwise value of the uncertainty, with pre-computable estimation error bounds (EEBs). The learned dynamics, the estimated disturbances, and the EEBs are then incorporated in a robust Riemann energy condition to compute the control law that guarantees exponential convergence of actual trajectories to desired ones throughout the learning phase, even when the learned model is poor. On the other hand, with improved accuracy, the learned model can help improve the robustness of the tracking controller, e.g., against input delays, and can be incorporated to plan better trajectories with improved performance, e.g., lower energy consumption and shorter travel time.The proposed framework is validated on a planar quadrotor example.
    Collage: Seamless Integration of Deep Learning Backends with Automatic Placement. (arXiv:2111.00655v2 [cs.LG] UPDATED)
    The strong demand for efficient and performant deployment of Deep Learning (DL) applications prompts the rapid development of a rich DL ecosystem. To keep up with this fast advancement, it is crucial for modern DL frameworks to efficiently integrate a variety of optimized tensor algebra libraries and runtimes as their backends and generate the fastest possible executable using these backends. However, current DL frameworks require significant manual effort and expertise to integrate every new backend while failing to unleash its full potential. Given the fast-evolving nature of the DL ecosystem, this manual approach often slows down continuous innovations across different layers; it prevents hardware vendors from the fast deployment of their cutting-edge libraries, DL framework developers must repeatedly adjust their hand-coded rules to accommodate new versions of libraries, and machine learning practitioners need to wait for the integration of new technologies and often encounter unsatisfactory performance. In this paper, we propose Collage, a DL framework that offers seamless integration of DL backends. Collage provides an expressive backend registration interface that allows users to precisely specify the capability of various backends. By leveraging the specifications of available backends, Collage automatically searches for an optimized backend placement strategy for a given workload and execution environment. Our evaluation shows that Collage outperforms the best existing framework for each hardware by $1.26\times$, $1.43\times$, $1.40\times$ on average on NVIDIA's RTX 2070 GPU, V100 GPU, and Intel's Xeon 8259CL CPU, respectively. Collage has also been deployed in Apache TVM.
    Deep Hierarchical Super Resolution for Scientific Data. (arXiv:2107.00462v2 [eess.IV] UPDATED)
    We present a novel technique for hierarchical super resolution (SR) with neural networks (NNs), which upscales volumetric data represented with an octree data structure to a high-resolution uniform grid with minimal seam artifacts on octree node boundaries. Our method uses existing state-of-the-art SR models and adds flexibility to upscale input data with varying levels of detail across the domain, instead of only uniform grid data that are supported in previous approaches. The key is to use a hierarchy of SR NNs, each trained to perform 2x SR between two levels of detail, with a hierarchical SR algorithm that minimizes seam artifacts by starting from the coarsest level of detail and working up. We show that our hierarchical approach outperforms baseline interpolation and hierarchical upscaling methods, and demonstrate the usefulness of our proposed approach across three use cases including data reduction using hierarchical downsampling+SR instead of uniform downsampling+SR, computation savings for hierarchical finite-time Lyapunov exponent field calculation, and super-resolving low-resolution simulation results for a high-resolution approximation visualization.
    SHAQ: Incorporating Shapley Value Theory into Multi-Agent Q-Learning. (arXiv:2105.15013v5 [cs.LG] UPDATED)
    Value factorisation is a useful technique for multi-agent reinforcement learning (MARL) in global reward game, however its underlying mechanism is not yet fully understood. This paper studies a theoretical framework for value factorisation with interpretability via Shapley value theory. We generalise Shapley value to Markov convex game called Markov Shapley value (MSV) and apply it as a value factorisation method in global reward game, which is obtained by the equivalence between the two games. Based on the properties of MSV, we derive Shapley-Bellman optimality equation (SBOE) to evaluate the optimal MSV, which corresponds to an optimal joint deterministic policy. Furthermore, we propose Shapley-Bellman operator (SBO) that is proved to solve SBOE. With a stochastic approximation and some transformations, a new MARL algorithm called Shapley Q-learning (SHAQ) is established, the implementation of which is guided by the theoretical results of SBO and MSV. We also discuss the relationship between SHAQ and relevant value factorisation methods. In the experiments, SHAQ exhibits not only superior performances on all tasks but also the interpretability that agrees with the theoretical analysis. The implementation of this paper is on https://github.com/hsvgbkhgbv/shapley-q-learning.
    Meta-Reinforcement Learning with Self-Modifying Networks. (arXiv:2202.02363v3 [cs.LG] UPDATED)
    Deep Reinforcement Learning has demonstrated the potential of neural networks tuned with gradient descent for solving complex tasks in well-delimited environments. However, these neural systems are slow learners producing specialized agents with no mechanism to continue learning beyond their training curriculum. On the contrary, biological synaptic plasticity is persistent and manifold, and has been hypothesized to play a key role in executive functions such as working memory and cognitive flexibility, potentially supporting more efficient and generic learning abilities. Inspired by this, we propose to build networks with dynamic weights, able to continually perform self-reflexive modification as a function of their current synaptic state and action-reward feedback, rather than a fixed network configuration. The resulting model, MetODS (for Meta-Optimized Dynamical Synapses) is a broadly applicable meta-reinforcement learning system able to learn efficient and powerful control rules in the agent policy space. A single layer with dynamic synapses can perform one-shot learning, generalizes navigation principles to unseen environments and manifests a strong ability to learn adaptive motor policies.
    MACE: A Flexible Framework for Membership Privacy Estimation in Generative Models. (arXiv:2009.05683v5 [cs.CR] UPDATED)
    Generative machine learning models are being increasingly viewed as a way to share sensitive data between institutions. While there has been work on developing differentially private generative modeling approaches, these approaches generally lead to sub-par sample quality, limiting their use in real world applications. Another line of work has focused on developing generative models which lead to higher quality samples but currently lack any formal privacy guarantees. In this work, we propose the first formal framework for membership privacy estimation in generative models. We formulate the membership privacy risk as a statistical divergence between training samples and hold-out samples, and propose sample-based methods to estimate this divergence. Compared to previous works, our framework makes more realistic and flexible assumptions. First, we offer a generalizable metric as an alternative to the accuracy metric especially for imbalanced datasets. Second, we loosen the assumption of having full access to the underlying distribution from previous studies , and propose sample-based estimations with theoretical guarantees. Third, along with the population-level membership privacy risk estimation via the optimal membership advantage, we offer the individual-level estimation via the individual privacy risk. Fourth, our framework allows adversaries to access the trained model via a customized query, while prior works require specific attributes.
    Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers. (arXiv:2210.06313v1 [cs.LG])
    This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
    A Generalist Framework for Panoptic Segmentation of Images and Videos. (arXiv:2210.06366v1 [cs.CV])
    Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model based on analog bits is used to model panoptic masks, with a simple, generic architecture and loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our generalist approach can perform competitively to state-of-the-art specialist methods in similar settings.
    Determining band structure parameters of two-dimensional materials by deep learning. (arXiv:2210.06310v1 [cond-mat.mes-hall])
    The field of two-dimensional materials has mastered the fabrication and characterisation of a broad range of novel high-quality compounds that feature increasing complexity. Determination of the band structure parameters of such complex materials is a major ingredient required for quantitative theory. This task currently presents a formidable challenge: ab initio methods often do not provide quantitatively accurate values of parameters, whereas inferring band structure parameters from experiments is hindered by the complexity of the band structure and indirect nature of experimental probes. In this work we propose a general framework for determination of band structure parameters from experimental data based on deep neural networks. As a specific example we apply our method to the penetration field capacitance measurement of trilayer graphene that effectively probes its density of states. First, we demonstrate that a trained deep network gives accurate predictions for the penetration field capacitance as a function of tight-binding parameters. Next, we use the fast and accurate predictions from the trained network to automatically determine tight-binding parameters directly from experimental data, with extracted parameters being in a good agreement with values in the literature. We conclude by discussing potential applications of our method to other materials and experimental techniques beyond penetration field capacitance.
    Prediction intervals for neural network models using weighted asymmetric loss functions. (arXiv:2210.04318v2 [stat.ML] UPDATED)
    We develop a novel and simple method to produce prediction intervals (PIs) for fitting and forecasting exercises. It finds the lower and upper bound of the intervals by minimising a weighted asymmetric loss function, where the weight depends on the width of the interval. We give a short mathematical proof. As a corollary of our proof, we find PIs for values restricted to a parameterised function and argue why the method works for predicting PIs of dependent variables. The results of applying the method on a neural network deployed in a real-world forecasting task prove the validity of its practical implementation in complex machine learning setups.
    Smart Cup: An impedance sensing based fluid intake monitoring system for beverages classification and freshness detection. (arXiv:2210.06285v1 [eess.SP])
    This paper presents a novel beverage intake monitoring system that can accurately recognize beverage kinds and freshness. By mounting carbon electrodes on the commercial cup, the system measures the electrochemical impedance spectrum of the fluid in the cup. We studied the frequency sensitivity of the electrochemical impedance spectrum regarding distinct beverages and the importance of features like amplitude, phase, and real and imaginary components for beverage classification. The results show that features from a low-frequency domain (100 Hz to 1000 Hz) provide more meaningful information for beverage classification than the higher frequency domain. Twenty beverages, including carbonated drinks and juices, were classified with nearly perfect accuracy using a supervised machine learning approach. The same performance was also observed in the freshness recognition, where four different kinds of milk and fruit juice were studied.
    SQuId: Measuring Speech Naturalness in Many Languages. (arXiv:2210.06324v1 [cs.CL])
    Much of text-to-speech research relies on human evaluation, which incurs heavy costs and slows down the development process. The problem is particularly acute in heavily multilingual applications, where recruiting and polling judges can take weeks. We introduce SQuId (Speech Quality Identification), a multilingual naturalness prediction model trained on over a million ratings and tested in 65 locales-the largest effort of this type to date. The main insight is that training one model on many locales consistently outperforms mono-locale baselines. We present our task, the model, and show that it outperforms a competitive baseline based on w2v-BERT and VoiceMOS by 50.0%. We then demonstrate the effectiveness of cross-locale transfer during fine-tuning and highlight its effect on zero-shot locales, i.e., locales for which there is no fine-tuning data. Through a series of analyses, we highlight the role of non-linguistic effects such as sound artifacts in cross-locale transfer. Finally, we present the effect of our design decision, e.g., model size, pre-training diversity, and language rebalancing with several ablation experiments.
    CoRRECT: A Deep Unfolding Framework for Motion-Corrected Quantitative R2* Mapping. (arXiv:2210.06330v1 [eess.IV])
    Quantitative MRI (qMRI) refers to a class of MRI methods for quantifying the spatial distribution of biological tissue parameters. Traditional qMRI methods usually deal separately with artifacts arising from accelerated data acquisition, involuntary physical motion, and magnetic-field inhomogeneities, leading to suboptimal end-to-end performance. This paper presents CoRRECT, a unified deep unfolding (DU) framework for qMRI consisting of a model-based end-to-end neural network, a method for motion-artifact reduction, and a self-supervised learning scheme. The network is trained to produce R2* maps whose k-space data matches the real data by also accounting for motion and field inhomogeneities. When deployed, CoRRECT only uses the k-space data without any pre-computed parameters for motion or inhomogeneity correction. Our results on experimentally collected multi-Gradient-Recalled Echo (mGRE) MRI data show that CoRRECT recovers motion and inhomogeneity artifact-free R2* maps in highly accelerated acquisition settings. This work opens the door to DU methods that can integrate physical measurement models, biophysical signal models, and learned prior models for high-quality qMRI.
    Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions. (arXiv:2204.10022v4 [cs.LG] UPDATED)
    Estimating the effects of continuous-valued interventions from observational data is a critically important task for climate science, healthcare, and economics. Recent work focuses on designing neural network architectures and regularization functions to allow for scalable estimation of average and individual-level dose-response curves from high-dimensional, large-sample data. Such methodologies assume ignorability (observation of all confounding variables) and positivity (observation of all treatment levels for every covariate value describing a set of units), assumptions problematic in the continuous treatment regime. Scalable sensitivity and uncertainty analyses to understand the ignorance induced in causal estimates when these assumptions are relaxed are less studied. Here, we develop a continuous treatment-effect marginal sensitivity model (CMSM) and derive bounds that agree with the observed data and a researcher-defined level of hidden confounding. We introduce a scalable algorithm and uncertainty-aware deep models to derive and estimate these bounds for high-dimensional, large-sample observational data. We work in concert with climate scientists interested in the climatological impacts of human emissions on cloud properties using satellite observations from the past 15 years. This problem is known to be complicated by many unobserved confounders.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v2 [cs.LG] UPDATED)
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model. (arXiv:2208.10458v2 [cs.LG] UPDATED)
    This paper studies multi-agent reinforcement learning in Markov games, with the goal of learning Nash equilibria or coarse correlated equilibria (CCE) sample-optimally. All prior results suffer from at least one of the two obstacles: the curse of multiple agents and the barrier of long horizon, regardless of the sampling protocol in use. We take a step towards settling this problem, assuming access to a flexible sampling mechanism: the generative model. Focusing on non-stationary finite-horizon Markov games, we develop a fast learning algorithm called \myalg~and an adaptive sampling scheme that leverage the optimism principle in online adversarial learning (particularly the Follow-the-Regularized-Leader (FTRL) method). Our algorithm learns an $\varepsilon$-approximate CCE in a general-sum Markov game using $$ \widetilde{O}\bigg( \frac{H^4 S \sum_{i=1}^m A_i}{\varepsilon^2} \bigg) $$ samples, where $m$ is the number of players, $S$ indicates the number of states, $H$ is the horizon, and $A_i$ denotes the number of actions for the $i$-th player. This is minimax-optimal (up to log factor) when the number of players is fixed. When applied to two-player zero-sum Markov games, our algorithm provably finds an $\varepsilon$-approximate Nash equilibrium with minimal samples. Along the way, we derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities, which might be of independent interest.
    Feasible and Desirable Counterfactual Generation by Preserving Human Defined Constraints. (arXiv:2210.05993v1 [cs.LG])
    We present a human-in-the-loop approach to generate counterfactual (CF) explanations that preserve global and local feasibility constraints. Global feasibility constraints refer to the causal constraints that are necessary for generating actionable CF explanation. Assuming a domain expert with knowledge on unary and binary causal constraints, our approach efficiently employs this knowledge to generate CF explanation by rejecting gradient steps that violate these constraints. Local feasibility constraints encode end-user's constraints for generating desirable CF explanation. We extract these constraints from the end-user of the model and exploit them during CF generation via user-defined distance metric. Through user studies, we demonstrate that incorporating causal constraints during CF generation results in significantly better explanations in terms of feasibility and desirability for participants. Adopting local and global feasibility constraints simultaneously, although improves user satisfaction, does not significantly improve desirability of the participants compared to only incorporating global constraints.
    The evolution of AI approaches for motor imagery EEG-based BCIs. (arXiv:2210.06290v1 [eess.SP])
    The Motor Imagery (MI) electroencephalography (EEG) based Brain Computer Interfaces (BCIs) allow the direct communication between humans and machines by exploiting the neural pathways connected to motor imagination. Therefore, these systems open the possibility of developing applications that could span from the medical field to the entertainment industry. In this context, Artificial Intelligence (AI) approaches become of fundamental importance especially when wanting to provide a correct and coherent feedback to BCI users. Moreover, publicly available datasets in the field of MI EEG-based BCIs have been widely exploited to test new techniques from the AI domain. In this work, AI approaches applied to datasets collected in different years and with different devices but with coherent experimental paradigms are investigated with the aim of providing a concise yet sufficiently comprehensive survey on the evolution and influence of AI techniques on MI EEG-based BCI data.
    Finite Sample Analysis Of Dynamic Regression Parameter Learning. (arXiv:1906.05591v4 [cs.LG] UPDATED)
    We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.
    Generating Training Data with Language Models: Towards Zero-Shot Language Understanding. (arXiv:2202.04538v2 [cs.CL] UPDATED)
    Pretrained language models (PLMs) have demonstrated remarkable performance in various natural language processing tasks: Unidirectional PLMs (e.g., GPT) are well known for their superior text generation capabilities; bidirectional PLMs (e.g., BERT) have been the prominent choice for natural language understanding (NLU) tasks. While both types of models have achieved promising few-shot learning performance, their potential for zero-shot learning has been underexplored. In this paper, we present a simple approach that uses both types of PLMs for fully zero-shot learning of NLU tasks without requiring any task-specific data: A unidirectional PLM generates class-conditioned texts guided by prompts, which are used as the training data for fine-tuning a bidirectional PLM. With quality training data selected based on the generation probability and regularization techniques (label smoothing and temporal ensembling) applied to the fine-tuning stage for better generalization and stability, our approach demonstrates strong performance across seven classification tasks of the GLUE benchmark (e.g., 72.3/73.8 on MNLI-m/mm and 92.8 on SST-2), significantly outperforming zero-shot prompting methods and achieving even comparable results to strong few-shot approaches using 32 training samples per class.
    Effects of Safety State Augmentation on Safe Exploration. (arXiv:2206.02675v2 [cs.LG] UPDATED)
    Safe exploration is a challenging and important problem in model-free reinforcement learning (RL). Often the safety cost is sparse and unknown, which unavoidably leads to constraint violations -- a phenomenon ideally to be avoided in safety-critical applications. We tackle this problem by augmenting the state-space with a safety state, which is nonnegative if and only if the constraint is satisfied. The value of this state also serves as a distance toward constraint violation, while its initial value indicates the available safety budget. This idea allows us to derive policies for scheduling the safety budget during training. We call our approach Simmer (Safe policy IMproveMEnt for RL) to reflect the careful nature of these schedules. We apply this idea to two safe RL problems: RL with constraints imposed on an average cost, and RL with constraints imposed on a cost with probability one. Our experiments suggest that "simmering, a safe algorithm can improve safety during training for both settings. We further show that Simmer can stabilize training and improve the performance of safe RL with average constraints.
    CAT-probing: A Metric-based Approach to Interpret How Pre-trained Models for Programming Language Attend Code Structure. (arXiv:2210.04633v2 [cs.SE] UPDATED)
    Code pre-trained models (CodePTMs) have recently demonstrated significant success in code intelligence. To interpret these models, some probing methods have been applied. However, these methods fail to consider the inherent characteristics of codes. In this paper, to address the problem, we propose a novel probing method CAT-probing to quantitatively interpret how CodePTMs attend code structure. We first denoise the input code sequences based on the token types pre-defined by the compilers to filter those tokens whose attention scores are too small. After that, we define a new metric CAT-score to measure the commonality between the token-level attention scores generated in CodePTMs and the pair-wise distances between corresponding AST nodes. The higher the CAT-score, the stronger the ability of CodePTMs to capture code structure. We conduct extensive experiments to integrate CAT-probing with representative CodePTMs for different programming languages. Experimental results show the effectiveness of CAT-probing in CodePTM interpretation. Our codes and data are publicly available at https://github.com/nchen909/CodeAttention.
    Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia. (arXiv:2210.06353v1 [cs.CL])
    Corpora that contain tabular data such as WebTables are a vital resource for the academic community. Essentially, they are the backbone of any modern research in information management. They are used for various tasks of data extraction, knowledge base construction, question answering, column semantic type detection and many other. Such corpora are useful not only as a source of data, but also as a base for building test datasets. So far, there were no such corpora for the Russian language and this seriously hindered research in the aforementioned areas. In this paper, we present the first corpus of Web tables created specifically out of Russian language material. It was built via a special toolkit we have developed to crawl the Russian Wikipedia. Both the corpus and the toolkit are open-source and publicly available. Finally, we present a short study that describes Russian Wikipedia tables and their statistics.
    On the Generalizability of ECG-based Stress Detection Models. (arXiv:2210.06225v1 [cs.LG])
    Stress is prevalent in many aspects of everyday life including work, healthcare, and social interactions. Many works have studied handcrafted features from various bio-signals that are indicators of stress. Recently, deep learning models have also been proposed to detect stress. Typically, stress models are trained and validated on the same dataset, often involving one stressful scenario. However, it is not practical to collect stress data for every scenario. So, it is crucial to study the generalizability of these models and determine to what extent they can be used in other scenarios. In this paper, we explore the generalization capabilities of Electrocardiogram (ECG)-based deep learning models and models based on handcrafted ECG features, i.e., Heart Rate Variability (HRV) features. To this end, we train three HRV models and two deep learning models that use ECG signals as input. We use ECG signals from two popular stress datasets - WESAD and SWELL-KW - differing in terms of stressors and recording devices. First, we evaluate the models using leave-one-subject-out (LOSO) cross-validation using training and validation samples from the same dataset. Next, we perform a cross-dataset validation of the models, that is, LOSO models trained on the WESAD dataset are validated using SWELL-KW samples and vice versa. While deep learning models achieve the best results on the same dataset, models based on HRV features considerably outperform them on data from a different dataset. This trend is observed for all the models on both datasets. Therefore, HRV models are a better choice for stress recognition in applications that are different from the dataset scenario. To the best of our knowledge, this is the first work to compare the cross-dataset generalizability between ECG-based deep learning models and HRV models.
    Scalable and Privacy-enhanced Graph Generative Model for Graph Neural Networks. (arXiv:2207.04396v2 [cs.LG] UPDATED)
    As the field of Graph Neural Networks (GNN) continues to grow, it experiences a corresponding increase in the need for large, real-world datasets to train and test new GNN models on challenging, realistic problems. Unfortunately, such graph datasets are often generated from online, highly privacy-restricted ecosystems, which makes research and development on these datasets hard, if not impossible. This greatly reduces the amount of benchmark graphs available to researchers, causing the field to rely only on a handful of publicly-available datasets. To address this dilemma, we introduce a novel graph generative model, Computation Graph Transformer (CGT) that can learn and reproduce the distribution of real-world graphs in a privacy-enhanced way. Our proposed model (1) generates effective benchmark graphs on which GNNs show similar task performance as on the source graphs, (2) scales to process large-scale real-world graphs, (3) guarantees privacy for end-users. Extensive experiments across a vast body of graph generative models show that only our model can successfully generate privacy-controlled, synthetic substitutes of large-scale real-world graphs that can be effectively used to evaluate GNN models.
    A review on Epileptic Seizure Detection using Machine Learning. (arXiv:2210.06292v1 [eess.SP])
    Epilepsy is a neurological brain disorder which life threatening and gives rise to recurrent seizures that are unprovoked. It occurs due to the abnormal chemical changes in our brain. Over the course of many years, studies have been conducted to support automatic diagnosis of epileptic seizures for the ease of clinicians. For that, several studies entail the use of machine learning methods for the early prediction of epileptic seizures. Mainly, feature extraction methods have been used to extract the right features from the EEG data generated by the EEG machine and then various machine learning classifiers are used for the classification process. This study provides a systematic literature review of feature selection process as well as the classification performance. This study was limited to the finding of most used feature extraction methods and the classifiers used for accurate classification of normal to epileptic seizures. The existing literature was examined from well-known repositories such as MPDI, IEEEXplore, Wiley, Elsevier, ACM, Springerlink and others. Furthermore, a taxonomy was created that recapitulates the state-of-the-art used solutions for this problem. We also studied the nature of different benchmark and unbiased datasets and gave a rigorous analysis of the working of classifiers. Finally, we concluded the research by presenting the gaps, challenges and opportunities which can further help researchers in prediction of epileptic seizure
    Fundamental limits and algorithms for sparse linear regression with sublinear sparsity. (arXiv:2101.11156v5 [cs.IT] UPDATED)
    We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error (MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a generalization of the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modification of the well-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed, and its state evolution is rigorously analyzed. Our results show that the traditional linear assumption between the signal dimension and number of observations in the replica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify the existing well-known AMP algorithms for linear regimes to sub-linear ones.
    Multilingual CheckList: Generation and Evaluation. (arXiv:2203.12865v3 [cs.CL] UPDATED)
    Multilingual evaluation benchmarks usually contain limited high-resource languages and do not test models for specific linguistic capabilities. CheckList is a template-based evaluation approach that tests models for specific capabilities. The CheckList template creation process requires native speakers, posing a challenge in scaling to hundreds of languages. In this work, we explore multiple approaches to generate Multilingual CheckLists. We device an algorithm - Template Extraction Algorithm (TEA) for automatically extracting target language CheckList templates from machine translated instances of a source language templates. We compare the TEA CheckLists with CheckLists created with different levels of human intervention. We further introduce metrics along the dimensions of cost, diversity, utility, and correctness to compare the CheckLists. We thoroughly analyze different approaches to creating CheckLists in Hindi. Furthermore, we experiment with 9 more different languages. We find that TEA followed by human verification is ideal for scaling Checklist-based evaluation to multiple languages while TEA gives a good estimates of model performance.
    NAGphormer: A Tokenized Graph Transformer for Node Classification in Large Graphs. (arXiv:2206.04910v2 [cs.LG] UPDATED)
    The graph Transformer emerges as a new architecture and has shown superior performance on various graph mining tasks. In this work, we observe that existing graph Transformers treat nodes as independent tokens and construct a single long sequence composed of all node tokens so as to train the Transformer model, causing it hard to scale to large graphs due to the quadratical complexity on the number of nodes for the self-attention computation. To this end, we propose a Neighborhood Aggregation Graph Transformer (NAGphormer) that treats each node as a sequence containing a series of tokens constructed by our proposed Hop2Token module. For each node, Hop2Token aggregates the neighborhood features from different hops into different representations and thereby produces a sequence of token vectors as one input. In this way, NAGphormer could be trained in a mini-batch manner and thus could scale to large graphs. Moreover, we mathematically show that as compared to a category of advanced Graph Neural Networks (GNNs), the decoupled Graph Convolutional Network, NAGphormer could learn more informative node representations from the multi-hop neighborhoods. Extensive experiments on benchmark datasets from small to large are conducted to demonstrate that NAGphormer consistently outperforms existing graph Transformers and mainstream GNNs.
    Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality. (arXiv:2208.03848v2 [cs.IT] UPDATED)
    Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena.
    TabPFN: A Transformer That Solves Small Tabular Classification Problems in a Second. (arXiv:2207.01848v3 [cs.LG] UPDATED)
    We present TabPFN, a trained Transformer that can do supervised classification for small tabular datasets in less than a second, needs no hyperparameter tuning and is competitive with state-of-the-art classification methods. TabPFN is fully entailed in the weights of our network, which accepts training and test samples as a set-valued input and yields predictions for the entire test set in a single forward pass. TabPFN is a Prior-Data Fitted Network (PFN) and is trained offline once, to approximate Bayesian inference on synthetic datasets drawn from our prior. This prior incorporates ideas from causal reasoning: It entails a large space of structural causal models with a preference for simple structures. On 30 datasets from the OpenML-CC18 suite, we show that our method clearly outperforms boosted trees and performs on par with complex state-of-the-art AutoML systems with up to 70$\times$ speedup. This increases to a 3200$\times$ speedup when a GPU is available. We provide all our code, the trained TabPFN, an interactive browser demo and a Colab notebook at https://github.com/automl/TabPFN.
    P2P: Tuning Pre-trained Image Models for Point Cloud Analysis with Point-to-Pixel Prompting. (arXiv:2208.02812v2 [cs.CV] UPDATED)
    Nowadays, pre-training big models on large-scale datasets has become a crucial topic in deep learning. The pre-trained models with high representation ability and transferability achieve a great success and dominate many downstream tasks in natural language processing and 2D vision. However, it is non-trivial to promote such a pretraining-tuning paradigm to the 3D vision, given the limited training data that are relatively inconvenient to collect. In this paper, we provide a new perspective of leveraging pre-trained 2D knowledge in 3D domain to tackle this problem, tuning pre-trained image models with the novel Point-to-Pixel prompting for point cloud analysis at a minor parameter cost. Following the principle of prompting engineering, we transform point clouds into colorful images with geometry-preserved projection and geometry-aware coloring to adapt to pre-trained image models, whose weights are kept frozen during the end-to-end optimization of point cloud analysis tasks. We conduct extensive experiments to demonstrate that cooperating with our proposed Point-to-Pixel Prompting, better pre-trained image model will lead to consistently better performance in 3D vision. Enjoying prosperous development from image pre-training field, our method attains 89.3% accuracy on the hardest setting of ScanObjectNN, surpassing conventional point cloud models with much fewer trainable parameters. Our framework also exhibits very competitive performance on ModelNet classification and ShapeNet Part Segmentation. Code is available at https://github.com/wangzy22/P2P.
    Towards Real-Time Temporal Graph Learning. (arXiv:2210.04114v2 [cs.LG] UPDATED)
    In recent years, graph representation learning has gained significant popularity, which aims to generate node embeddings that capture features of graphs. One of the methods to achieve this is employing a technique called random walks that captures node sequences in a graph and then learns embeddings for each node using a natural language processing technique called Word2Vec. These embeddings are then used for deep learning on graph data for classification tasks, such as link prediction or node classification. Prior work operates on pre-collected temporal graph data and is not designed to handle updates on a graph in real-time. Real world graphs change dynamically and their entire temporal updates are not available upfront. In this paper, we propose an end-to-end graph learning pipeline that performs temporal graph construction, creates low-dimensional node embeddings, and trains multi-layer neural network models in an online setting. The training of the neural network models is identified as the main performance bottleneck as it performs repeated matrix operations on many sequentially connected low-dimensional kernels. We propose to unlock fine-grain parallelism in these low-dimensional kernels to boost performance of model training.
    Towards Theoretically Inspired Neural Initialization Optimization. (arXiv:2210.05956v1 [cs.LG])
    Automated machine learning has been widely explored to reduce human efforts in designing neural architectures and looking for proper hyperparameters. In the domain of neural initialization, however, similar automated techniques have rarely been studied. Most existing initialization methods are handcrafted and highly dependent on specific architectures. In this paper, we propose a differentiable quantity, named GradCosine, with theoretical insights to evaluate the initial state of a neural network. Specifically, GradCosine is the cosine similarity of sample-wise gradients with respect to the initialized parameters. By analyzing the sample-wise optimization landscape, we show that both the training and test performance of a network can be improved by maximizing GradCosine under gradient norm constraint. Based on this observation, we further propose the neural initialization optimization (NIO) algorithm. Generalized from the sample-wise analysis into the real batch setting, NIO is able to automatically look for a better initialization with negligible cost compared with the training time. With NIO, we improve the classification performance of a variety of neural architectures on CIFAR-10, CIFAR-100, and ImageNet. Moreover, we find that our method can even help to train large vision Transformer architecture without warmup.
    A Momentum Accelerated Adaptive Cubic Regularization Method for Nonconvex Optimization. (arXiv:2210.05987v1 [math.OC])
    The cubic regularization method (CR) and its adaptive version (ARC) are popular Newton-type methods in solving unconstrained non-convex optimization problems, due to its global convergence to local minima under mild conditions. The main aim of this paper is to develop a momentum-accelerated adaptive cubic regularization method (ARCm) to improve the convergent performance. With the proper choice of momentum step size, we show the global convergence of ARCm and the local convergence can also be guaranteed under the \KL property. Such global and local convergence can also be established when inexact solvers with low computational costs are employed in the iteration procedure. Numerical results for non-convex logistic regression and robust linear regression models are reported to demonstrate that the proposed ARCm significantly outperforms state-of-the-art cubic regularization methods (e.g., CR, momentum-based CR, ARC) and the trust region method. In particular, the number of iterations required by ARCm is less than 10\% to 50\% required by the most competitive method (ARC) in the experiments.
    Anomaly Detection using Generative Models and Sum-Product Networks in Mammography Scans. (arXiv:2210.06188v1 [cs.CV])
    Unsupervised anomaly detection models which are trained solely by healthy data, have gained importance in the recent years, as the annotation of medical data is a tedious task. Autoencoders and generative adversarial networks are the standard anomaly detection methods that are utilized to learn the data distribution. However, they fall short when it comes to inference and evaluation of the likelihood of test samples. We propose a novel combination of generative models and a probabilistic graphical model. After encoding image samples by autoencoders, the distribution of data is modeled by Random and Tensorized Sum-Product Networks ensuring exact and efficient inference at test time. We evaluate different autoencoder architectures in combination with Random and Tensorized Sum-Product Networks on mammography images using patch-wise processing and observe superior performance over utilizing the models standalone and state-of-the-art in anomaly detection for medical data.
    Projective Transformation Rectification for Camera-captured Chest X-ray Photograph Interpretation with Synthetic Data. (arXiv:2210.05954v1 [cs.CV])
    Automatic interpretation on smartphone-captured chest X-ray (CXR) photographs is challenging due to the geometric distortion (projective transformation) caused by the non-ideal camera position. In this paper, we proposed an innovative deep learning-based Projective Transformation Rectification Network (PTRN) to automatically rectify such distortions by predicting the projective transformation matrix. PTRN is trained on synthetic data to avoid the expensive collection of natural data. Therefore, we proposed an innovative synthetic data framework that accounts for the visual attributes of natural photographs including screen, background, illuminations, and visual artifacts, and generate synthetic CXR photographs and projective transformation matrices as the ground-truth labels for training PTRN. Finally, smartphone-captured CXR photographs are automatically rectified by trained PTRN and interpreted by a classifier trained on high-quality digital CXRs to produce final interpretation results. In the CheXphoto CXR photograph interpretation competition released by the Stanford University Machine Learning Group, our approach achieves a huge performance improvement and won first place (ours 0.850, second-best 0.762, in AUC). A deeper analysis demonstrates that the use of PTRN successfully achieves the performance on CXR photographs to the same level as on digital CXRs, indicating PTRN can eliminate all negative impacts of projective transformation to the interpretation performance. Additionally, there are many real-world scenarios where distorted photographs have to be used for image classification, our PTRN can be used to solve those similar problems due to its generality design.
    Fast Bayesian Updates for Deep Learning with a Use Case in Active Learning. (arXiv:2210.06112v1 [cs.LG])
    Retraining deep neural networks when new data arrives is typically computationally expensive. Moreover, certain applications do not allow such costly retraining due to time or computational constraints. Fast Bayesian updates are a possible solution to this issue. Therefore, we propose a Bayesian update based on Monte-Carlo samples and a last-layer Laplace approximation for different Bayesian neural network types, i.e., Dropout, Ensemble, and Spectral Normalized Neural Gaussian Process (SNGP). In a large-scale evaluation study, we show that our updates combined with SNGP represent a fast and competitive alternative to costly retraining. As a use case, we combine the Bayesian updates for SNGP with different sequential query strategies to exemplarily demonstrate their improved selection performance in active learning.
    Common Corruption Robustness of Point Cloud Detectors: Benchmark and Enhancement. (arXiv:2210.05896v1 [cs.CV])
    Object detection through LiDAR-based point cloud has recently been important in autonomous driving. Although achieving high accuracy on public benchmarks, the state-of-the-art detectors may still go wrong and cause a heavy loss due to the widespread corruptions in the real world like rain, snow, sensor noise, etc. Nevertheless, there is a lack of a large-scale dataset covering diverse scenes and realistic corruption types with different severities to develop practical and robust point cloud detectors, which is challenging due to the heavy collection costs. To alleviate the challenge and start the first step for robust point cloud detection, we propose the physical-aware simulation methods to generate degraded point clouds under different real-world common corruptions. Then, for the first attempt, we construct a benchmark based on the physical-aware common corruptions for point cloud detectors, which contains a total of 1,122,150 examples covering 7,481 scenes, 25 common corruption types, and 6 severities. With such a novel benchmark, we conduct extensive empirical studies on 8 state-of-the-art detectors that contain 6 different detection frameworks. Thus we get several insight observations revealing the vulnerabilities of the detectors and indicating the enhancement directions. Moreover, we further study the effectiveness of existing robustness enhancement methods based on data augmentation and data denoising. The benchmark can potentially be a new platform for evaluating point cloud detectors, opening a door for developing novel robustness enhancement methods.
    Unsupervised Learning of Equivariant Structure from Sequences. (arXiv:2210.05972v1 [cs.LG])
    In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict the future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product by applying simultaneous block-diagonalization to the transition operators in the latent space, the procedure which is commonly used in representation theory to decompose the feature-space based on the type of response to group actions. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that finding a simple structured relation and learning a model with extrapolation capability are two sides of the same coin. The code is available at https://github.com/takerum/meta_sequential_prediction.
    Embeddings as Epistemic States: Limitations on the Use of Pooling Operators for Accumulating Knowledge. (arXiv:2210.05723v1 [cs.AI])
    Various neural network architectures rely on pooling operators to aggregate information coming from different sources. It is often implicitly assumed in such contexts that vectors encode epistemic states, i.e. that vectors capture the evidence that has been obtained about some properties of interest, and that pooling these vectors yields a vector that combines this evidence. We study, for a number of standard pooling operators, under what conditions they are compatible with this idea, which we call the epistemic pooling principle. While we find that all the considered pooling operators can satisfy the epistemic pooling principle, this only holds when embeddings are sufficiently high-dimensional and, for most pooling operators, when the embeddings satisfy particular constraints (e.g. having non-negative coordinates). We then study the implications of these constraints, starting from the idea that we should be able to verify whether an arbitrary propositional formula is satisfied in the epistemic state encoded by a given vector. We find that when the epistemic pooling principle is satisfied, in most cases it is impossible to verify the satisfaction of propositional formulas using linear scoring functions, with two exceptions: (i) max-pooling with embeddings that are upper-bounded and (ii) Hadamard pooling with non-negative embeddings. Finally, we also study an extension of the epistemic pooling principle to weighted epistemic states, where max-pooling emerges as the most suitable operator.
    Diffusion Models for Causal Discovery via Topological Ordering. (arXiv:2210.06201v1 [cs.LG])
    Discovering causal relations from observational data becomes possible with additional assumptions such as considering the functional relations to be constrained as nonlinear with additive noise. In this case, the Hessian of the data log-likelihood can be used for finding leaf nodes in a causal graph. Topological ordering approaches for causal discovery exploit this by performing graph discovery in two steps, first sequentially identifying nodes in reverse order of depth (topological ordering), and secondly pruning the potential relations. This is more efficient since the search is performed over a permutation rather than a graph space. However, existing computational methods for obtaining the Hessian still do not scale as the number of variables and the number of samples are increased. Therefore, inspired by recent innovations in diffusion probabilistic models (DPMs), we propose DiffAN, a topological ordering algorithm that leverages DPMs. Further, we introduce theory for updating the learned Hessian without re-training the neural network, and we show that computing with a subset of samples gives an accurate approximation of the ordering, which allows scaling to datasets with more samples and variables. We show empirically that our method scales exceptionally well to datasets with up to $500$ nodes and up to $10^5$ samples while still performing on par over small datasets with state-of-the-art causal discovery methods. Implementation is available at https://github.com/vios-s/DiffAN .
    Self-supervised Learning for Label-Efficient Sleep Stage Classification: A Comprehensive Evaluation. (arXiv:2210.06286v1 [eess.SP])
    The past few years have witnessed a remarkable advance in deep learning for EEG-based sleep stage classification (SSC). However, the success of these models is attributed to possessing a massive amount of labeled data for training, limiting their applicability in real-world scenarios. In such scenarios, sleep labs can generate a massive amount of data, but labeling these data can be expensive and time-consuming. Recently, the self-supervised learning (SSL) paradigm has shined as one of the most successful techniques to overcome the scarcity of labeled data. In this paper, we evaluate the efficacy of SSL to boost the performance of existing SSC models in the few-labels regime. We conduct a thorough study on three SSC datasets, and we find that fine-tuning the pretrained SSC models with only 5% of labeled data can achieve competitive performance to the supervised training with full labels. Moreover, self-supervised pretraining helps SSC models to be more robust to data imbalance and domain shift problems. The code is publicly available at \url{https://github.com/emadeldeen24/eval_ssl_ssc}.
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v2 [stat.ML] UPDATED)
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v4 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.
    VCT: A Video Compression Transformer. (arXiv:2206.07307v2 [cs.CV] UPDATED)
    We show how transformers can be used to vastly simplify neural video compression. Previous methods have been relying on an increasing number of architectural biases and priors, including motion prediction and warping operations, resulting in complex models. Instead, we independently map input frames to representations and use a transformer to model their dependencies, letting it predict the distribution of future representations given the past. The resulting video compression transformer outperforms previous methods on standard video compression data sets. Experiments on synthetic data show that our model learns to handle complex motion patterns such as panning, blurring and fading purely from data. Our approach is easy to implement, and we release code to facilitate future research.
    Synthesizing explainable counterfactual policies for algorithmic recourse with program synthesis. (arXiv:2201.07135v2 [cs.LG] UPDATED)
    Being able to provide counterfactual interventions - sequences of actions we would have had to take for a desirable outcome to happen - is essential to explain how to change an unfavourable decision by a black-box machine learning model (e.g., being denied a loan request). Existing solutions have mainly focused on generating feasible interventions without providing explanations on their rationale. Moreover, they need to solve a separate optimization problem for each user. In this paper, we take a different approach and learn a program that outputs a sequence of explainable counterfactual actions given a user description and a causal graph. We leverage program synthesis techniques, reinforcement learning coupled with Monte Carlo Tree Search for efficient exploration, and rule learning to extract explanations for each recommended action. An experimental evaluation on synthetic and real-world datasets shows how our approach generates effective interventions by making orders of magnitude fewer queries to the black-box classifier with respect to existing solutions, with the additional benefit of complementing them with interpretable explanations.
    JuryGCN: Quantifying Jackknife Uncertainty on Graph Convolutional Networks. (arXiv:2210.05959v1 [cs.LG])
    Graph Convolutional Network (GCN) has exhibited strong empirical performance in many real-world applications. The vast majority of existing works on GCN primarily focus on the accuracy while ignoring how confident or uncertain a GCN is with respect to its predictions. Despite being a cornerstone of trustworthy graph mining, uncertainty quantification on GCN has not been well studied and the scarce existing efforts either fail to provide deterministic quantification or have to change the training procedure of GCN by introducing additional parameters or architectures. In this paper, we propose the first frequentist-based approach named JuryGCN in quantifying the uncertainty of GCN, where the key idea is to quantify the uncertainty of a node as the width of confidence interval by a jackknife estimator. Moreover, we leverage the influence functions to estimate the change in GCN parameters without re-training to scale up the computation. The proposed JuryGCN is capable of quantifying uncertainty deterministically without modifying the GCN architecture or introducing additional parameters. We perform extensive experimental evaluation on real-world datasets in the tasks of both active learning and semi-supervised node classification, which demonstrate the efficacy of the proposed method.
    CTL++: Evaluating Generalization on Never-Seen Compositional Patterns of Known Functions, and Compatibility of Neural Representations. (arXiv:2210.06350v1 [cs.LG])
    Well-designed diagnostic tasks have played a key role in studying the failure of neural nets (NNs) to generalize systematically. Famous examples include SCAN and Compositional Table Lookup (CTL). Here we introduce CTL++, a new diagnostic dataset based on compositions of unary symbolic functions. While the original CTL is used to test length generalization or productivity, CTL++ is designed to test systematicity of NNs, that is, their capability to generalize to unseen compositions of known functions. CTL++ splits functions into groups and tests performance on group elements composed in a way not seen during training. We show that recent CTL-solving Transformer variants fail on CTL++. The simplicity of the task design allows for fine-grained control of task difficulty, as well as many insightful analyses. For example, we measure how much overlap between groups is needed by tested NNs for learning to compose. We also visualize how learned symbol representations in outputs of functions from different groups are compatible in case of success but not in case of failure. These results provide insights into failure cases reported on more complex compositions in the natural language domain. Our code is public.
    On the Implicit Bias in Deep-Learning Algorithms. (arXiv:2208.12591v2 [cs.LG] UPDATED)
    Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not well-understood why they are able to generalize despite having more parameters than training examples. It is believed that implicit bias is a key factor in their ability to generalize, and hence it was widely studied in recent years. In this short survey, we explain the notion of implicit bias, review main results and discuss their implications.
    SCROLLS: Standardized CompaRison Over Long Language Sequences. (arXiv:2201.03533v2 [cs.CL] UPDATED)
    NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
    Neural Implicit Surface Evolution using Differential Equations. (arXiv:2201.09636v3 [cs.LG] UPDATED)
    This work investigates the use of smooth neural networks for modeling dynamic variations of implicit surfaces under partial differential equations (PDE). For this purpose, it extends the representation of neural implicit surfaces to the space-time $\mathbb{R}^3\times \mathbb{R}$, which opens up mechanisms for \textbf{continuous} geometric transformations. Examples include evolving an initial condition surface towards general vector fields, smoothing and sharpening using the mean curvature equation, and interpolations of initial conditions regularized by specific differential equations. The network training considers two constraints. A data term is responsible for fitting the PDE's initial condition to the corresponding time instant, usually $\mathbb{R}^3 \times \{0\}$. Then, a PDE term forces the network to approximate a solution of the underlying equation, \textbf{without any supervision}. The network can also be initialized based on previously trained initial conditions resulting in faster convergence when compared with the standard approach.
    Spiking Neural Operators for Scientific Machine Learning. (arXiv:2205.10130v2 [cs.NE] UPDATED)
    The main computational task of Scientific Machine Learning (SciML) is function regression, required both for inputs as well as outputs of a simulation. Physics-Informed Neural Networks (PINNs) and neural operators (such as DeepONet) have been very effective in solving Partial Differential Equations (PDEs), but they tax computational resources heavily and cannot be readily adopted for edge computing. Here, we address this issue by considering Spiking Neural Networks (SNNs), which have shown promise in reducing energy consumption by two orders of magnitude or more. We present a SNN-based method to perform regression, which has been a challenge due to the inherent difficulty in representing a function's input domain and continuous output values as spikes. We first propose a new method for encoding continuous values into spikes based on a triangular matrix in space and time, and demonstrate its better performance compared to the existing methods. Next, we demonstrate that using a simple SNN architecture consisting of Leaky Integrate and Fire (LIF) activation and two dense layers, we can achieve relatively accurate function regression results. Moreover, we can replace the LIF with a trained Multi-Layer Perceptron (MLP) network and obtain comparable results but three times faster. Then, we introduce the DeepONet, consisting of a branch (typically a Fully-connected Neural Network, FNN) for inputs and a trunk (also a FNN) for outputs. We can build a spiking DeepONet by either replacing the branch or the trunk by a SNN. We demonstrate this new approach for classification using the SNN in the branch, achieving results comparable to the literature. Finally, we design a spiking DeepONet for regression by replacing its trunk with a SNN, and achieve good accuracy for approximating functions as well as inferring solutions of differential equations.
    Entity Aware Negative Sampling with Auxiliary Loss of False Negative Prediction for Knowledge Graph Embedding. (arXiv:2210.06242v1 [cs.LG])
    Knowledge graph (KG) embedding is widely used in many downstream applications using KGs. Generally, since KGs contain only ground truth triples, it is necessary to construct arbitrary negative samples for representation learning of KGs. Recently, various methods for sampling high-quality negatives have been studied because the quality of negative triples has great effect on KG embedding. In this paper, we propose a novel method called Entity Aware Negative Sampling (EANS), which is able to sample negative entities resemble to positive one by adopting Gaussian distribution to the aligned entity index space. Additionally, we introduce auxiliary loss for false negative prediction that can alleviate the impact of the sampled false negative triples. The proposed method can generate high-quality negative samples regardless of negative sample size and effectively mitigate the influence of false negative samples. The experimental results on standard benchmarks show that our EANS outperforms existing the state-of-the-art methods of negative sampling on several knowledge graph embedding models. Moreover, the proposed method achieves competitive performance even when the number of negative samples is limited to only one.
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v3 [cs.LG] UPDATED)
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks.
    Energy Consumption-Aware Tabular Benchmarks for Neural Architecture Search. (arXiv:2210.06015v1 [cs.LG])
    The demand for large-scale computational resources for Neural Architecture Search (NAS) has been lessened by tabular benchmarks for NAS. Evaluating NAS strategies is now possible on extensive search spaces and at a moderate computational cost. But so far, NAS has mainly focused on maximising performance on some hold-out validation/test set. However, energy consumption is a partially conflicting objective that should not be neglected. We hypothesise that constraining NAS to include the energy consumption of training the models could reveal a sub-space of undiscovered architectures that are more computationally efficient with a smaller carbon footprint. To support the hypothesis, an existing tabular benchmark for NAS is augmented with the energy consumption of each architecture. We then perform multi-objective optimisation that includes energy consumption as an additional objective. We demonstrate the usefulness of multi-objective NAS for uncovering the trade-off between performance and energy consumption as well as for finding more energy-efficient architectures. The updated tabular benchmark, EC-NAS-Bench, is open-sourced to encourage the further exploration of energy consumption-aware NAS.
    Probabilistic Inverse Modeling: An Application in Hydrology. (arXiv:2210.06213v1 [cs.LG])
    The astounding success of these methods has made it imperative to obtain more explainable and trustworthy estimates from these models. In hydrology, basin characteristics can be noisy or missing, impacting streamflow prediction. For solving inverse problems in such applications, ensuring explainability is pivotal for tackling issues relating to data bias and large search space. We propose a probabilistic inverse model framework that can reconstruct robust hydrology basin characteristics from dynamic input weather driver and streamflow response data. We address two aspects of building more explainable inverse models, uncertainty estimation and robustness. This can help improve the trust of water managers, handling of noisy data and reduce costs. We propose uncertainty based learning method that offers 6\% improvement in $R^2$ for streamflow prediction (forward modeling) from inverse model inferred basin characteristic estimates, 17\% reduction in uncertainty (40\% in presence of noise) and 4\% higher coverage rate for basin characteristics.
    Building Heterogeneous Cloud System for Machine Learning Inference. (arXiv:2210.05889v1 [cs.DC])
    Online inference is becoming a key service product for many businesses, deployed in cloud platforms to meet customer demands. Despite their revenue-generation capability, these services need to operate under tight Quality-of-Service (QoS) and cost budget constraints. This paper introduces KAIROS, a novel runtime framework that maximizes the query throughput while meeting QoS target and a cost budget. KAIROS designs and implements novel techniques to build a pool of heterogeneous compute hardware without online exploration overhead, and distribute inference queries optimally at runtime. Our evaluation using industry-grade deep learning (DL) models shows that KAIROS yields up to 2X the throughput of an optimal homogeneous solution, and outperforms state-of-the-art schemes by up to 70\%, despite advantageous implementations of the competing schemes to ignore their exploration overhead.
    Task Compass: Scaling Multi-task Pre-training with Task Prefix. (arXiv:2210.06277v1 [cs.CL])
    Leveraging task-aware annotated data as supervised signals to assist with self-supervised learning on large-scale unlabeled data has become a new trend in pre-training language models. Existing studies show that multi-task learning with large-scale supervised tasks suffers from negative effects across tasks. To tackle the challenge, we propose a task prefix guided multi-task pre-training framework to explore the relationships among tasks. We conduct extensive experiments on 40 datasets, which show that our model can not only serve as the strong foundation backbone for a wide range of tasks but also be feasible as a probing tool for analyzing task relationships. The task relationships reflected by the prefixes align transfer learning performance between tasks. They also suggest directions for data augmentation with complementary tasks, which help our model achieve human-parity results on commonsense reasoning leaderboards. Code is available at https://github.com/cooelf/CompassMTL
    Transfer learning on electromyography (EMG) tasks: approaches and beyond. (arXiv:2210.06295v1 [eess.SP])
    Machine learning on electromyography (EMG) has recently achieved remarkable success on a variety of tasks, while such success relies heavily on the assumption that the training and future data must be of the same data distribution. However, this assumption may not hold in many real-world applications. Model calibration is required via data re-collection and label annotation, which is generally very expensive and time-consuming. To address this problem, transfer learning (TL), which aims to improve target learners' performance by transferring the knowledge from related source domains, is emerging as a new paradigm to reduce the amount of calibration effort. In this survey, we assess the eligibility of more than fifty published peer-reviewed representative transfer learning approaches for EMG applications. Unlike previous surveys on purely transfer learning or EMG-based machine learning, this survey aims to provide an insight into the biological foundations of existing transfer learning methods on EMG-related analysis. In specific, we first introduce the physiological structure of the muscles and the EMG generating mechanism, and the recording of EMG to provide biological insights behind existing transfer learning approaches. Further, we categorize existing research endeavors into data based, model based, training scheme based, and adversarial based. This survey systematically summarizes and categorizes existing transfer learning approaches for EMG related machine learning applications. In addition, we discuss possible drawbacks of existing works and point out the future direction of better EMG transfer learning algorithms to enhance practicality for real-world applications.
    Images as Weight Matrices: Sequential Image Generation Through Synaptic Learning Rules. (arXiv:2210.06184v1 [cs.CV])
    Work on fast weight programmers has demonstrated the effectiveness of key/value outer product-based learning rules for sequentially generating a weight matrix (WM) of a neural net (NN) by another NN or itself. However, the weight generation steps are typically not visually interpretable by humans, because the contents stored in the WM of an NN are not. Here we apply the same principle to generate natural images. The resulting fast weight painters (FPAs) learn to execute sequences of delta learning rules to sequentially generate images as sums of outer products of self-invented keys and values, one rank at a time, as if each image was a WM of an NN. We train our FPAs in the generative adversarial networks framework, and evaluate on various image datasets. We show how these generic learning rules can generate images with respectable visual quality without any explicit inductive bias for images. While the performance largely lags behind the one of specialised state-of-the-art image generators, our approach allows for visualising how synaptic learning rules iteratively produce complex connection patterns, yielding human-interpretable meaningful images. Finally, we also show that an additional convolutional U-Net (now popular in diffusion models) at the output of an FPA can learn one-step "denoising" of FPA-generated images to enhance their quality. Our code is public.
    Predicting the clinical citation count of biomedical papers using multilayer perceptron neural network. (arXiv:2210.06346v1 [cs.CL])
    The number of clinical citations received from clinical guidelines or clinical trials has been considered as one of the most appropriate indicators for quantifying the clinical impact of biomedical papers. Therefore, the early prediction of the clinical citation count of biomedical papers is critical to scientific activities in biomedicine, such as research evaluation, resource allocation, and clinical translation. In this study, we designed a four-layer multilayer perceptron neural network (MPNN) model to predict the clinical citation count of biomedical papers in the future by using 9,822,620 biomedical papers published from 1985 to 2005. We extracted ninety-one paper features from three dimensions as the input of the model, including twenty-one features in the paper dimension, thirty-five in the reference dimension, and thirty-five in the citing paper dimension. In each dimension, the features can be classified into three categories, i.e., the citation-related features, the clinical translation-related features, and the topic-related features. Besides, in the paper dimension, we also considered the features that have previously been demonstrated to be related to the citation counts of research papers. The results showed that the proposed MPNN model outperformed the other five baseline models, and the features in the reference dimension were the most important.
    Graph Neural Network Bandits. (arXiv:2207.06456v2 [cs.LG] UPDATED)
    We consider the bandit optimization problem with the reward function defined over graph-structured data. This problem has important applications in molecule design and drug discovery, where the reward is naturally invariant to graph permutations. The key challenges in this setting are scaling to large domains, and to graphs with many nodes. We resolve these challenges by embedding the permutation invariance into our model. In particular, we show that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel. By establishing a novel connection between such kernels and the graph neural tangent kernel (GNTK), we introduce the first GNN confidence bound and use it to design a phased-elimination algorithm with sublinear regret. Our regret bound depends on the GNTK's maximum information gain, which we also provide a bound for. While the reward function depends on all $N$ node features, our guarantees are independent of the number of graph nodes $N$. Empirically, our approach exhibits competitive performance and scales well on graph-structured domains.
    Centralized Training with Hybrid Execution in Multi-Agent Reinforcement Learning. (arXiv:2210.06274v1 [cs.LG])
    We introduce hybrid execution in multi-agent reinforcement learning (MARL), a new paradigm in which agents aim to successfully perform cooperative tasks with any communication level at execution time by taking advantage of information-sharing among the agents. Under hybrid execution, the communication level can range from a setting in which no communication is allowed between agents (fully decentralized), to a setting featuring full communication (fully centralized). To formalize our setting, we define a new class of multi-agent partially observable Markov decision processes (POMDPs) that we name hybrid-POMDPs, which explicitly models a communication process between the agents. We contribute MARO, an approach that combines an autoregressive predictive model to estimate missing agents' observations, and a dropout-based RL training scheme that simulates different communication levels during the centralized training phase. We evaluate MARO on standard scenarios and extensions of previous benchmarks tailored to emphasize the negative impact of partial observability in MARL. Experimental results show that our method consistently outperforms baselines, allowing agents to act with faulty communication while successfully exploiting shared information.
    Bridging the Gap Between Vision Transformers and Convolutional Neural Networks on Small Datasets. (arXiv:2210.05958v1 [cs.CV])
    There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets, which is concluded to the lack of inductive bias. In this paper, we further consider this problem and point out two weaknesses of ViTs in inductive biases, that is, the spatial relevance and diverse channel representation. First, on spatial aspect, objects are locally compact and relevant, thus fine-grained feature needs to be extracted from a token and its neighbors. While the lack of data hinders ViTs to attend the spatial relevance. Second, on channel aspect, representation exhibits diversity on different channels. But the scarce data can not enable ViTs to learn strong enough representation for accurate recognition. To this end, we propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases. On spatial aspect, we adopt a hybrid structure, in which convolution is integrated into patch embedding and multi-layer perceptron module, forcing the model to capture the token features as well as their neighboring features. On channel aspect, we introduce a dynamic feature aggregation module in MLP and a brand new "head token" design in multi-head self-attention module to help re-calibrate channel representation and make different channel group representation interacts with each other. The fusion of weak channel representation forms a strong enough representation for classification. With this design, we successfully eliminate the performance gap between CNNs and ViTs, and our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters. Code is available at https://github.com/ArieSeirack/DHVT.
    ControlVAE: Model-Based Learning of Generative Controllers for Physics-Based Characters. (arXiv:2210.06063v1 [cs.GR])
    In this paper, we introduce ControlVAE, a novel model-based framework for learning generative motion control policies based on variational autoencoders (VAE). Our framework can learn a rich and flexible latent representation of skills and a skill-conditioned generative control policy from a diverse set of unorganized motion sequences, which enables the generation of realistic human behaviors by sampling in the latent space and allows high-level control policies to reuse the learned skills to accomplish a variety of downstream tasks. In the training of ControlVAE, we employ a learnable world model to realize direct supervision of the latent space and the control policy. This world model effectively captures the unknown dynamics of the simulation system, enabling efficient model-based learning of high-level downstream tasks. We also learn a state-conditional prior distribution in the VAE-based generative control policy, which generates a skill embedding that outperforms the non-conditional priors in downstream tasks. We demonstrate the effectiveness of ControlVAE using a diverse set of tasks, which allows realistic and interactive control of the simulated characters.
    Two-stream Network for ECG Signal Classification. (arXiv:2210.06293v1 [eess.SP])
    Electrocardiogram (ECG), a technique for medical monitoring of cardiac activity, is an important method for identifying cardiovascular disease. However, analyzing the increasing quantity of ECG data consumes a lot of medical resources. This paper explores an effective algorithm for automatic classifications of multi-classes of heartbeat types based on ECG. Most neural network based methods target the individual heartbeats, ignoring the secrets embedded in the temporal sequence. And the ECG signal has temporal variation and unique individual characteristics, which means that the same type of ECG signal varies among patients under different physical conditions. A two-stream architecture is used in this paper and presents an enhanced version of ECG recognition based on this. The architecture achieves classification of holistic ECG signal and individual heartbeat and incorporates identified and temporal stream networks. Identified networks are used to extract features of individual heartbeats, while temporal networks aim to extract temporal correlations between heartbeats. Results on the MIT-BIH Arrhythmia Database demonstrate that the proposed algorithm performs an accuracy of 99.38\%. In addition, the proposed algorithm reaches an 88.07\% positive accuracy on massive data in real life, showing that the proposed algorithm can efficiently categorize different classes of heartbeat with high diagnostic performance.
    Classification by estimating the cumulative distribution function for small data. (arXiv:2210.05953v1 [cs.LG])
    In this paper, we study the classification problem by estimating the conditional probability function of the given data. Different from the traditional expected risk estimation theory on empirical data, we calculate the probability via Fredholm equation, this leads to estimate the distribution of the data. Based on the Fredholm equation, a new expected risk estimation theory by estimating the cumulative distribution function is presented. The main characteristics of the new expected risk estimation is to measure the risk on the distribution of the input space. The corresponding empirical risk estimation is also presented, and an $\varepsilon$-insensitive $L_{1}$ cumulative support vector machines ($\varepsilon$-$L_{1}$VSVM) is proposed by introducing an insensitive loss. It is worth mentioning that the classification models and the classification evaluation indicators based on the new mechanism are different from the traditional one. Experimental results show the effectiveness of the proposed $\varepsilon$-$L_{1}$VSVM and the corresponding cumulative distribution function indicator on validity and interpretability of small data classification.
    Learning to Optimize Quasi-Newton Methods. (arXiv:2210.06171v1 [cs.LG])
    We introduce a novel machine learning optimizer called LODO, which online meta-learns an implicit inverse Hessian of the loss as a subroutine of quasi-Newton optimization. Our optimizer merges Learning to Optimize (L2O) techniques with quasi-Newton methods to learn neural representations of symmetric matrix vector products, which are more flexible than those in other quasi-Newton methods. Unlike other L2O methods, ours does not require any meta-training on a training task distribution, and instead learns to optimize on the fly while optimizing on the test task, adapting to the local characteristics of the loss landscape while traversing it. Theoretically, we show that our optimizer approximates the inverse Hessian in noisy loss landscapes and is capable of representing a wide range of inverse Hessians. We experimentally verify our algorithm's performance in the presence of noise, and show that simpler alternatives for representing the inverse Hessians worsen performance. Lastly, we use our optimizer to train a semi-realistic deep neural network with 95k parameters, and obtain competitive results against standard neural network optimizers.
    On Divergence Measures for Bayesian Pseudocoresets. (arXiv:2210.06205v1 [cs.LG])
    A Bayesian pseudocoreset is a small synthetic dataset for which the posterior over parameters approximates that of the original dataset. While promising, the scalability of Bayesian pseudocoresets is not yet validated in realistic problems such as image classification with deep neural networks. On the other hand, dataset distillation methods similarly construct a small dataset such that the optimization using the synthetic dataset converges to a solution with performance competitive with optimization using full data. Although dataset distillation has been empirically verified in large-scale settings, the framework is restricted to point estimates, and their adaptation to Bayesian inference has not been explored. This paper casts two representative dataset distillation algorithms as approximations to methods for constructing pseudocoresets by minimizing specific divergence measures: reverse KL divergence and Wasserstein distance. Furthermore, we provide a unifying view of such divergence measures in Bayesian pseudocoreset construction. Finally, we propose a novel Bayesian pseudocoreset algorithm based on minimizing forward KL divergence. Our empirical results demonstrate that the pseudocoresets constructed from these methods reflect the true posterior even in high-dimensional Bayesian inference problems.
    Maximum entropy exploration in contextual bandits with neural networks and energy based models. (arXiv:2210.06302v1 [cs.LG])
    Contextual bandits can solve a huge range of real-world problems. However, current popular algorithms to solve them either rely on linear models, or unreliable uncertainty estimation in non-linear models, which are required to deal with the exploration-exploitation trade-off. Inspired by theories of human cognition, we introduce novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces. We present two classes of models, one with neural networks as reward estimators, and the other with energy based models, which model the probability of obtaining an optimal reward given an action. We evaluate the performance of these models in static and dynamic contextual bandit simulation environments. We show that both techniques outperform well-known standard algorithms, where energy based models have the best overall performance. This provides practitioners with new techniques that perform well in static and dynamic settings, and are particularly well suited to non-linear scenarios with continuous action spaces.
    JukeDrummer: Conditional Beat-aware Audio-domain Drum Accompaniment Generation via Transformer VQ-VA. (arXiv:2210.06007v1 [cs.SD])
    This paper proposes a model that generates a drum track in the audio domain to play along to a user-provided drum-free recording. Specifically, using paired data of drumless tracks and the corresponding human-made drum tracks, we train a Transformer model to improvise the drum part of an unseen drumless recording. We combine two approaches to encode the input audio. First, we train a vector-quantized variational autoencoder (VQ-VAE) to represent the input audio with discrete codes, which can then be readily used in a Transformer. Second, using an audio-domain beat tracking model, we compute beat-related features of the input audio and use them as embeddings in the Transformer. Instead of generating the drum track directly as waveforms, we use a separate VQ-VAE to encode the mel-spectrogram of a drum track into another set of discrete codes, and train the Transformer to predict the sequence of drum-related discrete codes. The output codes are then converted to a mel-spectrogram with a decoder, and then to the waveform with a vocoder. We report both objective and subjective evaluations of variants of the proposed model, demonstrating that the model with beat information generates drum accompaniment that is rhythmically and stylistically consistent with the input audio.
    Differentially Private Bootstrap: New Privacy Analysis and Inference Strategies. (arXiv:2210.06140v1 [stat.ML])
    Differential private (DP) mechanisms protect individual-level information by introducing randomness into the statistical analysis procedure. While there are now many DP tools for various statistical problems, there is still a lack of general techniques to understand the sampling distribution of a DP estimator, which is crucial for uncertainty quantification in statistical inference. We analyze a DP bootstrap procedure that releases multiple private bootstrap estimates to infer the sampling distribution and construct confidence intervals. Our privacy analysis includes new results on the privacy cost of a single DP bootstrap estimate applicable to incorporate arbitrary DP mechanisms and identifies some misuses of the bootstrap in the existing literature. We show that the release of $B$ DP bootstrap estimates from mechanisms satisfying $(\mu/\sqrt{(2-2/\mathrm{e})B})$-Gaussian DP asymptotically satisfies $\mu$-Gaussian DP as $B$ goes to infinity. We also develop a statistical procedure based on the DP bootstrap estimates to correctly infer the sampling distribution using techniques related to the deconvolution of probability measures, an approach which is novel in analyzing DP procedures. From our density estimate, we construct confidence intervals and compare them to existing methods through simulations and real-world experiments using the 2016 Canada Census Public Use Microdata. The coverage of our private confidence intervals achieves the nominal confidence level, while other methods fail to meet this guarantee.
    Language Models are Realistic Tabular Data Generators. (arXiv:2210.06280v1 [cs.LG])
    Tabular data is among the oldest and most ubiquitous forms of data. However, the generation of synthetic samples with the original data's characteristics still remains a significant challenge for tabular data. While many generative models from the computer vision domain, such as autoencoders or generative adversarial networks, have been adapted for tabular data generation, less research has been directed towards recent transformer-based large language models (LLMs), which are also generative in nature. To this end, we propose GReaT (Generation of Realistic Tabular data), which exploits an auto-regressive generative LLM to sample synthetic and yet highly realistic tabular data. Furthermore, GReaT can model tabular data distributions by conditioning on any subset of features; the remaining features are sampled without additional overhead. We demonstrate the effectiveness of the proposed approach in a series of experiments that quantify the validity and quality of the produced data samples from multiple angles. We find that GReaT maintains state-of-the-art performance across many real-world data sets with heterogeneous feature types.
    An Energy-Efficient Spiking Neural Network for Finger Velocity Decoding for Implantable Brain-Machine Interface. (arXiv:2210.06287v1 [eess.SP])
    Brain-machine interfaces (BMIs) are promising for motor rehabilitation and mobility augmentation. High-accuracy and low-power algorithms are required to achieve implantable BMI systems. In this paper, we propose a novel spiking neural network (SNN) decoder for implantable BMI regression tasks. The SNN is trained with enhanced spatio-temporal backpropagation to fully leverage its ability in handling temporal problems. The proposed SNN decoder achieves the same level of correlation coefficient as the state-of-the-art ANN decoder in offline finger velocity decoding tasks, while it requires only 6.8% of the computation operations and 9.4% of the memory access.
    ECG for high-throughput screening of multiple diseases: Proof-of-concept using multi-diagnosis deep learning from population-based datasets. (arXiv:2210.06291v1 [eess.SP])
    Electrocardiogram (ECG) abnormalities are linked to cardiovascular diseases, but may also occur in other non-cardiovascular conditions such as mental, neurological, metabolic and infectious conditions. However, most of the recent success of deep learning (DL) based diagnostic predictions in selected patient cohorts have been limited to a small set of cardiac diseases. In this study, we use a population-based dataset of >250,000 patients with >1000 medical conditions and >2 million ECGs to identify a wide range of diseases that could be accurately diagnosed from the patient's first in-hospital ECG. Our DL models uncovered 128 diseases and 68 disease categories with strong discriminative performance.
    Guaranteed Conservation of Momentum for Learning Particle-based Fluid Dynamics. (arXiv:2210.06036v1 [cs.LG])
    We present a novel method for guaranteeing linear momentum in learned physics simulations. Unlike existing methods, we enforce conservation of momentum with a hard constraint, which we realize via antisymmetrical continuous convolutional layers. We combine these strict constraints with a hierarchical network architecture, a carefully constructed resampling scheme, and a training approach for temporal coherence. In combination, the proposed method allows us to increase the physical accuracy of the learned simulator substantially. In addition, the induced physical bias leads to significantly better generalization performance and makes our method more reliable in unseen test cases. We evaluate our method on a range of different, challenging fluid scenarios. Among others, we demonstrate that our approach generalizes to new scenarios with up to one million particles. Our results show that the proposed algorithm can learn complex dynamics while outperforming existing approaches in generalization and training performance. An implementation of our approach is available at https://github.com/tum-pbs/DMCF.
    Towards Mining Creative Thinking Patterns from Educational Data. (arXiv:2210.06118v1 [cs.IR])
    Creativity, i.e., the process of generating and developing fresh and original ideas or products that are useful or effective, is a valuable skill in a variety of domains. Creativity is called an essential 21st-century skill that should be taught in schools. The use of educational technology to promote creativity is an active study field, as evidenced by several studies linking creativity in the classroom to beneficial learning outcomes. Despite the burgeoning body of research on adaptive technology for education, mining creative thinking patterns from educational data remains a challenging task. In this paper, to address this challenge, we put the first step towards formalizing educational knowledge by constructing a domain-specific Knowledge Base to identify essential concepts, facts, and assumptions in identifying creative patterns. We then introduce a pipeline to contextualize the raw educational data, such as assessments and class activities. Finally, we present a rule-based approach to learning from the Knowledge Base, and facilitate mining creative thinking patterns from contextualized data and knowledge. We evaluate our approach with real-world datasets and highlight how the proposed pipeline can help instructors understand creative thinking patterns from students' activities and assessment tasks.
    Transfer Learning on Heterogeneous Feature Spaces for Treatment Effects Estimation. (arXiv:2210.06183v1 [cs.LG])
    Consider the problem of improving the estimation of conditional average treatment effects (CATE) for a target domain of interest by leveraging related information from a source domain with a different feature space. This heterogeneous transfer learning problem for CATE estimation is ubiquitous in areas such as healthcare where we may wish to evaluate the effectiveness of a treatment for a new patient population for which different clinical covariates and limited data are available. In this paper, we address this problem by introducing several building blocks that use representation learning to handle the heterogeneous feature spaces and a flexible multi-task architecture with shared and private layers to transfer information between potential outcome functions across domains. Then, we show how these building blocks can be used to recover transfer learning equivalents of the standard CATE learners. On a new semi-synthetic data simulation benchmark for heterogeneous transfer learning we not only demonstrate performance improvements of our heterogeneous transfer causal effect learners across datasets, but also provide insights into the differences between these learners from a transfer perspective.
    Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations. (arXiv:2210.05955v1 [stat.ML])
    Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, e.g., identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, i.e., inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results.
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v1 [stat.ML])
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.
    SpecRNet: Towards Faster and More Accessible Audio DeepFake Detection. (arXiv:2210.06105v1 [cs.SD])
    Audio DeepFakes are utterances generated with the use of deep neural networks. They are highly misleading and pose a threat due to use in fake news, impersonation, or extortion. In this work, we focus on increasing accessibility to the audio DeepFake detection methods by providing SpecRNet, a neural network architecture characterized by a quick inference time and low computational requirements. Our benchmark shows that SpecRNet, requiring up to about 40% less time to process an audio sample, provides performance comparable to LCNN architecture - one of the best audio DeepFake detection models. Such a method can not only be used by online multimedia services to verify a large bulk of content uploaded daily but also, thanks to its low requirements, by average citizens to evaluate materials on their devices. In addition, we provide benchmarks in three unique settings that confirm the correctness of our model. They reflect scenarios of low-resource datasets, detection on short utterances and limited attacks benchmark in which we take a closer look at the influence of particular attacks on given architectures.
    Boosting Graph Neural Networks via Adaptive Knowledge Distillation. (arXiv:2210.05920v1 [cs.LG])
    Graph neural networks (GNNs) have shown remarkable performance on diverse graph mining tasks. Although different GNNs can be unified as the same message passing framework, they learn complementary knowledge from the same graph. Knowledge distillation (KD) is developed to combine the diverse knowledge from multiple models. It transfers knowledge from high-capacity teachers to a lightweight student. However, to avoid oversmoothing, GNNs are often shallow, which deviates from the setting of KD. In this context, we revisit KD by separating its benefits from model compression and emphasizing its power of transferring knowledge. To this end, we need to tackle two challenges: how to transfer knowledge from compact teachers to a student with the same capacity; and, how to exploit student GNN's own strength to learn knowledge. In this paper, we propose a novel adaptive KD framework, called BGNN, which sequentially transfers knowledge from multiple GNNs into a student GNN. We also introduce an adaptive temperature module and a weight boosting module. These modules guide the student to the appropriate knowledge for effective learning. Extensive experiments have demonstrated the effectiveness of BGNN. In particular, we achieve up to 3.05% improvement for node classification and 7.67% improvement for graph classification over vanilla GNNs.
    The Role of Exploration for Task Transfer in Reinforcement Learning. (arXiv:2210.06168v1 [cs.LG])
    The exploration--exploitation trade-off in reinforcement learning (RL) is a well-known and much-studied problem that balances greedy action selection with novel experience, and the study of exploration methods is usually only considered in the context of learning the optimal policy for a single learning task. However, in the context of online task transfer, where there is a change to the task during online operation, we hypothesize that exploration strategies that anticipate the need to adapt to future tasks can have a pronounced impact on the efficiency of transfer. As such, we re-examine the exploration--exploitation trade-off in the context of transfer learning. In this work, we review reinforcement learning exploration methods, define a taxonomy with which to organize them, analyze these methods' differences in the context of task transfer, and suggest avenues for future investigation.
    Generalised Mutual Information for Discriminative Clustering. (arXiv:2210.06300v1 [stat.ML])
    In the last decade, recent successes in deep clustering majorly involved the mutual information (MI) as an unsupervised objective for training neural networks with increasing regularisations. While the quality of the regularisations have been largely discussed for improvements, little attention has been dedicated to the relevance of MI as a clustering objective. In this paper, we first highlight how the maximisation of MI does not lead to satisfying clusters. We identified the Kullback-Leibler divergence as the main reason of this behaviour. Hence, we generalise the mutual information by changing its core distance, introducing the generalised mutual information (GEMINI): a set of metrics for unsupervised neural network training. Unlike MI, some GEMINIs do not require regularisations when training. Some of these metrics are geometry-aware thanks to distances or kernels in the data space. Finally, we highlight that GEMINIs can automatically select a relevant number of clusters, a property that has been little studied in deep clustering context where the number of clusters is a priori unknown.
    Digital twins of nonlinear dynamical systems. (arXiv:2210.06144v1 [nlin.AO])
    We articulate the design imperatives for machine-learning based digital twins for nonlinear dynamical systems subject to external driving, which can be used to monitor the ``health'' of the target system and anticipate its future collapse. We demonstrate that, with single or parallel reservoir computing configurations, the digital twins are capable of challenging forecasting and monitoring tasks. Employing prototypical systems from climate, optics and ecology, we show that the digital twins can extrapolate the dynamics of the target system to certain parameter regimes never experienced before, make continual forecasting/monitoring with sparse real-time updates under non-stationary external driving, infer hidden variables and accurately predict their dynamical evolution, adapt to different forms of external driving, and extrapolate the global bifurcation behaviors to systems of some different sizes. These features make our digital twins appealing in significant applications such as monitoring the health of critical systems and forecasting their potential collapse induced by environmental changes.
    Privacy of federated QR decomposition using additive secure multiparty computation. (arXiv:2210.06163v1 [cs.CR])
    Federated learning (FL) is a privacy-aware data mining strategy keeping the private data on the owners' machine and thereby confidential. The clients compute local models and send them to an aggregator which computes a global model. In hybrid FL, the local parameters are additionally masked using secure aggregation, such that only the global aggregated statistics become available in clear text, not the client specific updates. Federated QR decomposition has not been studied extensively in the context of cross-silo federated learning. In this article, we investigate the suitability of three QR decomposition algorithms for cross-silo FL and suggest a privacy-aware QR decomposition scheme based on the Gram-Schmidt algorithm which does not blatantly leak raw data. We apply the algorithm to compute linear regression in a federated manner.
    Resolving the Approximability of Offline and Online Non-monotone DR-Submodular Maximization over General Convex Sets. (arXiv:2210.05965v1 [cs.DS])
    In recent years, maximization of DR-submodular continuous functions became an important research field, with many real-worlds applications in the domains of machine learning, communication systems, operation research and economics. Most of the works in this field study maximization subject to down-closed convex set constraints due to an inapproximability result by Vondr\'ak (2013). However, Durr et al. (2021) showed that one can bypass this inapproximability by proving approximation ratios that are functions of $m$, the minimum $\ell_{\infty}$-norm of any feasible vector. Given this observation, it is possible to get results for maximizing a DR-submodular function subject to general convex set constraints, which has led to multiple works on this problem. The most recent of which is a polynomial time $\tfrac{1}{4}(1 - m)$-approximation offline algorithm due to Du (2022). However, only a sub-exponential time $\tfrac{1}{3\sqrt{3}}(1 - m)$-approximation algorithm is known for the corresponding online problem. In this work, we present a polynomial time online algorithm matching the $\tfrac{1}{4}(1 - m)$-approximation of the state-of-the-art offline algorithm. We also present an inapproximability result showing that our online algorithm and Du's (2022) offline algorithm are both optimal in a strong sense. Finally, we study the empirical performance of our algorithm and the algorithm of Du (which was only theoretically studied previously), and show that they consistently outperform previously suggested algorithms on revenue maximization, location summarization and quadratic programming applications.
    Deep Counterfactual Estimation with Categorical Background Variables. (arXiv:2210.05811v1 [cs.LG])
    Referred to as the third rung of the causal inference ladder, counterfactual queries typically ask the "What if ?" question retrospectively. The standard approach to estimate counterfactuals resides in using a structural equation model that accurately reflects the underlying data generating process. However, such models are seldom available in practice and one usually wishes to infer them from observational data alone. Unfortunately, the correct structural equation model is in general not identifiable from the observed factual distribution. Nevertheless, in this work, we show that under the assumption that the main latent contributors to the treatment responses are categorical, the counterfactuals can be still reliably predicted. Building upon this assumption, we introduce CounterFactual Query Prediction (CFQP), a novel method to infer counterfactuals from continuous observations when the background variables are categorical. We show that our method significantly outperforms previously available deep-learning-based counterfactual methods, both theoretically and empirically on time series and image data. Our code is available at https://github.com/edebrouwer/cfqp.
    Generative Adversarial Nets: Can we generate a new dataset based on only one training set?. (arXiv:2210.06005v1 [cs.LG])
    A generative adversarial network (GAN) is a class of machine learning frameworks designed by Goodfellow et al. in 2014. In the GAN framework, the generative model is pitted against an adversary: a discriminative model that learns to determine whether a sample is from the model distribution or the data distribution. GAN generates new samples from the same distribution as the training set. In this work, we aim to generate a new dataset that has a different distribution from the training set. In addition, the Jensen-Shannon divergence between the distributions of the generative and training datasets can be controlled by some target $\delta \in [0, 1]$. Our work is motivated by applications in generating new kinds of rice that have similar characteristics as good rice.
    BORA: Bayesian Optimization for Resource Allocation. (arXiv:2210.05977v1 [cs.LG])
    Optimal resource allocation is gaining a renewed interest due its relevance as a core problem in managing, over time, cloud and high-performance computing facilities. Semi-Bandit Feedback (SBF) is the reference method for efficiently solving this problem. In this paper we propose (i) an extension of the optimal resource allocation to a more general class of problems, specifically with resources availability changing over time, and (ii) Bayesian Optimization as a more efficient alternative to SBF. Three algorithms for Bayesian Optimization for Resource Allocation, namely BORA, are presented, working on allocation decisions represented as numerical vectors or distributions. The second option required to consider the Wasserstein distance as a more suitable metric to use into one of the BORA algorithms. Results on (i) the original SBF case study proposed in the literature, and (ii) a real-life application (i.e., the optimization of multi-channel marketing) empirically prove that BORA is a more efficient and effective learning-and-optimization framework than SBF.
    Short-term prediction of stream turbidity using surrogate data and a meta-model approach. (arXiv:2210.05821v1 [stat.ML])
    Many water-quality monitoring programs aim to measure turbidity to help guide effective management of waterways and catchments, yet distributing turbidity sensors throughout networks is typically cost prohibitive. To this end, we built and compared the ability of dynamic regression (ARIMA), long short-term memory neural nets (LSTM), and generalized additive models (GAM) to forecast stream turbidity one step ahead, using surrogate data from relatively low-cost in-situ sensors and publicly available databases. We iteratively trialled combinations of four surrogate covariates (rainfall, water level, air temperature and total global solar exposure) selecting a final model for each type that minimised the corrected Akaike Information Criterion. Cross-validation using a rolling time-window indicated that ARIMA, which included the rainfall and water-level covariates only, produced the most accurate predictions, followed closely by GAM, which included all four covariates. We constructed a meta-model, trained on time-series features of turbidity, to take advantage of the strengths of each model over different time points and predict the best model (that with the lowest forecast error one-step prior) for each time step. The meta-model outperformed all other models, indicating that this methodology can yield high accuracy and may be a viable alternative to using measurements sourced directly from turbidity-sensors where costs prohibit their deployment and maintenance, and when predicting turbidity across the short term. Our findings also indicated that temperature and light-associated variables, for example underwater illuminance, may hold promise as cost-effective, high-frequency surrogates of turbidity, especially when combined with other covariates, like rainfall, that are typically measured at coarse levels of spatial resolution.
    Efficient Adversarial Training without Attacking: Worst-Case-Aware Robust Reinforcement Learning. (arXiv:2210.05927v1 [cs.LG])
    Recent studies reveal that a well-trained deep reinforcement learning (RL) policy can be particularly vulnerable to adversarial perturbations on input observations. Therefore, it is crucial to train RL agents that are robust against any attacks with a bounded budget. Existing robust training methods in deep RL either treat correlated steps separately, ignoring the robustness of long-term rewards, or train the agents and RL-based attacker together, doubling the computational burden and sample complexity of the training process. In this work, we propose a strong and efficient robust training framework for RL, named Worst-case-aware Robust RL (WocaR-RL) that directly estimates and optimizes the worst-case reward of a policy under bounded l_p attacks without requiring extra samples for learning an attacker. Experiments on multiple environments show that WocaR-RL achieves state-of-the-art performance under various strong attacks, and obtains significantly higher training efficiency than prior state-of-the-art robust training methods. The code of this work is available at https://github.com/umd-huang-lab/WocaR-RL.
    Few-shot Backdoor Attacks via Neural Tangent Kernels. (arXiv:2210.05929v1 [cs.LG])
    In a backdoor attack, an attacker injects corrupted examples into the training set. The goal of the attacker is to cause the final trained model to predict the attacker's desired target label when a predefined trigger is added to test inputs. Central to these attacks is the trade-off between the success rate of the attack and the number of corrupted training examples injected. We pose this attack as a novel bilevel optimization problem: construct strong poison examples that maximize the attack success rate of the trained model. We use neural tangent kernels to approximate the training dynamics of the model being attacked and automatically learn strong poison examples. We experiment on subclasses of CIFAR-10 and ImageNet with WideResNet-34 and ConvNeXt architectures on periodic and patch trigger attacks and show that NTBA-designed poisoned examples achieve, for example, an attack success rate of 90% with ten times smaller number of poison examples injected compared to the baseline. We provided an interpretation of the NTBA-designed attacks using the analysis of kernel linear regression. We further demonstrate a vulnerability in overparametrized deep neural networks, which is revealed by the shape of the neural tangent kernel.
    Optimizing Evaluation Metrics for Multi-Task Learning via the Alternating Direction Method of Multipliers. (arXiv:2210.05935v1 [cs.LG])
    Multi-task learning (MTL) aims to improve the generalization performance of multiple tasks by exploiting the shared factors among them. Various metrics (e.g., F-score, Area Under the ROC Curve) are used to evaluate the performances of MTL methods. Most existing MTL methods try to minimize either the misclassified errors for classification or the mean squared errors for regression. In this paper, we propose a method to directly optimize the evaluation metrics for a large family of MTL problems. The formulation of MTL that directly optimizes evaluation metrics is the combination of two parts: (1) a regularizer defined on the weight matrix over all tasks, in order to capture the relatedness of these tasks; (2) a sum of multiple structured hinge losses, each corresponding to a surrogate of some evaluation metric on one task. This formulation is challenging in optimization because both of its parts are non-smooth. To tackle this issue, we propose a novel optimization procedure based on the alternating direction scheme of multipliers, where we decompose the whole optimization problem into a sub-problem corresponding to the regularizer and another sub-problem corresponding to the structured hinge losses. For a large family of MTL problems, the first sub-problem has closed-form solutions. To solve the second sub-problem, we propose an efficient primal-dual algorithm via coordinate ascent. Extensive evaluation results demonstrate that, in a large family of MTL problems, the proposed MTL method of directly optimization evaluation metrics has superior performance gains against the corresponding baseline methods.
    When are Local Queries Useful for Robust Learning?. (arXiv:2210.06089v1 [cs.LG])
    Distributional assumptions have been shown to be necessary for the robust learnability of concept classes when considering the exact-in-the-ball robust risk and access to random examples by Gourdeau et al. (2019). In this paper, we study learning models where the learner is given more power through the use of local queries, and give the first distribution-free algorithms that perform robust empirical risk minimization (ERM) for this notion of robustness. The first learning model we consider uses local membership queries (LMQ), where the learner can query the label of points near the training sample. We show that, under the uniform distribution, LMQs do not increase the robustness threshold of conjunctions and any superclass, e.g., decision lists and halfspaces. Faced with this negative result, we introduce the local equivalence query (LEQ) oracle, which returns whether the hypothesis and target concept agree in the perturbation region around a point in the training sample, as well as a counterexample if it exists. We show a separation result: on one hand, if the query radius $\lambda$ is strictly smaller than the adversary's perturbation budget $\rho$, then distribution-free robust learning is impossible for a wide variety of concept classes; on the other hand, the setting $\lambda=\rho$ allows us to develop robust ERM algorithms. We then bound the query complexity of these algorithms based on online learning guarantees and further improve these bounds for the special case of conjunctions. We finish by giving robust learning algorithms for halfspaces with margins on both $\{0,1\}^n$ and $\mathbb{R}^n$.
    SlotFormer: Unsupervised Visual Dynamics Simulation with Object-Centric Models. (arXiv:2210.05861v1 [cs.CV])
    Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -- a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without object-level labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks.
    Equal Experience in Recommender Systems. (arXiv:2210.05936v1 [cs.LG])
    We explore the fairness issue that arises in recommender systems. Biased data due to inherent stereotypes of particular groups (e.g., male students' average rating on mathematics is often higher than that on humanities, and vice versa for females) may yield a limited scope of suggested items to a certain group of users. Our main contribution lies in the introduction of a novel fairness notion (that we call equal experience), which can serve to regulate such unfairness in the presence of biased data. The notion captures the degree of the equal experience of item recommendations across distinct groups. We propose an optimization framework that incorporates the fairness notion as a regularization term, as well as introduce computationally-efficient algorithms that solve the optimization. Experiments on synthetic and benchmark real datasets demonstrate that the proposed framework can indeed mitigate such unfairness while exhibiting a minor degradation of recommendation accuracy.
    TetGAN: A Convolutional Neural Network for Tetrahedral Mesh Generation. (arXiv:2210.05735v1 [cs.CV])
    We present TetGAN, a convolutional neural network designed to generate tetrahedral meshes. We represent shapes using an irregular tetrahedral grid which encodes an occupancy and displacement field. Our formulation enables defining tetrahedral convolution, pooling, and upsampling operations to synthesize explicit mesh connectivity with variable topological genus. The proposed neural network layers learn deep features over each tetrahedron and learn to extract patterns within spatial regions across multiple scales. We illustrate the capabilities of our technique to encode tetrahedral meshes into a semantically meaningful latent-space which can be used for shape editing and synthesis. Our project page is at https://threedle.github.io/tetGAN/.
    SARAH-based Variance-reduced Algorithm for Stochastic Finite-sum Cocoercive Variational Inequalities. (arXiv:2210.05994v1 [math.OC])
    Variational inequalities are a broad formalism that encompasses a vast number of applications. Motivated by applications in machine learning and beyond, stochastic methods are of great importance. In this paper we consider the problem of stochastic finite-sum cocoercive variational inequalities. For this class of problems, we investigate the convergence of the method based on the SARAH variance reduction technique. We show that for strongly monotone problems it is possible to achieve linear convergence to a solution using this method. Experiments confirm the importance and practical applicability of our approach.
    C-Mixup: Improving Generalization in Regression. (arXiv:2210.05775v1 [cs.LG])
    Improving the generalization of deep networks is an important open challenge, particularly in domains without plentiful data. The mixup algorithm improves generalization by linearly interpolating a pair of examples and their corresponding labels. These interpolated examples augment the original training set. Mixup has shown promising results in various classification tasks, but systematic analysis of mixup in regression remains underexplored. Using mixup directly on regression labels can result in arbitrarily incorrect labels. In this paper, we propose a simple yet powerful algorithm, C-Mixup, to improve generalization on regression tasks. In contrast with vanilla mixup, which picks training examples for mixing with uniform probability, C-Mixup adjusts the sampling probability based on the similarity of the labels. Our theoretical analysis confirms that C-Mixup with label similarity obtains a smaller mean square error in supervised regression and meta-regression than vanilla mixup and using feature similarity. Another benefit of C-Mixup is that it can improve out-of-distribution robustness, where the test distribution is different from the training distribution. By selectively interpolating examples with similar labels, it mitigates the effects of domain-associated information and yields domain-invariant representations. We evaluate C-Mixup on eleven datasets, ranging from tabular to video data. Compared to the best prior approach, C-Mixup achieves 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task generalization, and out-of-distribution robustness, respectively. Code is released at https://github.com/huaxiuyao/C-Mixup.
    Efficient Offline Policy Optimization with a Learned Model. (arXiv:2210.05980v1 [cs.LG])
    MuZero Unplugged presents a promising approach for offline policy learning from logged data. It conducts Monte-Carlo Tree Search (MCTS) with a learned model and leverages Reanalyze algorithm to learn purely from offline data. For good performance, MCTS requires accurate learned models and a large number of simulations, thus costing huge computing time. This paper investigates a few hypotheses where MuZero Unplugged may not work well under the offline RL settings, including 1) learning with limited data coverage; 2) learning from offline data of stochastic environments; 3) improperly parameterized models given the offline data; 4) with a low compute budget. We propose to use a regularized one-step look-ahead approach to tackle the above issues. Instead of planning with the expensive MCTS, we use the learned model to construct an advantage estimation based on a one-step rollout. Policy improvements are towards the direction that maximizes the estimated advantage with regularization of the dataset. We conduct extensive empirical studies with BSuite environments to verify the hypotheses and then run our algorithm on the RL Unplugged Atari benchmark. Experimental results show that our proposed approach achieves stable performance even with an inaccurate learned model. On the large-scale Atari benchmark, the proposed method outperforms MuZero Unplugged by 43%. Most significantly, it uses only 5.6% wall-clock time (i.e., 1 hour) compared to MuZero Unplugged (i.e., 17.8 hours) to achieve a 150% IQM normalized score with the same hardware and software stacks.
    FasterRisk: Fast and Accurate Interpretable Risk Scores. (arXiv:2210.05846v1 [cs.LG])
    Over the last century, risk scores have been the most popular form of predictive model used in healthcare and criminal justice. Risk scores are sparse linear models with integer coefficients; often these models can be memorized or placed on an index card. Typically, risk scores have been created either without data or by rounding logistic regression coefficients, but these methods do not reliably produce high-quality risk scores. Recent work used mathematical programming, which is computationally slow. We introduce an approach for efficiently producing a collection of high-quality risk scores learned from data. Specifically, our approach produces a pool of almost-optimal sparse continuous solutions, each with a different support set, using a beam-search algorithm. Each of these continuous solutions is transformed into a separate risk score through a "star ray" search, where a range of multipliers are considered before rounding the coefficients sequentially to maintain low logistic loss. Our algorithm returns all of these high-quality risk scores for the user to consider. This method completes within minutes and can be valuable in a broad variety of applications.
    Dynamic Ensemble Size Adjustment for Memory Constrained Mondrian Forest. (arXiv:2210.05704v1 [cs.LG])
    Supervised learning algorithms generally assume the availability of enough memory to store data models during the training and test phases. However, this assumption is unrealistic when data comes in the form of infinite data streams, or when learning algorithms are deployed on devices with reduced amounts of memory. Such memory constraints impact the model behavior and assumptions. In this paper, we show that under memory constraints, increasing the size of a tree-based ensemble classifier can worsen its performance. In particular, we experimentally show the existence of an optimal ensemble size for a memory-bounded Mondrian forest on data streams and we design an algorithm to guide the forest toward that optimal number by using an estimation of overfitting. We tested different variations for this algorithm on a variety of real and simulated datasets, and we conclude that our method can achieve up to 95% of the performance of an optimally-sized Mondrian forest for stable datasets, and can even outperform it for datasets with concept drifts. All our methods are implemented in the OrpailleCC open-source library and are ready to be used on embedded systems and connected objects.
    Hate-CLIPper: Multimodal Hateful Meme Classification based on Cross-modal Interaction of CLIP Features. (arXiv:2210.05916v1 [cs.CL])
    Hateful memes are a growing menace on social media. While the image and its corresponding text in a meme are related, they do not necessarily convey the same meaning when viewed individually. Hence, detecting hateful memes requires careful consideration of both visual and textual information. Multimodal pre-training can be beneficial for this task because it effectively captures the relationship between the image and the text by representing them in a similar feature space. Furthermore, it is essential to model the interactions between the image and text features through intermediate fusion. Most existing methods either employ multimodal pre-training or intermediate fusion, but not both. In this work, we propose the Hate-CLIPper architecture, which explicitly models the cross-modal interactions between the image and text representations obtained using Contrastive Language-Image Pre-training (CLIP) encoders via a feature interaction matrix (FIM). A simple classifier based on the FIM representation is able to achieve state-of-the-art performance on the Hateful Memes Challenge (HMC) dataset with an AUROC of 85.8, which even surpasses the human performance of 82.65. Experiments on other meme datasets such as Propaganda Memes and TamilMemes also demonstrate the generalizability of the proposed approach. Finally, we analyze the interpretability of the FIM representation and show that cross-modal interactions can indeed facilitate the learning of meaningful concepts. The code for this work is available at https://github.com/gokulkarthik/hateclipper.
    Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. (arXiv:2210.05918v1 [cs.LG])
    We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.
    Vote'n'Rank: Revision of Benchmarking with Social Choice Theory. (arXiv:2210.05769v1 [cs.LG])
    The development of state-of-the-art systems in different applied areas of machine learning (ML) is driven by benchmarks, which have shaped the paradigm of evaluating generalisation capabilities from multiple perspectives. Although the paradigm is shifting towards more fine-grained evaluation across diverse tasks, the delicate question of how to aggregate the performances has received particular interest in the community. In general, benchmarks follow the unspoken utilitarian principles, where the systems are ranked based on their mean average score over task-specific metrics. Such aggregation procedure has been viewed as a sub-optimal evaluation protocol, which may have created the illusion of progress. This paper proposes Vote'n'Rank, a framework for ranking systems in multi-task benchmarks under the principles of the social choice theory. We demonstrate that our approach can be efficiently utilised to draw new insights on benchmarking in several ML sub-fields and identify the best-performing systems in research and development case studies. The Vote'n'Rank's procedures are more robust than the mean average while being able to handle missing performance scores and determine conditions under which the system becomes the winner.
    Contrastive introspection (ConSpec) to rapidly identify invariant steps for success. (arXiv:2210.05845v1 [cs.LG])
    Reinforcement learning (RL) algorithms have achieved notable success in recent years, but still struggle with fundamental issues in long-term credit assignment. It remains difficult to learn in situations where success is contingent upon multiple critical steps that are distant in time from each other and from a sparse reward; as is often the case in real life. Moreover, how RL algorithms assign credit in these difficult situations is typically not coded in a way that can rapidly generalize to new situations. Here, we present an approach using offline contrastive learning, which we call contrastive introspection (ConSpec), that can be added to any existing RL algorithm and addresses both issues. In ConSpec, a contrastive loss is used during offline replay to identify invariances among successful episodes. This takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon than it is to prospectively predict reward at every step taken in the environment. ConSpec stores this knowledge in a collection of prototypes summarizing the intermediate states required for success. During training, arrival at any state that matches these prototypes generates an intrinsic reward that is added to any external rewards. As well, the reward shaping provided by ConSpec can be made to preserve the optimal policy of the underlying RL agent. The prototypes in ConSpec provide two key benefits for credit assignment: (1) They enable rapid identification of all the critical states. (2) They do so in a readily interpretable manner, enabling out of distribution generalization when sensory features are altered. In summary, ConSpec is a modular system that can be added to any existing RL algorithm to improve its long-term credit assignment.
    Boosting the Transferability of Adversarial Attacks with Reverse Adversarial Perturbation. (arXiv:2210.05968v1 [cs.CV])
    Deep neural networks (DNNs) have been shown to be vulnerable to adversarial examples, which can produce erroneous predictions by injecting imperceptible perturbations. In this work, we study the transferability of adversarial examples, which is significant due to its threat to real-world applications where model architecture or parameters are usually unknown. Many existing works reveal that the adversarial examples are likely to overfit the surrogate model that they are generated from, limiting its transfer attack performance against different target models. To mitigate the overfitting of the surrogate model, we propose a novel attack method, dubbed reverse adversarial perturbation (RAP). Specifically, instead of minimizing the loss of a single adversarial point, we advocate seeking adversarial example located at a region with unified low loss value, by injecting the worst-case perturbation (the reverse adversarial perturbation) for each step of the optimization procedure. The adversarial attack with RAP is formulated as a min-max bi-level optimization problem. By integrating RAP into the iterative process for attacks, our method can find more stable adversarial examples which are less sensitive to the changes of decision boundary, mitigating the overfitting of the surrogate model. Comprehensive experimental comparisons demonstrate that RAP can significantly boost adversarial transferability. Furthermore, RAP can be naturally combined with many existing black-box attack techniques, to further boost the transferability. When attacking a real-world image recognition system, Google Cloud Vision API, we obtain 22% performance improvement of targeted attacks over the compared method. Our codes are available at https://github.com/SCLBD/Transfer_attack_RAP.
    AMICO: Amodal Instance Composition. (arXiv:2210.05828v1 [cs.CV])
    Image composition aims to blend multiple objects to form a harmonized image. Existing approaches often assume precisely segmented and intact objects. Such assumptions, however, are hard to satisfy in unconstrained scenarios. We present Amodal Instance Composition for compositing imperfect -- potentially incomplete and/or coarsely segmented -- objects onto a target image. We first develop object shape prediction and content completion modules to synthesize the amodal contents. We then propose a neural composition model to blend the objects seamlessly. Our primary technical novelty lies in using separate foreground/background representations and blending mask prediction to alleviate segmentation errors. Our results show state-of-the-art performance on public COCOA and KINS benchmarks and attain favorable visual results across diverse scenes. We demonstrate various image composition applications such as object insertion and de-occlusion.
    Comparison of Soft and Hard Target RNN-T Distillation for Large-scale ASR. (arXiv:2210.05793v1 [cs.LG])
    Knowledge distillation is an effective machine learning technique to transfer knowledge from a teacher model to a smaller student model, especially with unlabeled data. In this paper, we focus on knowledge distillation for the RNN-T model, which is widely used in state-of-the-art (SoTA) automatic speech recognition (ASR). Specifically, we compared using soft and hard target distillation to train large-scaleRNN-T models on the LibriSpeech/LibriLight public dataset (60k hours) and our in-house data (600k hours). We found that hard tar-gets are more effective when the teacher and student have different architecture, such as large teacher and small streaming student. On the other hand, soft target distillation works better in self-training scenario like iterative large teacher training. For a large model with0.6B weights, we achieve a new SoTA word error rate (WER) on LibriSpeech (8% relative improvement on dev-other) using Noisy Student Training with soft target distillation. It also allows our production teacher to adapt new data domain continuously.
    Multi-Content Time-Series Popularity Prediction with Multiple-Model Transformers in MEC Networks. (arXiv:2210.05874v1 [cs.LG])
    Coded/uncoded content placement in Mobile Edge Caching (MEC) has evolved as an efficient solution to meet the significant growth of global mobile data traffic by boosting the content diversity in the storage of caching nodes. To meet the dynamic nature of the historical request pattern of multimedia contents, the main focus of recent researches has been shifted to develop data-driven and real-time caching schemes. In this regard and with the assumption that users' preferences remain unchanged over a short horizon, the Top-K popular contents are identified as the output of the learning model. Most existing datadriven popularity prediction models, however, are not suitable for the coded/uncoded content placement frameworks. On the one hand, in coded/uncoded content placement, in addition to classifying contents into two groups, i.e., popular and nonpopular, the probability of content request is required to identify which content should be stored partially/completely, where this information is not provided by existing data-driven popularity prediction models. On the other hand, the assumption that users' preferences remain unchanged over a short horizon only works for content with a smooth request pattern. To tackle these challenges, we develop a Multiple-model (hybrid) Transformer-based Edge Caching (MTEC) framework with higher generalization ability, suitable for various types of content with different time-varying behavior, that can be adapted with coded/uncoded content placement frameworks. Simulation results corroborate the effectiveness of the proposed MTEC caching framework in comparison to its counterparts in terms of the cache-hit ratio, classification accuracy, and the transferred byte volume.
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v5 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
    A Robust Initialization of Residual Blocks for Effective ResNet Training without Batch Normalization. (arXiv:2112.12299v2 [cs.LG] UPDATED)
    Batch Normalization is an essential component of all state-of-the-art neural networks architectures. However, since it introduces many practical issues, much recent research has been devoted to designing normalization-free architectures. In this paper, we show that weights initialization is key to train ResNet-like normalization-free networks. In particular, we propose a slight modification to the summation operation of a block output to the skip-connection branch, so that the whole network is correctly initialized. We show that this modified architecture achieves competitive results on CIFAR-10, CIFAR-100 and ImageNet without further regularization nor algorithmic modifications.
    Stochastic Constrained DRO with a Complexity Independent of Sample Size. (arXiv:2210.05740v1 [cs.LG])
    Distributionally Robust Optimization (DRO), as a popular method to train robust models against distribution shift between training and test sets, has received tremendous attention in recent years. In this paper, we propose and analyze stochastic algorithms that apply to both non-convex and convex losses for solving Kullback Leibler divergence constrained DRO problem. Compared with existing methods solving this problem, our stochastic algorithms not only enjoy competitive if not better complexity independent of sample size but also just require a constant batch size at every iteration, which is more practical for broad applications. We establish a nearly optimal complexity bound for finding an $\epsilon$ stationary solution for non-convex losses and an optimal complexity for finding an $\epsilon$ optimal solution for convex losses. Empirical studies demonstrate the effectiveness of the proposed algorithms for solving non-convex and convex constrained DRO problems.
    Parameter estimation of the homodyned K distribution based on neural networks and trainable fractional-order moments. (arXiv:2210.05833v1 [cs.LG])
    Homodyned K (HK) distribution has been widely used to describe the scattering phenomena arising in various research fields, such as ultrasound imaging or optics. In this work, we propose a machine learning based approach to the estimation of the HK distribution parameters. We develop neural networks that can estimate the HK distribution parameters based on the signal-to-noise ratio, skewness and kurtosis calculated using fractional-order moments. Compared to the previous approaches, we consider the orders of the moments as trainable variables that can be optimized along with the network weights using the back-propagation algorithm. Networks are trained based on samples generated from the HK distribution. Obtained results demonstrate that the proposed method can be used to accurately estimate the HK distribution parameters.
    Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression. (arXiv:2005.14458v3 [stat.ML] UPDATED)
    Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf.
    iMedBot: A Web-based Intelligent Agent for Healthcare Related Prediction and Deep Learning. (arXiv:2210.05671v1 [cs.LG])
    Background: Breast cancer is a multifactorial disease, genetic and environmental factors will affect its incidence probability. Breast cancer metastasis is one of the main cause of breast cancer related deaths reported by the American Cancer Society (ACS). Method: the iMedBot is a web application that we developed using the python Flask web framework and deployed on Amazon Web Services. It contains a frontend and a backend. The backend is supported by a python program we developed using the python Keras and scikit-learn packages, which can be used to learn deep feedforward neural network (DFNN) models. Result: the iMedBot can provide two main services: 1. it can predict 5-, 10-, or 15-year breast cancer metastasis based on a set of clinical information provided by a user. The prediction is done by using a set of DFNN models that were pretrained, and 2. It can train DFNN models for a user using user-provided dataset. The model trained will be evaluated using AUC and both the AUC value and the AUC ROC curve will be provided. Conclusion: The iMedBot web application provides a user-friendly interface for user-agent interaction in conducting personalized prediction and model training. It is an initial attempt to convert results of deep learning research into an online tool that may stir further research interests in this direction. Keywords: Deep learning, Breast Cancer, Web application, Model training.
    Toward Sustainable Continual Learning: Detection and Knowledge Repurposing of Similar Tasks. (arXiv:2210.05751v1 [cs.CV])
    Most existing works on continual learning (CL) focus on overcoming the catastrophic forgetting (CF) problem, with dynamic models and replay methods performing exceptionally well. However, since current works tend to assume exclusivity or dissimilarity among learning tasks, these methods require constantly accumulating task-specific knowledge in memory for each task. This results in the eventual prohibitive expansion of the knowledge repository if we consider learning from a long sequence of tasks. In this work, we introduce a paradigm where the continual learner gets a sequence of mixed similar and dissimilar tasks. We propose a new continual learning framework that uses a task similarity detection function that does not require additional learning, with which we analyze whether there is a specific task in the past that is similar to the current task. We can then reuse previous task knowledge to slow down parameter expansion, ensuring that the CL system expands the knowledge repository sublinearly to the number of learned tasks. Our experiments show that the proposed framework performs competitively on widely used computer vision benchmarks such as CIFAR10, CIFAR100, and EMNIST.
    Linkless Link Prediction via Relational Distillation. (arXiv:2210.05801v1 [cs.LG])
    Graph Neural Networks (GNNs) have been widely used on graph data and have shown exceptional performance in the task of link prediction. Despite their effectiveness, GNNs often suffer from high latency due to non-trivial neighborhood data dependency in practical deployments. To address this issue, researchers have proposed methods based on knowledge distillation (KD) to transfer the knowledge from teacher GNNs to student MLPs, which are known to be efficient even with industrial scale data, and have shown promising results on node classification. Nonetheless, using KD to accelerate link prediction is still unexplored. In this work, we start with exploring two direct analogs of traditional KD for link prediction, i.e., predicted logit-based matching and node representation-based matching. Upon observing direct KD analogs do not perform well for link prediction, we propose a relational KD framework, Linkless Link Prediction (LLP). Unlike simple KD methods that match independent link logits or node representations, LLP distills relational knowledge that is centered around each (anchor) node to the student MLP. Specifically, we propose two matching strategies that complement each other: rank-based matching and distribution-based matching. Extensive experiments demonstrate that LLP boosts the link prediction performance of MLPs with significant margins, and even outperforms the teacher GNNs on 6 out of 9 benchmarks. LLP also achieves a 776.37x speedup in link prediction inference compared to GNNs on the large scale OGB-Citation2 dataset.
    Statistical Modeling of Soft Error Influence on Neural Networks. (arXiv:2210.05876v1 [cs.LG])
    Soft errors in large VLSI circuits pose dramatic influence on computing- and memory-intensive neural network (NN) processing. Understanding the influence of soft errors on NNs is critical to protect against soft errors for reliable NN processing. Prior work mainly rely on fault simulation to analyze the influence of soft errors on NN processing. They are accurate but usually specific to limited configurations of errors and NN models due to the prohibitively slow simulation speed especially for large NN models and datasets. With the observation that the influence of soft errors propagates across a large number of neurons and accumulates as well, we propose to characterize the soft error induced data disturbance on each neuron with normal distribution model according to central limit theorem and develop a series of statistical models to analyze the behavior of NN models under soft errors in general. The statistical models reveal not only the correlation between soft errors and NN model accuracy, but also how NN parameters such as quantization and architecture affect the reliability of NNs. The proposed models are compared with fault simulation and verified comprehensively. In addition, we observe that the statistical models that characterize the soft error influence can also be utilized to predict fault simulation results in many cases and we explore the use of the proposed statistical models to accelerate fault simulations of NNs. According to our experiments, the accelerated fault simulation shows almost two orders of magnitude speedup with negligible simulation accuracy loss over the baseline fault simulations.
    Robustify Transformers with Robust Kernel Density Estimation. (arXiv:2210.05794v1 [cs.LG])
    Recent advances in Transformer architecture have empowered its empirical success in various tasks across different domains. However, existing works mainly focus on improving the standard accuracy and computational cost, without considering the robustness of contaminated samples. Existing work has shown that the self-attention mechanism, which is the center of the Transformer architecture, can be viewed as a non-parametric estimator based on the well-known kernel density estimation (KDE). This motivates us to leverage the robust kernel density estimation (RKDE) in the self-attention mechanism, to alleviate the issue of the contamination of data by down-weighting the weight of bad samples in the estimation process. The modified self-attention mechanism can be incorporated into different Transformer variants. Empirical results on language modeling and image classification tasks demonstrate the effectiveness of this approach.
    Matching Pursuit Based Scheduling for Over-the-Air Federated Learning. (arXiv:2206.06679v2 [cs.IT] UPDATED)
    This paper develops a class of low-complexity device scheduling algorithms for over-the-air federated learning via the method of matching pursuit. The proposed scheme tracks closely the close-to-optimal performance achieved by difference-of-convex programming, and outperforms significantly the well-known benchmark algorithms based on convex relaxation. Compared to the state-of-the-art, the proposed scheme poses a drastically lower computational load on the system: For $K$ devices and $N$ antennas at the parameter server, the benchmark complexity scales with $\left(N^2+K\right)^3 + N^6$ while the complexity of the proposed scheme scales with $K^p N^q$ for some $0 < p,q \leq 2$. The efficiency of the proposed scheme is confirmed via numerical experiments on the CIFAR-10 dataset.
    A Unified Framework for Alternating Offline Model Training and Policy Learning. (arXiv:2210.05922v1 [cs.LG])
    In offline model-based reinforcement learning (offline MBRL), we learn a dynamic model from historically collected data, and subsequently utilize the learned model and fixed datasets for policy learning, without further interacting with the environment. Offline MBRL algorithms can improve the efficiency and stability of policy learning over the model-free algorithms. However, in most of the existing offline MBRL algorithms, the learning objectives for the dynamic models and the policies are isolated from each other. Such an objective mismatch may lead to inferior performance of the learned agents. In this paper, we address this issue by developing an iterative offline MBRL framework, where we maximize a lower bound of the true expected return, by alternating between dynamic-model training and policy learning. With the proposed unified model-policy learning framework, we achieve competitive performance on a wide range of continuous-control offline reinforcement learning datasets. Source code is publicly released.
    Visual Language Maps for Robot Navigation. (arXiv:2210.05714v1 [cs.RO])
    Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images to natural language descriptions of object goals, it remains disjoint from the process of mapping the environment, so that it lacks the spatial precision of classic geometric maps. To address this problem, we propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world. VLMaps can be autonomously built from video feed on robots using standard exploration approaches and enables natural language indexing of the map without additional labeled data. Specifically, when combined with large language models (LLMs), VLMaps can be used to (i) translate natural language commands into a sequence of open-vocabulary navigation goals (which, beyond prior work, can be spatial by construction, e.g., "in between the sofa and TV" or "three meters to the right of the chair") directly localized in the map, and (ii) can be shared among multiple robots with different embodiments to generate new obstacle maps on-the-fly (by using a list of obstacle categories). Extensive experiments carried out in simulated and real world environments show that VLMaps enable navigation according to more complex language instructions than existing methods. Videos are available at https:vlmaps.github.io.
    Real-world-robustness of tree-based classifiers. (arXiv:2208.10354v2 [cs.LG] UPDATED)
    The concept of trustworthy AI has gained widespread attention lately. One of the aspects relevant to trustworthy AI is robustness of ML models. In this study, we show how to exactly compute the recently introduced measure of real-world-robustness - a measure for robustness against naturally occurring distortions of input data - for tree-based classifiers under the assumption that the natural distortions are given as probability distributions. The idea is to extract the decision rules of a trained tree-based classifier, separate the feature space into non-overlapping regions and determine the probability that a data sample with distortion returns its predicted label. The original method works for all black box classifiers, but is only an approximation and only works if the input dimension is not too high, whereas our proposed method returns exact results.
    On RKHS Choices for Assessing Graph Generators via Kernel Stein Statistics. (arXiv:2210.05746v1 [stat.ML])
    Score-based kernelised Stein discrepancy (KSD) tests have emerged as a powerful tool for the goodness of fit tests, especially in high dimensions; however, the test performance may depend on the choice of kernels in an underlying reproducing kernel Hilbert space (RKHS). Here we assess the effect of RKHS choice for KSD tests of random networks models, developed for exponential random graph models (ERGMs) in Xu and Reinert (2021)and for synthetic graph generators in Xu and Reinert (2022). We investigate the power performance and the computational runtime of the test in different scenarios, including both dense and sparse graph regimes. Experimental results on kernel performance for model assessment tasks are shown and discussed on synthetic and real-world network applications.
    Gradient-Guided Importance Sampling for Learning Binary Energy-Based Models. (arXiv:2210.05782v1 [cs.LG])
    Learning energy-based models (EBMs) is known to be difficult especially on discrete data where gradient-based learning strategies cannot be applied directly. Although ratio matching is a sound method to learn discrete EBMs, it suffers from expensive computation and excessive memory requirement, thereby resulting in difficulties for learning EBMs on high-dimensional data. Motivated from these limitations, in this study, we propose ratio matching with gradient-guided importance sampling (RMwGGIS). Particularly, we use the gradient of the energy function w.r.t. the discrete data space to approximately construct the provably optimal proposal distribution, which is subsequently used by importance sampling to efficiently estimate the original ratio matching objective. We perform experiments on density modeling over synthetic discrete data, graph generation, and training Ising models to evaluate our proposed method. The experimental results demonstrate that our method can significantly alleviate the limitations of ratio matching, perform more effectively in practice, and scale to high-dimensional problems. Our implementation is available at {https://github.com/divelab/RMwGGIS.
    Unsupervised detection of structural damage using Variational Autoencoder and a One-Class Support Vector Machine. (arXiv:2210.05674v1 [cs.LG])
    In recent years, Artificial Neural Networks (ANNs) have been introduced in Structural Health Monitoring (SHM) systems. An unsupervised method with a data-driven approach allows the ANN training on data acquired from an undamaged structural condition to detect structural damages. In standard approaches, after the training stage, a decision rule is manually defined to detect anomalous data. However, this process could be made automatic using machine learning methods, whom performances are maximised using hyperparameter optimization techniques. The paper proposes an unsupervised method with a data-driven approach to detect structural anomalies. The methodology consists of: (i) a Variational Autoencoder (VAE) to approximate undamaged data distribution and (ii) a One-Class Support Vector Machine (OC-SVM) to discriminate different health conditions using damage sensitive features extracted from VAE's signal reconstruction. The method is applied to a scale steel structure that was tested in nine damage's scenarios by IASC-ASCE Structural Health Monitoring Task Group.
    Shapley Head Pruning: Identifying and Removing Interference in Multilingual Transformers. (arXiv:2210.05709v1 [cs.CL])
    Multilingual transformer-based models demonstrate remarkable zero and few-shot transfer across languages by learning and reusing language-agnostic features. However, as a fixed-size model acquires more languages, its performance across all languages degrades, a phenomenon termed interference. Often attributed to limited model capacity, interference is commonly addressed by adding additional parameters despite evidence that transformer-based models are overparameterized. In this work, we show that it is possible to reduce interference by instead identifying and pruning language-specific parameters. First, we use Shapley Values, a credit allocation metric from coalitional game theory, to identify attention heads that introduce interference. Then, we show that removing identified attention heads from a fixed model improves performance for a target language on both sentence classification and structural prediction, seeing gains as large as 24.7\%. Finally, we provide insights on language-agnostic and language-specific attention heads using attention visualization.
    Transfer Learning with Joint Fine-Tuning for Multimodal Sentiment Analysis. (arXiv:2210.05790v1 [cs.LG])
    Most existing methods focus on sentiment analysis of textual data. However, recently there has been a massive use of images and videos on social platforms, motivating sentiment analysis from other modalities. Current studies show that exploring other modalities (e.g., images) increases sentiment analysis performance. State-of-the-art multimodal models, such as CLIP and VisualBERT, are pre-trained on datasets with the text paired with images. Although the results obtained by these models are promising, pre-training and sentiment analysis fine-tuning tasks of these models are computationally expensive. This paper introduces a transfer learning approach using joint fine-tuning for sentiment analysis. Our proposal achieved competitive results using a more straightforward alternative fine-tuning strategy that leverages different pre-trained unimodal models and efficiently combines them in a multimodal space. Moreover, our proposal allows flexibility when incorporating any pre-trained model for texts and images during the joint fine-tuning stage, being especially interesting for sentiment classification in low-resource scenarios.
    Trading Off Resource Budgets for Improved Regret Bounds. (arXiv:2210.05789v1 [cs.LG])
    In this work we consider a variant of adversarial online learning where in each round one picks $B$ out of $N$ arms and incurs cost equal to the $\textit{minimum}$ of the costs of each arm chosen. We propose an algorithm called Follow the Perturbed Multiple Leaders (FPML) for this problem, which we show (by adapting the techniques of Kalai and Vempala [2005]) achieves expected regret $\mathcal{O}(T^{\frac{1}{B+1}}\ln(N)^{\frac{B}{B+1}})$ over time horizon $T$ relative to the $\textit{single}$ best arm in hindsight. This introduces a trade-off between the budget $B$ and the single-best-arm regret, and we proceed to investigate several applications of this trade-off. First, we observe that algorithms which use standard regret minimizers as subroutines can sometimes be adapted by replacing these subroutines with FPML, and we use this to generalize existing algorithms for Online Submodular Function Maximization [Streeter and Golovin, 2008] in both the full feedback and semi-bandit feedback settings. Next, we empirically evaluate our new algorithms on an online black-box hyperparameter optimization problem. Finally, we show how FPML can lead to new algorithms for Linear Programming which require stronger oracles at the benefit of fewer oracle calls.
    Synthetic Power Analyses: Empirical Evaluation and Application to Cognitive Neuroimaging. (arXiv:2210.05835v1 [cs.CV])
    In the experimental sciences, statistical power analyses are often used before data collection to determine the required sample size. However, traditional power analyses can be costly when data are difficult or expensive to collect. We propose synthetic power analyses; a framework for estimating statistical power at various sample sizes, and empirically explore the performance of synthetic power analysis for sample size selection in cognitive neuroscience experiments. To this end, brain imaging data is synthesized using an implicit generative model conditioned on observed cognitive processes. Further, we propose a simple procedure to modify the statistical tests which result in conservative statistics. Our empirical results suggest that synthetic power analysis could be a low-cost alternative to pilot data collection when the proposed experiments share cognitive processes with previously conducted experiments.
    Towards Consistency and Complementarity: A Multiview Graph Information Bottleneck Approach. (arXiv:2210.05676v1 [cs.LG])
    The empirical studies of Graph Neural Networks (GNNs) broadly take the original node feature and adjacency relationship as singleview input, ignoring the rich information of multiple graph views. To circumvent this issue, the multiview graph analysis framework has been developed to fuse graph information across views. How to model and integrate shared (i.e. consistency) and view-specific (i.e. complementarity) information is a key issue in multiview graph analysis. In this paper, we propose a novel Multiview Variational Graph Information Bottleneck (MVGIB) principle to maximize the agreement for common representations and the disagreement for view-specific representations. Under this principle, we formulate the common and view-specific information bottleneck objectives across multiviews by using constraints from mutual information. However, these objectives are hard to directly optimize since the mutual information is computationally intractable. To tackle this challenge, we derive variational lower and upper bounds of mutual information terms, and then instead optimize variational bounds to find the approximate solutions for the information objectives. Extensive experiments on graph benchmark datasets demonstrate the superior effectiveness of the proposed method.
    Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits. (arXiv:2110.08627v2 [cs.LG] UPDATED)
    We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To this end, we design and analyze the BoBW-lil'UCB$(\gamma)$ algorithm. Complementarily, by establishing lower bounds on the regret achievable by any algorithm with a given BAI failure probability, we show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives, and (ii) BoBW-lil'UCB$(\gamma)$ achieves order-wise optimal performance for RM or BAI under different values of $\gamma$. Our work elucidates the trade-off more precisely by showing how the constants in previous works depend on certain hardness parameters. Finally, we show that BoBW-lil'UCB outperforms a close competitor UCB$_\alpha$ (Degenne et al., 2019) in terms of the time complexity and the regret on diverse datasets such as MovieLens and Published Kinase Inhibitor Set.
    BabyNet: A Lightweight Network for Infant Reaching Action Recognition in Unconstrained Environments to Support Future Pediatric Rehabilitation Applications. (arXiv:2208.04950v2 [cs.CV] UPDATED)
    Action recognition is an important component to improve autonomy of physical rehabilitation devices, such as wearable robotic exoskeletons. Existing human action recognition algorithms focus on adult applications rather than pediatric ones. In this paper, we introduce BabyNet, a light-weight (in terms of trainable parameters) network structure to recognize infant reaching action from off-body stationary cameras. We develop an annotated dataset that includes diverse reaches performed while in a sitting posture by different infants in unconstrained environments (e.g., in home settings, etc.). Our approach uses the spatial and temporal connection of annotated bounding boxes to interpret onset and offset of reaching, and to detect a complete reaching action. We evaluate the efficiency of our proposed approach and compare its performance against other learning-based network structures in terms of capability of capturing temporal inter-dependencies and accuracy of detection of reaching onset and offset. Results indicate our BabyNet can attain solid performance in terms of (average) testing accuracy that exceeds that of other larger networks, and can hence serve as a light-weight data-driven framework for video-based infant reaching action recognition.
    Adaptive Dual Channel Convolution Hypergraph Representation Learning for Technological Intellectual Property. (arXiv:2210.05947v1 [cs.IR])
    In the age of big data, the demand for hidden information mining in technological intellectual property is increasing in discrete countries. Definitely, a considerable number of graph learning algorithms for technological intellectual property have been proposed. The goal is to model the technological intellectual property entities and their relationships through the graph structure and use the neural network algorithm to extract the hidden structure information in the graph. However, most of the existing graph learning algorithms merely focus on the information mining of binary relations in technological intellectual property, ignoring the higherorder information hidden in non-binary relations. Therefore, a hypergraph neural network model based on dual channel convolution is proposed. For the hypergraph constructed from technological intellectual property data, the hypergraph channel and the line expanded graph channel of the hypergraph are used to learn the hypergraph, and the attention mechanism is introduced to adaptively fuse the output representations of the two channels. The proposed model outperforms the existing approaches on a variety of datasets.
    STable: Table Generation Framework for Encoder-Decoder Models. (arXiv:2206.04045v2 [cs.CL] UPDATED)
    The output structure of database-like tables, consisting of values structured in horizontal rows and vertical columns identifiable by name, can cover a wide range of NLP tasks. Following this constatation, we propose a framework for text-to-table neural models applicable to problems such as extraction of line items, joint entity and relation extraction, or knowledge base population. The permutation-based decoder of our proposal is a generalized sequential method that comprehends information from all cells in the table. The training maximizes the expected log-likelihood for a table's content across all random permutations of the factorization order. During the content inference, we exploit the model's ability to generate cells in any order by searching over possible orderings to maximize the model's confidence and avoid substantial error accumulation, which other sequential models are prone to. Experiments demonstrate a high practical value of the framework, which establishes state-of-the-art results on several challenging datasets, outperforming previous solutions by up to 15%.
    Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics. (arXiv:2210.06226v1 [stat.ML])
    Several algorithms involving the Variational R\'enyi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.
    FCT-GAN: Enhancing Table Synthesis via Fourier Transform. (arXiv:2210.06239v1 [cs.LG])
    Synthetic tabular data emerges as an alternative for sharing knowledge while adhering to restrictive data access regulations, e.g., European General Data Protection Regulation (GDPR). Mainstream state-of-the-art tabular data synthesizers draw methodologies from Generative Adversarial Networks (GANs), which are composed of a generator and a discriminator. While convolution neural networks are shown to be a better architecture than fully connected networks for tabular data synthesizing, two key properties of tabular data are overlooked: (i) the global correlation across columns, and (ii) invariant synthesizing to column permutations of input data. To address the above problems, we propose a Fourier conditional tabular generative adversarial network (FCT-GAN). We introduce feature tokenization and Fourier networks to construct a transformer-style generator and discriminator, and capture both local and global dependencies across columns. The tokenizer captures local spatial features and transforms original data into tokens. Fourier networks transform tokens to frequency domains and element-wisely multiply a learnable filter. Extensive evaluation on benchmarks and real-world data shows that FCT-GAN can synthesize tabular data with high machine learning utility (up to 27.8% better than state-of-the-art baselines) and high statistical similarity to the original data (up to 26.5% better), while maintaining the global correlation across columns, especially on high dimensional dataset.
    Transformers generalize differently from information stored in context vs in weights. (arXiv:2210.05675v1 [cs.CL])
    Transformer models can use two fundamentally different kinds of information: information stored in weights during training, and information provided ``in-context'' at inference time. In this work, we show that transformers exhibit different inductive biases in how they represent and generalize from the information in these two sources. In particular, we characterize whether they generalize via parsimonious rules (rule-based generalization) or via direct comparison with observed examples (exemplar-based generalization). This is of important practical consequence, as it informs whether to encode information in weights or in context, depending on how we want models to use that information. In transformers trained on controlled stimuli, we find that generalization from weights is more rule-based whereas generalization from context is largely exemplar-based. In contrast, we find that in transformers pre-trained on natural language, in-context learning is significantly rule-based, with larger models showing more rule-basedness. We hypothesise that rule-based generalization from in-context information might be an emergent consequence of large-scale training on language, which has sparse rule-like structure. Using controlled stimuli, we verify that transformers pretrained on data containing sparse rule-like structure exhibit more rule-based generalization.
    Exploration via Elliptical Episodic Bonuses. (arXiv:2210.05805v1 [cs.LG])
    In recent years, a number of reinforcement learning (RL) methods have been proposed to explore complex environments which differ across episodes. In this work, we show that the effectiveness of these methods critically relies on a count-based episodic term in their exploration bonus. As a result, despite their success in relatively simple, noise-free settings, these methods fall short in more realistic scenarios where the state space is vast and prone to noise. To address this limitation, we introduce Exploration via Elliptical Episodic Bonuses (E3B), a new method which extends count-based episodic bonuses to continuous state spaces and encourages an agent to explore states that are diverse under a learned embedding within each episode. The embedding is learned using an inverse dynamics model in order to capture controllable aspects of the environment. Our method sets a new state-of-the-art across 16 challenging tasks from the MiniHack suite, without requiring task-specific inductive biases. E3B also matches existing methods on sparse reward, pixel-based VizDoom environments, and outperforms existing methods in reward-free exploration on Habitat, demonstrating that it can scale to high-dimensional pixel-based observations and realistic environments.
    Explaining Online Reinforcement Learning Decisions of Self-Adaptive Systems. (arXiv:2210.05931v1 [cs.LG])
    Design time uncertainty poses an important challenge when developing a self-adaptive system. As an example, defining how the system should adapt when facing a new environment state, requires understanding the precise effect of an adaptation, which may not be known at design time. Online reinforcement learning, i.e., employing reinforcement learning (RL) at runtime, is an emerging approach to realizing self-adaptive systems in the presence of design time uncertainty. By using Online RL, the self-adaptive system can learn from actual operational data and leverage feedback only available at runtime. Recently, Deep RL is gaining interest. Deep RL represents learned knowledge as a neural network whereby it can generalize over unseen inputs, as well as handle continuous environment states and adaptation actions. A fundamental problem of Deep RL is that learned knowledge is not explicitly represented. For a human, it is practically impossible to relate the parametrization of the neural network to concrete RL decisions and thus Deep RL essentially appears as a black box. Yet, understanding the decisions made by Deep RL is key to (1) increasing trust, and (2) facilitating debugging. Such debugging is especially relevant for self-adaptive systems, because the reward function, which quantifies the feedback to the RL algorithm, must be defined by developers. The reward function must be explicitly defined by developers, thus introducing a potential for human error. To explain Deep RL for self-adaptive systems, we enhance and combine two existing explainable RL techniques from the machine learning literature. The combined technique, XRL-DINE, overcomes the respective limitations of the individual techniques. We present a proof-of-concept implementation of XRL-DINE, as well as qualitative and quantitative results of applying XRL-DINE to a self-adaptive system exemplar.
    Finding and Listing Front-door Adjustment Sets. (arXiv:2210.05816v1 [stat.ME])
    Identifying the effects of new interventions from data is a significant challenge found across a wide range of the empirical sciences. A well-known strategy for identifying such effects is Pearl's front-door (FD) criterion (Pearl, 1995). The definition of the FD criterion is declarative, only allowing one to decide whether a specific set satisfies the criterion. In this paper, we present algorithms for finding and enumerating possible sets satisfying the FD criterion in a given causal diagram. These results are useful in facilitating the practical applications of the FD criterion for causal effects estimation and helping scientists to select estimands with desired properties, e.g., based on cost, feasibility of measurement, or statistical power.
    Neural Importance Sampling for Rapid and Reliable Gravitational-Wave Inference. (arXiv:2210.05686v1 [gr-qc])
    We combine amortized neural posterior estimation with importance sampling for fast and accurate gravitational-wave inference. We first generate a rapid proposal for the Bayesian posterior using neural networks, and then attach importance weights based on the underlying likelihood and prior. This provides (1) a corrected posterior free from network inaccuracies, (2) a performance diagnostic (the sample efficiency) for assessing the proposal and identifying failure cases, and (3) an unbiased estimate of the Bayesian evidence. By establishing this independent verification and correction mechanism we address some of the most frequent criticisms against deep learning for scientific inference. We carry out a large study analyzing 42 binary black hole mergers observed by LIGO and Virgo with the SEOBNRv4PHM and IMRPhenomXPHM waveform models. This shows a median sample efficiency of $\approx 10\%$ (two orders-of-magnitude better than standard samplers) as well as a ten-fold reduction in the statistical uncertainty in the log evidence. Given these advantages, we expect a significant impact on gravitational-wave inference, and for this approach to serve as a paradigm for harnessing deep learning methods in scientific applications.
    A Self-attention Guided Multi-scale Gradient GAN for Diversified X-ray Image Synthesis. (arXiv:2210.06334v1 [eess.IV])
    Imbalanced image datasets are commonly available in the domain of biomedical image analysis. Biomedical images contain diversified features that are significant in predicting targeted diseases. Generative Adversarial Networks (GANs) are utilized to address the data limitation problem via the generation of synthetic images. Training challenges such as mode collapse, non-convergence, and instability degrade a GAN's performance in synthesizing diversified and high-quality images. In this work, SAMGAN, an attention-guided multi-scale gradient GAN architecture is proposed to model the relationship between long-range dependencies of biomedical image features and improves the training performance using a flow of multi-scale gradients at multiple resolutions in the layers of generator and discriminator models. The intent is to reduce the impact of mode collapse and stabilize the training of GAN using an attention mechanism with multi-scale gradient learning for diversified X-ray image synthesis. Multi-scale Structural Similarity Index Measure (MS-SSIM) and Frechet Inception Distance (FID) are used to identify the occurrence of mode collapse and evaluate the diversity of synthetic images generated. The proposed architecture is compared with the multi-scale gradient GAN (MSG-GAN) to assess the diversity of generated synthetic images. Results indicate that the SAMGAN outperforms MSG-GAN in synthesizing diversified images as evidenced by the MS-SSIM and FID scores.
    Betting the system: Using lineups to predict football scores. (arXiv:2210.06327v1 [cs.LG])
    This paper aims to reduce randomness in football by analysing the role of lineups in final scores using machine learning prediction models we have developed. Football clubs invest millions of dollars on lineups and knowing how individual statistics translate to better outcomes can optimise investments. Moreover, sports betting is growing exponentially and being able to predict the future is profitable and desirable. We use machine learning models and historical player data from English Premier League (2020-2022) to predict scores and to understand how individual performance can improve the outcome of a match. We compared different prediction techniques to maximise the possibility of finding useful models. We created heuristic and machine learning models predicting football scores to compare different techniques. We used different sets of features and shown goalkeepers stats are more important than attackers stats to predict goals scored. We applied a broad evaluation process to assess the efficacy of the models in real world applications. We managed to predict correctly all relegated teams after forecast 100 consecutive matches. We show that Support Vector Regression outperformed other techniques predicting final scores and that lineups do improve predictions. Finally, our model was profitable (42% return) when emulating a betting system using real world odds data.
    What can we learn about a generated image corrupting its latent representation?. (arXiv:2210.06257v1 [cs.CV])
    Generative adversarial networks (GANs) offer an effective solution to the image-to-image translation problem, thereby allowing for new possibilities in medical imaging. They can translate images from one imaging modality to another at a low cost. For unpaired datasets, they rely mostly on cycle loss. Despite its effectiveness in learning the underlying data distribution, it can lead to a discrepancy between input and output data. The purpose of this work is to investigate the hypothesis that we can predict image quality based on its latent representation in the GANs bottleneck. We achieve this by corrupting the latent representation with noise and generating multiple outputs. The degree of differences between them is interpreted as the strength of the representation: the more robust the latent representation, the fewer changes in the output image the corruption causes. Our results demonstrate that our proposed method has the ability to i) predict uncertain parts of synthesized images, and ii) identify samples that may not be reliable for downstream tasks, e.g., liver segmentation task.
    Distilling Knowledge from Language Models for Video-based Action Anticipation. (arXiv:2210.05991v1 [cs.CV])
    Anticipating future actions in a video is useful for many autonomous and assistive technologies. Prior action anticipation work mostly treats this as a vision modality problem, where the models learn the task information primarily from the video features in the target action anticipation datasets. In this work, we propose a method to make use of the text-modality that is available during the training, to bring in complementary information that is not present in the target action anticipation datasets. In particular, we leverage pre-trained language models to build a text-modality teacher that is able to predict future actions based on text labels of the past actions extracted from the input video. To further adapt the teacher to the target domain (cooking), we also pretrain the teacher on textual instructions from a recipes dataset (Recipe1M). Then, we distill the knowledge gained by the text-modality teacher into a vision-modality student to further improve it's performance. We empirically evaluate this simple cross-modal distillation strategy on two video datasets EGTEA-GAZE+ and EPIC-KITCHEN 55. Distilling this text-modality knowledge into a strong vision model (Anticipative Vision Transformer) yields consistent gains across both datasets, 3.5% relative improvement on top1 class mean recall for EGTEA-GAZE+, 7.2% on top5 many-shot class mean recall for EPIC-KITCHEN 55 and achieves new state-of-the-results.
    A Comparative Study on 1.5T-3T MRI Conversion through Deep Neural Network Models. (arXiv:2210.06362v1 [eess.IV])
    In this paper, we explore the capabilities of a number of deep neural network models in generating whole-brain 3T-like MR images from clinical 1.5T MRIs. The models include a fully convolutional network (FCN) method and three state-of-the-art super-resolution solutions, ESPCN [26], SRGAN [17] and PRSR [7]. The FCN solution, U-Convert-Net, carries out mapping of 1.5T-to-3T slices through a U-Net-like architecture, with 3D neighborhood information integrated through a multi-view ensemble. The pros and cons of the models, as well the associated evaluation metrics, are measured with experiments and discussed in depth. To the best of our knowledge, this study is the first work to evaluate multiple deep learning solutions for whole-brain MRI conversion, as well as the first attempt to utilize FCN/U-Net-like structure for this purpose.
    fAux: Testing Individual Fairness via Gradient Alignment. (arXiv:2210.06288v1 [stat.ML])
    Machine learning models are vulnerable to biases that result in unfair treatment of individuals from different populations. Recent work that aims to test a model's fairness at the individual level either relies on domain knowledge to choose metrics, or on input transformations that risk generating out-of-domain samples. We describe a new approach for testing individual fairness that does not have either requirement. We propose a novel criterion for evaluating individual fairness and develop a practical testing method based on this criterion which we call fAux (pronounced fox). This is based on comparing the derivatives of the predictions of the model to be tested with those of an auxiliary model, which predicts the protected variable from the observed data. We show that the proposed method effectively identifies discrimination on both synthetic and real-world datasets, and has quantitative and qualitative advantages over contemporary methods.
    UTTS: Unsupervised TTS with Conditional Disentangled Sequential Variational Auto-encoder. (arXiv:2206.02512v3 [eess.AS] UPDATED)
    In this paper, we propose a novel unsupervised text-to-speech (UTTS) framework which does not require text-audio pairs for the TTS acoustic modeling (AM). UTTS is a multi-speaker speech synthesizer that supports zero-shot voice cloning, it is developed from a perspective of disentangled speech representation learning. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. We leverage recent advancements in self-supervised speech representation learning as well as speech synthesis front-end techniques for system development. Specifically, we employ our recently formulated Conditional Disentangled Sequential Variational Auto-encoder (C-DSVAE) as the backbone UTTS AM, which offers well-structured content representations given unsupervised alignment (UA) as condition during training. For UTTS inference, we utilize a lexicon to map input text to the phoneme sequence, which is expanded to the frame-level forced alignment (FA) with a speaker-dependent duration model. Then, we develop an alignment mapping module that converts FA to UA. Finally, the C-DSVAE, serving as the self-supervised TTS AM, takes the predicted UA and a target speaker embedding to generate the mel spectrogram, which is ultimately converted to waveform with a neural vocoder. We show how our method enables speech synthesis without using a paired TTS corpus. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations. Audio samples are available at our demo page https://neurtts.github.io/utts_demo.
    ZeroC: A Neuro-Symbolic Model for Zero-shot Concept Recognition and Acquisition at Inference Time. (arXiv:2206.15049v3 [cs.LG] UPDATED)
    Humans have the remarkable ability to recognize and acquire novel visual concepts in a zero-shot manner. Given a high-level, symbolic description of a novel concept in terms of previously learned visual concepts and their relations, humans can recognize novel concepts without seeing any examples. Moreover, they can acquire new concepts by parsing and communicating symbolic structures using learned visual concepts and relations. Endowing these capabilities in machines is pivotal in improving their generalization capability at inference time. In this work, we introduce Zero-shot Concept Recognition and Acquisition (ZeroC), a neuro-symbolic architecture that can recognize and acquire novel concepts in a zero-shot way. ZeroC represents concepts as graphs of constituent concept models (as nodes) and their relations (as edges). To allow inference time composition, we employ energy-based models (EBMs) to model concepts and relations. We design ZeroC architecture so that it allows a one-to-one mapping between a symbolic graph structure of a concept and its corresponding EBM, which for the first time, allows acquiring new concepts, communicating its graph structure, and applying it to classification and detection tasks (even across domains) at inference time. We introduce algorithms for learning and inference with ZeroC. We evaluate ZeroC on a challenging grid-world dataset which is designed to probe zero-shot concept recognition and acquisition, and demonstrate its capability.
    Don't Copy the Teacher: Data and Model Challenges in Embodied Dialogue. (arXiv:2210.04443v2 [cs.LG] UPDATED)
    Embodied dialogue instruction following requires an agent to complete a complex sequence of tasks from a natural language exchange. The recent introduction of benchmarks (Padmakumar et al., 2022) raises the question of how best to train and evaluate models for this multi-turn, multi-agent, long-horizon task. This paper contributes to that conversation, by arguing that imitation learning (IL) and related low-level metrics are actually misleading and do not align with the goals of embodied dialogue research and may hinder progress. We provide empirical comparisons of metrics, analysis of three models, and make suggestions for how the field might best progress. First, we observe that models trained with IL take spurious actions during evaluation. Second, we find that existing models fail to ground query utterances, which are essential for task completion. Third, we argue evaluation should focus on higher-level semantic goals.
    Meta-Learning Dynamics Forecasting Using Task Inference. (arXiv:2102.10271v5 [cs.LG] UPDATED)
    Current deep learning models for dynamics forecasting struggle with generalization. They can only forecast in a specific domain and fail when applied to systems with different parameters, external forces, or boundary conditions. We propose a model-based meta-learning method called DyAd which can generalize across heterogeneous domains by partitioning them into different tasks. DyAd has two parts: an encoder which infers the time-invariant hidden features of the task with weak supervision, and a forecaster which learns the shared dynamics of the entire domain. The encoder adapts and controls the forecaster during inference using adaptive instance normalization and adaptive padding. Theoretically, we prove that the generalization error of such procedure is related to the task relatedness in the source domain, as well as the domain differences between source and target. Experimentally, we demonstrate that our model outperforms state-of-the-art approaches on both turbulent flow and real-world ocean data forecasting tasks.
    On the Representation Collapse of Sparse Mixture of Experts. (arXiv:2204.09179v3 [cs.CL] UPDATED)
    Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse. In this work, we propose to estimate the routing scores between tokens and experts on a low-dimensional hypersphere. We conduct extensive experiments on cross-lingual language model pre-training and fine-tuning on downstream tasks. Experimental results across seven multilingual benchmarks show that our method achieves consistent gains. We also present a comprehensive analysis on the representation and routing behaviors of our models. Our method alleviates the representation collapse issue and achieves more consistent routing than the baseline mixture-of-experts methods.
    Dilated FCN: Listening Longer to Hear Better. (arXiv:1907.11956v1 [cs.SD] CROSS LISTED)
    Deep neural network solutions have emerged as a new and powerful paradigm for speech enhancement (SE). The capabilities to capture long context and extract multi-scale patterns are crucial to design effective SE networks. Such capabilities, however, are often in conflict with the goal of maintaining compact networks to ensure good system generalization. In this paper, we explore dilation operations and apply them to fully convolutional networks (FCNs) to address this issue. Dilations equip the networks with greatly expanded receptive fields, without increasing the number of parameters. Different strategies to fuse multi-scale dilations, as well as to install the dilation modules are explored in this work. Using Noisy VCTK and AzBio sentences datasets, we demonstrate that the proposed dilation models significantly improve over the baseline FCN and outperform the state-of-the-art SE solutions.
    Double Bubble, Toil and Trouble: Enhancing Certified Robustness through Transitivity. (arXiv:2210.06077v1 [cs.LG])
    In response to subtle adversarial examples flipping classifications of neural network models, recent research has promoted certified robustness as a solution. There, invariance of predictions to all norm-bounded attacks is achieved through randomised smoothing of network inputs. Today's state-of-the-art certifications make optimal use of the class output scores at the input instance under test: no better radius of certification (under the $L_2$ norm) is possible given only these score. However, it is an open question as to whether such lower bounds can be improved using local information around the instance under test. In this work, we demonstrate how today's "optimal" certificates can be improved by exploiting both the transitivity of certifications, and the geometry of the input space, giving rise to what we term Geometrically-Informed Certified Robustness. By considering the smallest distance to points on the boundary of a set of certifications this approach improves certifications for more than $80\%$ of Tiny-Imagenet instances, yielding an on average $5 \%$ increase in the associated certification. When incorporating training time processes that enhance the certified radius, our technique shows even more promising results, with a uniform $4$ percentage point increase in the achieved certified radius.
    Modular Flows: Differential Molecular Generation. (arXiv:2210.06032v1 [cs.LG])
    Generating new molecules is fundamental to advancing critical applications such as drug discovery and material synthesis. Flows can generate molecules effectively by inverting the encoding process, however, existing flow models either require artifactual dequantization or specific node/edge orderings, lack desiderata such as permutation invariance or induce discrepancy between the encoding and the decoding steps that necessitates {\em post hoc} validity correction. We circumvent these issues with novel continuous normalizing E(3)-equivariant flows, based on a system of node ODEs coupled as a graph PDE, that repeatedly reconcile locally toward globally aligned densities. Our models can be cast as message-passing temporal networks, and result in superlative performance on the tasks of density estimation and molecular generation. In particular, our generated samples achieve state-of-the-art on both the standard QM9 and ZINC250K benchmarks.
    On the Importance of Gradient Norm in PAC-Bayesian Bounds. (arXiv:2210.06143v1 [cs.LG])
    Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss and on-average bounded gradient norm assumptions. Following this relaxation, we propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities. These inequalities add an additional loss-gradient norm term to the generalization bound, which is intuitively a surrogate of the model complexity. We apply the proposed bound on Bayesian deep nets and empirically analyze the effect of this new loss-gradient norm term on different neural architectures.
    Learning on Arbitrary Graph Topologies via Predictive Coding. (arXiv:2201.13180v3 [cs.LG] UPDATED)
    Training with backpropagation (BP) in standard deep learning consists of two main steps: a forward pass that maps a data point to its prediction, and a backward pass that propagates the error of this prediction back through the network. This process is highly effective when the goal is to minimize a specific objective function. However, it does not allow training on networks with cyclic or backward connections. This is an obstacle to reaching brain-like capabilities, as the highly complex heterarchical structure of the neural connections in the neocortex are potentially fundamental for its effectiveness. In this paper, we show how predictive coding (PC), a theory of information processing in the cortex, can be used to perform inference and learning on arbitrary graph topologies. We experimentally show how this formulation, called PC graphs, can be used to flexibly perform different tasks with the same network by simply stimulating specific neurons, and investigate how the topology of the graph influences the final performance. We conclude by comparing against simple baselines trained~with~BP.
    Reinforcement learning for automatic quadrilateral mesh generation: a soft actor-critic approach. (arXiv:2203.11203v2 [cs.LG] UPDATED)
    This paper proposes, implements, and evaluates a reinforcement learning (RL)-based computational framework for automatic mesh generation. Mesh generation plays a fundamental role in numerical simulations in the area of computer aided design and engineering (CAD/E). It is identified as one of the critical issues in the NASA CFD Vision 2030 Study. Existing mesh generation methods suffer from high computational complexity, low mesh quality in complex geometries, and speed limitations. These methods and tools, including commercial software packages, are typically semiautomatic and they need inputs or help from human experts. By formulating the mesh generation as a Markov decision process (MDP) problem, we are able to use a state-of-the-art reinforcement learning (RL) algorithm called "soft actor-critic" to automatically learn from trials the policy of actions for mesh generation. The implementation of this RL algorithm for mesh generation allows us to build a fully automatic mesh generation system without human intervention and any extra clean-up operations, which fills the gap in the existing mesh generation tools. In the experiments to compare with two representative commercial software packages, our system demonstrates promising performance with respect to scalability, generalizability, and effectiveness.
    E3Bind: An End-to-End Equivariant Network for Protein-Ligand Docking. (arXiv:2210.06069v1 [q-bio.BM])
    In silico prediction of the ligand binding pose to a given protein target is a crucial but challenging task in drug discovery. This work focuses on blind flexible selfdocking, where we aim to predict the positions, orientations and conformations of docked molecules. Traditional physics-based methods usually suffer from inaccurate scoring functions and high inference cost. Recently, data-driven methods based on deep learning techniques are attracting growing interest thanks to their efficiency during inference and promising performance. These methods usually either adopt a two-stage approach by first predicting the distances between proteins and ligands and then generating the final coordinates based on the predicted distances, or directly predicting the global roto-translation of ligands. In this paper, we take a different route. Inspired by the resounding success of AlphaFold2 for protein structure prediction, we propose E3Bind, an end-to-end equivariant network that iteratively updates the ligand pose. E3Bind models the protein-ligand interaction through careful consideration of the geometric constraints in docking and the local context of the binding site. Experiments on standard benchmark datasets demonstrate the superior performance of our end-to-end trainable model compared to traditional and recently-proposed deep learning methods.
    Exploring Efficient-tuning Methods in Self-supervised Speech Models. (arXiv:2210.06175v1 [eess.AS])
    In this study, we aim to explore efficient tuning methods for speech self-supervised learning. Recent studies show that self-supervised learning (SSL) can learn powerful representations for different speech tasks. However, fine-tuning pre-trained models for each downstream task is parameter-inefficient since SSL models are notoriously large with millions of parameters. Adapters are lightweight modules commonly used in NLP to solve this problem. In downstream tasks, the parameters of SSL models are frozen, and only the adapters are trained. Given the lack of studies generally exploring the effectiveness of adapters for self-supervised speech tasks, we intend to fill this gap by adding various adapter modules in pre-trained speech SSL models. We show that the performance parity can be achieved with over 90% parameter reduction, and discussed the pros and cons of efficient tuning techniques. This is the first comprehensive investigation of various adapter types across speech tasks.
    Large Language Models Can Be Strong Differentially Private Learners. (arXiv:2110.05679v5 [cs.LG] UPDATED)
    Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers.
    CLEEGN: A Convolutional Neural Network for Plug-and-Play Automatic EEG Reconstruction. (arXiv:2210.05988v1 [eess.SP])
    Human electroencephalography (EEG) is a brain monitoring modality that senses cortical neuroelectrophysiological activity in high-temporal resolution. One of the greatest challenges posed in applications of EEG is the unstable signal quality susceptible to inevitable artifacts during recordings. To date, most existing techniques for EEG artifact removal and reconstruction are applicable to offline analysis solely, or require individualized training data to facilitate online reconstruction. We have proposed CLEEGN, a novel convolutional neural network for plug-and-play automatic EEG reconstruction. CLEEGN is based on a subject-independent pre-trained model using existing data and can operate on a new user without any further calibration. The performance of CLEEGN was validated using multiple evaluations including waveform observation, reconstruction error assessment, and decoding accuracy on well-studied labeled datasets. The results of simulated online validation suggest that, even without any calibration, CLEEGN can largely preserve inherent brain activity and outperforms leading online/offline artifact removal methods in the decoding accuracy of reconstructed EEG data. In addition, visualization of model parameters and latent features exhibit the model behavior and reveal explainable insights related to existing knowledge of neuroscience. We foresee pervasive applications of CLEEGN in prospective works of online plug-and-play EEG decoding and analysis.
    Indoor Localization with Robust Global Channel Charting: A Time-Distance-Based Approach. (arXiv:2210.06294v1 [eess.SP])
    Fingerprinting-based positioning significantly improves the indoor localization performance in non-line-of-sight-dominated areas. However, its deployment and maintenance is cost-intensive as it needs ground-truth reference systems for both the initial training and the adaption to environmental changes. In contrast, channel charting (CC) works without explicit reference information and only requires the spatial correlations of channel state information (CSI). While CC has shown promising results in modelling the geometry of the radio environment, a deeper insight into CC for localization using multi-anchor large-bandwidth measurements is still pending. We contribute a novel distance metric for time-synchronized single-input/single-output CSIs that approaches a linear correlation to the Euclidean distance. This allows to learn the environment's global geometry without annotations. To efficiently optimize the global channel chart we approximate the metric with a Siamese neural network. This enables full CC-assisted fingerprinting and positioning only using a linear transformation from the chart to the real-world coordinates. We compare our approach to the state-of-the-art of CC on two different real-world data sets recorded with a 5G and UWB radio setup. Our approach outperforms others with localization accuracies of 0.69m for the UWB and 1.4m for the 5G setup. We show that CC-assisted fingerprinting enables highly accurate localization and reduces (or eliminates) the need for annotated training data.
    Understanding Cross-Domain Few-Shot Learning Based on Domain Similarity and Few-Shot Difficulty. (arXiv:2202.01339v3 [cs.LG] UPDATED)
    Cross-domain few-shot learning (CD-FSL) has drawn increasing attention for handling large differences between the source and target domains--an important concern in real-world scenarios. To overcome these large differences, recent works have considered exploiting small-scale unlabeled data from the target domain during the pre-training stage. This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain. In this paper, we empirically investigate which pre-training is preferred based on domain similarity and few-shot difficulty of the target domain. We discover that the performance gain of self-supervised pre-training over supervised pre-training becomes large when the target domain is dissimilar to the source domain, or the target domain itself has low few-shot difficulty. We further design two pre-training schemes, mixed-supervised and two-stage learning, that improve performance. In this light, we present six findings for CD-FSL, which are supported by extensive experiments and analyses on three source and eight target benchmark datasets with varying levels of domain similarity and few-shot difficulty. Our code is available at https://github.com/sungnyun/understanding-cdfsl.
    Aergia: Leveraging Heterogeneity in Federated Learning Systems. (arXiv:2210.06154v1 [cs.LG])
    Federated Learning (FL) is a popular approach for distributed deep learning that prevents the pooling of large amounts of data in a central server. FL relies on clients to update a global model using their local datasets. Classical FL algorithms use a central federator that, for each training round, waits for all clients to send their model updates before aggregating them. In practical deployments, clients might have different computing powers and network capabilities, which might lead slow clients to become performance bottlenecks. Previous works have suggested to use a deadline for each learning round so that the federator ignores the late updates of slow clients, or so that clients send partially trained models before the deadline. To speed up the training process, we instead propose Aergia, a novel approach where slow clients (i) freeze the part of their model that is the most computationally intensive to train; (ii) train the unfrozen part of their model; and (iii) offload the training of the frozen part of their model to a faster client that trains it using its own dataset. The offloading decisions are orchestrated by the federator based on the training speed that clients report and on the similarities between their datasets, which are privately evaluated thanks to a trusted execution environment. We show through extensive experiments that Aergia maintains high accuracy and significantly reduces the training time under heterogeneous settings by up to 27% and 53% compared to FedAvg and TiFL, respectively.
    Reinforcement Learning with Automated Auxiliary Loss Search. (arXiv:2210.06041v1 [cs.LG])
    A good state representation is crucial to solving complicated reinforcement learning (RL) challenges. Many recent works focus on designing auxiliary losses for learning informative representations. Unfortunately, these handcrafted objectives rely heavily on expert knowledge and may be sub-optimal. In this paper, we propose a principled and universal method for learning better representations with auxiliary loss functions, named Automated Auxiliary Loss Search (A2LS), which automatically searches for top-performing auxiliary loss functions for RL. Specifically, based on the collected trajectory data, we define a general auxiliary loss space of size $7.5 \times 10^{20}$ and explore the space with an efficient evolutionary search strategy. Empirical results show that the discovered auxiliary loss (namely, A2-winner) significantly improves the performance on both high-dimensional (image) and low-dimensional (vector) unseen tasks with much higher efficiency, showing promising generalization ability to different settings and even different benchmark domains. We conduct a statistical analysis to reveal the relations between patterns of auxiliary losses and RL performance.
    Clustering Embedding Tables, Without First Learning Them. (arXiv:2210.05974v1 [cs.LG])
    To work with categorical features, machine learning systems employ embedding tables. These tables can become exceedingly large in modern recommendation systems, necessitating the development of new methods for fitting them in memory, even during training. Some of the most successful methods for table compression are Product- and Residual Vector Quantization (Gray & Neuhoff, 1998). These methods replace table rows with references to k-means clustered "codewords." Unfortunately, this means they must first know the table before compressing it, so they can only save memory during inference, not training. Recent work has used hashing-based approaches to minimize memory usage during training, but the compression obtained is inferior to that obtained by "post-training" quantization. We show that the best of both worlds may be obtained by combining techniques based on hashing and clustering. By first training a hashing-based "sketch", then clustering it, and then training the clustered quantization, our method achieves compression ratios close to those of post-training quantization with the training time memory reductions of hashing-based methods. We show experimentally that our method provides better compression and/or accuracy that previous methods, and we prove that our method always converges to the optimal embedding table for least-squares training.
    Outlier-Insensitive Kalman Filtering Using NUV Priors. (arXiv:2210.06083v1 [eess.SP])
    The Kalman filter (KF) is a widely-used algorithm for tracking the latent state of a dynamical system from noisy observations. For systems that are well-described by linear Gaussian state space models, the KF minimizes the mean-squared error (MSE). However, in practice, observations are corrupted by outliers, severely impairing the KFs performance. In this work, an outlier-insensitive KF is proposed, where robustness is achieved by modeling each potential outlier as a normally distributed random variable with unknown variance (NUV). The NUVs variances are estimated online, using both expectation-maximization (EM) and alternating maximization (AM). The former was previously proposed for the task of smoothing with outliers and was adapted here to filtering, while both EM and AM obtained the same performance and outperformed the other algorithms, the AM approach is less complex and thus requires 40 percentage less run-time. Our empirical study demonstrates that the MSE of our proposed outlier-insensitive KF outperforms previously proposed algorithms, and that for data clean of outliers, it reverts to the classic KF, i.e., MSE optimality is preserved
    Predictive Event Segmentation and Representation with Neural Networks: A Self-Supervised Model Assessed by Psychological Experiments. (arXiv:2210.05710v1 [q-bio.NC])
    People segment complex, ever-changing and continuous experience into basic, stable and discrete spatio-temporal experience units, called events. Event segmentation literature investigates the mechanisms that allow people to extract events. Event segmentation theory points out that people predict ongoing activities and observe prediction error signals to find event boundaries that keep events apart. In this study, we investigated the mechanism giving rise to this ability by a computational model and accompanying psychological experiments. Inspired from event segmentation theory and predictive processing, we introduced a self-supervised model of event segmentation. This model consists of neural networks that predict the sensory signal in the next time-step to represent different events, and a cognitive model that regulates these networks on the basis of their prediction errors. In order to verify the ability of our model in segmenting events, learning them during passive observation, and representing them in its internal representational space, we prepared a video that depicts human behaviors represented by point-light displays. We compared event segmentation behaviors of participants and our model with this video in two hierarchical event segmentation levels. By using point-biserial correlation technique, we demonstrated that event segmentation decisions of our model correlated with the responses of participants. Moreover, by approximating representation space of participants by a similarity-based technique, we showed that our model formed a similar representation space with those of participants. The result suggests that our model that tracks the prediction error signals can produce human-like event boundaries and event representations. Finally, we discussed our contribution to the literature of event cognition and our understanding of how event segmentation is implemented in the brain.
    Travel the Same Path: A Novel TSP Solving Strategy. (arXiv:2210.05906v1 [cs.LG])
    In this paper, we provide a novel strategy for solving Traveling Salesman Problem, which is a famous combinatorial optimization problem studied intensely in the TCS community. In particular, we consider the imitation learning framework, which helps a deterministic algorithm making good choices whenever it needs to, resulting in a speed up while maintaining the exactness of the solution without suffering from the unpredictability and a potential large deviation. Furthermore, we demonstrate a strong generalization ability of a graph neural network trained under the imitation learning framework. Specifically, the model is capable of solving a large instance of TSP faster than the baseline while has only seen small TSP instances when training.
    Pathology Steered Stratification Network for Subtype Identification in Alzheimer's Disease. (arXiv:2210.05880v1 [q-bio.QM])
    Alzheimer's disease (AD) is a heterogeneous, multifactorial neurodegenerative disorder characterized by beta-amyloid, pathologic tau, and neurodegeneration. The massive heterogeneity between neurobiological examinations and clinical assessment is the current biggest challenge in the early diagnosis of Alzheimer's disease, urging for a comprehensive stratification of the aging population that is defined by reliable neurobiological biomarkers and closely associated with clinical outcomes. However, existing statistical inference approaches in neuroimaging studies of AD subtype identification fail to take into account the neuropathological domain knowledge, which could lead to ill-posed results that are sometimes inconsistent with neurological principles. To fill this knowledge gap, we propose a novel pathology steered stratification network (PSSN) that integrates mainstream AD pathology with multimodal longitudinal neuroimaging data to categorize the aging population. By combining theory-based biological modeling and data-driven deep learning, this cross-disciplinary approach can not only generate long-term biomarker prediction consistent with the end-state of individuals but also stratifies subjects into fine-grained subtypes with distinct neurological underpinnings, where ag-ing brains within the same subtype share com-mon biological behaviors that emerge as similar trajectories of cognitive decline. Our stratification outperforms K-means and SuStaIn in both inter-cluster heterogeneity and intra-cluster homogeneity of various clinical scores. Importantly, we identify six subtypes spanning AD spectrum, where each subtype exhibits a distinctive biomarker pattern that is consistent with its clinical outcome. A disease evolutionary graph is further provided by quantifying subtype transition probabilities, which may assist pre-symptomatic diagnosis and guide therapeutic treatments.
    Context-aware Bayesian choice models. (arXiv:2210.05737v1 [stat.ML])
    The mixed multinomial logit (MMNL) model assumes constant preference parameters of a decision-maker throughout different choice situations, which may be considered too strong for certain choice modelling applications. This paper proposes an effective approach to model context-dependent intra-respondent heterogeneity and introduces the idea of Context-aware Bayesian Mixed Multinomial Logit (C-MMNL) Model, where a neural network maps contextual information to shifts in the preference parameters of each individual in each choice occasion. The proposed model offers several key advantages. First, it supports for both continuous and discrete variables, as well as complex non-linear interactions between both types of variables. Secondly, each specification of the context is considered jointly as a whole by the neural network rather than each variable being considered independently. Finally, since the parameters of the neural network are shared across all decision-makers, it can leverage information from other decision-makers and use it to infer the effect of a particular context. Even though the C-MMNL model allows for flexible interactions between attributes, there is hardly an increase in the complexity of the model and the computation time, compared to the MMNL model. We present two real-world case studies from travel behaviour domain - a travel mode choice model and a bicycle route choice model. The bicycle route choice model is based on a large-scale, crowdsourced dataset of GPS trajectories including 110,083 trips made by 8,555 cyclists.
    Application of Deep Learning on Single-Cell RNA-sequencing Data Analysis: A Review. (arXiv:2210.05677v1 [q-bio.GN])
    Single-cell RNA-sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during development of complex organisms and improved our understanding of disease states, such as cancer, diabetes, and COVID, among others. Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative, compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analysis tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep algorithms for scRNA-seq data analysis.
    Social-Group-Agnostic Word Embedding Debiasing via the Stereotype Content Model. (arXiv:2210.05831v1 [cs.CL])
    Existing word embedding debiasing methods require social-group-specific word pairs (e.g., "man"-"woman") for each social attribute (e.g., gender), which cannot be used to mitigate bias for other social groups, making these methods impractical or costly to incorporate understudied social groups in debiasing. We propose that the Stereotype Content Model (SCM), a theoretical framework developed in social psychology for understanding the content of stereotypes, which structures stereotype content along two psychological dimensions - "warmth" and "competence" - can help debiasing efforts to become social-group-agnostic by capturing the underlying connection between bias and stereotypes. Using only pairs of terms for warmth (e.g., "genuine"-"fake") and competence (e.g.,"smart"-"stupid"), we perform debiasing with established methods and find that, across gender, race, and age, SCM-based debiasing performs comparably to group-specific debiasing
    Match Cutting: Finding Cuts with Smooth Visual Transitions. (arXiv:2210.05766v1 [cs.CV])
    A match cut is a transition between a pair of shots that uses similar framing, composition, or action to fluidly bring the viewer from one scene to the next. Match cuts are frequently used in film, television, and advertising. However, finding shots that work together is a highly manual and time-consuming process that can take days. We propose a modular and flexible system to efficiently find high-quality match cut candidates starting from millions of shot pairs. We annotate and release a dataset of approximately 20k labeled pairs that we use to evaluate our system, using both classification and metric learning approaches that leverage a variety of image, video, audio, and audio-visual feature extractors. In addition, we release code and embeddings for reproducing our experiments at github.com/netflix/matchcut.
    Annihilation of Spurious Minima in Two-Layer ReLU Networks. (arXiv:2210.06088v1 [cs.LG])
    We study the optimization problem associated with fitting two-layer ReLU neural networks with respect to the squared loss, where labels are generated by a target network. Use is made of the rich symmetry structure to develop a novel set of tools for studying the mechanism by which over-parameterization annihilates spurious minima. Sharp analytic estimates are obtained for the loss and the Hessian spectrum at different minima, and it is proved that adding neurons can turn symmetric spurious minima into saddles; minima of lesser symmetry require more neurons. Using Cauchy's interlacing theorem, we prove the existence of descent directions in certain subspaces arising from the symmetry structure of the loss function. This analytic approach uses techniques, new to the field, from algebraic geometry, representation theory and symmetry breaking, and confirms rigorously the effectiveness of over-parameterization in making the associated loss landscape accessible to gradient-based methods. For a fixed number of neurons and inputs, the spectral results remain true under symmetry breaking perturbation of the target.
    Multimodality Multi-Lead ECG Arrhythmia Classification using Self-Supervised Learning. (arXiv:2210.06297v1 [eess.SP])
    Electrocardiogram (ECG) signal is one of the most effective sources of information mainly employed for the diagnosis and prediction of cardiovascular diseases (CVDs) connected with the abnormalities in heart rhythm. Clearly, single modality ECG (i.e. time series) cannot convey its complete characteristics, thus, exploiting both time and time-frequency modalities in the form of time-series data and spectrogram is needed. Leveraging the cutting-edge self-supervised learning (SSL) technique on unlabeled data, we propose SSL-based multimodality ECG classification. Our proposed network follows SSL learning paradigm and consists of two modules corresponding to pre-stream task, and down-stream task, respectively. In the SSL-pre-stream task, we utilize self-knowledge distillation (KD) techniques with no labeled data, on various transformations and in both time and frequency domains. In the down-stream task, which is trained on labeled data, we propose a gate fusion mechanism to fuse information from multimodality.To evaluate the effectiveness of our approach, ten-fold cross validation on the 12-lead PhysioNet 2020 dataset has been conducted.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v2 [cs.LG] UPDATED)
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    A composable machine-learning approach for steady-state simulations on high-resolution grids. (arXiv:2210.05837v1 [cs.LG])
    In this paper we show that our Machine Learning (ML) approach, CoMLSim (Composable Machine Learning Simulator), can simulate PDEs on highly-resolved grids with higher accuracy and generalization to out-of-distribution source terms and geometries than traditional ML baselines. Our unique approach combines key principles of traditional PDE solvers with local-learning and low-dimensional manifold techniques to iteratively simulate PDEs on large computational domains. The proposed approach is validated on more than 5 steady-state PDEs across different PDE conditions on highly-resolved grids and comparisons are made with the commercial solver, Ansys Fluent as well as 4 other state-of-the-art ML methods. The numerical experiments show that our approach outperforms ML baselines in terms of 1) accuracy across quantitative metrics and 2) generalization to out-of-distribution conditions as well as domain sizes. Additionally, we provide results for a large number of ablations experiments conducted to highlight components of our approach that strongly influence the results. We conclude that our local-learning and iterative-inferencing approach reduces the challenge of generalization that most ML models face.
    Deterioration Prediction using Time-Series of Three Vital Signs and Current Clinical Features Amongst COVID-19 Patients. (arXiv:2210.05881v1 [cs.LG])
    Unrecognized patient deterioration can lead to high morbidity and mortality. Most existing deterioration prediction models require a large number of clinical information, typically collected in hospital settings, such as medical images or comprehensive laboratory tests. This is infeasible for telehealth solutions and highlights a gap in deterioration prediction models that are based on minimal data, which can be recorded at a large scale in any clinic, nursing home, or even at the patient's home. In this study, we propose and develop a prognostic model that predicts if a patient will experience deterioration in the forthcoming 3-24 hours. The model sequentially processes routine triadic vital signs: (a) oxygen saturation, (b) heart rate, and (c) temperature. The model is also provided with basic patient information, including sex, age, vaccination status, vaccination date, and status of obesity, hypertension, or diabetes. We train and evaluate the model using data collected from 37,006 COVID-19 patients at NYU Langone Health in New York, USA. The model achieves an area under the receiver operating characteristic curve (AUROC) of 0.808-0.880 for 3-24 hour deterioration prediction. We also conduct occlusion experiments to evaluate the importance of each input feature, where the results reveal the significance of continuously monitoring the variations of the vital signs. Our results show the prospect of accurate deterioration forecast using a minimum feature set that can be relatively easily obtained using wearable devices and self-reported patient information.
    Mathematical Theory of Bayesian Statistics for Unknown Information Source. (arXiv:2206.05630v3 [cs.LG] UPDATED)
    In statistical inference, uncertainty is unknown and all models are wrong. That is to say, a person who makes a statistical model and a prior distribution is simultaneously aware that both are fictional candidates. To study such cases, statistical measures have been constructed, such as cross validation, information criteria, and marginal likelihood, however, their mathematical properties have not yet been completely clarified when statistical models are under- and over- parametrized. We introduce a place of mathematical theory of Bayesian statistics for unknown uncertainty, which clarifies general properties of cross validation, information criteria, and marginal likelihood, even if an unknown data-generating process is unrealizable by a model or even if the posterior distribution cannot be approximated by any normal distribution. Hence it gives a helpful standpoint for a person who cannot believe in any specific model and prior. This paper consists of three parts. The first is a new result, whereas the second and third are well-known previous results with new experiments. We show there exists a more precise estimator of the generalization loss than leave-one-out cross validation, there exists a more accurate approximation of marginal likelihood than BIC, and the optimal hyperparameters for generalization loss and marginal likelihood are different.
    Regularized Graph Structure Learning with Semantic Knowledge for Multi-variates Time-Series Forecasting. (arXiv:2210.06126v1 [cs.LG])
    Multivariate time-series forecasting is a critical task for many applications, and graph time-series network is widely studied due to its capability to capture the spatial-temporal correlation simultaneously. However, most existing works focus more on learning with the explicit prior graph structure, while ignoring potential information from the implicit graph structure, yielding incomplete structure modeling. Some recent works attempt to learn the intrinsic or implicit graph structure directly while lacking a way to combine explicit prior structure with implicit structure together. In this paper, we propose Regularized Graph Structure Learning (RGSL) model to incorporate both explicit prior structure and implicit structure together, and learn the forecasting deep networks along with the graph structure. RGSL consists of two innovative modules. First, we derive an implicit dense similarity matrix through node embedding, and learn the sparse graph structure using the Regularized Graph Generation (RGG) based on the Gumbel Softmax trick. Second, we propose a Laplacian Matrix Mixed-up Module (LM3) to fuse the explicit graph and implicit graph together. We conduct experiments on three real-word datasets. Results show that the proposed RGSL model outperforms existing graph forecasting algorithms with a notable margin, while learning meaningful graph structure simultaneously. Our code and models are made publicly available at https://github.com/alipay/RGSL.git.
  • Open

    Distributional Random Forests: Heterogeneity Adjustment and Multivariate Distributional Regression. (arXiv:2005.14458v3 [stat.ML] UPDATED)
    Random Forest (Breiman, 2001) is a successful and widely used regression and classification algorithm. Part of its appeal and reason for its versatility is its (implicit) construction of a kernel-type weighting function on training data, which can also be used for targets other than the original mean estimation. We propose a novel forest construction for multivariate responses based on their joint conditional distribution, independent of the estimation target and the data model. It uses a new splitting criterion based on the MMD distributional metric, which is suitable for detecting heterogeneity in multivariate distributions. The induced weights define an estimate of the full conditional distribution, which in turn can be used for arbitrary and potentially complicated targets of interest. The method is very versatile and convenient to use, as we illustrate on a wide range of examples. The code is available as Python and R packages drf.
    Scalable particle-based alternatives to EM. (arXiv:2204.12965v2 [stat.CO] UPDATED)
    (Neal and Hinton, 1998) recast the problem tackled by EM as the minimization of a free energy functional $F$ on an infinite-dimensional space and EM itself as coordinate descent applied to $F$. Here, we explore alternative ways to optimize the functional. In particular, we identify various gradient flows associated with $F$ and show that their limits coincide with $F$'s stationary points. By discretizing the flows, we obtain three practical particle-based algorithms for maximum likelihood estimation in broad classes of latent variable models. The novel algorithms scale well to high-dimensional settings and outperform existing state-of-the-art methods in experiments.
    A Characterization of Semi-Supervised Adversarially-Robust PAC Learnability. (arXiv:2202.05420v2 [cs.LG] UPDATED)
    We study the problem of learning an adversarially robust predictor to test time attacks in the semi-supervised PAC model. We address the question of how many labeled and unlabeled examples are required to ensure learning. We show that having enough unlabeled data (the size of a labeled sample that a fully-supervised method would require), the labeled sample complexity can be arbitrarily smaller compared to previous works, and is sharply characterized by a different complexity measure. We prove nearly matching upper and lower bounds on this sample complexity. This shows that there is a significant benefit in semi-supervised robust learning even in the worst-case distribution-free model, and establishes a gap between the supervised and semi-supervised label complexities which is known not to hold in standard non-robust PAC learning.
    fAux: Testing Individual Fairness via Gradient Alignment. (arXiv:2210.06288v1 [stat.ML])
    Machine learning models are vulnerable to biases that result in unfair treatment of individuals from different populations. Recent work that aims to test a model's fairness at the individual level either relies on domain knowledge to choose metrics, or on input transformations that risk generating out-of-domain samples. We describe a new approach for testing individual fairness that does not have either requirement. We propose a novel criterion for evaluating individual fairness and develop a practical testing method based on this criterion which we call fAux (pronounced fox). This is based on comparing the derivatives of the predictions of the model to be tested with those of an auxiliary model, which predicts the protected variable from the observed data. We show that the proposed method effectively identifies discrimination on both synthetic and real-world datasets, and has quantitative and qualitative advantages over contemporary methods.
    Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers. (arXiv:2210.06313v1 [cs.LG])
    This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP. Moreover, larger Transformers with more layers and wider MLP hidden dimensions are sparser as measured by the percentage of nonzero entries. Through extensive experiments we demonstrate that the emergence of sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks, on both training and evaluation data, for Transformers of various configurations, at layers of all depth levels, as well as for other architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also emerges using training datasets with random labels, or with random inputs, or with infinite amount of data, demonstrating that sparsity is not a result of a specific family of datasets. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers. Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser activation via Top-k thresholding with a small value of k brings a collection of desired but missing properties for Transformers, namely less sensitivity to noisy training data, more robustness to input corruptions, and better calibration for their prediction confidence.
    Alpha-divergence Variational Inference Meets Importance Weighted Auto-Encoders: Methodology and Asymptotics. (arXiv:2210.06226v1 [stat.ML])
    Several algorithms involving the Variational R\'enyi (VR) bound have been proposed to minimize an alpha-divergence between a target posterior distribution and a variational distribution. Despite promising empirical results, those algorithms resort to biased stochastic gradient descent procedures and thus lack theoretical guarantees. In this paper, we formalize and study the VR-IWAE bound, a generalization of the Importance Weighted Auto-Encoder (IWAE) bound. We show that the VR-IWAE bound enjoys several desirable properties and notably leads to the same stochastic gradient descent procedure as the VR bound in the reparameterized case, but this time by relying on unbiased gradient estimators. We then provide two complementary theoretical analyses of the VR-IWAE bound and thus of the standard IWAE bound. Those analyses shed light on the benefits or lack thereof of these bounds. Lastly, we illustrate our theoretical claims over toy and real-data examples.
    Interaction-Grounded Learning with Action-inclusive Feedback. (arXiv:2206.08364v2 [cs.LG] UPDATED)
    Consider the problem setting of Interaction-Grounded Learning (IGL), in which a learner's goal is to optimally interact with the environment with no explicit reward to ground its policies. The agent observes a context vector, takes an action, and receives a feedback vector, using this information to effectively optimize a policy with respect to a latent reward function. Prior analyzed approaches fail when the feedback vector contains the action, which significantly limits IGL's success in many potential scenarios such as Brain-computer interface (BCI) or Human-computer interface (HCI) applications. We address this by creating an algorithm and analysis which allows IGL to work even when the feedback vector contains the action, encoded in any fashion. We provide theoretical guarantees and large-scale experiments based on supervised datasets to demonstrate the effectiveness of the new approach.
    Learning sparse features can lead to overfitting in neural networks. (arXiv:2206.12314v2 [stat.ML] UPDATED)
    It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.
    On RKHS Choices for Assessing Graph Generators via Kernel Stein Statistics. (arXiv:2210.05746v1 [stat.ML])
    Score-based kernelised Stein discrepancy (KSD) tests have emerged as a powerful tool for the goodness of fit tests, especially in high dimensions; however, the test performance may depend on the choice of kernels in an underlying reproducing kernel Hilbert space (RKHS). Here we assess the effect of RKHS choice for KSD tests of random networks models, developed for exponential random graph models (ERGMs) in Xu and Reinert (2021)and for synthetic graph generators in Xu and Reinert (2022). We investigate the power performance and the computational runtime of the test in different scenarios, including both dense and sparse graph regimes. Experimental results on kernel performance for model assessment tasks are shown and discussed on synthetic and real-world network applications.
    On the Implicit Bias in Deep-Learning Algorithms. (arXiv:2208.12591v2 [cs.LG] UPDATED)
    Gradient-based deep-learning algorithms exhibit remarkable performance in practice, but it is not well-understood why they are able to generalize despite having more parameters than training examples. It is believed that implicit bias is a key factor in their ability to generalize, and hence it was widely studied in recent years. In this short survey, we explain the notion of implicit bias, review main results and discuss their implications.
    Counterfactual harm. (arXiv:2204.12993v4 [cs.AI] UPDATED)
    To act safely and ethically in the real world, agents must be able to reason about harm and avoid harmful actions. However, to date there is no statistical method for measuring harm and factoring it into algorithmic decisions. In this paper we propose the first formal definition of harm and benefit using causal models. We show that any factual definition of harm must violate basic intuitions in certain scenarios, and show that standard machine learning algorithms that cannot perform counterfactual reasoning are guaranteed to pursue harmful policies following distributional shifts. We use our definition of harm to devise a framework for harm-averse decision making using counterfactual objective functions. We demonstrate this framework on the problem of identifying optimal drug doses using a dose-response model learned from randomized control trial data. We find that the standard method of selecting doses using treatment effects results in unnecessarily harmful doses, while our counterfactual approach allows us to identify doses that are significantly less harmful without sacrificing efficacy.
    Scalable Sensitivity and Uncertainty Analysis for Causal-Effect Estimates of Continuous-Valued Interventions. (arXiv:2204.10022v4 [cs.LG] UPDATED)
    Estimating the effects of continuous-valued interventions from observational data is a critically important task for climate science, healthcare, and economics. Recent work focuses on designing neural network architectures and regularization functions to allow for scalable estimation of average and individual-level dose-response curves from high-dimensional, large-sample data. Such methodologies assume ignorability (observation of all confounding variables) and positivity (observation of all treatment levels for every covariate value describing a set of units), assumptions problematic in the continuous treatment regime. Scalable sensitivity and uncertainty analyses to understand the ignorance induced in causal estimates when these assumptions are relaxed are less studied. Here, we develop a continuous treatment-effect marginal sensitivity model (CMSM) and derive bounds that agree with the observed data and a researcher-defined level of hidden confounding. We introduce a scalable algorithm and uncertainty-aware deep models to derive and estimate these bounds for high-dimensional, large-sample observational data. We work in concert with climate scientists interested in the climatological impacts of human emissions on cloud properties using satellite observations from the past 15 years. This problem is known to be complicated by many unobserved confounders.
    Generalization Bounds on Multi-Kernel Learning with Mixed Datasets. (arXiv:2205.07313v2 [cs.LG] UPDATED)
    This paper presents novel generalization bounds for the multi-kernel learning problem. Motivated by applications in sensor networks and spatial-temporal models, we assume that the dataset is mixed where each sample is taken from a finite pool of Markov chains. Our bounds for learning kernels admit $O(\sqrt{\log m})$ dependency on the number of base kernels and $O(1/\sqrt{n})$ dependency on the number of training samples. However, some $O(1/\sqrt{n})$ terms are added to compensate for the dependency among samples compared with existing generalization bounds for multi-kernel learning with i.i.d. datasets.
    Differentially Private Bootstrap: New Privacy Analysis and Inference Strategies. (arXiv:2210.06140v1 [stat.ML])
    Differential private (DP) mechanisms protect individual-level information by introducing randomness into the statistical analysis procedure. While there are now many DP tools for various statistical problems, there is still a lack of general techniques to understand the sampling distribution of a DP estimator, which is crucial for uncertainty quantification in statistical inference. We analyze a DP bootstrap procedure that releases multiple private bootstrap estimates to infer the sampling distribution and construct confidence intervals. Our privacy analysis includes new results on the privacy cost of a single DP bootstrap estimate applicable to incorporate arbitrary DP mechanisms and identifies some misuses of the bootstrap in the existing literature. We show that the release of $B$ DP bootstrap estimates from mechanisms satisfying $(\mu/\sqrt{(2-2/\mathrm{e})B})$-Gaussian DP asymptotically satisfies $\mu$-Gaussian DP as $B$ goes to infinity. We also develop a statistical procedure based on the DP bootstrap estimates to correctly infer the sampling distribution using techniques related to the deconvolution of probability measures, an approach which is novel in analyzing DP procedures. From our density estimate, we construct confidence intervals and compare them to existing methods through simulations and real-world experiments using the 2016 Canada Census Public Use Microdata. The coverage of our private confidence intervals achieves the nominal confidence level, while other methods fail to meet this guarantee.
    Group Meritocratic Fairness in Linear Contextual Bandits. (arXiv:2206.03150v2 [stat.ML] UPDATED)
    We study the linear contextual bandit problem where an agent has to select one candidate from a pool and each candidate belongs to a sensitive group. In this setting, candidates' rewards may not be directly comparable between groups, for example when the agent is an employer hiring candidates from different ethnic groups and some groups have a lower reward due to discriminatory bias and/or social injustice. We propose a notion of fairness that states that the agent's policy is fair when it selects a candidate with highest relative rank, which measures how good the reward is when compared to candidates from the same group. This is a very strong notion of fairness, since the relative rank is not directly observed by the agent and depends on the underlying reward model and on the distribution of rewards. Thus we study the problem of learning a policy which approximates a fair policy under the condition that the contexts are independent between groups and the distribution of rewards of each group is absolutely continuous. In particular, we design a greedy policy which at each round constructs a ridge regression estimate from the observed context-reward pairs, and then computes an estimate of the relative rank of each candidate using the empirical cumulative distribution function. We prove that, despite its simplicity and the lack of an initial exploration phase, the greedy policy achieves, up to log factors and with high probability, a fair pseudo-regret of order $\sqrt{dT}$ after $T$ rounds, where $d$ is the dimension of the context vectors. The policy also satisfies demographic parity at each round when averaged over all possible information available before the selection. Finally, we use simulated settings and experiments on the US census data to show that our policy achieves sub-linear fair pseudo-regret also in practice.
    Adaptive Estimation and Uniform Confidence Bands for Nonparametric Structural Functions and Elasticities. (arXiv:2107.11869v2 [econ.EM] UPDATED)
    We introduce two practical methods for estimation and inference on a nonparametric structural function $h_0$ and its derivatives -- such as elasticities or other marginal effects -- using instrumental variables. The first is a data-driven choice of sieve dimension. The second is a data-driven approach for constructing uniform confidence bands (UCBs) for $h_0$ and its derivatives. Both procedures are simple to implement, have strong theoretical justification, and do not require prior information about the smoothness of $h_0$ or instrument strength. Our first procedure leads to estimators of $h_0$ and its derivatives that converge at the fastest possible (i.e., minimax) rate in sup-norm. Our second procedure yields UCBs for $h_0$ and its derivatives that have correct asymptotic coverage and contract at, or within a logarithmic factor of, the minimax rate. Our UCBs are asymptotically more efficient (i.e., narrower) than UCBs based on the usual approach of undersmoothing. As an application, we estimate the elasticity of the intensive margin of firm exports in a monopolistic competition model of international trade. Simulations illustrate the good performance of our procedures in empirically calibrated designs. Our results provide evidence against common parameterizations of the distribution of unobserved firm heterogeneity.
    Maximum entropy exploration in contextual bandits with neural networks and energy based models. (arXiv:2210.06302v1 [cs.LG])
    Contextual bandits can solve a huge range of real-world problems. However, current popular algorithms to solve them either rely on linear models, or unreliable uncertainty estimation in non-linear models, which are required to deal with the exploration-exploitation trade-off. Inspired by theories of human cognition, we introduce novel techniques that use maximum entropy exploration, relying on neural networks to find optimal policies in settings with both continuous and discrete action spaces. We present two classes of models, one with neural networks as reward estimators, and the other with energy based models, which model the probability of obtaining an optimal reward given an action. We evaluate the performance of these models in static and dynamic contextual bandit simulation environments. We show that both techniques outperform well-known standard algorithms, where energy based models have the best overall performance. This provides practitioners with new techniques that perform well in static and dynamic settings, and are particularly well suited to non-linear scenarios with continuous action spaces.
    Trajectory balance: Improved credit assignment in GFlowNets. (arXiv:2201.13259v2 [cs.LG] UPDATED)
    Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object. We find previously proposed learning objectives for GFlowNets, flow matching and detailed balance, which are analogous to temporal difference learning, to be prone to inefficient credit propagation across long action sequences. We thus propose a new learning objective for GFlowNets, trajectory balance, as a more efficient alternative to previously used objectives. We prove that any global minimizer of the trajectory balance objective can define a policy that samples exactly from the target distribution. In experiments on four distinct domains, we empirically demonstrate the benefits of the trajectory balance objective for GFlowNet convergence, diversity of generated samples, and robustness to long action sequences and large action spaces.
    Energy Consumption-Aware Tabular Benchmarks for Neural Architecture Search. (arXiv:2210.06015v1 [cs.LG])
    The demand for large-scale computational resources for Neural Architecture Search (NAS) has been lessened by tabular benchmarks for NAS. Evaluating NAS strategies is now possible on extensive search spaces and at a moderate computational cost. But so far, NAS has mainly focused on maximising performance on some hold-out validation/test set. However, energy consumption is a partially conflicting objective that should not be neglected. We hypothesise that constraining NAS to include the energy consumption of training the models could reveal a sub-space of undiscovered architectures that are more computationally efficient with a smaller carbon footprint. To support the hypothesis, an existing tabular benchmark for NAS is augmented with the energy consumption of each architecture. We then perform multi-objective optimisation that includes energy consumption as an additional objective. We demonstrate the usefulness of multi-objective NAS for uncovering the trade-off between performance and energy consumption as well as for finding more energy-efficient architectures. The updated tabular benchmark, EC-NAS-Bench, is open-sourced to encourage the further exploration of energy consumption-aware NAS.
    Probabilistic Reconciliation of Count Time Series. (arXiv:2207.09322v2 [stat.ME] UPDATED)
    Forecast reconciliation is an important topic of research. Although some recent works studied probabilistic forecast reconciliation, there is currently no method for the probabilistic reconciliation of count time series, which are very common. In this paper we make two main contributions. First we propose a formal definition of coherent and reconciled probabilistic forecast for count variables. Secondly, we propose a principled method for reconciling count time series, based on a generalization of Bayes' rule. We carry out experiments regarding the application of temporal hierarchies to count time series, obtaining major improvements compared to probabilistic reconciliation based on the Gaussian or the truncated Gaussian distribution.
    An Analytical Theory of Curriculum Learning in Teacher-Student Networks. (arXiv:2106.08068v2 [cs.LG] UPDATED)
    In humans and animals, curriculum learning -- presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. Curricula could in principle change both the learning speed and asymptotic performance of a model. To study the former, we provide an exact description of the online learning setting, confirming the long-standing experimental observation that curricula can modestly speed up learning. To study the latter, we derive performance in a batch learning setting, in which a network trains to convergence in successive phases of learning on dataset slices of varying difficulty. With standard training losses, curriculum does not provide generalisation benefit, in line with empirical observations. However, we show that by connecting different learning phases through simple Gaussian priors, curriculum can yield a large improvement in test performance. Taken together, our reduced analytical descriptions help reconcile apparently conflicting empirical results and trace regimes where curriculum learning yields the largest gains. More broadly, our results suggest that fully exploiting a curriculum may require explicit changes to the loss function at curriculum boundaries.
    Deep Probability Estimation. (arXiv:2111.10734v4 [cs.LG] UPDATED)
    Reliable probability estimation is of crucial importance in many real-world applications where there is inherent (aleatoric) uncertainty. Probability-estimation models are trained on observed outcomes (e.g. whether it has rained or not, or whether a patient has died or not), because the ground-truth probabilities of the events of interest are typically unknown. The problem is therefore analogous to binary classification, with the difference that the objective is to estimate probabilities rather than predicting the specific outcome. This work investigates probability estimation from high-dimensional data using deep neural networks. There exist several methods to improve the probabilities generated by these models but they mostly focus on model (epistemic) uncertainty. For problems with inherent uncertainty, it is challenging to evaluate performance without access to ground-truth probabilities. To address this, we build a synthetic dataset to study and compare different computable metrics. We evaluate existing methods on the synthetic data as well as on three real-world probability estimation tasks, all of which involve inherent uncertainty: precipitation forecasting from radar images, predicting cancer patient survival from histopathology images, and predicting car crashes from dashcam videos. We also give a theoretical analysis of a model for high-dimensional probability estimation which reproduces several of the phenomena evinced in our experiments. Finally, we propose a new method for probability estimation using neural networks, which modifies the training process to promote output probabilities that are consistent with empirical probabilities computed from the data. The method outperforms existing approaches on most metrics on the simulated as well as real-world data.
    Non-stationary Bandits with Knapsacks. (arXiv:2205.12427v2 [cs.LG] UPDATED)
    In this paper, we study the problem of bandits with knapsacks (BwK) in a non-stationary environment. The BwK problem generalizes the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm. At each time, the decision maker/player chooses to play an arm, and s/he will receive a reward and consume certain amount of resource from each of the multiple resource types. The objective is to maximize the cumulative reward over a finite horizon subject to some knapsack constraints on the resources. Existing works study the BwK problem under either a stochastic or adversarial environment. Our paper considers a non-stationary environment which continuously interpolates between these two extremes. We first show that the traditional notion of variation budget is insufficient to characterize the non-stationarity of the BwK problem for a sublinear regret due to the presence of the constraints, and then we propose a new notion of global non-stationarity measure. We employ both non-stationarity measures to derive upper and lower bounds for the problem. Our results are based on a primal-dual analysis of the underlying linear programs and highlight the interplay between the constraints and the non-stationarity. Finally, we also extend the non-stationarity measure to the problem of online convex optimization with constraints and obtain new regret bounds accordingly.
    Empirical Gateaux Derivatives for Causal Inference. (arXiv:2208.13701v3 [stat.ME] UPDATED)
    We study a constructive algorithm that approximates Gateaux derivatives for statistical functionals by finite-differencing, with a focus on causal inference functionals. We consider the case where probability distributions are not known a priori but also need to be estimated from data. These estimated distributions lead to empirical Gateaux derivatives, and we study the relationships between empirical, numerical, and analytical Gateaux derivatives. Starting with a case study of estimating the mean potential outcome (hence average treatment effect), we instantiate the exact relationship between finite-differences and the analytical Gateaux derivative. We then derive requirements on the rates of numerical approximation in perturbation and smoothing that preserve the statistical benefits of one-step adjustments, such as rate-double-robustness. We then study more complicated functionals such as dynamic treatment regimes and the linear-programming formulation for policy optimization in infinite-horizon Markov decision processes. The newfound ability to approximate bias adjustments in the presence of arbitrary constraints illustrates the usefulness of constructive approaches for Gateaux derivatives. We also find that the statistical structure of the functional (rate-double robustness) can permit less conservative rates of finite-difference approximation. This property, however, can be specific to particular functionals, e.g. it occurs for the mean potential outcome (hence average treatment effect) but not the infinite-horizon MDP policy value.
    On the Importance of Gradient Norm in PAC-Bayesian Bounds. (arXiv:2210.06143v1 [cs.LG])
    Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss and on-average bounded gradient norm assumptions. Following this relaxation, we propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities. These inequalities add an additional loss-gradient norm term to the generalization bound, which is intuitively a surrogate of the model complexity. We apply the proposed bound on Bayesian deep nets and empirically analyze the effect of this new loss-gradient norm term on different neural architectures.
    Policy Evaluation for Temporal and/or Spatial Dependent Experiments in Ride-sourcing Platforms. (arXiv:2202.10887v3 [stat.ME] UPDATED)
    Policy evaluation based on A/B testing has attracted considerable interest in digital marketing, but such evaluation in ride-sourcing platforms (e.g., Uber and Didi) is not well studied primarily due to the complex structure of their temporal and/or spatial dependent experiments. Motivated by policy evaluation in ride-sourcing platforms, the aim of this paper is to establish causal relationship between platform's policies and outcomes of interest under a switchback design. We propose a novel potential outcome framework based on a temporal varying coefficient decision process (VCDP) model to capture the dynamic treatment effects in temporal dependent experiments. We further characterize the average treatment effect by decomposing it as the sum of direct effect (DE) and indirect effect (IE). We develop estimation and inference procedures for both DE and IE. Furthermore, we propose a spatio-temporal VCDP to deal with spatiotemporal dependent experiments. For both VCDP models, we establish the statistical properties (e.g., weak convergence and asymptotic power) of our estimation and inference procedures. We conduct extensive simulations to investigate the finite-sample performance of the proposed estimation and inference procedures. We examine how our VCDP models can help improve policy evaluation for various dispatching and dispositioning policies in Didi.
    An $\alpha$-No-Regret Algorithm For Graphical Bilinear Bandits. (arXiv:2206.00466v2 [cs.LG] UPDATED)
    We propose the first regret-based approach to the Graphical Bilinear Bandits problem, where $n$ agents in a graph play a stochastic bilinear bandit game with each of their neighbors. This setting reveals a combinatorial NP-hard problem that prevents the use of any existing regret-based algorithm in the (bi-)linear bandit literature. In this paper, we fill this gap and present the first regret-based algorithm for graphical bilinear bandits using the principle of optimism in the face of uncertainty. Theoretical analysis of this new method yields an upper bound of $\tilde{O}(\sqrt{T})$ on the $\alpha$-regret and evidences the impact of the graph structure on the rate of convergence. Finally, we show through various experiments the validity of our approach.
    Efficient and Modular Implicit Differentiation. (arXiv:2105.15183v5 [cs.LG] UPDATED)
    Automatic differentiation (autodiff) has revolutionized machine learning. It allows to express complex computations by composing elementary ones in creative ways and removes the burden of computing their derivatives by hand. More recently, differentiation of optimization problem solutions has attracted widespread attention with applications such as optimization layers, and in bi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use for practitioners, as it often required case-by-case tedious mathematical derivations and implementations. In this paper, we propose automatic implicit differentiation, an efficient and modular approach for implicit differentiation of optimization problems. In our approach, the user defines directly in Python a function $F$ capturing the optimality conditions of the problem to be differentiated. Once this is done, we leverage autodiff of $F$ and the implicit function theorem to automatically differentiate the optimization problem. Our approach thus combines the benefits of implicit differentiation and autodiff. It is efficient as it can be added on top of any state-of-the-art solver and modular as the optimality condition specification is decoupled from the implicit differentiation mechanism. We show that seemingly simple principles allow to recover many existing implicit differentiation methods and create new ones easily. We demonstrate the ease of formulating and solving bi-level optimization problems using our framework. We also showcase an application to the sensitivity analysis of molecular dynamics.
    Can Push-forward Generative Models Fit Multimodal Distributions?. (arXiv:2206.14476v2 [stat.ML] UPDATED)
    Many generative models synthesize data by transforming a standard Gaussian random variable using a deterministic neural network. Among these models are the Variational Autoencoders and the Generative Adversarial Networks. In this work, we call them "push-forward" models and study their expressivity. We show that the Lipschitz constant of these generative networks has to be large in order to fit multimodal distributions. More precisely, we show that the total variation distance and the Kullback-Leibler divergence between the generated and the data distribution are bounded from below by a constant depending on the mode separation and the Lipschitz constant. Since constraining the Lipschitz constants of neural networks is a common way to stabilize generative models, there is a provable trade-off between the ability of push-forward models to approximate multimodal distributions and the stability of their training. We validate our findings on one-dimensional and image datasets and empirically show that generative models consisting of stacked networks with stochastic input at each step, such as diffusion models do not suffer of such limitations.
    Robust Streaming PCA. (arXiv:1902.03223v3 [stat.ML] UPDATED)
    We consider streaming principal component analysis when the stochastic data-generating model is subject to perturbations. While existing models assume a fixed covariance, we adopt a robust perspective where the covariance matrix belongs to a temporal uncertainty set. Under this setting, we provide fundamental limits on convergence of any algorithm recovering principal components. We analyze the convergence of the noisy power method and Oja's algorithm, both studied for the stationary data generating model, and argue that the noisy power method is rate-optimal in our setting. Finally, we demonstrate the validity of our analysis through numerical experiments on synthetic and real-world dataset.
    Generalised Mutual Information for Discriminative Clustering. (arXiv:2210.06300v1 [stat.ML])
    In the last decade, recent successes in deep clustering majorly involved the mutual information (MI) as an unsupervised objective for training neural networks with increasing regularisations. While the quality of the regularisations have been largely discussed for improvements, little attention has been dedicated to the relevance of MI as a clustering objective. In this paper, we first highlight how the maximisation of MI does not lead to satisfying clusters. We identified the Kullback-Leibler divergence as the main reason of this behaviour. Hence, we generalise the mutual information by changing its core distance, introducing the generalised mutual information (GEMINI): a set of metrics for unsupervised neural network training. Unlike MI, some GEMINIs do not require regularisations when training. Some of these metrics are geometry-aware thanks to distances or kernels in the data space. Finally, we highlight that GEMINIs can automatically select a relevant number of clusters, a property that has been little studied in deep clustering context where the number of clusters is a priori unknown.
    Achieving the Pareto Frontier of Regret Minimization and Best Arm Identification in Multi-Armed Bandits. (arXiv:2110.08627v2 [cs.LG] UPDATED)
    We study the Pareto frontier of two archetypal objectives in multi-armed bandits, namely, regret minimization (RM) and best arm identification (BAI) with a fixed horizon. It is folklore that the balance between exploitation and exploration is crucial for both RM and BAI, but exploration is more critical in achieving the optimal performance for the latter objective. To this end, we design and analyze the BoBW-lil'UCB$(\gamma)$ algorithm. Complementarily, by establishing lower bounds on the regret achievable by any algorithm with a given BAI failure probability, we show that (i) no algorithm can simultaneously perform optimally for both the RM and BAI objectives, and (ii) BoBW-lil'UCB$(\gamma)$ achieves order-wise optimal performance for RM or BAI under different values of $\gamma$. Our work elucidates the trade-off more precisely by showing how the constants in previous works depend on certain hardness parameters. Finally, we show that BoBW-lil'UCB outperforms a close competitor UCB$_\alpha$ (Degenne et al., 2019) in terms of the time complexity and the regret on diverse datasets such as MovieLens and Published Kinase Inhibitor Set.
    SCROLLS: Standardized CompaRison Over Long Language Sequences. (arXiv:2201.03533v2 [cs.CL] UPDATED)
    NLP benchmarks have largely focused on short texts, such as sentences and paragraphs, even though long texts comprise a considerable amount of natural language in the wild. We introduce SCROLLS, a suite of tasks that require reasoning over long texts. We examine existing long-text datasets, and handpick ones where the text is naturally long, while prioritizing tasks that involve synthesizing information across the input. SCROLLS contains summarization, question answering, and natural language inference tasks, covering multiple domains, including literature, science, business, and entertainment. Initial baselines, including Longformer Encoder-Decoder, indicate that there is ample room for improvement on SCROLLS. We make all datasets available in a unified text-to-text format and host a live leaderboard to facilitate research on model architecture and pretraining methods.
    Minimax-Optimal Multi-Agent RL in Markov Games With a Generative Model. (arXiv:2208.10458v2 [cs.LG] UPDATED)
    This paper studies multi-agent reinforcement learning in Markov games, with the goal of learning Nash equilibria or coarse correlated equilibria (CCE) sample-optimally. All prior results suffer from at least one of the two obstacles: the curse of multiple agents and the barrier of long horizon, regardless of the sampling protocol in use. We take a step towards settling this problem, assuming access to a flexible sampling mechanism: the generative model. Focusing on non-stationary finite-horizon Markov games, we develop a fast learning algorithm called \myalg~and an adaptive sampling scheme that leverage the optimism principle in online adversarial learning (particularly the Follow-the-Regularized-Leader (FTRL) method). Our algorithm learns an $\varepsilon$-approximate CCE in a general-sum Markov game using $$ \widetilde{O}\bigg( \frac{H^4 S \sum_{i=1}^m A_i}{\varepsilon^2} \bigg) $$ samples, where $m$ is the number of players, $S$ indicates the number of states, $H$ is the horizon, and $A_i$ denotes the number of actions for the $i$-th player. This is minimax-optimal (up to log factor) when the number of players is fixed. When applied to two-player zero-sum Markov games, our algorithm provably finds an $\varepsilon$-approximate Nash equilibrium with minimal samples. Along the way, we derive a refined regret bound for FTRL that makes explicit the role of variance-type quantities, which might be of independent interest.
    Modular Flows: Differential Molecular Generation. (arXiv:2210.06032v1 [cs.LG])
    Generating new molecules is fundamental to advancing critical applications such as drug discovery and material synthesis. Flows can generate molecules effectively by inverting the encoding process, however, existing flow models either require artifactual dequantization or specific node/edge orderings, lack desiderata such as permutation invariance or induce discrepancy between the encoding and the decoding steps that necessitates {\em post hoc} validity correction. We circumvent these issues with novel continuous normalizing E(3)-equivariant flows, based on a system of node ODEs coupled as a graph PDE, that repeatedly reconcile locally toward globally aligned densities. Our models can be cast as message-passing temporal networks, and result in superlative performance on the tasks of density estimation and molecular generation. In particular, our generated samples achieve state-of-the-art on both the standard QM9 and ZINC250K benchmarks.
    Unsupervised Learning of Equivariant Structure from Sequences. (arXiv:2210.05972v1 [cs.LG])
    In this study, we present meta-sequential prediction (MSP), an unsupervised framework to learn the symmetry from the time sequence of length at least three. Our method leverages the stationary property (e.g. constant velocity, constant acceleration) of the time sequence to learn the underlying equivariant structure of the dataset by simply training the encoder-decoder model to be able to predict the future observations. We will demonstrate that, with our framework, the hidden disentangled structure of the dataset naturally emerges as a by-product by applying simultaneous block-diagonalization to the transition operators in the latent space, the procedure which is commonly used in representation theory to decompose the feature-space based on the type of response to group actions. We will showcase our method from both empirical and theoretical perspectives. Our result suggests that finding a simple structured relation and learning a model with extrapolation capability are two sides of the same coin. The code is available at https://github.com/takerum/meta_sequential_prediction.
    Selective Classification Via Neural Network Training Dynamics. (arXiv:2205.13532v3 [cs.LG] UPDATED)
    Selective classification is the task of rejecting inputs a model would predict incorrectly on through a trade-off between input space coverage and model accuracy. Current methods for selective classification impose constraints on either the model architecture or the loss function; this inhibits their usage in practice. In contrast to prior work, we show that state-of-the-art selective classification performance can be attained solely from studying the (discretized) training dynamics of a model. We propose a general framework that, for a given test input, monitors metrics capturing the disagreement with the final predicted label over intermediate models obtained during training; we then reject data points exhibiting too much disagreement at late stages in training. In particular, we instantiate a method that tracks when the label predicted during training stops disagreeing with the final predicted label. Our experimental evaluation shows that our method achieves state-of-the-art accuracy/coverage trade-offs on typical selective classification benchmarks.
    Generalization Error Bounds on Deep Learning with Markov Datasets. (arXiv:2201.11059v4 [stat.ML] UPDATED)
    In this paper, we derive upper bounds on generalization errors for deep neural networks with Markov datasets. These bounds are developed based on Koltchinskii and Panchenko's approach for bounding the generalization error of combined classifiers with i.i.d. datasets. The development of new symmetrization inequalities in high-dimensional probability for Markov chains is a key element in our extension, where the spectral gap of the infinitesimal generator of the Markov chain plays a key parameter in these inequalities. We also propose a simple method to convert these bounds and other similar ones in traditional deep learning and machine learning to Bayesian counterparts for both i.i.d. and Markov datasets. Extensions to $m$-order homogeneous Markov chains such as AR and ARMA models and mixtures of several Markov data services are given.
    C-Mixup: Improving Generalization in Regression. (arXiv:2210.05775v1 [cs.LG])
    Improving the generalization of deep networks is an important open challenge, particularly in domains without plentiful data. The mixup algorithm improves generalization by linearly interpolating a pair of examples and their corresponding labels. These interpolated examples augment the original training set. Mixup has shown promising results in various classification tasks, but systematic analysis of mixup in regression remains underexplored. Using mixup directly on regression labels can result in arbitrarily incorrect labels. In this paper, we propose a simple yet powerful algorithm, C-Mixup, to improve generalization on regression tasks. In contrast with vanilla mixup, which picks training examples for mixing with uniform probability, C-Mixup adjusts the sampling probability based on the similarity of the labels. Our theoretical analysis confirms that C-Mixup with label similarity obtains a smaller mean square error in supervised regression and meta-regression than vanilla mixup and using feature similarity. Another benefit of C-Mixup is that it can improve out-of-distribution robustness, where the test distribution is different from the training distribution. By selectively interpolating examples with similar labels, it mitigates the effects of domain-associated information and yields domain-invariant representations. We evaluate C-Mixup on eleven datasets, ranging from tabular to video data. Compared to the best prior approach, C-Mixup achieves 6.56%, 4.76%, 5.82% improvements in in-distribution generalization, task generalization, and out-of-distribution robustness, respectively. Code is released at https://github.com/huaxiuyao/C-Mixup.
    Trading Off Resource Budgets for Improved Regret Bounds. (arXiv:2210.05789v1 [cs.LG])
    In this work we consider a variant of adversarial online learning where in each round one picks $B$ out of $N$ arms and incurs cost equal to the $\textit{minimum}$ of the costs of each arm chosen. We propose an algorithm called Follow the Perturbed Multiple Leaders (FPML) for this problem, which we show (by adapting the techniques of Kalai and Vempala [2005]) achieves expected regret $\mathcal{O}(T^{\frac{1}{B+1}}\ln(N)^{\frac{B}{B+1}})$ over time horizon $T$ relative to the $\textit{single}$ best arm in hindsight. This introduces a trade-off between the budget $B$ and the single-best-arm regret, and we proceed to investigate several applications of this trade-off. First, we observe that algorithms which use standard regret minimizers as subroutines can sometimes be adapted by replacing these subroutines with FPML, and we use this to generalize existing algorithms for Online Submodular Function Maximization [Streeter and Golovin, 2008] in both the full feedback and semi-bandit feedback settings. Next, we empirically evaluate our new algorithms on an online black-box hyperparameter optimization problem. Finally, we show how FPML can lead to new algorithms for Linear Programming which require stronger oracles at the benefit of fewer oracle calls.
    Context-aware Bayesian choice models. (arXiv:2210.05737v1 [stat.ML])
    The mixed multinomial logit (MMNL) model assumes constant preference parameters of a decision-maker throughout different choice situations, which may be considered too strong for certain choice modelling applications. This paper proposes an effective approach to model context-dependent intra-respondent heterogeneity and introduces the idea of Context-aware Bayesian Mixed Multinomial Logit (C-MMNL) Model, where a neural network maps contextual information to shifts in the preference parameters of each individual in each choice occasion. The proposed model offers several key advantages. First, it supports for both continuous and discrete variables, as well as complex non-linear interactions between both types of variables. Secondly, each specification of the context is considered jointly as a whole by the neural network rather than each variable being considered independently. Finally, since the parameters of the neural network are shared across all decision-makers, it can leverage information from other decision-makers and use it to infer the effect of a particular context. Even though the C-MMNL model allows for flexible interactions between attributes, there is hardly an increase in the complexity of the model and the computation time, compared to the MMNL model. We present two real-world case studies from travel behaviour domain - a travel mode choice model and a bicycle route choice model. The bicycle route choice model is based on a large-scale, crowdsourced dataset of GPS trajectories including 110,083 trips made by 8,555 cyclists.
    Machine Learning and Deep Learning -- A review for Ecologists. (arXiv:2204.05023v3 [q-bio.QM] UPDATED)
    1. The popularity of Machine learning (ML), Deep learning (DL), and Artificial intelligence (AI) has risen sharply in recent years. Despite this spike in popularity, the inner workings of ML and DL algorithms are often perceived as opaque, and their relationship to classical data analysis tools remains debated. 2. Although it is often assumed that ML and DL excel primarily at making predictions, ML and DL can also be used for analytical tasks traditionally addressed with statistical models. Moreover, most recent discussions and reviews on ML focus mainly on DL, missing out on synthesizing the wealth of ML algorithms with different advantages and general principles. 3. Here, we provide a comprehensive overview of the field of ML and DL, starting by summarizing its historical developments, existing algorithm families, differences to traditional statistical tools, and universal ML principles. We then discuss why and when ML and DL models excel at prediction tasks and where they could offer alternatives to traditional statistical methods for inference, highlighting current and emerging applications for ecological problems. Finally, we summarize emerging trends such as scientific and causal ML, explainable AI, and responsible AI that may significantly impact ecological data analysis in the future. 4. We conclude that ML and DL are powerful new tools for predictive modeling and data analysis. The superior performance of ML and DL algorithms compared to statistical models can be explained by their higher flexibility and automatic data-dependent complexity optimization. However, their use for causal inference is still disputed as the focus of ML and DL methods on predictions creates challenges for the interpretation of these models. Nevertheless, we expect ML and DL to become an indispensable tool in E&E, comparable to other traditional statistical tools.
    Fundamental limits and algorithms for sparse linear regression with sublinear sparsity. (arXiv:2101.11156v5 [cs.IT] UPDATED)
    We establish exact asymptotic expressions for the normalized mutual information and minimum mean-square-error (MMSE) of sparse linear regression in the sub-linear sparsity regime. Our result is achieved by a generalization of the adaptive interpolation method in Bayesian inference for linear regimes to sub-linear ones. A modification of the well-known approximate message passing algorithm to approach the MMSE fundamental limit is also proposed, and its state evolution is rigorously analyzed. Our results show that the traditional linear assumption between the signal dimension and number of observations in the replica and adaptive interpolation methods is not necessary for sparse signals. They also show how to modify the existing well-known AMP algorithms for linear regimes to sub-linear ones.
    Finite Sample Analysis Of Dynamic Regression Parameter Learning. (arXiv:1906.05591v4 [cs.LG] UPDATED)
    We consider the dynamic linear regression problem, where the predictor vector may vary with time. This problem can be modeled as a linear dynamical system, with non-constant observation operator, where the parameters that need to be learned are the variance of both the process noise and the observation noise. While variance estimation for dynamic regression is a natural problem, with a variety of applications, existing approaches to this problem either lack guarantees altogether, or only have asymptotic guarantees without explicit rates. In particular, existing literature does not provide any clues to the following fundamental question: In terms of data characteristics, what does the convergence rate depend on? In this paper we study the global system operator -- the operator that maps the noise vectors to the output. We obtain estimates on its spectrum, and as a result derive the first known variance estimators with finite sample complexity guarantees. The proposed bounds depend on the shape of a certain spectrum related to the system operator, and thus provide the first known explicit geometric parameter of the data that can be used to bound estimation errors. In addition, the results hold for arbitrary sub Gaussian distributions of noise terms. We evaluate the approach on synthetic and real-world benchmarks.
    Adversarial random forests for density estimation and generative modelling. (arXiv:2205.09435v2 [stat.ML] UPDATED)
    We propose methods for density estimation and data synthesis using a novel form of unsupervised random forests. Inspired by generative adversarial networks, we implement a recursive procedure in which trees gradually learn structural properties of the data through alternating rounds of generation and discrimination. The method is provably consistent under minimal assumptions. Unlike existing tree-based alternatives, our approach provides smooth unconditional densities and allows for fully synthetic data generation. We achieve comparable or superior performance to state-of-the-art deep learning models on various tabular data benchmarks while executing about two orders of magnitude faster on average. All algorithms are implemented in easy-to-use $\texttt{R}$ and Python packages.
    Spectral Algorithms Optimally Recover Planted Sub-structures. (arXiv:2203.11847v2 [cs.DS] UPDATED)
    Spectral algorithms are an important building block in machine learning and graph algorithms. We are interested in studying when such algorithms can be applied directly to provide optimal solutions to inference tasks. Previous works by Abbe, Fan, Wang and Zhong (2020) and by Dhara, Gaudio, Mossel and Sandon (2022) showed the optimality for community detection in the Stochastic Block Model (SBM), as well as in a censored variant of the SBM. Here we show that this optimality is somewhat universal as it carries over to other planted substructures such as the planted dense subgraph problem and submatrix localization problem, as well as to a censored version of the planted dense subgraph problem.
    uGLAD: Sparse graph recovery by optimizing deep unrolled networks. (arXiv:2205.11610v2 [cs.LG] UPDATED)
    Probabilistic Graphical Models (PGMs) are generative models of complex systems. They rely on conditional independence assumptions between variables to learn sparse representations which can be visualized in a form of a graph. Such models are used for domain exploration and structure discovery in poorly understood domains. This work introduces a novel technique to perform sparse graph recovery by optimizing deep unrolled networks. Assuming that the input data $X\in\mathbb{R}^{M\times D}$ comes from an underlying multivariate Gaussian distribution, we apply a deep model on $X$ that outputs the precision matrix $\hat{\Theta}$, which can also be interpreted as the adjacency matrix. Our model, uGLAD, builds upon and extends the state-of-the-art model GLAD to the unsupervised setting. The key benefits of our model are (1) uGLAD automatically optimizes sparsity-related regularization parameters leading to better performance than existing algorithms. (2) We introduce multi-task learning based `consensus' strategy for robust handling of missing data in an unsupervised setting. We evaluate model results on synthetic Gaussian data, non-Gaussian data generated from Gene Regulatory Networks, and present a case study in anaerobic digestion.
    Flowification: Everything is a Normalizing Flow. (arXiv:2205.15209v2 [cs.LG] CROSS LISTED)
    We develop a method that can be used to calculate the likelihood contribution of linear and convolutional layers allowing multi-layer perceptrons and convolutional networks to be converted into normalizing flows. We term this process flowification. In some cases flowification requires the addition of uncorrelated noise to the model but in the simplest case no additional parameters. The technique we develop can be applied to a broad range of architectures, allowing them to be used for a wide range of tasks. Our models also allow existing density estimation techniques to be combined with high performance feature extractors. In contrast to standard density estimation techniques that require specific architectures and specialized knowledge, our approach can leverage design knowledge from different domains and is a step closer to the realization of general purpose architectures. We investigate the efficacy of linear and convolutional layers for the task of density estimation on standard datasets.
    Sampling in Constrained Domains with Orthogonal-Space Variational Gradient Descent. (arXiv:2210.06447v1 [cs.LG])
    Sampling methods, as important inference and learning techniques, are typically designed for unconstrained domains. However, constraints are ubiquitous in machine learning problems, such as those on safety, fairness, robustness, and many other properties that must be satisfied to apply sampling results in real-life applications. Enforcing these constraints often leads to implicitly-defined manifolds, making efficient sampling with constraints very challenging. In this paper, we propose a new variational framework with a designed orthogonal-space gradient flow (O-Gradient) for sampling on a manifold $\mathcal{G}_0$ defined by general equality constraints. O-Gradient decomposes the gradient into two parts: one decreases the distance to $\mathcal{G}_0$ and the other decreases the KL divergence in the orthogonal space. While most existing manifold sampling methods require initialization on $\mathcal{G}_0$, O-Gradient does not require such prior knowledge. We prove that O-Gradient converges to the target constrained distribution with rate $\widetilde{O}(1/\text{the number of iterations})$ under mild conditions. Our proof relies on a new Stein characterization of conditional measure which could be of independent interest. We implement O-Gradient through both Langevin dynamics and Stein variational gradient descent and demonstrate its effectiveness in various experiments, including Bayesian deep neural networks.
    Information bottleneck theory of high-dimensional regression: relevancy, efficiency and optimality. (arXiv:2208.03848v2 [cs.IT] UPDATED)
    Avoiding overfitting is a central challenge in machine learning, yet many large neural networks readily achieve zero training loss. This puzzling contradiction necessitates new approaches to the study of overfitting. Here we quantify overfitting via residual information, defined as the bits in fitted models that encode noise in training data. Information efficient learning algorithms minimize residual information while maximizing the relevant bits, which are predictive of the unknown generative models. We solve this optimization to obtain the information content of optimal algorithms for a linear regression problem and compare it to that of randomized ridge regression. Our results demonstrate the fundamental trade-off between residual and relevant information and characterize the relative information efficiency of randomized regression with respect to optimal algorithms. Finally, using results from random matrix theory, we reveal the information complexity of learning a linear map in high dimensions and unveil information-theoretic analogs of double and multiple descent phenomena.
    Graph Neural Network Bandits. (arXiv:2207.06456v2 [cs.LG] UPDATED)
    We consider the bandit optimization problem with the reward function defined over graph-structured data. This problem has important applications in molecule design and drug discovery, where the reward is naturally invariant to graph permutations. The key challenges in this setting are scaling to large domains, and to graphs with many nodes. We resolve these challenges by embedding the permutation invariance into our model. In particular, we show that graph neural networks (GNNs) can be used to estimate the reward function, assuming it resides in the Reproducing Kernel Hilbert Space of a permutation-invariant additive kernel. By establishing a novel connection between such kernels and the graph neural tangent kernel (GNTK), we introduce the first GNN confidence bound and use it to design a phased-elimination algorithm with sublinear regret. Our regret bound depends on the GNTK's maximum information gain, which we also provide a bound for. While the reward function depends on all $N$ node features, our guarantees are independent of the number of graph nodes $N$. Empirically, our approach exhibits competitive performance and scales well on graph-structured domains.
    Contrastive Neural Ratio Estimation. (arXiv:2210.06170v1 [stat.ML])
    Likelihood-to-evidence ratio estimation is usually cast as either a binary (NRE-A) or a multiclass (NRE-B) classification task. In contrast to the binary classification framework, the current formulation of the multiclass version has an intrinsic and unknown bias term, making otherwise informative diagnostics unreliable. We propose a multiclass framework free from the bias inherent to NRE-B at optimum, leaving us in the position to run diagnostics that practitioners depend on. It also recovers NRE-A in one corner case and NRE-B in the limiting case. For fair comparison, we benchmark the behavior of all algorithms in both familiar and novel training regimes: when jointly drawn data is unlimited, when data is fixed but prior draws are unlimited, and in the commonplace fixed data and parameters setting. Our investigations reveal that the highest performing models are distant from the competitors (NRE-A, NRE-B) in hyperparameter space. We make a recommendation for hyperparameters distinct from the previous models. We suggest a bound on the mutual information as a performance metric for simulation-based inference methods, without the need for posterior samples, and provide experimental results.
    A New Family of Generalization Bounds Using Samplewise Evaluated CMI. (arXiv:2210.06422v1 [cs.LG])
    We present a new family of information-theoretic generalization bounds, in which the training loss and the population loss are compared through a jointly convex function. This function is upper-bounded in terms of the disintegrated, samplewise, evaluated conditional mutual information (CMI), an information measure that depends on the losses incurred by the selected hypothesis, rather than on the hypothesis itself, as is common in probably approximately correct (PAC)-Bayesian results. We demonstrate the generality of this framework by recovering and extending previously known information-theoretic bounds. Furthermore, using the evaluated CMI, we derive a samplewise, average version of Seeger's PAC-Bayesian bound, where the convex function is the binary KL divergence. In some scenarios, this novel bound results in a tighter characterization of the population loss of deep neural networks than previous bounds. Finally, we derive high-probability versions of some of these average bounds. We demonstrate the unifying nature of the evaluated CMI bounds by using them to recover average and high-probability generalization bounds for multiclass classification with finite Natarajan dimension.
    SGDA with shuffling: faster convergence for nonconvex-P{\L} minimax optimization. (arXiv:2210.05995v1 [math.OC])
    Stochastic gradient descent-ascent (SGDA) is one of the main workhorses for solving finite-sum minimax optimization problems. Most practical implementations of SGDA randomly reshuffle components and sequentially use them (i.e., without-replacement sampling); however, there are few theoretical results on this approach for minimax algorithms, especially outside the easier-to-analyze (strongly-)monotone setups. To narrow this gap, we study the convergence bounds of SGDA with random reshuffling (SGDA-RR) for smooth nonconvex-nonconcave objectives with Polyak-{\L}ojasiewicz (P{\L}) geometry. We analyze both simultaneous and alternating SGDA-RR for nonconvex-P{\L} and primal-P{\L}-P{\L} objectives, and obtain convergence rates faster than with-replacement SGDA. Our rates also extend to mini-batch SGDA-RR, recovering known rates for full-batch gradient descent-ascent (GDA). Lastly, we present a comprehensive lower bound for two-time-scale GDA, which matches the full-batch rate for primal-P{\L}-P{\L} case.
    Near-Minimax Optimal Estimation With Shallow ReLU Neural Networks. (arXiv:2109.08844v3 [stat.ML] UPDATED)
    We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared data-fitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (mean-squared error) of these neural network estimators when the data-generating function belongs to the second-order Radon-domain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.
    Learning Energy Networks with Generalized Fenchel-Young Losses. (arXiv:2205.09589v2 [cs.LG] UPDATED)
    Energy-based models, a.k.a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs and outputs. To learn the parameters of the energy function, the solution to that optimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction for learning energy networks. Our losses enjoy many desirable properties and their gradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concave energies. We demonstrate our losses on multilabel classification and imitation learning tasks.
    Outlier-Insensitive Kalman Filtering Using NUV Priors. (arXiv:2210.06083v1 [eess.SP])
    The Kalman filter (KF) is a widely-used algorithm for tracking the latent state of a dynamical system from noisy observations. For systems that are well-described by linear Gaussian state space models, the KF minimizes the mean-squared error (MSE). However, in practice, observations are corrupted by outliers, severely impairing the KFs performance. In this work, an outlier-insensitive KF is proposed, where robustness is achieved by modeling each potential outlier as a normally distributed random variable with unknown variance (NUV). The NUVs variances are estimated online, using both expectation-maximization (EM) and alternating maximization (AM). The former was previously proposed for the task of smoothing with outliers and was adapted here to filtering, while both EM and AM obtained the same performance and outperformed the other algorithms, the AM approach is less complex and thus requires 40 percentage less run-time. Our empirical study demonstrates that the MSE of our proposed outlier-insensitive KF outperforms previously proposed algorithms, and that for data clean of outliers, it reverts to the classic KF, i.e., MSE optimality is preserved  ( 2 min )
    Equal Experience in Recommender Systems. (arXiv:2210.05936v1 [cs.LG])
    We explore the fairness issue that arises in recommender systems. Biased data due to inherent stereotypes of particular groups (e.g., male students' average rating on mathematics is often higher than that on humanities, and vice versa for females) may yield a limited scope of suggested items to a certain group of users. Our main contribution lies in the introduction of a novel fairness notion (that we call equal experience), which can serve to regulate such unfairness in the presence of biased data. The notion captures the degree of the equal experience of item recommendations across distinct groups. We propose an optimization framework that incorporates the fairness notion as a regularization term, as well as introduce computationally-efficient algorithms that solve the optimization. Experiments on synthetic and benchmark real datasets demonstrate that the proposed framework can indeed mitigate such unfairness while exhibiting a minor degradation of recommendation accuracy.  ( 2 min )
    Finite time analysis of temporal difference learning with linear function approximation: Tail averaging and regularisation. (arXiv:2210.05918v1 [cs.LG])
    We study the finite-time behaviour of the popular temporal difference (TD) learning algorithm when combined with tail-averaging. We derive finite time bounds on the parameter error of the tail-averaged TD iterate under a step-size choice that does not require information about the eigenvalues of the matrix underlying the projected TD fixed point. Our analysis shows that tail-averaged TD converges at the optimal $O\left(1/t\right)$ rate, both in expectation and with high probability. In addition, our bounds exhibit a sharper rate of decay for the initial error (bias), which is an improvement over averaging all iterates. We also propose and analyse a variant of TD that incorporates regularisation. From analysis, we conclude that the regularised version of TD is useful for problems with ill-conditioned features.  ( 2 min )
    Generalised correlated batched bandits via the ARC algorithm with application to dynamic pricing. (arXiv:2102.04263v2 [math.OC] UPDATED)
    The Asymptotic Randomised Control (ARC) algorithm provides a rigorous approximation to the optimal strategy for a wide class of Bayesian bandits, while retaining low computational complexity. In particular, the ARC approach provides nearly optimal choices even when the payoffs are correlated or more than the reward is observed. The algorithm is guaranteed to asymptotically optimise the expected discounted payoff, with error depending on the initial uncertainty of the bandit. In this paper, we extend the ARC framework to consider a batched bandit problem where observations arrive from a generalised linear model. In particular, we develop a large sample approximation to allow correlated and generally distributed observation. We apply this to a classic dynamic pricing problem based on a Bayesian hierarchical model and demonstrate that the ARC algorithm outperforms alternative approaches.  ( 2 min )
    Concentration of the exponential mechanism and differentially private multivariate medians. (arXiv:2210.06459v1 [math.ST])
    We prove concentration inequalities for the output of the exponential mechanism about the maximizer of the population objective function. This bound applies to objective functions that satisfy a mild regularity condition. To illustrate our result, we study the problem of differentially private multivariate median estimation. We present novel finite-sample performance guarantees for differentially private multivariate depth-based medians which are essentially sharp. Our results cover commonly used depth functions, such as the halfspace (or Tukey) depth, spatial depth, and the integrated dual depth. We show that under Cauchy marginals, the cost of heavy-tailed location estimation outweighs the cost of privacy. We demonstrate our results numerically using a Gaussian contamination model in dimensions up to $d = 100$, and compare them to a state-of-the-art private mean estimation algorithm.  ( 2 min )
    Identifiability and Asymptotics in Learning Homogeneous Linear ODE Systems from Discrete Observations. (arXiv:2210.05955v1 [stat.ML])
    Ordinary Differential Equations (ODEs) have recently gained a lot of attention in machine learning. However, the theoretical aspects, e.g., identifiability and asymptotic properties of statistical estimation are still obscure. This paper derives a sufficient condition for the identifiability of homogeneous linear ODE systems from a sequence of equally-spaced error-free observations sampled from a single trajectory. When observations are disturbed by measurement noise, we prove that under mild conditions, the parameter estimator based on the Nonlinear Least Squares (NLS) method is consistent and asymptotic normal with $n^{-1/2}$ convergence rate. Based on the asymptotic normality property, we construct confidence sets for the unknown system parameters and propose a new method to infer the causal structure of the ODE system, i.e., inferring whether there is a causal link between system variables. Furthermore, we extend the results to degraded observations, including aggregated and time-scaled ones. To the best of our knowledge, our work is the first systematic study of the identifiability and asymptotic properties in learning linear ODE systems. We also construct simulations with various system dimensions to illustrate the established theoretical results.  ( 2 min )
    Self-Supervised Equivariant Regularization Reconciles Multiple Instance Learning: Joint Referable Diabetic Retinopathy Classification and Lesion Segmentation. (arXiv:2210.05946v1 [eess.IV])
    Lesion appearance is a crucial clue for medical providers to distinguish referable diabetic retinopathy (rDR) from non-referable DR. Most existing large-scale DR datasets contain only image-level labels rather than pixel-based annotations. This motivates us to develop algorithms to classify rDR and segment lesions via image-level labels. This paper leverages self-supervised equivariant learning and attention-based multi-instance learning (MIL) to tackle this problem. MIL is an effective strategy to differentiate positive and negative instances, helping us discard background regions (negative instances) while localizing lesion regions (positive ones). However, MIL only provides coarse lesion localization and cannot distinguish lesions located across adjacent patches. Conversely, a self-supervised equivariant attention mechanism (SEAM) generates a segmentation-level class activation map (CAM) that can guide patch extraction of lesions more accurately. Our work aims at integrating both methods to improve rDR classification accuracy. We conduct extensive validation experiments on the Eyepacs dataset, achieving an area under the receiver operating characteristic curve (AU ROC) of 0.958, outperforming current state-of-the-art algorithms.  ( 2 min )
    Short-term prediction of stream turbidity using surrogate data and a meta-model approach. (arXiv:2210.05821v1 [stat.ML])
    Many water-quality monitoring programs aim to measure turbidity to help guide effective management of waterways and catchments, yet distributing turbidity sensors throughout networks is typically cost prohibitive. To this end, we built and compared the ability of dynamic regression (ARIMA), long short-term memory neural nets (LSTM), and generalized additive models (GAM) to forecast stream turbidity one step ahead, using surrogate data from relatively low-cost in-situ sensors and publicly available databases. We iteratively trialled combinations of four surrogate covariates (rainfall, water level, air temperature and total global solar exposure) selecting a final model for each type that minimised the corrected Akaike Information Criterion. Cross-validation using a rolling time-window indicated that ARIMA, which included the rainfall and water-level covariates only, produced the most accurate predictions, followed closely by GAM, which included all four covariates. We constructed a meta-model, trained on time-series features of turbidity, to take advantage of the strengths of each model over different time points and predict the best model (that with the lowest forecast error one-step prior) for each time step. The meta-model outperformed all other models, indicating that this methodology can yield high accuracy and may be a viable alternative to using measurements sourced directly from turbidity-sensors where costs prohibit their deployment and maintenance, and when predicting turbidity across the short term. Our findings also indicated that temperature and light-associated variables, for example underwater illuminance, may hold promise as cost-effective, high-frequency surrogates of turbidity, especially when combined with other covariates, like rainfall, that are typically measured at coarse levels of spatial resolution.  ( 3 min )
    Probabilistic Inverse Modeling: An Application in Hydrology. (arXiv:2210.06213v1 [cs.LG])
    The astounding success of these methods has made it imperative to obtain more explainable and trustworthy estimates from these models. In hydrology, basin characteristics can be noisy or missing, impacting streamflow prediction. For solving inverse problems in such applications, ensuring explainability is pivotal for tackling issues relating to data bias and large search space. We propose a probabilistic inverse model framework that can reconstruct robust hydrology basin characteristics from dynamic input weather driver and streamflow response data. We address two aspects of building more explainable inverse models, uncertainty estimation and robustness. This can help improve the trust of water managers, handling of noisy data and reduce costs. We propose uncertainty based learning method that offers 6\% improvement in $R^2$ for streamflow prediction (forward modeling) from inverse model inferred basin characteristic estimates, 17\% reduction in uncertainty (40\% in presence of noise) and 4\% higher coverage rate for basin characteristics.  ( 2 min )
    Trajectory Inference via Mean-field Langevin in Path Space. (arXiv:2205.07146v4 [math.OC] UPDATED)
    Trajectory inference aims at recovering the dynamics of a population from snapshots of its temporal marginals. To solve this task, a min-entropy estimator relative to the Wiener measure in path space was introduced by Lavenant et al. arXiv:2102.09204, and shown to consistently recover the dynamics of a large class of drift-diffusion processes from the solution of an infinite dimensional convex optimization problem. In this paper, we introduce a grid-free algorithm to compute this estimator. Our method consists in a family of point clouds (one per snapshot) coupled via Schr\"odinger bridges which evolve with noisy gradient descent. We study the mean-field limit of the dynamics and prove its global convergence to the desired estimator. Overall, this leads to an inference method with end-to-end theoretical guarantees that solves an interpretable model for trajectory inference. We also present how to adapt the method to deal with mass variations, a useful extension when dealing with single cell RNA-sequencing data where cells can branch and die.  ( 2 min )

  • Open

    [D] Wide Attention Is The Way Forward For Transformers
    "On average, across 4 NLP tasks and 10 attention types, single layer wide models perform 0.3% better than their deep counterparts" Discussions of some wide attention results https://twitter.com/andrey_kurenkov/status/1579474438822985728 submitted by /u/SuchOccasion457 [link] [comments]  ( 127 min )
    [Project] I've built an Auto Subtitled Video Generator using Streamlit and OpenAI Whisper, hosted on HuggingFace spaces.
    All you have to do is input a YouTube video link and get a video with subtitles (alongside with .txt, .vtt, .srt files). Whisper can translate 98 different languages to English. If you want to give it a try; Link of the app: https://huggingface.co/spaces/BatuhanYilmaz/Auto-Subtitled-Video-Generator ​ https://reddit.com/link/y2cpjc/video/oiac58arcft91/player submitted by /u/Batuhan_Y [link] [comments]  ( 139 min )
    [D] How to render a frame-wise NeRF into a video NeRF?
    I have made a NeRF model of a video, where I can pause on any particular frame and render it with novel viewpoints. I can render almost the entire range of viewing angles. I want to render this in a video format i.e. I don't want to pause, the video will keep playing and I would be rendering it from novel viewpoints as the video progresses. Is there any repository which handles this task? I can think of a brute-force way to define a trajectory of the camera for every frame and then render and stitch the frames, but was wondering if there are any existing/better ways to do this. submitted by /u/Top-Pitch-3253 [link] [comments]  ( 127 min )
    [News] Lightning AI Open Sources Stable Diffusion App "Muse"
    New app is a fully functioning text-to-image generating app that is free to use and completely open source. It's available on the website. submitted by /u/LightningAI_Main [link] [comments]  ( 125 min )
    [D] We now have "prompt2img" and "img2prompt", therefore "prompt2prompt". Could we do something similar like ResNet for language/image-generation models?
    The famous ResNet architecture adds skip connections, in part to tackle the vanishing gradient problem (to some success): Information skip or go through a layer. Could we use a similar architecture in language models? Text (or representations thereof) are transformed to images and back to text, to increase the quality of the in-between images (which should make them easier to transform back to text). Very hand-wavy, I know, but maybe you get the idea. submitted by /u/kyzratik500 [link] [comments]  ( 126 min )
    [D] Train From Scratch vs Use Previous Checkpoint
    Hi everyone, I have encountered this problem at work today and I was wondering if I might get some advice here. In a nutshell, I have a (DL) model that I've trained on some (quite large) dataset a few months ago. Now, we have collected more data (2-3x the initial amount) and we were thinking about retraining a model on this new data. The thing I am trying to figure out is wether we should train a new model from scratch with new weights or use the current weights for initialization. My guess is that the later option would save us training time (we are using paid cloud resources for that and we are a startup, so money is tight), but I am not sure how this would affect the final model performance. Any ideas? submitted by /u/Personal-Trainer-541 [link] [comments]  ( 127 min )
    [D] Image labeling from excel with URL
    Hi, I am trying to label traffic sign images from an excel file. The labels of different kinds of traffic signs are in column A and the URL's to the images are in another column. I have downloaded all the images but they are saved as the URL name without the labels. Is there a way for me to automatically assign all the labels to the corresponding image or do I have to label everything by hand? I use python for this project but I'm still learning. All help would be appreciated! Linked is a example of what the excel file looks like, the whole file is 2500+ images. https://preview.redd.it/c12qaww30et91.png?width=947&format=png&auto=webp&s=f4dc2ad942c13beaa8a88e8d3695216fc783e481 submitted by /u/Savings_Scratch9149 [link] [comments]  ( 125 min )
    [Project] On simplifying MLOps stack
    Running ML workflows involves several hurdles. You connect to a machine through SSH, install the CUDA driver, fetch your code, copy the data, build a docker image, run the script, watch the process, etc. Finally, if the machine is a cloud instance, stop it.The other alternative is to use end-to-end platforms - open source or enterprise ones. In an attempt to possibly simplify it, we open-sourced a tool that allows running ML workflows from CLI but they would actually run in the cloud and takes care of - provisioning infrastructure, setting up the environment, etc. Would be glad to get your feedback on the project [github.com/dstackai/dstack]. See the link in the comment. Many thanks submitted by /u/Kaudinya [link] [comments]  ( 125 min )
  • Open

    Large and Fully Charged: Polestar 3 Sets New Standard for Premium Electric SUVs
    The age of electric vehicles has arrived and, with it, an entirely new standard for premium SUVs. Polestar, the performance EV brand spun out from Volvo Cars, launched its third model today in Copenhagen. With the Polestar 3, the automaker has taken SUV design back to the drawing board, building a vehicle as innovative as Read article > The post Large and Fully Charged: Polestar 3 Sets New Standard for Premium Electric SUVs appeared first on NVIDIA Blog.  ( 5 min )
    What Is Green Computing?
    Everyone wants green computing. Mobile users demand maximum performance and battery life. Businesses and governments increasingly require systems that are powerful yet environmentally friendly. And cloud services must respond to global demands without making the grid stutter. For these reasons and more, green computing has evolved rapidly over the past three decades, and it’s here Read article > The post What Is Green Computing? appeared first on NVIDIA Blog.  ( 9 min )
  • Open

    Presenting a project's supporting material
    Hello, So, I am entering the research phase of my MSc in AI. The main body work work will be represented in a research paper, of around 8 pages. For the material that we could not/did not put into our paper, we are to submit what the department calls "supporting material". They do not specify what for this material can take. How would you gather and present this supporting material? Initially I am thinking about a github repo to present the project, with jupyter notebooks and the like, but I am interested in how others may do this. It seems previous people on the course have just written a longer report and submitted that, but that doesn't seem like a good way to show what you did. submitted by /u/LateThree1 [link] [comments]  ( 119 min )
    AlphaZero: Bigger is better, say AI researchers
    submitted by /u/much_successes [link] [comments]  ( 113 min )
    How do we still feed the world while reducing chemical use and CO2? Simon Aspinall, CEO, Ecorobotix
    submitted by /u/chelsea_bear [link] [comments]  ( 111 min )
    I let an AI add nearly 1000 pieces of new content to my kingdom management rpg (from character dialogues and faction names, to jester jokes and random loot items)
    submitted by /u/Huw2k8 [link] [comments]  ( 108 min )
    A recent tune of mine that I created an Alt version with AI
    Original tune melody is a slide guitar. The AI changed that to a human-like voicing. The AI version is a rough cut but eq and processing cleaned it up a bit. The AI created whispers throughout as well as changing the slide guitar. Original https://soundcloud.com/iamlazerkat/lost-in-the-fabric-of-space-time AI https://soundcloud.com/iamlazerkat/ai-voiced-lost-in-the-fabric-of-space-time submitted by /u/IamLazerKat [link] [comments]  ( 108 min )
    What are some interesting topics I can talk about on my research paper?
    Title submitted by /u/Careless-Yogurt-7871 [link] [comments]  ( 108 min )
    Join the rebellion!
    submitted by /u/Angry_Grandpa_ [link] [comments]  ( 111 min )
    Another Midjourney creation! Which one do you prefer?
    submitted by /u/SS-AI [link] [comments]  ( 107 min )
    Just launched Synesthetic.ai, search and remix 10M+ Stable Diffusion images
    submitted by /u/notrealAI [link] [comments]  ( 111 min )
    Creating Full Body Deepfakes by Combining Multiple NeRFs
    submitted by /u/magenta_placenta [link] [comments]  ( 112 min )
  • Open

    Customize business rules for intelligent document processing with human review and BI visualization
    A massive amount of business documents are processed daily across industries. Many of these documents are paper-based, scanned into your system as images, or in an unstructured format like PDF. Each company may apply unique rules associated with its business background while processing these documents. How to extract information accurately and process them flexibly is […]  ( 8 min )
  • Open

    A more direct approach to series solutions
    In the previous post we found a solution to using operator calculus, i.e. treating the differential operator D like a number and doing tricks with it. See the earlier post for a justification of why we can get away with unorthodox manipulations. We can generalize the method of the previous post to say that a […] A more direct approach to series solutions first appeared on John D. Cook.  ( 4 min )
    Operator calculus
    Students who take a course in differential equations don’t really learn how to solve differential equations. The problems whose solutions they reproduce were solved over 300 years ago. The methods taught in undergraduate ODE classes are in some sense mnemonics, a way to remember a solution discovered long ago. Abandon hope of originality The more […] Operator calculus first appeared on John D. Cook.  ( 7 min )
  • Open

    Are there any papers or theories on why SAC is better for continuous control tasks than on-policy methods?
    I'm curious if there's any papers or research done on why SAC is better than PPO for continuous control tasks. submitted by /u/DolantheMFWizard [link] [comments]  ( 117 min )
    Join the rebellion!
    submitted by /u/Angry_Grandpa_ [link] [comments]  ( 120 min )
  • Open

    Training the Transformer Model
    We have put together the complete Transformer model, and now we are ready to train it for neural machine translation. We shall use a training dataset for this purpose, which contains short English and German sentence pairs. We will also revisit the role of masking in computing the accuracy and loss metrics during the training […] The post Training the Transformer Model appeared first on Machine Learning Mastery.  ( 27 min )
  • Open

    how to choose the amount of dense and dense layer size?
    I am making a neural network program to convert Celsius to Fahrenheit. There are integers of input. ​ model.add(Dense(512, input_dim=1, activation='relu')) model.add(Dense(512, activation='relu')) model.add(Dense(256, activation='relu')) model.add(Dense(128, activation='relu')) model.add(Dense(1, activation='sigmoid')) ​ will this be too much for just 6 integers of inputs? submitted by /u/PureForWhite [link] [comments]  ( 116 min )

  • Open

    My family's unlikely homeschooling journey
    My husband Jeremy and I never intended to homeschool, and yet we have now, unexpectedly, committed to homeschooling long-term. Prior to the pandemic, we both worked full-time in careers that we loved and found meaningful, and we sent our daughter to a full-day Montessori school. Although I struggled with significant health issues, I felt unbelievably lucky and fulfilled in both my family life and my professional life. The pandemic upended my careful balance. Every family is different, with different needs, circumstances, and constraints, and what works for one may not work for others. My intention here is primarily to share the journey of my own (very privileged) family. Our unplanned introduction to homeschooling For the first year of the pandemic, most schools in California, where …  ( 7 min )

  • Open

    The Jupyter+git problem is now solved
    Jupyter notebooks don’t work with git by default. With nbdev2, the Jupyter+git problem has been totally solved. It provides a set of hooks which provide clean git diffs, solve most git conflicts automatically, and ensure that any remaining conflicts can be resolved entirely within the standard Jupyter notebook environment. To get started, follow the directions on Git-friendly Jupyter. Contents The Jupyter+git problem The solution The nbdev2 git merge driver The nbdev2 Jupyter save hook Background The result Postscript: other Jupyter+git tools ReviewNB An alternative solution: Jupytext nbdime The Jupyter+git problem Jupyter notebooks are a powerful tool for scientists, engineers, technical writers, students, teachers, and more. They provide an ideal notebook environment for interact…  ( 7 min )
2022-11-11T01:09:20.450Z osmosfeed 1.15.1